TL;DR: A/B Testing Sample Size
- A/B testing sample size is the number of visitors each variation needs before your A/B test results are statistically reliable.
- Running a test with too few visitors produces misleading results – even if one variant looks like a winner.
- To calculate sample size, you need your baseline conversion rate, minimum detectable effect, statistical power, and significance level.
- Most tests need at least 1,000–5,000 visitors per variation to reach reliable conclusions, often more.
- Tools like Evan Miller’s calculator and AB Testguide make the math straightforward – but understanding the inputs matters more than the tool.
A/B Testing Sample Size Explained for Beginners
Before diving into formulas, let’s get the fundamentals out of the way – because most sample size mistakes start with a shaky understanding of what “enough data” actually means.
An A/B test splits your audience into two groups. Group A sees the original (the control). Group B sees a changed version (the variant). You measure which one produces a better outcome – more signups, more purchases, lower bounce rate, whatever the goal is.
The problem? Random variation exists in every dataset. On any given day, your conversion rate will fluctuate slightly – not because of what you changed, but because of who happened to visit your site. Sample size is how you account for that noise.
Think of it this way. Flip a coin ten times and you might get seven heads. Flip it a thousand times and you’ll land close to 500. The more flips, the less random variation distorts your picture. The same logic applies to A/B tests. Small samples amplify noise. Large samples reveal signal.
Officially, this is called statistical significance – a measure of how confident you can be that the difference you observed wasn’t random chance. The standard threshold most teams use is 95% confidence (p < 0.05), meaning there’s only a 5% chance the result was a fluke.
But significance alone isn’t enough. You also need statistical power – typically set at 80% – which is the probability that your test will actually detect a real effect when one exists. Low power means you’ll often run a test that shows “no difference” when one is actually there. Both matter. Together, they determine your sample size.
How Sample Size Affects A/B Testing Results and Accuracy

Sample size isn’t just a technicality. It’s the difference between a test you can act on and one that wastes weeks of traffic.
Here’s what happens when you get it wrong in each direction.
Too small: You declare a winner prematurely. The variant looks 15% better after 200 visitors, you call it done, ship it – and the “lift” disappears within a week. This is called a false positive, or Type I error. It’s surprisingly common. Research from Evan Miller found that stopping tests early can inflate false positive rates to as high as 26% even when using a 5% significance threshold.
Too large: You keep a test running well past the point where it’s statistically valid, wasting traffic and delaying a decision you could have made weeks ago. Oversized tests also create their own distortions – seasonal trends, external events, and audience mix shifts can muddy results if a test runs for months.
The target is a sample size that gives you:
- Enough power to detect the effect you’re looking for
- A controlled false positive rate
- A realistic runtime given your actual traffic volume
One more thing to understand: effect size matters enormously. If you’re looking for a 30% lift in conversions, you need far fewer visitors than if you’re trying to detect a 5% improvement. Smaller effects require larger samples. This is why teams with modest traffic should either test bolder changes or accept longer runtimes – not skip the math and call it early.
A/B Testing Sample Size: How to Calculate It Step by Step
You don’t need a statistics degree to do this right. Here’s how to work through it.
Step 1: Find Your Baseline Conversion Rate
This is your current conversion rate – before any changes. Pull it from your analytics tool for a representative period. Use at least 30 days, ideally 60–90, to smooth out weekly fluctuation.
Example: Your landing page converts at 3.2%.
Step 2: Define Your Minimum Detectable Effect (MDE)
The MDE is the smallest improvement worth detecting. If a 2% relative lift (moving from 3.2% to 3.26%) wouldn’t change any business decision, don’t design the test around it.
Most teams use an MDE of 10–20% relative improvement. Higher MDEs = smaller required sample sizes. Lower MDEs = larger required sample sizes.
Example: You want to detect a 15% relative improvement – moving your baseline from 3.2% to 3.68%.
Step 3: Set Your Statistical Power and Significance Level
Power is conventionally 80% (sometimes 90% for high-stakes tests). Significance level is 95% (α = 0.05), sometimes 90% for lower-stakes tests.
These are defaults you can adjust – but understand the tradeoff. Raising power to 90% requires a roughly 25% larger sample.
Step 4: Plug Into a Sample Size Formula (or Calculator)
The simplified formula for a two-sample proportion test is:
n = (Z_α/2 + Z_β)² × (p1(1-p1) + p2(1-p2)) / (p1 – p2)²
Where:
- Z_α/2 = 1.96 (for 95% significance)
- Z_β = 0.84 (for 80% power)
- p1 = baseline conversion rate
- p2 = expected conversion rate with variant
This gets unwieldy quickly. Use a calculator instead.
Reliable free calculators:
- Evan Miller’s A/B Test Sample Size Calculator – the most-cited, simplest to use
- AB Testguide Sample Size Calculator – includes power and significance controls
- VWO’s A/B Testing Calculator – also estimates test duration
Step 5: Translate Sample Size Into Runtime
Once you have your per-variation sample size, divide by your daily unique visitors to get test duration.
Example:
- Required sample per variation: 4,800
- Two variations = 9,600 total visitors needed
- Daily traffic to test page: 320 visitors
- Estimated runtime: ~30 days
If that feels too long, you have two options: increase the MDE (test bigger changes) or wait and collect more traffic before running the test. Don’t shorten the timeline by reducing your significance threshold – that just increases the chance of a false positive.
How Many Visitors You Need for a Reliable A/B Test
There’s no single answer – but there are useful benchmarks.
| Baseline Conversion Rate | Target MDE (Relative) | Sample per Variation |
| 1% | 20% | ~19,000 |
| 2% | 15% | ~14,000 |
| 3% | 15% | ~9,500 |
| 5% | 10% | ~15,000 |
| 10% | 10% | ~7,500 |
Assumes 95% significance, 80% power, two-tailed test.

A few patterns worth noting from the table:
Lower baseline conversion rates demand dramatically more traffic. A landing page converting at 1% needs roughly twice the visitors of one converting at 3%, even if you’re detecting the same relative improvement.
This is why A/B testing low-traffic pages is so painful. A page with 50 visitors per day converting at 1% – assuming you want to detect a 20% lift – needs 380 days per variation to reach a reliable conclusion. That’s not a test. That’s a guess with extra steps.
What can you do if you don’t have enough traffic?
- Test bigger changes. A redesign that might move the needle 30–40% is more detectable than a button color tweak.
- Combine similar pages into a single test if the audience is consistent.
- Use a one-tailed test if you have a strong directional hypothesis – this reduces the required sample by about 20%, though it comes with its own caveats.
- Consider pre-validation before a live test. Platforms like Articos use synthetic A/B testing – where AI personas evaluate your variants before any live traffic is involved. This isn’t a replacement for a live test when you have the traffic, but it’s a practical way to rule out weak variants early and go into your live test with stronger hypotheses.
Common A/B Testing Sample Size Mistakes to Avoid
Most bad A/B tests aren’t bad because of the variant – they’re bad because of how the test was run. These are the mistakes that show up repeatedly.
1. Peeking at Results and Stopping Early
The most common mistake, by a wide margin. You launch a test, check it on day three, see a 22% lift, and call it done. But that lift wasn’t real – it was noise that happens to look clean this early in the data.
A study by Ronny Kohavi and colleagues at Microsoft found that the majority of A/B tests run by teams without statistical discipline produce false winners – largely because of early stopping.
The fix is straightforward: calculate your required sample size before the test, then don’t look at the data until you’ve hit it.
2. Running the Test for Days, Not Weeks
Even if you hit your sample size quickly, running a test for fewer than two full weeks introduces day-of-week bias. Most sites see very different user behavior on weekdays vs. weekends. A test that runs Monday to Thursday is measuring a non-representative audience.
Aim for full business cycles – at least one to two complete weeks, ideally more.
3. Testing Too Many Variations
Every variation you add dilutes your traffic and multiplies your testing timeline. A five-variant test on moderate-traffic pages can take months. Worse, multiple comparisons inflate your false positive rate – if you’re running five variants against a control at 95% significance each, your actual experiment-level false positive rate is much higher than 5%.
Run the minimum number of variants needed to answer your question. Usually that’s two – control and one challenger.
4. Ignoring Segment-Level Effects
Overall results hide a lot. A variant that “wins” on aggregate might be losing with mobile users, new visitors, or a specific traffic source – all of which matter if they represent a large chunk of revenue.
Segment your results after the test, not during. Looking at segments mid-test risks the same peeking problem as looking at overall results early.
5. Setting an Unrealistic MDE
Teams often set their MDE based on what would be “a good result” rather than what’s actually detectable given their traffic. Setting an MDE of 5% relative improvement on a low-traffic page means you’ll need years of data. Set the MDE based on the traffic you have and the timeline you can sustain, then design your test accordingly.
6. Not Accounting for Multiple Tests Running Simultaneously
If multiple tests are live at the same time and the test audiences overlap, results contaminate each other. One test might be influencing behavior that shows up in another test’s metrics. Keep test segments isolated, especially if tests touch the same user journey.
How Articos Can Help With A/B Testing Before You Have Enough Traffic
Traditional A/B testing requires a waiting game – you need sufficient live traffic before results mean anything. For teams with modest visitor volumes, this isn’t just inconvenient; it means months pass before a decision gets made.
Articos approaches A/B testing differently. Instead of waiting for live traffic, you upload your variants, select your test goals (Conversion Clarity, Value Proposition, CTA Effectiveness, Message Resonance, and others), and synthetic AI personas – trained on real behavioral data – conduct structured comparative interviews across your variants. The output is a research report showing how each variant performs on your chosen dimensions.
This isn’t a substitute for a live traffic test when you have the volume. What it does well is front-load the decision-making. You can eliminate weak variants before a live test, sharpen your hypothesis, and go into the experiment with higher confidence – which means you’re testing fewer ideas for longer, rather than cycling through variants that probably wouldn’t have worked anyway.
For growth teams at earlier-stage SaaS companies or product managers running research on low-traffic features, this pre-validation layer tends to make live A/B testing more efficient. Not a replacement – a filter.
FAQs: A/B Testing Sample Size
There’s no universal minimum, but most statisticians consider anything under 100 conversions per variation unreliable. In practice, you need enough to detect the effect size you’re targeting – which for most realistic conversion rates and MDEs means 1,000–5,000 visitors per variation. Use a sample size calculator with your specific inputs rather than assuming a fixed floor.
Long enough to collect the required sample size and cover at least one to two full business cycles (typically one to two weeks minimum). If your required sample size would take six months of traffic to collect, the test design needs to change – not the timeline.
Statistical significance is a measure of how confident you are that the difference between your control and variant wasn’t caused by random chance. At 95% significance (the standard), there’s a 5% probability that the difference you observed happened by accident. It doesn’t tell you how big the effect is – only how likely it is to be real.
Your false positive rate goes up sharply. When you stop a test the moment results look good, you’re exploiting a moment of favorable random variation – not a genuine effect. Research shows false positive rates can exceed 26% even with a nominal 5% significance level. Determine your required sample size before the test and commit to running it until you hit that number.
Because there’s more variance in the data. A conversion rate of 1% fluctuates more dramatically with small samples than a rate of 10% does. Mathematically, variance in a proportion is p(1-p) – which peaks at 50% and decreases on either side. Low conversion rates sit closer to 0, where small fluctuations represent large percentage swings. More data is required to separate signal from noise.
You can run a test, but reaching statistical validity will take a long time for most conversion goals. Options include: testing larger, more impactful changes that are easier to detect; testing on higher-volume metrics like clicks rather than purchases; or using pre-validation tools like Articos to screen ideas before committing to a live test. For startups especially, the goal should be maximizing learning per unit of traffic – not running tests that can’t reach conclusions.