How Many Samples are Needed?
When we conduct A/B tests, it is important to know the minimum number of
samples required to reach sufficient statistical power, or in other words, be
able to successfully identify the difference when there is a difference
between the treatment and the control group.
Let us first make some definitions clear:
Type I Error ($\alpha$). Type I error
happens when we reject a null hypothesis when it we should not (analogous to
false negative rate), this is also know as the significance level, and is
usually set to be 5%, meaning that if there is truly difference, then 95 out
of 100 times we will make correct rejection.
Type II Error ($\beta$). Type II
error happens when we should have rejected a null hypothesis when we did not
(analogous to false positive rate), this is also know as 1- the statistical
power, and is usually set to be 20%.
Therefore we see that people commonly use 5% for significance level and 80%
for statistical power, and in a business setting, this means we would rather
miss 4 good products or features than launching a bad one.
Again, the reason to determine the sample size is because we want to
make sure we have enough statistical power, which is the probability of
detecting a meaningful difference when there really is one. The larger the
sample size is, the more power we would get.
The general formula is as follows,
$n = \frac{\sigma^2}{\Delta^2}(Z_{\alpha/2}+Z_{\beta})^2$
1. Sample Size for Comparing Two Means
Sample size needed for comparing the means of two normally distributed
samples (~$N(\mu_i, \sigma_i)$) of equal size using a two-sided test with
significance level $\alpha$ and power $1-\beta$:
$n=\frac{(\sigma_1^2+\sigma_2^2)(Z_{\alpha/2}+Z_{\beta})^2}{\Delta^2}$
for each group
, where $\Delta = |\mu_1-\mu_2|$.
Example. Suppose our OEC(overall
evaluation criterion) is mean daily conversion rate, then the sample size
for each group would be the number of days. Each data point is the
conversion rate of a specific day, and we need to compare the mean daily
conversion rates between two groups. In this case, we assume the two groups'
daily conversion rate follow normal distribution and use the formula above
to calculate the required sample sizes.
2. Sample Size for Comparing Two Proportions
Sample size needed to compare two binomial proportions (~$B(p_i)$) using a
two-sided test with significance level $\alpha$ and power $1-\beta$, where one
sample $n_2$ is $k$ times as large as the other sample $n_1$
(independent-sample case):
$n_1=\frac{[\sqrt{\bar{p}\bar{q}(1+\frac{1}{k})}Z_{\alpha/2}+\sqrt{p_1q_1+\frac{p_2q_2}{k}}Z_{\beta}]^2}{\Delta^2}$
, where $\Delta = |p_1-p_2|$, $\bar{p}=\frac{p_1+kp_2}{1+k}$ and
$\bar{q}=1-\bar{p}$.
Example. Suppose our OEC is conversion
rate and we want to know the number of visits/sessions needed for each group,
then each data point would be a visit/session, which has only two outcomes
(purchase vs not purchase). Since each session is a Bernoulli trial, we assume
each group follows a binomial distribution, and the formula above should be
used for calculating the required sample sizes.
∇ other formula variants
1. Rule of Thumb
This is the most simple way of approximating the sample size, the way to derive this is to plug in the numbers of $Z_{0.25}$ and $Z_{0.2}$ in the formula of calculating sample sizes for comparing proportions:
$16p_1(1-p_1)/\Delta^2$
2. Evanmiller
This formula is used in this online calculator, which is also used in Udamy's A/B test class offered by Google:
$[Z_{\alpha/2}\sqrt{2p_1(1-p_1)}+Z_{\beta}\sqrt{p_1(1-p_1)+p_2(1-p2)}]^2/\Delta^2$
How Long Should the A/B Test Last?
Now we know how to calculate the required sample sizes, we could estimate the
duration needed for running the A/B test.
This depends on the traffic of the app/web to be test, or how many data points
we could get each day. Suppose we have 2,000 visitors per day for our tested
pages, and our total number of samples needed (for two groups) is 6,000, then
the minimum number of days we need is:
$d=6000/2000=3$ (days)
However, usually there are other considerations when running A/B tests. For
example, we want our testing period span across weekdays and also weekends, when
users might have different behaviors. Usually we recommend at least run your A/B
tests for 1 week and not more than 4 weeks.
References: