FLanqi: Sample Sizes and Duration for A/B Testing

How Many Samples are Needed?

When we conduct A/B tests, it is important to know the minimum number of samples required to reach sufficient statistical power, or in other words, be able to successfully identify the difference when there is a difference between the treatment and the control group.

Let us first make some definitions clear:

Type I Error ($\alpha$). Type I error happens when we reject a null hypothesis when it we should not (analogous to false negative rate), this is also know as the significance level, and is usually set to be 5%, meaning that if there is truly difference, then 95 out of 100 times we will make correct rejection.

Type II Error ($\beta$). Type II error happens when we should have rejected a null hypothesis when we did not (analogous to false positive rate), this is also know as 1- the statistical power, and is usually set to be 20%.

Therefore we see that people commonly use 5% for significance level and 80% for statistical power, and in a business setting, this means we would rather miss 4 good products or features than launching a bad one.

Again, the reason to determine the sample size is because we want to make sure we have enough statistical power, which is the probability of detecting a meaningful difference when there really is one. The larger the sample size is, the more power we would get.

The general formula is as follows,

$n = \frac{\sigma^2}{\Delta^2}(Z_{\alpha/2}+Z_{\beta})^2$

1. Sample Size for Comparing Two Means

Sample size needed for comparing the means of two normally distributed samples (~$N(\mu_i, \sigma_i)$) of equal size using a two-sided test with significance level $\alpha$ and power $1-\beta$:

$n=\frac{(\sigma_1^2+\sigma_2^2)(Z_{\alpha/2}+Z_{\beta})^2}{\Delta^2}$ for each group

, where $\Delta = |\mu_1-\mu_2|$.

Example. Suppose our OEC(overall evaluation criterion) is mean daily conversion rate, then the sample size for each group would be the number of days. Each data point is the conversion rate of a specific day, and we need to compare the mean daily conversion rates between two groups. In this case, we assume the two groups' daily conversion rate follow normal distribution and use the formula above to calculate the required sample sizes.

2. Sample Size for Comparing Two Proportions

Sample size needed to compare two binomial proportions (~$B(p_i)$) using a two-sided test with significance level $\alpha$ and power $1-\beta$, where one sample $n_2$ is $k$ times as large as the other sample $n_1$ (independent-sample case):

$n_1=\frac{[\sqrt{\bar{p}\bar{q}(1+\frac{1}{k})}Z_{\alpha/2}+\sqrt{p_1q_1+\frac{p_2q_2}{k}}Z_{\beta}]^2}{\Delta^2}$

, where $\Delta = |p_1-p_2|$, $\bar{p}=\frac{p_1+kp_2}{1+k}$ and $\bar{q}=1-\bar{p}$.

Example. Suppose our OEC is conversion rate and we want to know the number of visits/sessions needed for each group, then each data point would be a visit/session, which has only two outcomes (purchase vs not purchase). Since each session is a Bernoulli trial, we assume each group follows a binomial distribution, and the formula above should be used for calculating the required sample sizes.

∇ other formula variants

1. Rule of Thumb

This is the most simple way of approximating the sample size, the way to derive this is to plug in the numbers of $Z_{0.25}$ and $Z_{0.2}$ in the formula of calculating sample sizes for comparing proportions:

$16p_1(1-p_1)/\Delta^2$

2. Evanmiller

This formula is used in this online calculator, which is also used in Udamy's A/B test class offered by Google:

$[Z_{\alpha/2}\sqrt{2p_1(1-p_1)}+Z_{\beta}\sqrt{p_1(1-p_1)+p_2(1-p2)}]^2/\Delta^2$

How Long Should the A/B Test Last?

Now we know how to calculate the required sample sizes, we could estimate the duration needed for running the A/B test.

This depends on the traffic of the app/web to be test, or how many data points we could get each day. Suppose we have 2,000 visitors per day for our tested pages, and our total number of samples needed (for two groups) is 6,000, then the minimum number of days we need is:

$d=6000/2000=3$ (days)

However, usually there are other considerations when running A/B tests. For example, we want our testing period span across weekdays and also weekends, when users might have different behaviors. Usually we recommend at least run your A/B tests for 1 week and not more than 4 weeks.

References:

FLanqi

Sample Sizes and Duration for A/B Testing

No comments:

Post a Comment