Boosting A/B test power with panel data models

Kyle Carlson | April 29, 2020

You can boost statistical power by modeling the heterogeneity of your users.

coolplot The initial draft benefited greatly from the thoughtful feedback of these reviewers: Julian Schuessler (@juli_schuess), Adam Haber (@_adam_haber), Carl Nadler, and Yu-Hsin (@_yuhsin).

Comments and feedback welcome on Twitter: @KyleCSN.

Summary

In A/B testing we want to accelerate our experiments and product improvements by using methods with high statistical power. This post explains how to boost power using basic tools from econometrics: ordinary regression, fixed effects, and random effects. These methods are all “correct” [1], but the best one depends on specific characteristics of your data. We will learn about those characteristics using simulation (notebook) and theory.

Where do these methods apply? Imagine a website where people post photos. During our experiment numerous people visit the website one or more times. Our data has two important characteristics. First, the observations are naturally grouped by person. We can count each individual visit or, alternatively, group our data at the person-day level. Second, we suspect that there is heterogeneity. In other words, some people love to post photos while others rarely do. Panel data models let us incorporate these factors to estimate the treatment effect more precisely, that is, with more statistical power.

However, we need to think carefully about two practical complications in our A/B test setting. First, the number of visits per person will be skewed with many people visiting just once. Their data will be thrown away by the popular fixed effects model. Second, in A/B tests we typically use a simplistic form of randomization, independent coin flips. Therefore, some people that visit multiple times will still see only the treatment or control experience. Fixed effects also throws away that data.[2] We will better understand these limitations after learning about “between variation”, “within variation”, and the random effects model.

Key lessons

  1. Fixed effects can be much less efficient than a simple regression. To choose between them, you should consider two key factors: (1) the number of visits per person and (2) the heterogeneity of the individuals’ propensity to post. Both factors increase the relative performance of fixed effects. You should simulate your experiment’s data-generating process and carefully look at your standard errors to evaluate which model to use.
  2. Use the random effects estimator to automatically adjust for those factors. It often outperforms the alternative models.
  3. If you include observational covariates in your model, use fixed effects. Random effects with observational covariates requires unrealistic assumptions.[3]

In-depth discussion

Code and implementation

The graphs and simulation results are available in this Colab notebook. I fit the models using the linearmodels package and implemented them manually with statsmodels. For implementation details, good sources include the linearmodel docs and source along with the Stata docs for xtreg. The notebook is also instructive about using multiprocessing to parallelize the simulations for a massive speed up.

Introducing the models

Fixed effects and random effects are both ways of modeling unobserved heterogeneity. In our example scenario, this means each person has an individual propensity to post a photo, but we have no data directly telling us that propensity. Generically, the model looks like

\[y_{it} = d_{it} \beta + c_i + \epsilon_{it}\]

where \(i\) indexes persons and \(t\) indexes the visits of each person. The variable \(y_{it}\) is a dummy indicating whether the person posted on that visit. So, \(y_{it}=1\) if persion \(i\) posted in visit \(t\) and 0 otherwise. The treatment dummy is \(d_{it}\). So, \(d_{it}=1\) if persion \(i\) was in the treatment in visit \(t\) and 0 otherwise. The corresponding treatment effect is \(\beta\), our parameter of interest. It tells us how much the treatment version of the site boosts the probability of posting. Finally, each individual’s propensity to post is represented by \(c_i\). Think of this as person \(i\)’s personal tendency to share photos.

The variation in the data can be split: between-person and within-person. Between-person variation refers to the fact that some persons post more than others and some persons are more exposed to the treatment than others. Within-person variation refers to the fact that a given person posts on some visits but not on others and sees the treatment version of the site on some visits and not on others.

Between-person and within-person variation: A Python demo

df[['i', 't', 'y', 'd']].head()
>   i  t  y  d
 0  0  0  0  1
 1  0  1  0  0
 2  1  0  0  0
 3  1  1  1  0
 4  1  2  0  0

# Between-person variation
df.groupby('i')[['y', 'd']].mean().std()
> y    0.404658
> d    0.372494

# Within-person variation
df.groupby('i')[['y', 'd']].transform(lambda x: x - x.mean()).std()
> y    0.351786
> d    0.400976

Given this underlying model and variation, let’s consider four ways to estimate \(\beta\). The key differences behind them are how they treat variation between and within persons.[4]

For precise details on these models, see the section “Demonstrative implementations” in the Colab notebook. Each model from the linearmodels package is reimplemented manually with data transformations and OLS.

Simple regression (pooled OLS)

Let’s call this estimator \(\hat{\beta}_{\text{OLS}}\). We can simply run a regression of \(y\) on \(d\). This is mathematically equivalent to comparing the averages of \(y\) between the control and treatment groups. It combines both within and between variation. The regression we are fitting combines \(c_i + \epsilon_{it}\) into a single error term: \(y_{it} = d_{it} \beta + \nu_{it}\). Because we are ignoring the individual heterogeneity, it goes into our residuals and inflates their variance. In turn, the variance of our estimator increases and power decreases.

Fixed effects (“within estimator”)

The intuition of fixed effects is that each individual is treated as their own control group, thus exploiting only the within variation. This makes each person’s individual propensity irrelevant. We can implement the fixed effects estimator \(\hat{\beta}_{\text{FE}}\) by substracting the person-specific averages from all of our data. We would regress \(y_{it} - \bar{y}_i\) on \(d_{it} - \bar{d}_i\). The “demeaning” also subtracts away the term \(c_i\). This potentially reduces the residual variance and increases precision.

For fixed effects to work, some individuals must see both the control and treatment sites. In our experiment scenario we face the problem that some individuals may see only one variant. That happens if they visit just one time or happen to get the same coin flip on every visit. These persons have no within-person variation in the treatment \(d_{it}\)! The fixed effects estimator ignores these individuals, which decreases our effective sample size.

The fixed effects model is the workhorse of applied microeconometrics because it is valid under relatively weak assumptions. Many economists will see it as their go-to model for any grouped data. However, in an A/B test setting we have strong enough conditions that we can potentially do better with a simple regression or random effects.

Between estimator

The between estimator \(\hat{\beta}_{\text{BE}}\) is the complement of the within estimator. It uses only the variation between individuals and discards the within-person variation. We can implement it by averaging all our data to the person-level and then running a regression. That is, we fit the regression \(\bar{y}_i = \alpha + \bar{d}_i \beta + \bar{\nu}_i\). For this model to work, we must have variation in \(\bar{d}_i\). In proportion of visits to the treatment site must be higher for some persons than others. That will happen as a consequence of the control/treatment assignment being made by independent coinflips, generating a mixture of binomial ditributions. The between estimator is rarely used in practice, but it is an ingredient in the random effects estimator.

Random effects

We said that the fixed effects and between estimators use different sources of variation in the data to estimate the treatment effect. What if we could combine both to get a better estimate? This is what random effects does! In fact, the random effects estimator is an average of the fixed effects and between estimators:[5]

\[\hat{\beta}_{\text{RE}} = \text{WeightedAverage}(\hat{\beta}_{\text{FE}}, \hat{\beta}_{\text{BE}})\]

The weighting of the average adapts to the relative amounts of between and within variation and number of observations per group. The key weighting parameter[6] is:

\[\hat{\omega} = \tfrac{\hat{\sigma}_{\epsilon}}{\sqrt{T\hat{\sigma}^2_{c} + \hat{\sigma}^2_{\epsilon}}} = \tfrac{\text{Within residual variation}}{\sqrt{\text{Number of visits per person}\times \text{Heterogeneity} + \text{Within residual variance}}}.\]

The parameter \(\hat{\omega}\) controls how much of the within and between variation are used by \(\hat{\beta}_{\text{RE}}.\) (Think inverse-variance weighting!) As \(\hat{\omega}\rightarrow 1\), the random effects estimator approaches the between estimator. But, as \(\hat{\omega}\rightarrow 0\), the random effects estimator approaches the fixed effects estimator. To make \(\hat{\omega}\) small, we need to have a high number of visits per person (\(T\)) or large heterogeneity between people (\(\hat{\sigma}^2_{c}\)). Intuitively, if there is a lot of heterogeneity between people, we should weight more heavily the fixed effects estimator, which subtracts away that heterogeneity.

The downside of the random effects estimator, and why it is used far less frequently than fixed effects, is that it requires a very strong exogeneity assumption. In particular, the regressors must be unrelated to the individual heterogeneity. For an A/B test, \(d_{it}\) is an independent coinflip, so it is independent of \(c_i\). That’s great, but we will probably be in trouble if we want to include any other covariates in the model. In that case, fixed effects will be preferrable.

Simulations

To see these models in action, we will run Monte Carlo simulations of thousands experiments. On each “experiment” data set we will fit each of the models to generate the sampling distributions of \(\hat{\beta}_{\text{OLS}}, \hat{\beta}_{\text{FE}},\) and \(\hat{\beta}_{\text{RE}}\). Our outcome of interest is the standard deviation of each estimator, that is, the true standard error.

The data-generating process in the simulations uses the linear probability model

\[P(Y_{it} = 1) = d_{it} \beta + c_i\]

where

  1. There are 100 person, \(i \in \{1,2,3,\ldots,100\}\),
  2. The distribution of visits per person has as mixture distribution: \(T_i \sim \text{Poisson}(\mu_i), \mu_i \sim \text{Exponential}(\lambda)\),
  3. The treatment is assigned by coinflip on each visit, \(d_{it} \sim \text{Bernoulli}(0.5)\),
  4. There is a constant additive treatment effect \(\beta = 0.04\), and
  5. The individual propensities to post are distributed symmetrically, \(c_i \sim \text{Beta}(\alpha, \alpha)\) rescaled s.t. \(c_i \in [0.05, 0.95]\).

Individual heterogeneity

What happens if we adjust the variance of \(c_i\)? In the first panel below we have three example distributions of \(c_i\) showing low, medium, and high heterogeneity. The second panel shows the precision of the estimators as the heterogeneity ranges from low to high. At low levels of heterogeneity the standard error of the fixed effects estimator is much larger than the other estimators. However, as the heterogneity increases both the fixed effects and random effects estimators increase greatly in precision, while the simple regression has no change in performance.[7]

se_by_sigma_c

You can estimate the heterogeneity after a random effects model using the variance_decomposition property. The results would look something like this. The value after “Effects” is \(\hat{\sigma}^2_c\).

r = RandomEffects(y, exog).fit()
print(r.variance_decomposition)

Effects                   0.168079
Residual                  0.086765
Percent due to Effects    0.659536
Name: Variance Decomposition, dtype: float64

Visits per person

The other factor of interest is the number of visits per person. Below we vary the average number of visits from 2 to 33. The first panel again shows example distributions, one with a small number of visits per person and the other with a larger number. The second panel reports the efficiency of each estimator relative[8] to the random effects estimator, in particular, the “Fixed effects” plot is \(\tfrac{\text{Var}(\hat{\beta}_\text{RE})}{\text{Var}(\hat{\beta}_\text{FE})}\).

se_by_group_size

Conclusion

When our experiment has grouped data and heterogeneity, we can increase our statistical power by applying the standard panel data models from our econometrics training.[9] The familiar fixed effects model can help or hurt significantly. To understand that we developed simulations and intuition about the key factors: group size and degree of heterogeneity. We also saw that the random effects model automatically adapts based on those factors, in effect, making the choice for us.

This post only scratches the surface of these types of models. Practitioners should use simulations with realistic data-generating processes inspired by their data, rather than the stylized, toy processes here. We also made simplifications, for example, choosing a constant additive treatment effect rather than a heterogeneous treatment effect (that could also be correlated with the heterogeneity in intercepts!). There are also well-developed Bayesian approaches to these models. Finally, we also ignored the complicated matter of estimating of standard errors. These matters are left to future posts.

Notes

  1. By “correct” we mean that they give us consistent estimates of the treatment effect. 

  2. This lack of within variation will be even worse if the randomization is not 50/50. 

  3. When you are including covariates, the question of the best model is complex. We are recommending fixed effects merely as the safe choice. 

  4. A comprehensive reference for these models is Econometric Analysis of Cross Section and Panel Data, Second Edition (2010) by Jeffrey M. Wooldridge. See also An Introduction to Classical Econometric Theory (2000) by Paul A. Ruud for more depth on random effects. 

  5. For more details on the math of the RE estimator see Ruud (2000) mentioned above or these lecture notes from James Powell at Berkeley

  6. The formula for \(\hat{\omega}\) is simplified for presentation by assuming \(T_i=T\), i.e., that all persons visited the same number of times (a balanced panel). 

  7. It may be surprising that adjusting the heterogeneity has no effect on the pooled OLS estimator. Adding variation should increase the variance of the estimator. However, since this is a linear probability model that combined residual variance in constant. Adjusting the heterogeneity simply apportions variance between the individual effects and the indiosyncratic effects. 

  8. We report relative efficiency because in this case it makes the pattern easier to see than reporting the raw standard errors. 

  9. In other contexts these models are called hierarchical models or multilevel models.