We consider a randomized experiment as per Quant 2, lecture 2. From a large population we draw a sample of units indexed by \(i=1,..,n\). For each unit we have a treatment, \(D_i=0,1\), potential outcomes \(Y_{1i}\) and \(Y_{0i}\), observed outcomes \(Y_i = D_i Y_{1i} + (1-D_i)Y_{0i}\), and then a single covariate \(X\). The experiment uses complete random assignment of \(n_1\) units to treatment (\(D_i=1\)) and the rest to control (\(D_i=0\)).
We consider three estimators: (i) the simple difference in means, equivalent to the OLS regression of \(Y_i\) on \(D_i\), (ii) OLS regression of \(Y_i\) on \(D_i\), \(X_i\) and a constant, and (iii) the interacted OLS regression of \(Y_i\) on \(D_i\), \(X_i\), \(D_iX_i\), and a constant. Estimator (i) is unbiased, as per the lecture notes. Lin (2013) provides expressions for the leading terms in the bias of (ii) and (iii) relative to the SATE. For (ii), we have, \[ ltb(SATE)_{(ii)} = - \frac{1}{n}\frac{1}{\sigma^2_X} \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=1}^n[\rho_i - \rho](X_i - \bar{X})^2, \] where \(\sigma^2_X = (1/N)\sum_{i=1}^n(X_i - \bar{X})^2\). Then for (iii), we have \[ ltb(SATE)_{(iii)} = -\frac{1}{\sigma^2_X}\left[\left(\frac{1}{n_1} - \frac{1}{n} \right) \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=1}^n e_{1i}(X_i - \bar{X})^2 - \left(\frac{1}{n-n_1} - \frac{1}{n} \right) \lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=1}^n e_{0i}(X_i - \bar{X})^2 \right] , \] where \(e_{di}\) is the prediction error for \(Y_{di}\) from regression (iii).
Examining these expressions, it is clear that the bias depends strongly on the size of the sample. In small samples, the bias is driven by correlation between the squared, centered covariate values and the potential outcomes/unit-level effects.
We simulate a population with data distribution such that the small sample bias from the adjusted regressions will be pretty bad in small samples—that is, where there is strong correlation between the centered, squared covariate and the unit-level effects:
N <- 1000
x <- runif(N, 0, 2)
y0 <- .5*x - .25*(x-mean(x))^2 +.25*rnorm(N)
y1 <- 2*y0 + .25 + (x-mean(x))^2 + .25*rnorm(N)
index <- 1:N
potcov <- data.frame(index, y1,y0,x)
Now plot these data: