## Introduction

What follows is an example I worked up when trying to figure out how to communicate potential outcomes in a regression framework to students graphically. This discussion is derived from Morgan and Winship’s 2015 book^{1}, especially pages 122-123. The goal is to represent the potential outcomes framework using standard regression notation, and then to discuss endogenous selection bias using this framework.

## Math

Define *Y*_{i} = *μ*_{0} + (*μ*_{1} − *μ*_{0})*D*_{i} + {*v*_{i}^{0} + *D*_{i}(*v*_{i}^{1} − *v*_{i}^{0})} where *D*_{i} is binary assignment to treatment for individual *i*, *μ*_{0} is the expected outcome for control, *μ*_{1} is the expected outcome for treated, *μ*_{1} − *μ*_{0} is the Average Treatment Effect or *δ*. The *v* terms are subset by *i* denoting that they are individual heterogeneity. *v*_{i}^{0} is an individual’s deviation from *μ*_{0} under control, or *Y*_{i}^{0} = *μ*_{0} + *v*_{i}^{0}. The same can be said for *v*_{i}^{1}, it is an individuals deviation from *μ*_{1} under treatment or *Y*_{i}^{1} = *μ*_{1} + *v*_{i}^{1}. Treatment minus control is equal to (*Y*_{i}^{1})−(*Y*_{i}^{0})=(*μ*_{1} + *v*_{i}^{1})−(*μ*_{0} + *v*_{i}^{0}) which is equal to the average treatment effect *μ*_{1} − *μ*_{0} and the individual heterogeneity in the treatment effect *v*_{i}^{1} − *v*_{i}^{0}.

For *μ*_{0} to properly represent the mean of *Y*_{i} under control, the *E*[*v*_{i}^{0}] must be equal to zero, or $\dfrac{\Sigma_i^N v^0_i}{N} = \Sigma_i^n v^0_i= 0$. You can get here from the linearity of expectation, such that if *μ*_{0} = *E*[*Y*_{i}] for control, we can substitute such that *μ*_{0} = *E*[*μ*_{0} + *v*_{i}^{0}]=*E*[*μ*_{0}]+*E*[*v*_{i}^{0}]=*μ*_{0} + *E*[*v*_{i}^{0}], therefore *E*[*v*_{i}^{0}]=0. The same is true for *E*[*v*_{i}^{1}], *μ*_{1} = *E*[*μ*_{1} + *v*_{i}^{1}]=*E*[*μ*_{1}]+*E*[*v*_{i}^{1}]=*μ*_{1} + *E*[*v*_{i}^{1}]. By the same token, then *E*[*v*_{i}^{1} − *v*_{i}^{0}]=*E*[*v*_{i}^{1}]−*E*[*v*_{i}^{0}]=0.

Morgan and Winship argue that issues arise when “*D*_{i} is correlated with the population-level variant of the error term, {*v*_{i}^{0} + *D*_{i}(*v*_{i}^{1} − *v*_{i}^{0})}, as would be the case when the size of the individual-level treatment effect, in this case (*μ*_{1} − *μ*_{0})+{*v*_{i}^{0} + *D*_{i}(*v*_{i}^{1} − *v*_{i}^{0})}, differs among those who select the treatment and those who do not.” For *E*[(*μ*_{1} − *μ*_{0})+{*v*_{i}^{0} + *D*_{i}(*v*_{i}^{1} − *v*_{i}^{0})}|*D*_{i} = 1] to be greater than *E*[(*μ*_{1} − *μ*_{0})+{*v*_{i}^{0} + *D*_{i}(*v*_{i}^{1} − *v*_{i}^{0})}|*D*_{i} = 0], we need to have *E*[*v*_{1}^{1}|*D*_{i} = 1]>*E*[*v*_{1}^{0}|*D*_{i} = 0] which does not imply *E*[*v*_{i}^{1}]>*E*[*v*_{i}^{0}] which would have violated our previous statements. If *E*[*v*_{i}^{1}|*D*_{i} = 1]=*E*[*v*_{i}^{1}] and *E*[*v*_{i}^{0}|*D*_{i} = 0]=*E*[*v*_{i}^{0}] as with random assignment, we can get an unbiased estimate of *δ* with a simple regression of *Y*_{i} on *D*_{i}. When *E*[*v*_{i}^{1}|*D*_{i} = 1]≠*E*[*v*_{i}^{1}] and/or *E*[*v*_{i}^{0}|*D*_{i} = 0]≠*E*[*v*_{i}^{0}], as when *D*_{i} is assigned to those with high individual treatment effects, we cannot get an unbiased estimate of *δ* with a simple regression.

## Worked Examples (Code in R)

Take, for example, the case of catholic school (following Morgan and Winships examples). Assume for the moment that the average test score of a public school student is equal to 100, such that *μ*_{0} = 100, and the average test score of a catholic school student is equal to 110, such that *μ*_{1} = 110. The Average Treatment Effect, *δ*, equals *μ*_{1} − *μ*_{0} = 10. Next, assume random variability in test performance such that *v*^{1} and *v*^{0} follow a normal distribution with mean zero and variance 100. Let this represent individual level variability in test taking. A graph of potential outcomes for students under different school types is shown below. The outcomes for a single individual, *i*, are connected with a line.

If we first assume selection into catholic school or public school (i.e. *D*_{i} is randomly assigned), we would no correlation between treatment and errors, and our simple regression correctly estimates the Average Treatment Effect. See below.

```
mu0 <- 100
mu1 <- 110
v0 <- rnorm(1000, mean=0, sd=10)
v1 <- rnorm(1000, mean=0, sd=10)
d <- rbinom(1000,1,.5)
y <- mu0 + (mu1 - mu0)*d + v0+d*(v1-v0)
mean(y[d==1])
## [1] 110.0734
mean(y[d==0])
## [1] 99.81277
summary(lm(y~d))
##
## Call:
## lm(formula = y ~ d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.639 -7.357 -0.213 7.181 39.384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.8128 0.4478 222.90 <2e-16 ***
## d 10.2606 0.6581 15.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 998 degrees of freedom
## Multiple R-squared: 0.1959, Adjusted R-squared: 0.1951
## F-statistic: 243.1 on 1 and 998 DF, p-value: < 2.2e-16
cor(d, v1)
## [1] 0.01767286
cor(d, v0)
## [1] 0.008900606
cor(d, (v1-v0))
## [1] 0.006432687
```

A graphical representation is below. Red lines represent kids in catholic school, blue lines represent kids in public school. Note: our observational data would only include those scores for red dots in catholic school and blue dots in public school.

Say, instead, that public school kids with a history of bad test scores are selected to attend catholic school. Their expected *v*^{0} is less than negative 5. Further assume that these kids only under perform in public school such that we would not expect their average deviations from *E*[*v*_{i}^{1}] to be biased, or *E*[*v*_{i}^{1}|*D* = 1]=0. We’ve now selected out the low performers from the public school population, increasing the baseline public school test scores, and left the catholic school test scores unchanged. Further, we’ve induced a correlation between *D*_{i} and *v*_{i}^{0} and between *D*_{i} and *v*_{i}^{1} − *v*_{i}^{0} but not between *D*_{i} and *v*_{i}^{1}. The estimated causal effect would be lower than the average treatment effect for this scenario. See below for the worked example.

```
mu0 <- 100
mu1 <- 110
v0 <- rnorm(1000, mean=0, sd=10)
v1 <- rnorm(1000, mean=0, sd=10)
d <- v0< -5
y <- mu0 + (mu1 - mu0)*d + v0+d*(v1-v0)
mean(y[d==1])
## [1] 110.5723
mean(y[d==0])
## [1] 104.9255
summary(lm(y~d))
##
## Call:
## lm(formula = y ~ d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.0500 -6.0554 -0.8261 4.8245 26.4238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.9255 0.3108 337.63 <2e-16 ***
## dTRUE 5.6467 0.5394 10.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.032 on 998 degrees of freedom
## Multiple R-squared: 0.09896, Adjusted R-squared: 0.09806
## F-statistic: 109.6 on 1 and 998 DF, p-value: < 2.2e-16
cor(d, v1)
## [1] 0.02248338
cor(d, v0)
## [1] -0.7704901
cor(d, (v1-v0))
## [1] 0.5620342
```

If we were to graph the potential outcomes, we would see the following:

We can do the same for an example where we only select kids who we think will prosper in catholic school for treatment. We only choose those who have a *v*_{i}^{1} > 5, but we ignore *v*_{i}^{0} in our treatment assignment. We have left the baseline performance unchanged (we’ve randomly selected on *v*_{i}^{0} after all), but we’ve increased the performance of catholic school children on average. We’ve thus induced a correlation between *D*_{i} and *v*_{i}^{1} and between *D*_{i} and *v*_{i}^{1} − *v*_{i}^{0} but not between *D*_{i} and *v*_{i}^{0}. See below for the worked example.

```
mu0 <- 100
mu1 <- 110
v0 <- rnorm(1000, mean=0, sd=10)
v1 <- rnorm(1000, mean=0, sd=10)
d <- v1 > 5
y <- mu0 + (mu1 - mu0)*d + v0+d*(v1-v0)
mean(y[d==1])
## [1] 121.4591
mean(y[d==0])
## [1] 99.93334
summary(lm(y~d))
##
## Call:
## lm(formula = y ~ d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.9563 -5.4148 -0.4595 5.8804 29.9768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.9333 0.3439 290.62 <2e-16 ***
## dTRUE 21.5258 0.6107 35.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.987 on 998 degrees of freedom
## Multiple R-squared: 0.5545, Adjusted R-squared: 0.5541
## F-statistic: 1242 on 1 and 998 DF, p-value: < 2.2e-16
cor(d, v1)
## [1] 0.7707782
cor(d, v0)
## [1] 0.02700884
cor(d, (v1-v0))
## [1] 0.509484
```

The plot of this example would look like this:

We can also select those kids who would do well in either catholic school or in public school. Say we assign treatment to those with *v*_{i}^{0} and *v*_{i}^{1} greater than 5. We’ve thus pushed down the public school scores while increasing the catholic school scores. We’ve induced a correlation between *D*_{i} and both *v*_{i}^{0} and *v*_{i}^{1} but not between *D*_{i} and *v*_{i}^{1} − *v*_{i}^{0}. See below for the worked example.

```
mu0 <- 100
mu1 <- 110
v0 <- rnorm(1000, mean=0, sd=10)
v1 <- rnorm(1000, mean=0, sd=10)
d <- v0 > 10 & v1 > 5
y <- mu0 + (mu1 - mu0)*d + v0+d*(v1-v0)
mean(y[d==1])
## [1] 121.842
mean(y[d==0])
## [1] 98.6236
summary(lm(y~d))
##
## Call:
## lm(formula = y ~ d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.732 -6.639 -0.066 6.724 32.485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.6236 0.3127 315.40 <2e-16 ***
## dTRUE 23.2184 1.3713 16.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.628 on 998 degrees of freedom
## Multiple R-squared: 0.2232, Adjusted R-squared: 0.2224
## F-statistic: 286.7 on 1 and 998 DF, p-value: < 2.2e-16
cor(d, v1)
## [1] 0.2733187
cor(d, v0)
## [1] 0.3571695
cor(d, (v1-v0))
## [1] -0.06421799
```

And the plot:

## End Notes

Thanks to Monica Alexander who looked over an early draft.

Morgan, Stephen L. and Christopher Winship. 2015. *Counterfactuals and Causal Inference. Methods and Principles for Social Research.* Second Edition. Cambridge University Press.