+++ date = 2018-02-19 # lastmod = 2018-02-19 draft = false tags = [“potential outcomes”, “causal”, “regression”, “endogenous selection bias”] title = “Regression and Potential Outcomes” summary = “What follows is an example I worked up when trying to figure out how to communicate potential outcomes in a regression framework to students graphically. This discussion is derived from Morgan and Winship’s 2015 book, especially pages 122-123. The goal is to represent the potential outcomes framework using standard regression notation, and then to discuss endogenous selection bias using this framework.” +++

Introduction

What follows is an example I worked up when trying to figure out how to communicate potential outcomes in a regression framework to students graphically. This discussion is derived from Morgan and Winship’s 2015 book¹, especially pages 122-123. The goal is to represent the potential outcomes framework using standard regression notation, and then to discuss endogenous selection bias using this framework.

Math

Define Y_iâ=âÎ¼₀â+â(Î¼₁âââÎ¼₀)D_iâ+â{v_i⁰â+âD_i(v_i¹âââv_i⁰)} where D_i is binary assignment to treatment for individual i, Î¼₀ is the expected outcome for control, Î¼₁ is the expected outcome for treated, Î¼₁âââÎ¼₀ is the Average Treatment Effect or Î´. The v terms are subset by i denoting that they are individual heterogeneity. v_i⁰ is an individual’s deviation from Î¼₀ under control, or Y_i⁰â=âÎ¼₀â+âv_i⁰. The same can be said for v_i¹, it is an individuals deviation from Î¼₁ under treatment or Y_i¹â=âÎ¼₁â+âv_i¹. Treatment minus control is equal to (Y_i¹)â(Y_i⁰)=(Î¼₁â+âv_i¹)â(Î¼₀â+âv_i⁰) which is equal to the average treatment effect Î¼₁âââÎ¼₀ and the individual heterogeneity in the treatment effect v_i¹âââv_i⁰.

For Î¼₀ to properly represent the mean of Y_i under control, the E\[*v**i*0\] must be equal to zero, or \(\\dfrac{\\Sigma\_i^N v^0\_i}{N} = \\Sigma\_i^n v^0\_i= 0\). You can get here from the linearity of expectation, such that if Î¼₀â=âE\[*Y**i*\] for control, we can substitute such that Î¼₀â=âE\[*Î¼*0â+â*v**i*0\]=E\[*Î¼*0\]+E\[*v**i*0\]=Î¼₀â+âE\[*v**i*0\], therefore E\[*v**i*0\]=0. The same is true for E\[*v**i*1\], Î¼₁â=âE\[*Î¼*1â+â*v**i*1\]=E\[*Î¼*1\]+E\[*v**i*1\]=Î¼₁â+âE\[*v**i*1\]. By the same token, then E\[*v**i*1âââ*v**i*0\]=E\[*v**i*1\]âE\[*v**i*0\]=0.

Morgan and Winship argue that issues arise when “D_i is correlated with the population-level variant of the error term, {v_i⁰â+âD_i(v_i¹âââv_i⁰)}, as would be the case when the size of the individual-level treatment effect, in this case (Î¼₁âââÎ¼₀)+{v_i⁰â+âD_i(v_i¹âââv_i⁰)}, differs among those who select the treatment and those who do not.” For E\[(*Î¼*1âââ*Î¼*0)+{*v**i*0â+â*D**i*(*v**i*1âââ*v**i*0)}|*D**i*â=â1\] to be greater than E\[(*Î¼*1âââ*Î¼*0)+{*v**i*0â+â*D**i*(*v**i*1âââ*v**i*0)}|*D**i*â=â0\], we need to have E\[*v*11|*D**i*â=â1\]>E\[*v*10|*D**i*â=â0\] which does not imply E\[*v**i*1\]>E\[*v**i*0\] which would have violated our previous statements. If E\[*v**i*1|*D**i*â=â1\]=E\[*v**i*1\] and E\[*v**i*0|*D**i*â=â0\]=E\[*v**i*0\] as with random assignment, we can get an unbiased estimate of Î´ with a simple regression of Y_i on D_i. When E\[*v**i*1|*D**i*â=â1\]â E\[*v**i*1\] and/or E\[*v**i*0|*D**i*â=â0\]â E\[*v**i*0\], as when D_i is assigned to those with high individual treatment effects, we cannot get an unbiased estimate of Î´ with a simple regression.

Worked Examples (Code in R)

Take, for example, the case of catholic school (following Morgan and Winships examples). Assume for the moment that the average test score of a public school student is equal to 100, such that Î¼₀â=â100, and the average test score of a catholic school student is equal to 110, such that Î¼₁â=â110. The Average Treatment Effect, Î´, equals Î¼₁âââÎ¼₀â=â10. Next, assume random variability in test performance such that v¹ and v⁰ follow a normal distribution with mean zero and variance 100. Let this represent individual level variability in test taking. A graph of potential outcomes for students under different school types is shown below. The outcomes for a single individual, i, are connected with a line.

If we first assume selection into catholic school or public school (i.e. D_i is randomly assigned), we would no correlation between treatment and errors, and our simple regression correctly estimates the Average Treatment Effect. See below.

mu0 <- 100
mu1 <- 110
v0 <- rnorm(1000, mean=0, sd=10)
v1 <- rnorm(1000, mean=0, sd=10)
d <- rbinom(1000,1,.5)
y <- mu0 + (mu1 - mu0)*d + v0+d*(v1-v0)
mean(y[d==1])

## [1] 110.0734

mean(y[d==0])

## [1] 99.81277

summary(lm(y~d))

## 
## Call:
## lm(formula = y ~ d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.639  -7.357  -0.213   7.181  39.384 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  99.8128     0.4478  222.90   <2e-16 ***
## d            10.2606     0.6581   15.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 998 degrees of freedom
## Multiple R-squared:  0.1959, Adjusted R-squared:  0.1951 
## F-statistic: 243.1 on 1 and 998 DF,  p-value: < 2.2e-16

cor(d, v1)

## [1] 0.01767286

cor(d, v0)

## [1] 0.008900606

cor(d, (v1-v0))

## [1] 0.006432687

A graphical representation is below. Red lines represent kids in catholic school, blue lines represent kids in public school. Note: our observational data would only include those scores for red dots in catholic school and blue dots in public school.

Say, instead, that public school kids with a history of bad test scores are selected to attend catholic school. Their expected v⁰ is less than negative 5. Further assume that these kids only under perform in public school such that we would not expect their average deviations from E\[*v**i*1\] to be biased, or E\[*v**i*1|*D*â=â1\]=0. We’ve now selected out the low performers from the public school population, increasing the baseline public school test scores, and left the catholic school test scores unchanged. Further, we’ve induced a correlation between D_i and v_i⁰ and between D_i and v_i¹âââv_i⁰ but not between D_i and v_i¹. The estimated causal effect would be lower than the average treatment effect for this scenario. See below for the worked example.

mu0 <- 100
mu1 <- 110
v0 <- rnorm(1000, mean=0, sd=10)
v1 <- rnorm(1000, mean=0, sd=10)
d <- v0< -5
y <- mu0 + (mu1 - mu0)*d + v0+d*(v1-v0)
mean(y[d==1])

## [1] 110.5723

mean(y[d==0])

## [1] 104.9255

summary(lm(y~d))

## 
## Call:
## lm(formula = y ~ d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.0500  -6.0554  -0.8261   4.8245  26.4238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 104.9255     0.3108  337.63   <2e-16 ***
## dTRUE         5.6467     0.5394   10.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.032 on 998 degrees of freedom
## Multiple R-squared:  0.09896,    Adjusted R-squared:  0.09806 
## F-statistic: 109.6 on 1 and 998 DF,  p-value: < 2.2e-16

cor(d, v1)

## [1] 0.02248338

cor(d, v0)

## [1] -0.7704901

cor(d, (v1-v0))

## [1] 0.5620342

If we were to graph the potential outcomes, we would see the following:

We can do the same for an example where we only select kids who we think will prosper in catholic school for treatment. We only choose those who have a v_i¹â>â5, but we ignore v_i⁰ in our treatment assignment. We have left the baseline performance unchanged (we’ve randomly selected on v_i⁰ after all), but we’ve increased the performance of catholic school children on average. We’ve thus induced a correlation between D_i and v_i¹ and between D_i and v_i¹âââv_i⁰ but not between D_i and v_i⁰. See below for the worked example.

mu0 <- 100
mu1 <- 110
v0 <- rnorm(1000, mean=0, sd=10)
v1 <- rnorm(1000, mean=0, sd=10)
d <- v1 > 5
y <- mu0 + (mu1 - mu0)*d + v0+d*(v1-v0)
mean(y[d==1])

## [1] 121.4591

mean(y[d==0])

## [1] 99.93334

summary(lm(y~d))

## 
## Call:
## lm(formula = y ~ d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.9563  -5.4148  -0.4595   5.8804  29.9768 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  99.9333     0.3439  290.62   <2e-16 ***
## dTRUE        21.5258     0.6107   35.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.987 on 998 degrees of freedom
## Multiple R-squared:  0.5545, Adjusted R-squared:  0.5541 
## F-statistic:  1242 on 1 and 998 DF,  p-value: < 2.2e-16

cor(d, v1)

## [1] 0.7707782

cor(d, v0)

## [1] 0.02700884

cor(d, (v1-v0))

## [1] 0.509484

The plot of this example would look like this:

We can also select those kids who would do well in either catholic school or in public school. Say we assign treatment to those with v_i⁰ and v_i¹ greater than 5. We’ve thus pushed down the public school scores while increasing the catholic school scores. We’ve induced a correlation between D_i and both v_i⁰ and v_i¹ but not between D_i and v_i¹âââv_i⁰. See below for the worked example.

mu0 <- 100
mu1 <- 110
v0 <- rnorm(1000, mean=0, sd=10)
v1 <- rnorm(1000, mean=0, sd=10)
d <- v0 > 10 & v1 > 5
y <- mu0 + (mu1 - mu0)*d + v0+d*(v1-v0)
mean(y[d==1])

## [1] 121.842

mean(y[d==0])

## [1] 98.6236

summary(lm(y~d))

## 
## Call:
## lm(formula = y ~ d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.732  -6.639  -0.066   6.724  32.485 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  98.6236     0.3127  315.40   <2e-16 ***
## dTRUE        23.2184     1.3713   16.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.628 on 998 degrees of freedom
## Multiple R-squared:  0.2232, Adjusted R-squared:  0.2224 
## F-statistic: 286.7 on 1 and 998 DF,  p-value: < 2.2e-16

cor(d, v1)

## [1] 0.2733187

cor(d, v0)

## [1] 0.3571695

cor(d, (v1-v0))

## [1] -0.06421799

And the plot:

End Notes

Thanks to Monica Alexander who looked over an early draft.

Morgan, Stephen L. and Christopher Winship. 2015. Counterfactuals and Causal Inference. Methods and Principles for Social Research. Second Edition. Cambridge University Press.