This site is for everyone who reads, reviews, or implements difference-in-difference studies. It is an evolving resource that highlights both the fundamental basics and new method developments for diff-in-diff. Thanks to the Laura and John Arnold Foundation for funding this work and our generous colleagues for their comments. We welcome your constructive feedback as well.

Sincerely,
Bret Zeldow and Laura Hatfield

# Introduction

Simple to understand and easy to implement, difference-in-differences (diff-in-diff) is a method to estimate causal effects of non-randomized interventions such as statewide policy changes.

For example, say California (treated) enacts a new health care law, but neighboring state Nevada (control) does not. We evaluate the effect of California’s new law by comparing how the difference in outcomes between the two states changed after the California law was enacted.

Thanks to its simplicity, diff-in-diff can be mistaken for a “quick and easy” way to draw causal conclusions. Here, we peer under the hood of diff-in-diff and illuminate its innerworkings, which are more complex than sometimes appreciated.

# Notation

A note before we embark on our journey: the table below provides a reference for our notation.

Symbol Meaning
$$Y(t)$$ Observed outcome at time $$t$$
$$A=0$$ Control
$$A=1$$ Treated
$$t=1,\ldots,T_0$$ Pre-intervention times
$$t=T_0+1,\ldots,T$$ Post-intervention times
$$Y^a(t)$$ Potential outcome with treatment $$A = a$$ at time $$t$$
$$X$$ Observed covariates
$$U$$ Unobserved covariates

# Target estimand

At the outset of any analysis, we must define a clear study question, such as “How does inpatient spending change after California enacts a new health care law?” Then we must translate this policy question into a statistical question about an estimand, such as “What is the differential change in spending in California versus Nevada before and after the law changed?” Finally, we define a method to estimate this using data, such as a linear model fit to inpatient spending in California and Nevada before and after the law change.

In this example, the target estimand might be a regression coefficient, $$\beta$$, that quantifies the differential change in California spending after the new law compared to the change in Nevada spending. We could use the ordinary least squares estimator to get an estimate, $$\hat{\beta}$$, from observed data.

To summarize,

• The quantity we care about is called the estimand. We determine the target estimand using the policy question and our statistical knowledge.

• The mathematical function (or algorithm) that takes data as input and produces a value of the estimand is called the estimator.

• The estimator’s output, given some actual data input, is called the estimate. This value represents our best guess at the thing we care about, given the data we have.

Instead of a regression coefficient, we can define the target estimand as the difference between potential outcomes under treatment versus no treatment. For example, the average effect of treatment on the treated (ATT) compares, in the treated group, the potential outcomes with treatment to the potential outcomes with no treatment. For a diff-in-diff, the ATT is the effect of treatment on the treated group in the post-treatment period. Written mathematically, the ATT is

Average effect of treatment on the treated (ATT) $\begin{equation*} ATT \equiv \mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1\right] \end{equation*}$

Here, $$t = 2$$ represents the post-treatment period. We will see later that this is not the only causal estimand that is relevant to diff-in-diff.

If both the regression coefficient $$\beta$$ and the ATT are possible target estimands, why do we prefer the ATT?

First, the ATT is explicitly causal. Most diff-in-diff studies address a causal question (e.g., “what is the causal effect of the new law on inpatient spending in California?”). In contrast, the regression coefficient $$\beta$$ is not as clear. Does it describe a causal relationship or just an association?

Second, the ATT is agnostic among possible estimators. We can estimate the ATT with many statistical methods: parametric, non-parametric and everything in between. The $$\beta$$ coefficient, on the other hand, is specific to the regression estimator.

If we could observe the potential outcomes both with treatment and with no treatment, estimating the ATT would be easy. We would simply calculate the difference in these two potential outcomes for each treated unit, and take the average. However, we cannot observe potential outcomes both with and without treatment. In the treated group, the potential outcomes with treatment are factual (we can observe them), but the potential outcomes with no treatment are counterfactual (we cannot observe them).

So how do we estimate the ATT when the some of the potential outcomes are unobservable? We use the control group to estimate untreated outcomes in the treated group in the post-intervention period. This is a good idea but requires some strong assumptions. Next, we discuss these assumptions and whether they are reasonable.

# Assumptions

### Consistency

For diff-in-diff, the treatment status of a unit can vary over time. However, we only permit two treatment histories: never treated (the control group) and treated in the post-intervention period only (the treated group). Thus, we will use $$A=0$$ and $$A=1$$ to represent the control and treated groups, with the understanding that the treated group only receives treatment whenever $$t > T_0$$ (see notation).

Every unit has two potential outcomes, but we only observe one — the one corresponding to the actual treatment status. The consistency assumption links the potential outcomes $$Y^a(t)$$ at time $$t$$ with treatment $$a$$ to the observed outcomes $$Y(t)$$.

Consistency Assumption
$Y(t) = (1 - A) \cdot Y^0(t) + A \cdot Y^1(t)$

If a unit is treated ($$A=1$$), then the observed outcome is the potential outcome with treatment $$Y(t) = Y^1(t)$$ and the potential outcome with no treatment $$Y^0(t)$$ is unobserved. If a unit is not treated ($$A=0$$), then $$Y(t) = Y^0(t)$$ and $$Y^1(t)$$ is unobserved.

However, we also assume that future treatment does not affect past outcomes. Thus, in the pre-intervention period, the potential outcome with (future) treatment and the potential outcome with no (future) treatment are the same. We write this assumption mathematically as

Arrow of time $Y(t) = Y^0(t) = Y^1(t),\; \mbox{for}\ t \leq T_0$

Thus, causal inference is a missing data problem: the unobserved counterfactual potential outcomes are missing data.

### Counterfactual assumption

We assume that the change in outcomes from pre- to post-intervention in the control group is a good proxy for the counterfactual change in untreated potential outcomes in the treated group. When we observe the treated and control units only once before treatment ($$t=1$$) and once after treatment ($$t=2$$), we write this as:

Counterfactual Assumption (1) \begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1\right] = \\ \nonumber \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0\right] \end{align*}

This is an assumption — not something we can test — because it involves unobserved counterfactual outcomes.

Traditionally, this assumption is called the parallel trends assumptions, but as we will see, that phrase can be ambiguous.

## Identification

Using the assumptions above, we can re-write the the target estimand (which involved unobserved counterfactuals) in a form that depends only on observed outcomes. This process is called “identification”.

For diff-in-diff, identification begins with the ATT, applies the Counterfactual Assumption (1) and the Consistency Assumption, and ends with the familiar diff-in-diff estimator.

The result is the familiar diff-in-diff estimator

\begin{align*} ATT &\equiv \mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1\right] \\ &= \lbrace \mathbb{E}\left[Y(2) \mid A = 1\right] - \mathbb{E}\left[Y(1) \mid A = 1\right] \rbrace - \\ & \ \ \ \ \ \ \lbrace \mathbb{E}\left[Y(2) \mid A = 0\right] - \mathbb{E}\left[Y(1) \mid A = 0\right] \rbrace \end{align*}

We can now estimate this ATT by simply plugging in sample averages for the four expectations on the right-hand side:

1. The post-intervention average of the treated group for $$\mathbb{E}\left[Y(2) \mid A = 1\right]$$
2. The pre-intervention average of the treated group for $$\mathbb{E}\left[Y(1) \mid A = 1\right]$$
3. The post-intervention average of the control group for $$\mathbb{E}\left[Y(2) \mid A = 0\right]$$
4. The pre-intervention average of the control group for $$\mathbb{E}\left[Y(1) \mid A = 0\right]$$.

Finding the standard error for this estimator is a little more complex, but we could estimate it by bootstrapping, for example.

Sometimes the counterfactual assumption may hold only after conditioning on some observed covariates, and the identification becomes more complex. More on this in the Confounding section.

# Multiple time periods

When we observe the treated and control units multiples times before and after treatment, we must adapt the target estimand and identifying assumptions accordingly. Let’s start by looking at possible target estimands.

## Target Estimands

We can calculate the ATT at any of the post-treatment time points

Time-varying ATT
Individual time points
For some $$t > T_0$$, $\begin{equation*} ATT(t) \equiv \mathbb{E}\left[Y^1(t) - Y^0(t) \mid A = 1\right] \end{equation*}$

or we can compute the average ATT across the post-treatment time points

Time-varying ATT
Averaged over time points
$\begin{equation*} ATT \equiv \mathbb{E}\left[\overline{Y^1}_{\{t>T_0\}} - \overline{Y^0}_{\{t>T_0\}} \mid A = 1\right] \end{equation*}$

Here, the overbar $$\overline{{\color{white} Y}}$$ indicates averaging and the subscript $$_{t>T_0}$$ refers to the time points over which the outcome is averaged.

Athey and Imbens (2018) and Goodman-Bacon (2018) discuss the weighted estimands that arise when the timing of the intervention varies across treated units. We will not address this further, and encourage you to read these papers for more.

## Assumptions

What kind of assumptions do we need to estimate the ATTs above? We consider several counterfactual assumptions that may require:

1. parallel average outcomes in pre- to post-intervention periods
2. parallel outcome trends across certain time points, or
3. parallel outcome trends across all time points.

First, consider an assumption that average over the pre- and post-intervention time points, effectively collapsing back to the simple two-period case.

Counterfactual Assumption (2a)
Avg pre, avg post
\begin{align*} \mathbb{E} \left[\overline{Y^0}_{\{t > T_0\}} - \overline{Y^0}_{\{t \leq T_0\}} \mid A = 0\right] = \\ \mathbb{E} \left[\overline{Y^0}_{\{t > T_0\}} - \overline{Y^0}_{\{t \leq T_0\}} \mid A = 1\right] \end{align*}

Here, we assume that the difference between the average of the pre-intervention outcomes and the average of the untreated post-intervention outcomes is the same for both treated and control groups. To identify the time-averaged ATT using this assumption, we use the same identification process as in the simple case with only one observation in each of the pre- and post-intervention periods.

In our next proposed assumption, we restrict our focus to only two time points: one pre-intervention and one post-intervention.

Counterfactual Assumption (2b)
One pre, one post
For some $$t^* > T_0$$, there exists a $$t' \leq T_0$$ such that \begin{align*} \mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 1\right] = \\ \mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 0\right] \end{align*}

Counterfactual Assumption (2b) is a restriction on the data at two time points, one before and one after treatment. In a sense, time points other than these two are not relevant. Or at least, the other time points need not satisfy the “parallel trends” assumption. While this assumption is perfectly valid if true, using such an assumption requires justification. For instance, why do we believe this assumption is satisfied for two time points but not the rest? To identify the ATT using this assumption, we again use the same identification process as in the simple case, since we are back to considering only one time point pre-intervantion and one time point post-intervention.

Counterfactual Assumption (2c)
Avg pre, one post
For some post-treatment time point $$t^* > T_0$$, \begin{align*} \mathbb{E}\left[Y^0(t^*) - \overline{Y^0}_{\{t \leq T_0\}} \mid A = 0\right] = \\ \mathbb{E}\left[Y^0(t^*) - \overline{Y^0}_{\{t \leq T_0\}} \mid A = 1\right] \end{align*}

In this version we assume that there are “parallel trends” between one post-intervention time point and the average of the pre-intervention outcomes.

Counterfactual Assumption (2d)
All pre, one post
For some $$t^* > T_0$$ and each $$t' \leq T_0$$: \begin{align*} \mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 1\right] = \\ \mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 0\right] \end{align*}

Counterfactual Assumption (2d) is a stricter version of (2c), where parallel trends holds at post-intervention time $$t^*$$ and every possible pre-intervention time point. Note that if Counterfactual Assumption (2d) holds, then Counterfactual Assumption (2c) also must hold, but the reverse is not necessarily true.

Finally, we get to the assumption we’ve been waiting for, in which the untreated potential outcomes evolve in parallel in the treatment and control groups at every pre- and post-intervention time point. This is the strictest version of parallel trends and is what researchers often mean by “parallel trends”.

Counterfactual Assumption (2e)
All pre, all post
For each $$t^* > T_0$$ and each $$t' \leq T_0$$: \begin{align*} \mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 1\right] = \\ \mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 0\right] \end{align*}

This is the most restrictive because it requires parallel evolution of the untreated outcomes at all pre- and post-intervention time points.

Most diff-in-diff applications have a line or two stating that they assume “parallel trends” without further elaboration. As the above assumptions illustrate, the counterfactual assumptions are more diverse and more specific than this general statement gives.

The “parallel trends assumption”, as commonly understood, is usually paired with a second assumption (see Dimick and Ryan 2014; Ryan, Burgess, and Dimick 2015):

Parallel pre-trends
In the pre-intervention period, time trends in the outcome are the same in treated and control units.

Common shocks
In the post-intervention period, exogenous forces affect treated and control groups equally.

Stating the assumptions this way is misleading for two reasons. First, not all identifying assumptions require parallel pre-intervention trends. For example, Counterfactual Assumption (2d) requires parallel trends in the pre-intervention period, but only Counterfactual Assumption (2e) demands parallel trends throughout the study.

Second, parallel pre-intervention trends is not an assumption at all! It is a testable empirical fact about the pre-intervention outcomes, involving no counterfactuals. By contrast, common shocks is an untestable assumption involving exogenous forces that are likely unknown to the researcher. See below for more discussion of parallel trends testing.

We prefer the counterfactual assumptions above because they are explicitly stated in terms of counterfactual outcomes, identify the diff-in-diff estimator, and avoid this false sense of security.

Which assumptions are reasonable in the data you see? Use the app below to explore potential outcomes that satisfy each of the above assumptions. The app randomly generates outcomes for the control group then randomly generates untreated outcomes (counterfactuals in the post-intervention period) for a treated group that satisfy each assumption above.

What do you have in mind when you say that you assume “parallel trends”? Does this match what you see in the app?

### Equivalence tests

Our primary concern with (the usual) hypothesis tests of parallel trends is that we can never actually prove what we set out to prove. The only conclusions that can emerge from a conventional frequentist null hypothesis test are “fail to reject the null” or “reject the null.” The decision to “fail to reject” is decidedly different than accepting the null. And in tests for parallel trends, the null is typically that the trends are parallel. So we can never actually say that our trends are parallel using the default infrastructure. Maybe this is a problem for some and perhaps not for others. However, there is another problem with hypotheses for testing assumptions. Let’s delve briefly into a thought experiment where the “parallelness” of trends is captured by a single parameter $$\theta$$ (where $$\theta = 0$$ denotes two lines that are perfectly parallel). Deviations from zero (either negative or positive) denote departures from “parallelness” at varying magnitudes. The hypotheses for testing parallel trends look something like:

$$H_0:$$ $$\theta = 0$$

$$H_1:$$ $$\theta \neq 0$$.

If we have a big enough sample size we can reject the null if the true value of $$\theta$$ is 5 or 3 or 1 or 0.01. But do we really care about deviations of magnitude 0.01 compared to deviations of 5? It would be better if we could insert expert knowledge into this test and incorporate some margin for deviation in our test. Equivalence tests do just this, while at the same time reversing the order of the hypotheses. Let $$\tau$$ denote an acceptable margin for deviations from parallel trends so that if $$|\theta| \leq \tau$$, we feel OK saying that the trends are parallel (or close enough). The hypotheses for an equivalence test are:

$$H_0:$$ $$|\theta| \leq \tau$$

$$H_1:$$ $$|\theta| > \tau$$.

Equivalence tests are nothing new. They are sometimes used in clinical trials to determine if a new drug is no worse than a standard-of-care drug, for example. They also happen to provide an intuitive approach to testing for parallel trends in the pre-treatment periods. Unfortunately, this setup won’t solve all our (diff-in-diff) problems. Sample size considerations can be a hindrance in assumption testing, for one. However, this sort of issue arises no matter how we construct our testing framework, so we might as well set up our tests in a way that is more intuitive.

# Model and Estimation

Using sample means to estimate the ATT works well when there are two time periods and no covariates. To go beyond this, we will specify a model that can readily be extended to more complex settings.

A typical linear model for the untreated outcomes $$Y^0_{it}$$ (Athey and Imbens (2006) or Angrist and Pischke (2008) p. 228, for example) is written $\begin{equation*} Y^0_{it} = \alpha + \delta_t + \gamma I(a_i = 1) + \epsilon_{it}\;, \end{equation*}$

that is, the counterfactual untreated outcomes are presented as a sum of an intercept $$\alpha$$, main effects for time $$\delta_t$$, a main effect for the treated group $$\gamma$$, and an error term $$\epsilon_{it}$$.

Now we simply connect the untreated outcomes to the observed outcomes $$Y_{it}$$ using the relation $\begin{equation*} Y_{it} = Y^0_{it} + \beta D_{it}\;, \end{equation*}$

where $$D_{it}$$ is an indicator of the treatment status of the $$i^{th}$$ unit at time $$t$$, and $$\beta$$ is the traditional diff-in-diff parameter. Note that $$D_{it}$$ is an interaction between indicators for the treatment group and the post-treatment period, $$D_{it} = a_i \cdot I(t > T_0)$$.

These models impose a constant diff-in-diff effect across units. For more about this strict assumption, please see our discussion of Athey and Imbens (2006).

Let’s return to the simple scenario of two groups and two time periods $$\left(t \in \{1,2\}\right)$$. The model for $$Y^0_{it}$$ reduces to $\begin{equation*} Y^0_{it} = \alpha + \delta I(t = 2) + \gamma I(a_i = 1) + \epsilon_{it}\;. \end{equation*}$

If this model is correctly specified, Counterfactual Assumption (1) holds since

\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1\right] &= (\alpha + \delta + \gamma) - (\alpha + \gamma) \\ &= \delta \end{align*}

and

\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0\right] &= (\alpha + \delta ) - (\alpha) \\ &= \delta\;. \end{align*}

Now, let’s introduce the effect of a covariate and see how it affects our counterfactual assumption. For example, write our model for $$Y^0$$ including an additive effect of a covariate $$X$$, $\begin{equation*} Y^0_{it} = \alpha + \delta_t + \gamma_a + \lambda_t x_i + \epsilon_{it}\;. \end{equation*}$

Here, the effect of $$X$$ on $$Y^0$$ may vary across time, so $$\lambda$$ is indexed by $$t$$.

Initially, we assume a constant effect of $$X$$ on $$Y^0$$ at $$t = 1$$ and $$t = 2$$, so $$\lambda_t = \lambda$$. In this case, Counterfactual Assumption (1) is still satisfied even if the distribution of $$X$$ differs by treatment group because these group-specific means cancel out:

\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1\right] &= (\alpha + \delta + \gamma + \lambda \mathbb{E}\left\{X \mid A = 1\right\} ) - \\ & \ \ \ \ \ \ (\alpha + \gamma + \lambda \mathbb{E}\left\{X \mid A = 1\right\}) \\ &= \delta \end{align*}

and

\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0\right] &= (\alpha + \delta + \lambda \mathbb{E}\left\{X \mid A = 0\right\} ) - \\ & \ \ \ \ \ \ (\alpha + \lambda \mathbb{E}\left\{X \mid A = 0\right\}) \\ &= \delta\;. \end{align*}

Lastly, we let the effect of $$X$$ on $$Y^0$$ vary across time ($$\lambda$$ indexed by $$t$$), after which we have a different story:

\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1\right] &= (\alpha + \delta + \gamma + \lambda_2 \mathbb{E}\left\{X \mid A = 1\right\} ) - \\ & \ \ \ \ \ \ (\alpha + \gamma + \lambda_1 \mathbb{E}\left\{X \mid A = 1\right\}) \\ &= \delta + \lambda_2 \mathbb{E}\left\{X \mid A = 1\right\} - \\ & \ \ \ \ \ \ \lambda_1 \mathbb{E}\left\{X \mid A = 1\right\} \end{align*}

and

\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0\right] &= (\alpha + \delta + \lambda_2 \mathbb{E}\left\{X \mid A = 0\right\} ) - \\ & \ \ \ \ \ \ (\alpha + \lambda_1 \mathbb{E}\left\{X \mid A = 0\right\}) \\ &= \delta + \lambda_2 \mathbb{E}\left\{X \mid A = 0\right\} - \\ & \ \ \ \ \ \ \lambda_1 \mathbb{E}\left\{X \mid A = 0\right\} \end{align*}

are not necessarily equal. They are only equal if the effect of $$X$$ on $$Y^0$$ is constant over time (i.e., $$\lambda_1 = \lambda_2$$) or the mean of the covariate in the two groups is the same (i.e., $$\mathbb{E}\left\{X \mid A = 1\right\} = \mathbb{E}\left\{X \mid A = 0\right\}$$). This illustrates an important connection between the counterfactual assumption and the regression model and introduces the notion of confounding in diff-in-diff.

To better visualize this, use the app below to explore time-varying confounding in simulated data. The y-axis is the mean of the untreated potential outcomes ($$Y^0$$) and the x-axis is time.

Remember: for Counterfactual Assumption (1) to hold, the lines connecting $$Y^0$$ values in the treated and control groups must be parallel.

Whenever the lines are not parallel (i.e., the differential change over time is not 0), Counterfactual Assumption (1) is violated.

• What happens when the covariate distributions are different in the treated and control groups? (hint: change the values of $$Pr(X=1|A=0)$$ and $$Pr(X=1|A=1)$$)
• What happens when the covariate effect varies over time? (hint: change the effects of $$X$$ on $$Y^0$$ at $$t = 1$$ and $$t = 2$$)

As you may have discovered in the app, $$X$$ is a confounder if two conditions hold:

1. $$X$$ is associated with treatment ($$A$$) and
2. the effect of $$X$$ on $$Y$$ varies across time.

We tackle confounding more thoroughly later.

### Fixed effects in diff-in-diff

Before we discuss inference and confounding, let’s talk about fixed effects briefly (see Mummolo and Peterson (2018) for a more in-depth discussion of fixed effects models and their interpretation). Fixed effects, particularly unit-level fixed effects, are used in causal inference to adjust for unmeasured time-invariant confounders. Of course, there are trade-offs. The discussion from Imai and Kim (In Press) explains that using unit fixed effects comes at the cost of capturing the dynamic relationship between the treatment and the outcome. By dynamic relationships we refer to ideas such as past treatments affecting future outcomes or past outcomes affecting future treatments. On the other hand some causal methods (marginal structural models, for instance) cannot account for time-invariant unmeasured confounders but do capture these dynamics. Basically, we can have one or the other: either we adjust for time-invariant unmeasured confounders and assume no dynamic relationship between treatment and outcome or we assume that there are no unmeasured confounders and allow for more complicated relationships between treatment and outcome.

Bai (2009) describes an interactive fixed effects model in which time fixed effects are multiplied by unit fixed effects in a factor structure. This can incorporate dynamics, but it has the identification challenges of any factor model.

So how do fixed effects pertain to diff-in-diff? Kropko and Kubinec (2018) discuss two-way fixed effects — unit and time — models in the context of diff-in-diff estimation. Their main point is that estimates coming from two-way fixed effect models are difficult to interpret when we have many time periods. When we have the canonical (two-period, binary treatment) diff-in-diff setup, the $$\beta$$ coefficient from the two-way fixed effect model $$\left(y_{it} = \alpha_i + \delta_t + \beta D_{it} + \epsilon_{it}\right)$$ equals the usual estimate. As more time periods are added within the fixed-effects framework, we implicitly add supplementary assumptions. In particular, the diff-in-diff effect is assumed homogenous across time and cases. Homogeneity across time is a stringent assumption that says the diff-in-diff effect is the same no matter how close or far apart the time periods are. We say this not to discourage use of two-way fixed effect models, but to discourage automatic use of them. True they work well for some cases (when we need to adjust for unmeasured time-invariant confounders), but we really need to examine our research goals on an application-by-application basis, consider the assumptions implicit in the models we’re thinking of using, and adjust our tools accordingly.

What if treated units are treated at different times? Or what if we don’t have a control group, only variation in treatment timing? Goodman-Bacon (2018) examines the two-way fixed effect regression model ($$Y_i(t) = \alpha_i + \delta_t + \beta D_{it} + \epsilon_{it}$$) as a diff-in-diff estimator when there exists treatment variation. It turns out that the diff-in-diff parameter $$\beta$$ is a weighted combination of all possible $$2 \times 2$$ diff-in-diff estimators found in the data. So each treatment group can be compared to the untreated group (if one exists), but each treatment group also serves as a control to every other treatment group. The global diff-in-diff estimate is a weighted average of all $$2 \times 2$$ estimates. The weights are determined by sample sizes in each group and the variance in the treatment variable.

# Confounding

In general, a confounder is a factor associated with both treatment and outcomes. This is why randomized trials are not subject to bias through confounders — no factor is associated with the randomly assigned treatment. In other words, the potential outcomes and treatment are independent.

Unconditionally unconfounded
$Y^a \perp A$

Sometimes, treatment may be randomized within levels of a covariate $$X$$ (conditionally randomized) and we write this relation:

Conditionally unconfounded
$Y^a \perp A \mid X$

In both of these versions, the treatment $$A$$ is independent of the potential outcomes $$Y^a$$, either unconditionally or conditional on $$X$$. In practice, these relations are only satisfied in randomized trials; otherwise, there is no guarantee that $$X$$ is sufficient to make $$A$$ and $$Y^a$$ conditionally independent. Even if we continue collecting covariates, it is likely that some unmeasured covariates $$U$$ are still a common cause of $$A$$ and $$Y^a$$.

In diff-in-diff studies, the notion of confounding is fundamentally different. As alluded to in the previous section, confounding in diff-in-diff violates the counterfactual assumption when (1) the covariate is associated with treatment and (2) there is a time-varying relationship between the covariate and outcomes or there is differential time evolution in covariate distributions between the treatment and control populations (the covariate must have an effect on the outcome).

Below, we briefly discuss confounding in linear and non-linear settings. For a very lucid discussion of confounding in diff-in-diff, we recommend Wing, Simon, and Bello-Gomez (2018).

## Confounding in linear settings

If we know how confounding arises, we can adress it. For example, if the truth is a linear data-generating model, we can use a linear regression model to address confounding. The flowchart below outlines six linear data-generating models and the appropriate linear regression adjustment for each.

Of these six scenarios, two require no adjustment at all. Of the 4 that require adjustment, only one requires the regression adjustment type nearly always found in the literature, i.e., adjusting for a time-varying covariates without any interaction with time. In the other three scenarios with confounding bias, the issue is due, in whole or in part, to time-varying covariate effects. For these cases, including an interaction of covariates with time is crucial to addressing confounding bias.

See directed acyclic graphs (DAGs) (together with a brief discussion) for these scenarios by selecting an option below:

In this scenario, the covariate $$X$$ does not vary over time. The arrow from $$X$$ to $$A$$ indicates that $$X$$ is a cause of $$A$$, satifying the first requirement of a confounder. Additionally, are arrows from $$X$$ to $$Y(1)$$ and to $$Y(2)$$ as well as an arrow from $$A$$ to $$Y(2)$$. [Note: there is no arrow from $$A$$ to $$Y(1)$$ because treatment is administered after $$Y(1)$$.] $$\alpha$$ is the effect of $$X$$ on $$Y(1)$$, and $$\beta$$ is the effect of $$X$$ on $$Y(2)$$. When $$\alpha = \beta$$, the effect of $$X$$ is time-invariant and we do not require covariate adjustment. When $$\alpha \neq \beta$$, we must adjust for the interaction of $$X$$ with time.

In this scenario, the time-varying covariate $$X$$ in periods 1 and 2 is denoted $$X(1)$$ and $$X(2)$$. There is no arrow connecting $$A$$ to $$X(2)$$, indicating that treatment does not affect the evolution of $$X(1)$$ to $$X(2)$$. When $$\alpha = \beta$$, the effect of $$X$$ is time-invariant and we do not need to adjust for the covariate. When $$\alpha \neq \beta$$, we must adjust for the interaction of $$X$$ with time.

In this scenario, the time-varying covariate $$X$$ evolves differentially by treatment group. However, most diff-in-diff analyses implicitly or explicitly assume that $$X$$ does not evolve based on treatment group. See our nonparametric section below. One diff-in-diff estimator that directly accounts for this phenomenon is Stuart et al. (2014), which we discuss in more detail below. When $$\alpha = \beta$$, the effect of $$X_t$$ on $$Y^0$$ is time-invariant and it suffices to adjust only for $$X_t$$. When $$\alpha \neq \beta$$, we must adjust for the interaction of $$X_t$$ with time.

## Confounding in non-linear settings

The main challenge of fitting diff-in-diff using nonlinear models is the care required in interpreting regression coefficients as causal estimates. In a logistic regression model, for example, adding independent variables to the model can change the coefficients on existing indepdendent variables, even if the added variables are unrelated to the existing ones (Mood 2010). Understanding confounding is also more complicated in nonlinear diff-in-diff models. See the nonlinear models section below for more.

# Inference

Typically, data used in diff-in-diff studies are complex and cannot be assumed to be iid (i.e., independently and identically distributed). For example, we may have hierarchical data, in which individual observations are nested within larger units (e.g., individuals in a US state) or longitudinal data, in which repeated measures are obtained for units. In both of these cases, assuming iid data will result in standard errors that are too small.

Bertrand, Duflo, and Mullainathan (2004) and Rokicki et al. (2018) discuss diff-in-diff inference in the presence of serial correlation. The authors consider methods to accomodate this issue such as collapsing the data, modeling the covariance structures, and permutation inference.

### Collapsing the data

Collapsing or aggregating again returns us to the simple two-period setting, obviating the need to consider longitudinal correlation in the data. When treatment is administered at the same time point, we can perform ordinary least squares on the aggregated data. On the other hand when treatment laws are staggered (e.g., states pass the same health care law in different years), Bertrand, Duflo, and Mullainathan (2004) suggests aggregating the residuals from a regression model and then analyzing those. See Goodman-Bacon (2018) and Athey and Imbens (2018) for more about varying treatment start times.

In simulation studies, Bertrand, Duflo, and Mullainathan (2004) and Rokicki et al. (2018) find that aggregation has good Type I error and coverage, but it does lose some information (and thus power).

### Clustered standard errors

The most popular way to account for serial correlation in diff-in-diff is clustered standard errors (Cameron and Miller 2015; Abadie et al. 2017). In practice, this is typically done in Stata using the cluster option from the regress function. Similar adjustment is available in any common statistical software. We declare which variable or variables constitute our clusters, and the software makes some kind of adjustment to the standard standard errors by accounting for within-cluster correlation.

This type of adjustment fails with only one treated unit (Conley and Taber 2011), for example, when a single state implements a policy of interest. There is no hard and fast rule on the number of treated units needed for clustered standard errors to be appropriate. Generally, it is better to have a balanced treated-control ratio than a lopsided one. As we mentioned in the preface to this section, Rokicki et al. (2018) examined DID inference when there were few groups. In particular, see Figures 2 (panel A) and 4 in that paper. Figure 2 examines 95% confidence interval coverage for the DID parameter under various inference techniques. In panel A, we can see that when the number of groups is small, clustering standard errors results in undercoverage, and this undercoverage is worse when the treated-to-control ratio is unbalanced. In Figure 4, coverage is presented as a proportion of the treated units to the control units. Whenever there is an unbalanced proportion of treated to control units, coverage suffers.

Fortunately, other choices exist if clustered standard errors are untenable and may even be preferred in many situations. Donald and Lang (2007) developed a two-part procedure for estimation and inference in simple models that works well even when the numbers of groups is small.

Mixed models with random effects at the cluster level can account for serial correlation. This is what we used in our demonstration of confounding in a previous section. Generalized estimating equations (GEE) take into account covariance structure and use a robust sandwich estimator for the standard errors (see Figure 2, panel D and Figure 4 in Rokicki et al. (2018)). Both of these methods are widely available in statistical software. In particular, GEE is powerful since it is robust to the misspecification of the correlation structure so if we guess incorrectly, it will not bias our estimate when the underlying regression model is correct. Specifying the correct covariance will increase the efficiency of our estimate. However, note that Rokicki et al. (2018) also found undercoverage in the confidence interval in the GEE estimates when the ratio of treated to control units was lopsided.

### Arbitrary covariance structures

Throughout the diff-in-diff literature, we find simulations and inference techniques based on an autoregressive covariance AR(1) structure for the residuals within a cluster. The AR(1) covariance structure is

$\text{Cov}(Y_i) = \sigma^2 \begin{pmatrix} 1 & \rho & \rho^2 & \cdots & \rho^{n-1} \\ \rho & 1 & \rho & \cdots & \rho^{n-2} \\ \rho^2 & \rho & 1 & \cdots & \rho^{n-3}\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \cdots & 1 \end{pmatrix}$

with an unknown variance parameter $$\sigma^2$$ and an unknown autoregressive parameter $$0 \leq \rho \leq 1$$. When $$\rho$$ is larger, clustered values are more highly correlated; whereas when $$\rho = 0$$, observations are independent. This structure assumes that correlation is positive (or zero) across all observations and that observations closer to each other are the most strongly correlated, and observations that are more distant are more weakly correlated.

The AR(1) correlation structure is pervasive in diff-in-diff simulation studies and inference techniques. Bertrand, Duflo, and Mullainathan (2004) considered this correlation structure in simulations and found that “this technique [assuming AR(1)] does little to solve the serial correlation problem” due to the difficulty in estimating $$\rho$$. Rokicki et al. (2018) used an AR(1) in their simulations.

McKenzie (2012) also discusses autocorrelation, emphasizing how statistical power relates to the number of time points and to autocorrelation, ultimately concluding that ANCOVA is more powerful than diff-in-diff.

The correlation structure in diff-in-diff applications may follow many different structures. For example, after de-meaning and de-trending, outcomes have a weak positive correlation in adjacent time points but a negative correlation at time points in far apart time points. In the shiny app below, we present correlation structures for simulated data and real data. The real datasets are from (a) the Dartmouth Health Atlas, (b) MarketScan claims data, and (c) Medicare claims, which are described in more detail within the app. Play around with the settings to simulate data that look like your applications. Does the correlation structure look the way you expect?

### Permutation tests

Permutation tests are a resampling method that can be used to test statistical hypotheses. In the diff-in-diff setting, permutation tests comprise the following steps:

1. Compute the test statistic of interest on the original data. For example, calculate the interaction term between time and treatment from a regression model. Call this $$\hat{\delta}$$.

2. For $$K$$ a large positive integer, permute the treatment assignment randomly to the original data, so that the data are the same save for a new treatment assignment. Do this $$K$$ times.

3. For each of the $$K$$ new datasets, compute the same test statistic. In our example, we compute a new interaction term from a regression model. Call these $$\hat{\delta}^{(k)}$$ for permutation $$k \in \{1, \dots, K\}$$.

4. Compare the test statistic $$\hat{\delta}$$ found in the first step to the test statistics $$\hat{\delta}^{(1)}, \dots, \hat{\delta}^{(K)}$$ found in the third step.

The fourth step is where we can get a nonparametric p-value for the parameter of interest. If, for instance, $$\hat{\delta}$$ is more extreme than 95% of $$\hat{\delta}^{(k)}$$ then the permutation test p-value is 0.05.

For more on randomization inference for difference-in-differences, see Conley and Taber (2011) and MacKinnon and Webb (2018).

# Robustness

### Diff-in-diff as a negative outcome control

Negative controls are a useful tool in epidemiology to detect and adjust for unobserved confounding (Lipsitch, Tchetgen, and Cohen (2010)). Sofer et al. (2016) link the negative outcome control (NOC) approach to diff-in-diff. In short, a negative outcome is an outcome that is unaffected by treatment. This is precisely the pre-treatment outcome in the diff-in-diff setup. With two time points $$t \in \{1, 2\}$$ there are two outcomes — $$Y(1)$$ and $$Y(2)$$. $$Y(1)$$ is the negative outcome.

Let’s reconsider the typical counterfactual (parallel trends) assumption conditional on observed $$X$$.

$\mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1, X\right] = \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0, X\right]$

We reorganize terms:

$\mathbb{E}\left[Y^0(2)\mid A = 1, X\right] - \mathbb{E}\left[Y^0(2) \mid A = 0, X\right] = \mathbb{E}\left[Y^0(1) \mid A = 0, X\right] - \mathbb{E}\left[Y^0(1) \mid A = 0, X\right]$

The left hand side encodes confounding bias for the effect of $$A$$ on $$Y(2)$$ and the right hand side encodes the bias of the effect of $$A$$ on $$Y(1)$$. The authors show various formulae to calculate the ATT with adjustment for unmeasured confounders based on NOC. All pertinent formulas for the ATT are found in Section 3 of Sofer et al. (2016) with applications in subsequent sections.

For more discussion of robustness checks in diff-in-diff and similar models, see Athey and Imbens (2016).

# Matching

Matching estimators adjust for confounding by balancing the treatment groups on measured covariates. Rather than using the entire sample population to estimate the diff-in-diff effect, units in the control group are selected by on their “closeness” to units in the treated group. We introduce this section by a series of Tweets (enhanced with GIFs!) by Laura Hatfield about a recent Daw and Hatfield (2018a) paper on matching and regression to the mean.

The argument focuses on estimators that match on outcomes in the pre-treatment period. Matching on pre-treatment outcomes is attractive in diff-in-diff because it improves comparability of the groups and possibily of their outcome trends. The crux of the argument in Daw and Hatfield (2018a) is that matching estimators can be dangerous in diff-in-diff settings due to regression to the mean. Regression to the mean is a notorious phenomonom in which extreme values tend to revert to the group mean on subsequent measurements. For example, if we select the ten students who score highest on an exam, at a subsequent exam, the average score for these ten students would drop towards the class mean.

For diff-in-diff, the effect is similar. By constraining the pre-treatment outcomes to be similar, we are more likely to select units of the group that are higher or lower than their respective group means. Once the matching constraint is dropped (in the post-treatment period), these units’ means can revert back to their respective group’s mean and possibly yield a spurious diff-in-diff effect. So in some cases, matching can actually introduce bias.

So how can we know whether matching is useful or harmful in our diff-in-diff study? Unfortunately sometimes we can’t know. Take the paper Ryan, Burgess, and Dimick (2015) which presents a simulation study using matching estimators and shows that matching can reduce confounding bias. In their paper, they sampled the treated and control groups from the same population, but the probability of being part of the treated group increased for high pre-treatment outcomes. In contrast, Daw and Hatfield (2018a) set up similar simulations but with treated and controls groups coming from different, but overlapping, populations.

For the following thought experiment, assume no diff-in-diff effect is present. If the populations are drawn as in Ryan, Burgess, and Dimick (2015), the two populations have different pre-treatment means. In a diff-in-diff study without matching, these units will regress to the mean in the post-treatment period (but they will regress to the same value since they are drawn from the same population!). This yields a non-zero diff-in-diff effect. Matching actually fixes this issue. If the populations are drawn as in Daw and Hatfield (2018a), the opposite is true. Without matching, the populations are different in the pre-intervention period and remain that way in the post-intervention period (since they are representative of their true populations, there is no regression to the mean). With matching, the populations are constrained to be the same in the pre-treatment period, and once the constraint is released in the post period, the two groups regress back to their group means. So in the Ryan, Burgess, and Dimick (2015) setup, matching is the solution to regression to the mean bias; in the Daw and Hatfield setup, matching is the cause of the regression to the mean bias.

Using real life data, there is no way to check empirically whether our groups come from the same population or from different populations. Determining this must come from expert knowledge from how the treatment assignment mechanisms work. To quote Daw and Hatfield (2018b) in the follow-up to their own paper:

(R)esearchers must carefully think through the possible treatment assignment mechanisms that may be operating in the real-life situation they are investigating. For example, if researchers are aware that a pay-for-performance incentive was assigned to physicians within a state based on average per-patient spending in the past year, one may be comfortable assuming that treatment assignment mechanism is operating at the unit level (i.e., the potential treatment and control units are from the same population). In contrast, if the same incentive was assigned to all physicians within a state and a researcher chooses a control state based on geographic proximity, it may be more reasonable to assume that treatment assignment is operating at the population level (i.e., the potential treatment and control units are from separate populations).

Other researchers have also noted the lurking biasedness of some matching diff-in-diff estimators. Lindner and McConnell (2018), for example, found that biasedness of the estimator was correlated in simulations with the standard deviation of the error term. As the standard error increased, so did the bias.

In a pair of papers, Chab'e-Ferret (Chabé-Ferret 2015; Chabé-Ferret 2017) similarly concluded that matching on pre-treatment outcomes can be problematic and is dominated by symmetric diff-in-diff.

# Synthetic Control

The first instance of synthetic control is taken from Abadie and Gardeazabal (2003) which looked at the effects of terrorism on economic growth across the regions of Spain. Beginning in the 1960s the Basque Country experienced a rash of terrorism unique within Spain. The authors used a 1998 truce and subsequent breaking of that truce as a natural experiment. The other regions of Spain were weighted to form a synthetic Basque Country, similar to the real Basque Country in demographics. The results showed that per capita GDP increased in Basque Country after the truce (relative to the synthetic Basque country with no truce). And when the cease-fire ended, GDP decreased.

Synthetic control methods are a close cousin to matching methods. The idea behind synthetic control is that a weighted combination of control units can form a closer match to the treated group than than any one (or several) control unit (Abadie, Diamond, and Hainmueller (2010)). The weights are chosen to minimize the distance between treated and control on a set of matching variables, which can include covariates and pre-treatment outcomes. The post-period outcomes for the synthetic control are calculated by taking a weighted average of the control groups’ outcomes. Many authors have extended synthetic control work recently (Kreif et al. 2016; Xu 2017; Ferman, Pinto, and Possebom 2017; Kaul et al. 2015).

Synthetic control has many upsides. We don’t need parallel trends because we can construct a control group with beautifully parallel outcomes in the pre-intervention period. It has its own packages in R, Stata, and Matlab (Abadie, Diamond, and Hainmueller (2011)). However, its downsides are similar to those seen in matching estimators for diff-in-diff. By matching on pre-period outcomes, we may invite similar biases. While synthetic control is an influential method, it still requires care.

# Semi- and Nonparametric Estimators

As we’ve seen, diff-in-diff analysis is a four-step process:

1. make assumptions about how our data were generated
2. suggest a sensible model for the untreated outcomes
3. connect the untreated outcomes to the observed outcomes
4. estimate the diff-in-diff parameter and make inference

While the template is simple, our estimates and inference can crumble if we’re wrong at any step along the way. We’ve discussed the importance of counterfactual assumptions and inferential procedures. We now turn our attention to the modeling aspect of diff-in-diff. So far, we have discussed only parametric models. Below, we present some semiparametric and nonparametric estimators for diff-in-diff. These give us more flexibility when we don’t believe in linearity in the regression model or fixed unit effects.

### Semi-parametric estimation with baseline covariates

Abadie (2005) addresses diff-in-diff when a pre-treatment covariate differs by treatment status and also affects the dynamics of the outcome variable. In our confounding section above, this is the “Time-invariant X with time-varying effect” scenario.

Let’s return to the two-period setting. When a covariate $$X$$ is associated with both treatment $$A$$ and changes in the outcome $$Y$$, the Counterfactual Assumption 1 no longer holds.

Thus, Abadie (2005) specifies an identifying assumption that conditions on $$X$$. That is,

Counterfactual Assumption (1a) $\mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1, X\right] = \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0, X\right].$

This assumption does not identify the ATT, but it can identify the CATT, that is, the conditional ATT:

Conditional average effect of treatment on the treated (CATT) $CATT \equiv \mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1, X\right].$

The CATT itself may be of interest, or we may want to average the CATT over the distribution of $$X$$ to get back the ATT. To identify the CATT, repeat the identification steps above with expectations conditional on $$X$$. As expected, it turns out that

\begin{align*} \mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1, X\right] &= \lbrace \mathbb{E}\left[Y(2) \mid A = 1, X\right] - \mathbb{E}\left[Y(1) \mid A = 1, X \right] \rbrace - \\ & \ \ \ \ \ \lbrace \mathbb{E}\left[Y(2) \mid A = 0, X\right] - \mathbb{E}\left[Y(1) \mid A = 0, X \right] \rbrace. \end{align*}

Nonparametric estimators for these quantities are easy when $$X$$ is a single categorical variable. They are simply sample averages for groups defined by combinations of $$X$$ and $$A$$:

1. The post-treatment average of the treated group with $$X=x$$ for $$\mathbb{E}\left[Y(2) \mid A = 1, X=x\right]$$
2. The pre-treatment average of the treated group with $$X=x$$ for $$\mathbb{E}\left[Y(1) \mid A = 1, X=x\right]$$
3. The post-treatment average of the control group with $$X=x$$ for $$\mathbb{E}\left[Y(2) \mid A = 0, X=x\right]$$
4. The pre-treatment average of the control group with $$X=x$$ for $$\mathbb{E}\left[Y(1) \mid A = 0, X=x\right]$$

However, if $$X$$ is high-dimensional or contains continuous covariates, these get tricky. Abadie proposes a semiparametric solution using propensity scores. Recall that a propensity score is the estiamted probability of treatment given pre-treatement covariate $$X$$, $$P(A = 1 \mid X)$$. For this approach to work, we need a new assumption — positivity. For every possible value (or values) of $$X$$, there must be a positive probability of receiving the treatment. That is,

Positivity Assumption $\begin{equation*} 0 < P(A = 1 | X) < 1 \; \text{ for all } X. \end{equation*}$

This assumption ensures the estimand is defined at all $$X$$. If some values of $$X$$ lead to guaranteed treatment or control (i.e., propensity scores of $$0$$ or $$1$$) we should reconsider the study population.

With positivity in hand, consider the weighted estimator of Abadie (2005):

$$$\mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1\right] = \mathbb{E}\left[\frac{Y(2) - Y(1)}{P(A = 1)} \cdot \frac{A - P(A = 1 \mid X)}{1 - P(A = 1 \mid X)}\right].$$$

To estimate these quantities, we need fitted values of the propensity scores for each unit, i.e., $$\hat{P}(A=1 | X= x_i)$$ and then we use sample averages in the treated and control groups. To see the math,

We only need the average change in outcomes among the treated units and the weighted average change in outcomes among the control units. The weights are $$\frac{\hat{P}(A=1 | X=x_i)}{1 - \hat{P}(A=1 | X=x_i)}$$. What are these weights sensible? Well, outcome changes among control units with $$x_i$$ that resemble treated units (i.e., with large $$\hat{P}(A=1 | X=x_i)$$) will get more weight. Outcome changes among control units with $$x_i$$ that resemble control units will get less weight.

To model the propensity scores, we could use a parametric model like logistic regression or something more flexible like machine learning. As usual, extending the model to multiple time points in the pre- and post-treatment periods is more complicated.

### Nonparametric estimation with empirical distributions

Athey and Imbens (2006) developed a generalization of diff-in-diff called “changes-in-changes” (of which diff-in-diff is a special case). This method drops many of the parametric assumptions of diff-in-diff and allows both time and treatment effects to vary across individuals. Again we are in a two-period, two-group setting. The Athey and Imbens (2006) model is much less restrictive than the usual parametric model. It only assumes two things:

1. $$Y_i^0 = h(u_i, t)$$ for unobservable characteristics $$u_i$$ and an unknown function $$h$$ (increasing in $$u$$)
2. Within groups, the distribution of $$u_i$$ does not change over time

Note that the distribution of $$u_i$$ can differ between treatment groups so long as this difference remains constant. Below, we discuss a method for estimating diff-in-diff without this assumption. To estimate the target parameter, Athey and Imbens (2006) estimate empirical outcome distributions in a familiar list of samples:

1. The post-treatment distribution of $$Y$$ in the treated group,
2. The pre-treatment distribution of $$Y$$ in the treated group,
3. The post-treatment distribution of $$Y$$ in the control group, and
4. The pre-treatment distribution of $$Y$$ in the control group.

What we are missing is

1. The post-treatment distribution of $$Y^0$$ in the treated group (i.e., counterfactual, untreated outcomes)

Since we cannot observe this distribution, Athey and Imbens (2006) estimate it through a combination of the empirical distributions for (1), (2), and (3). After estimating (5), the effect of treatment is the difference between observed (4) and estimated (5). Let $$F_{Y^0_{12}}$$ be the counterfactual distribution for the untreated outcomes for the treated group at $$t = 2$$. We estimate this quantity through the relation:

$F_{Y^0_{12}}(y) = F_{Y_{11}}(F^{-1}_{Y_{01}}(F_{Y_{02}}(y)))\;,$ where $$F_{Y_{11}}$$ is the distribution function for the (observed) outcomes for the treated group at $$t = 1$$; $$F_{Y_{01}}$$ is the distribution function for the (observed) outcomes for the untreated group at $$t = 1$$; and $$F_{Y_{02}}$$ is the distribution function for the (observed) outcomes for the untreated group at $$t = 2$$.

The other distribution of note, $$F_{Y^1_{12}}(y)$$, is actually observed since this are the treated outcomes for the treated group at $$t = 2$$. Since we have estimates for both $$F_{Y^1_{12}}$$ and $$F_{Y^0_{12}}$$, we can also estimate the diff-in-diff effect. MATLAB code for this estimator is available on the author’s website.

Bonhomme and Sauder (2011) extend this idea to allow the shape of the outcome distributions to differ in the pre- and post-intervention periods. The cost of this additional flexibility is that they must assume additivity.

### Semiparametric estimation with time-varying covariates

One of the key identifying assumptions from Athey and Imbens (2006) is that the distribution of $$u$$ — all non-treatment and non-time factors — is invariant across time within the treated and control groups. That is, the distributions do not change with time. Stuart et al. (2014) circumvents this restriction by considering four distinct groups (control/pre-treament, control/post-treatment, treated/pre-treatment, treated/post-treatment) rather than just two groups (control and treated) observed in two time periods. With the four groups, the distribution of $$u$$ can change over time. A consequence of this setup is that it no longer makes sense to talk about the diff-in-diff parameter as the effect of treatment on the treated; instead, the estimand is defined as the effect of treatment on treated in pre-treatment period.

The estimator uses propensity scores for predicting the probabilities of each observation being in each of the four groups. We can use some kind of multinomial regression. The treatment effect is then calculated as a weighted average of the observed outcomes (see section 2.4 of Stuart et al. (2014)).

# Nonlinear models

Much of diff-in-diff theory and its applications focus on continuous outcomes, but nonlinear outcomes are common too. Nonlinear outcomes include binary outcomes such as death status or count outcomes such as the number of hospitalizations. If we have a binary outcome, we can model the probability directly with a linear probability model; the downside to this approach is that predicted probabilities can fall outside of the $$[0, 1]$$ range. We can restrict predicted probabilities within $$[0, 1]$$ using an appropriate transformation — logit and probit transformations are perhaps the most common. However, in doing so we lose a lot of nice properties that come with the linear model. Ai and Norton (2003) first pointed out this vexing occurrence in relation to diff-in-diff. In particular, they showed that the cross-partial effect can be nonzero even when the treatment/post-period interaction term is 0.

Puhani (2012) noted that while the point from Ai and Norton (2003) is true, the true diff-in-diff estimate is still taken directly from the interaction term in the model. He shows that diff-in-diff is actually a difference of two cross-partial derivatives so the interaction term always has the same sign as the diff-in-diff effect (not necessarily the case in Ai and Norton (2003)). Thus, inference on the treatment effect can be conducted through the usual test of the interaction parameter.

The Karaca-Mandic, Norton, and Dowd (2012) paper ties the previous two papers together. They show in Figures 3 and 4 how the diff-in-diff effect (on the probability scale) can change as the value of the linear predictor $$X\beta$$ changes, even when the model does not include an interaction term. The authors then go through an interactive example using Stata, which might be useful to researchers intending to do a diff-in-diff analysis with a nonlinear model.

## Acknowledgments

Thank you to Savannah Bergquist, Austin Denteh, Alex McDowell, Arman Oganisian, Toyya Pujol-Mitchell, and Kathy Swartz for their helpful comments.

# References

Abadie, Alberto. 2005. “Semiparametric Difference-in-Differences Estimators.” Review of Economic Studies 72: 1–19. doi:10.1111/0034-6527.00321.

Abadie, Alberto, and Matias D. Cattaneo. 2018. “Econometric Methods for Program Evaluation.” Annual Review of Economics 10 (1): 465–503. doi:10.1146/annurev-economics-080217-053402.

Abadie, Alberto, and Javier Gardeazabal. 2003. “The Economic Costs of Conflict: A Case Study of the Basque Country.” American Economic Review 93 (1): 113–32. doi:10.1257/000282803321455188.

Abadie, Alberto, Susan Athey, Guido Imbens, and Jeffrey Wooldridge. 2017. “When Should You Adjust Standard Errors for Clustering?” arXiv:1710.02926 [Econ, Math, Stat], October. http://arxiv.org/abs/1710.02926.

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105: 493–505. doi:10.1198/jasa.2009.ap08746.

———. 2011. “Synth: An R Package for Synthetic Control Methods in Comparative Case Studies.” Journal of Statistical Software 42 (1): 1–17. doi:10.18637/jss.v042.i13.

———. 2015. “Comparative Politics and the Synthetic Control Method.” American Journal of Political Science 59 (2): 495–510. doi:10.1111/ajps.12116.

Ai, Chunrong, and Edward C. Norton. 2003. “Interaction Terms in Logit and Probit Models.” Economics Letters 80 (1): 123–29. doi:10.1016/S0165-1765(03)00032-6.

Altman, Douglas G., and J. Martin Bland. 1995. “Statistics Notes: Absence of Evidence Is Not Evidence of Absence.” BMJ 311 (7003): 485. doi:10.1136/bmj.311.7003.485.

Angrist, J. D. 2001. “Estimation of Limited Dependent Variable Models with Dummy Endogenous Regressors: Simple Strategies for Empirical Practice.” Journal of Business and Economic Statistics 18: 2–28. doi:10.1198/07350010152472571.

Angrist, J. D., and J.-S. Pischke. 2008. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton, NJ: Princeton University Press. http://www.mostlyharmlesseconometrics.com/.

Angrist, J., and J.-S. Pischke. 2010. “The Credibility Revolution in Empirical Economics: How Better Research Design Is Taking the Con Out of Econometrics.” 15794. Cambridge, MA: National Bureau of Economic Research. http://www.nber.org/papers/w15794.

Athey, Susan, and Guido Imbens. 2006. “Identification and Inference in Nonlinear Difference-in-Differences Models.” Econometrica 74 (2): 431–97. doi:10.1111/j.1468-0262.2006.00668.x.

———. 2016. “The State of Applied Econometrics - Causality and Policy Evaluation.” arXiv:1607.00699 [Econ, Stat], July. http://arxiv.org/abs/1607.00699.

———. 2018. “Design-Based Analysis in Difference-in-Differences Settings with Staggered Adoption.” arXiv:1808.05293 [Cs, Econ, Math, Stat], August. http://arxiv.org/abs/1808.05293.

Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi. 2017. “Matrix Completion Methods for Causal Panel Data Models.” arXiv:1710.10251 [Econ, Math, Stat], October. http://arxiv.org/abs/1710.10251.

Bai, Jushan. 2009. “Panel Data Models with Interactive Fixed Effects.” Econometrica 77 (4): 1229–79. doi:10.3982/ECTA6135.

Basu, Sanjay, Ankita Meghani, and Arjumand Siddiqi. 2017. “Evaluating the Health Impact of Large-Scale Public Policy Changes: Classical and Novel Approaches.” Annual Review of Public Health 38 (1): 351–70. doi:10.1146/annurev-publhealth-031816-044208.

Bauhoff, Sebastian. 2014. “The Effect of School District Nutrition Policies on Dietary Intake and Overweight: A Synthetic Control Approach.” Economics & Human Biology 12 (January): 45–55. doi:10.1016/j.ehb.2013.06.001.

Bertrand, M., E. Duflo, and S. Mullainathan. 2004. “How Much Should We Trust Differences-in-Differences Estimates?” Quarterly Journal of Economics 119: 249–75. doi:10.1162/003355304772839588.

Bilinski, Alyssa, and Laura A. Hatfield. 2018. “Seeking Evidence of Absence: Reconsidering Tests of Model Assumptions.” arXiv:1805.03273 [Stat], May. http://arxiv.org/abs/1805.03273.

Blundell, Richard, and Monica Costa Dias. 2009. “Alternative Approaches to Evaluation in Empirical Microeconomics.” Journal of Human Resources 44 (3): 565–640. doi:10.3368/jhr.44.3.565.

Bonhomme, Stéphane, and Ulrich Sauder. 2011. “Recovering Distributions in Difference-in-Differences Models: A Comparison of Selective and Comprehensive Schooling.” The Review of Economics and Statistics 93 (May): 479–94. doi:10.1162/REST_a_00164.

Brown, Timothy Tyler, and Juan Pablo Atal. 2018. “How Robust Are Reference Pricing Studies on Outpatient Medical Procedures? Three Different Preprocessing Techniques Applied to Difference-in Differences.” Health Economics, November. doi:10.1002/hec.3841.

Cameron, A. Colin, and Douglas L. Miller. 2015. “A Practitioner’s Guide to Cluster-Robust Inference.” Journal of Human Resources 50 (2): 317–72. doi:10.3368/jhr.50.2.317.

Chabé-Ferret, Sylvain. 2015. “Analysis of the Bias of Matching and Difference-in-Difference Under Alternative Earnings and Selection Processes.” Journal of Econometrics 185 (1): 110–23. doi:10.1016/j.jeconom.2014.09.013.

———. 2017. “Should We Combine Difference in Differences with Conditioning on Pre-Treatment Outcomes?” 17-824. Toulouse School of Economics. https://www.tse-fr.eu/publications/should-we-combine-difference-differences-conditioning-pre-treatment-outcomes.

Chernozhukov, Victor, Kaspar Wuthrich, and Yinchu Zhu. 2017. “An Exact and Robust Conformal Inference Method for Counterfactual and Synthetic Controls.” arXiv:1712.09089 [Econ, Stat], December. http://arxiv.org/abs/1712.09089.

Conley, Timothy G., and Christopher R. Taber. 2011. “Inference with ‘Difference in Differences’ with a Small Number of Policy Changes.” The Review of Economics and Statistics 93 (February): 113–25. doi:10.1162/REST_a_00049.

Daw, Jamie R., and Laura A. Hatfield. 2018a. “Matching and Regression-to-the-Mean in Difference-in-Differences Analysis.” Health Services Research 53 (6): 4138–56. doi:10.1111/1475-6773.12993.

———. 2018b. “Matching in Difference-in-Differences: Between a Rock and a Hard Place.” Health Services Research 53 (6): 4111–7. doi:10.1111/1475-6773.13017.

Dimick, J. B., and Andrew M. Ryan. 2014. “Methods for Evaluating Changes in Health Care Policy: The Difference-in-Differences Approach.” JAMA 312 (December): 2401–2. doi:10.1001/jama.2014.16153.

Donald, Stephen G., and Kevin Lang. 2007. “Inference with Difference-in-Differences and Other Panel Data.” The Review of Economics and Statistics 89 (May): 221–33. doi:10.1162/rest.89.2.221.

Doudchenko, N., and G. W. Imbens. 2016. “Balancing, Regression, Difference-in-Differences and Synthetic Control Methods: A Synthesis.” 22791. Cambridge, MA: National Bureau of Economic Research. http://www.nber.org/papers/w22791.

Dube, Arindrajit, and Ben Zipperer. 2015. “Pooling Multiple Case Studies Using Synthetic Controls: An Application to Minimum Wage Policies.” IZA DP 8944. Bonn, Germany: Institute for the Study of Labor. https://www.iza.org/publications/dp/8944.

Ferman, Bruno, and Cristine Pinto. 2016. “Revisiting the Synthetic Control Estimator.” 86495. Munich: MPRA. https://mpra.ub.uni-muenchen.de/86495/.

Ferman, Bruno, Cristine Pinto, and Vitor Possebom. 2017. “Cherry Picking with Synthetic Controls.” 78213. https://mpra.ub.uni-muenchen.de/78213/.

Fretheim, Atle, Fang Zhang, Dennis Ross-Degnan, Andrew D. Oxman, Helen Cheyne, Robbie Foy, Steve Goodacre, et al. 2015. “A Reanalysis of Cluster Randomized Trials Showed Interrupted Time-Series Studies Were Valuable in Health System Evaluation.” Journal of Clinical Epidemiology 68 (3): 324–33. doi:10.1016/j.jclinepi.2014.10.003.

Freyaldenhoven, Simon, Christian Hansen, and Jesse M Shapiro. 2018. “Pre-Event Trends in the Panel Event-Study Design.” 24565. Cambridge, MA: National Bureau of Economic Research. http://www.nber.org/papers/w24565.

Gaibulloev, Khusrav, Todd Sandler, and Donggyu Sul. 2014/ed. “Dynamic Panel Analysis Under Cross-Sectional Dependence.” Political Analysis 22 (2): 258–73. doi:10.1093/pan/mpt029.

Glymour, M. Maria, Jennifer Weuve, Lisa F. Berkman, Ichiro Kawachi, and James M. Robins. 2005. “When Is Baseline Adjustment Useful in Analyses of Change? An Example with Education and Cognitive Change.” American Journal of Epidemiology 162 (3): 267–78. doi:10.1093/aje/kwi187.

Gobillon, Laurent, and Thierry Magnac. 2015. “Regional Policy Evaluation: Interactive Fixed Effects and Synthetic Controls.” The Review of Economics and Statistics 98 (3): 535–51. doi:10.1162/REST_a_00537.

Goodman-Bacon, Andrew. 2018. “Difference-in-Differences with Variation in Treatment Timing.” 25018. National Bureau of Economic Research. https://www.nber.org/papers/w25018.

Greenaway-McGrevy, Ryan, Chirok Han, and Donggyu Sul. 2012. “Asymptotic Distribution of Factor Augmented Estimators for Panel Regression.” Journal of Econometrics, Recent Advances in Panel Data, Nonlinear and Nonparametric Models: A Festschrift in Honor of Peter C.B. Phillips, 169 (1): 48–53. doi:10.1016/j.jeconom.2012.01.003.

Greene, William. 2004. “The Behaviour of the Maximum Likelihood Estimator of Limited Dependent Variable Models in the Presence of Fixed Effects.” The Econometrics Journal 7 (1): 98–119. doi:10.1111/j.1368-423X.2004.00123.x.

———. 2010. “Testing Hypotheses About Interaction Terms in Nonlinear Models.” Economics Letters 107 (2): 291–96. doi:10.1016/j.econlet.2010.02.014.

Hahn, Jinyong, and Ruoyao Shi. 2017. “Synthetic Control and Inference.” Econometrics 5 (4): 52. doi:10.3390/econometrics5040052.

Han, B., H. Yu, and M. W. Friedberg. 2017. “Evaluating the Impact of Parent-Reported Medical Home Status on Children’s Health Care Utilization, Expenditures, and Quality: A Difference-in-Differences Analysis with Causal Inference Methods.” Health Services Research 52 (April): 786–806. doi:10.1111/1475-6773.12512.

Hartman, Erin, and F. Daniel Hidalgo. 2018. “An Equivalence Approach to Balance and Placebo Tests.” American Journal of Political Science. doi:10.1111/ajps.12387.

Imai, Kosuke, and In Song Kim. In Press. “When Should We Use Fixed Effects Regression Models for Causal Inference with Longitudinal Data?” American Journal of Political Science. http://web.mit.edu/insong/www/pdf/FEmatch.pdf.

Imbens, G. W., and J. D. Angrist. 1994. “Identification and Estimation of Local Average Treatment Effects.” Econometrica 62: 467–75. doi:10.2307/2951620.

Kahn-Lang, Ariella, and Kevin Lang. 2018. “The Promise and Pitfalls of Differences-in-Differences: Reflections on ‘16 and Pregnant’ and Other Applications.” 24857. Cambridge, MA: National Bureau of Economic Research. doi:10.3386/w24857.

Karaca-Mandic, Pinar, Edward C. Norton, and Bryan Dowd. 2012. “Interaction Terms in Nonlinear Models.” Health Services Research 47 (1pt1): 255–74. doi:10.1111/j.1475-6773.2011.01314.x.

Kaul, Ashok, Stefan Kloßner, Gregor Pfeifer, and Manuel Schieler. 2015. “Synthetic Control Methods: Never Use All Pre-Intervention Outcomes Together with Covariates.” 83790. https://mpra.ub.uni-muenchen.de/id/eprint/83790.

King, Gary, and Langche Zeng. 2006. “The Dangers of Extreme Counterfactuals.” Political Analysis 14 (2): 131–59. doi:10.1093/pan/mpj004.

Kinn, Daniel. 2018. “Synthetic Control Methods and Big Data.” arXiv:1803.00096 [Econ], February. http://arxiv.org/abs/1803.00096.

Kreif, N., R. Grieve, D. Hangartner, A. J. Turner, S. Nikolova, and M. Sutton. 2016. “Examination of the Synthetic Control Method for Evaluating Health Policies with Multiple Treated Units.” Health Economics 25 (December): 1514–28. doi:10.1002/hec.3258.

Kropko, Jonathan, and Robert Kubinec. 2018. “Why the Two-Way Fixed Effects Model Is Difficult to Interpret, and What to Do About It.” https://ssrn.com/abstract=3062619.

Lindner, Stephan, and K. John McConnell. 2018. “Difference-in-Differences and Matching on Outcomes: A Tale of Two Unobservables.” Health Services and Outcomes Research Methodology, October. doi:10.1007/s10742-018-0189-0.

Lipsitch, Marc, Eric Tchetgen Tchetgen, and Ted Cohen. 2010. “Negative Controls: A Tool for Detecting Confounding and Bias in Observational Studies.” Epidemiology 21 (3): 383–88. doi:10.1097/EDE.0b013e3181d61eeb.

Lopez Bernal, J., S. Soumerai, and A. Gasparrini. 2018. “A Methodological Framework for Model Selection in Interrupted Time Series Studies.” Journal of Clinical Epidemiology, June. doi:10.1016/j.jclinepi.2018.05.026.

MacKinnon, James G, and Matthew D Webb. 2018. “Randomization Inference for Difference-in-Differences with Few Treated Clusters.” 1355. Kingston, Ontario: Queen’s University. http://qed.econ.queensu.ca/working_papers/papers/qed_wp_1355.pdf.

McKenzie, David. 2012. “Beyond Baseline and Follow-up: The Case for More T in Experiments.” Journal of Development Economics 99 (2): 210–21. doi:10.1016/j.jdeveco.2012.01.002.

Meyer, B. D. 1995. “Natural and Quasi-Experiments in Economics.” Journal of Business & Economic Statistics 13 (2): 151–61. doi:10.2307/1392369.

Moon, Hyungsik Roger, and Martin Weidner. 2015. “Linear Regression for Panel with Unknown Number of Factors as Interactive Fixed Effects.” Econometrica 83 (4): 1543–79. doi:10.3982/ECTA9382.

Mora, R., and I. Reggio. 2012. “Treatment Effect Identification Using Alternative Parallel Assumptions.” Working Paper 12-33. Madrid: Universidad Carlos III de Madrid. http://hdl.handle.net/10016/16065.

Mummolo, Jonathan, and Erik Peterson. 2018. “Improving the Interpretation of Fixed Effects Regression Results.” Political Science Research and Methods, January, 1–7. doi:10.1017/psrm.2017.44.

O’Neill, S., N. Kreif, R. Grieve, M. Sutton, and J. S. Sekhon. 2016. “Estimating Causal Effects: Considering Three Alternatives to Difference-in-Differences Estimation.” Health Services and Outcomes Research Methodology 16: 1–21. doi:10.1007/s10742-016-0146-8.

Pesaran, M. Hashem. 2006. “Estimation and Inference in Large Heterogeneous Panels with a Multifactor Error Structure.” Econometrica 74 (4): 967–1012. http://www.jstor.org/stable/3805914.

Powell, David. 2018. “Imperfect Synthetic Controls: Did the Massachusetts Health Care Reform Save Lives?” WR-1246. Santa Monica, CA: RAND Labor & Population. www.rand.org/pubs/working_papers/WR1246.html.

Puhani, Patrick A. 2012. “The Treatment Effect, the Cross Difference, and the Interaction Term in Nonlinear ‘Difference-in-Differences’ Models.” Economics Letters 115 (1): 85–87. doi:10.1016/j.econlet.2011.11.025.

Pustejovsky, James E., and Elizabeth Tipton. 2018. “Small-Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models.” Journal of Business & Economic Statistics 36 (4): 672–83. doi:10.1080/07350015.2016.1247004.

Reese, Simon, and Joakim Westerlund. 2018. “Estimation of Factor-Augmented Panel Regressions with Weakly Influential Factors.” Econometric Reviews 37 (5): 401–65. doi:10.1080/07474938.2015.1106758.

Robbins, Michael W., Jessica Saunders, and Beau Kilmer. 2017. “A Framework for Synthetic Control Methods with High-Dimensional, Micro-Level Data: Evaluating a Neighborhood-Specific Crime Intervention.” Journal of the American Statistical Association 112 (517): 109–26. doi:10.1080/01621459.2016.1213634.

Rokicki, S., J. Cohen, G. Fink, J. A. Salomon, and M. B. Landrum. 2018. “Inference with Difference-in-Differences with a Small Number of Groups: A Review, Simulation Study, and Empirical Application Using SHARE Data.” Medical Care 56 (January): 97–105. doi:10.1097/MLR.0000000000000830.

Roth, Jonathan. 2018. “Should We Adjust for the Test for Pre-Trends in Difference-in-Difference Designs?” arXiv:1804.01208 [Econ, Math, Stat], April. http://arxiv.org/abs/1804.01208.

Ryan, Andrew M. 2018. “Well-Balanced or Too Matchy-Matchy? The Controversy over Matching in Difference-in-Differences.” Health Services Research 53 (6): 4106–10. doi:10.1111/1475-6773.13015.

Ryan, Andrew M., J. F. Burgess, and J. B. Dimick. 2015. “Why We Should Not Be Indifferent to Specification Choices for Difference-in-Differences.” Health Services Research, December. doi:10.1111/1475-6773.12270.

Samartsidis, Pantelis, Shaun R. Seaman, Anne M. Presanis, Matthew Hickman, and Daniela De Angelis. 2018. “Review of Methods for Assessing the Causal Effect of Binary Interventions from Aggregate Time-Series Observational Data.” arXiv:1804.07683v1 [Stat.AP], April. https://arxiv.org/abs/1804.07683.

Sofer, Tamar, David B. Richardson, Elena Colicino, Joel Schwartz, and Eric J. Tchetgen Tchetgen. 2016. “On Negative Outcome Control of Unobserved Confounding as a Generalization of Difference-in-Differences.” Statistical Science 31 (3): 348–61. doi:10.1214/16-STS558.

Stuart, Elizabeth A., Haiden A. Huskamp, Kenneth Duckworth, Jeffrey Simmons, Zirui Song, Michael E. Chernew, and Colleen L. Barry. 2014. “Using Propensity Scores in Difference-in-Differences Models to Estimate the Effects of a Policy Change.” Health Services and Outcomes Research Methodology 14 (4): 166–82. doi:10.1007/s10742-014-0123-z.

VanderWeele, Tyler J., and Ilya Shpitser. 2013. “On the Definition of a Confounder.” The Annals of Statistics 41 (1): 196–220. doi:10.1214/12-AOS1058.

Wing, Coady, Kosali Simon, and Ricardo A. Bello-Gomez. 2018. “Designing Difference in Difference Studies: Best Practices for Public Health Policy Research.” Annual Review of Public Health 39 (1): 453–69. doi:10.1146/annurev-publhealth-040617-013507.

Xu, Yiqing. 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis 25 (1): 57–76. doi:10.1017/pan.2016.2.

© 2019 Bret Zeldow and Laura Hatfield