# Introduction

After a new law or policy is enacted , we often want to determine whether or not it was effective with respect to its goals . Difference-in-differences (diff-in-diff) is one study design used to answer such questions. To use diff-in-diff, we need observed outcomes of people who were exposed to the intervention (treated) and people not exposed to the intevention (control), both before and after the intervention. For example, suppose California (treated) enacts a new health care law designed to lower health care spending, but neighboring Nevada (control) does not. We can estimate the effect of the new law by comparing how the health care spending in these two states changes before and after its implementation.

Thanks to its apparent simplicity, diff-in-diff can be mistaken for a “quick and easy” way to answer causal questions. However, as we peer under the hood of diff-in-diff and illuminate its innerworkings, we can appreciate that the method is more complex than it seems.

# Notation

Before we begin, please refer to the table below which provides a guide to the notation used throughout.

Symbol | Meaning |
---|---|

\(Y(t)\) | Observed outcome at time \(t\) |

\(A=0\) | Control |

\(A=1\) | Treated |

\(t=1,\ldots,T_0\) | Pre-treatment times |

\(t=T_0+1,\ldots,T\) | Post-treatment times |

\(Y^a(t)\) | Potential outcome with treatment \(A = a\) at time \(t\) |

\(X\) | Observed covariates |

\(U\) | Unobserved covariates |

# Target estimand

At the outset of any analysis, we first define a study question, such as “Did the new California law *actually* reduce health care spending?” This particular question is aimed at determining causality. That is, we want to know whether the new law *caused* spending to go down, not whether spending went down for other reasons.

Next, we transform our question into a statistical quantity called a *target estimand*. The target estimand, or target parameter, is a statistical representation of our policy question. For example, the target estimand might be “the average difference in health care spending in California after the new law minus average health care spending in California if the law had not been passed.” This target estimand is written in terms of potential outcomes. In our toy scenario, California has two potential outcomes: health care spending under the new law and health care spending without the new law. Only one of these is observable (spending with the new law); the other is unobservable because it didn’t happen (spending without the new law).

Third, we choose an *estimator*, which is an algorithm that uses data to help us learn about the target estimand. Here, we focus on the diff-in-diff estimator, which relies on some strong assumptions, including that health care spending in Nevada can help us understand what would have happened in California without the new law. That’s how we can use observed data to learn about a target estimand that is written in terms of unobservable outcomes. More on this later.

With all these elements in place, now we can actually compute our *estimate*, a value of the estimand found by applying the estimator to the observed data.

To recap,

The quantity we care about is called the

*estimand*. We choose a target estimand that corresponds to our policy question and express it in terms of potential outcomes.The algorithm that takes data as input and produces a value of the estimand is called the

*estimator*.The estimator’s output, given data input, is called the

*estimate*. This value represents our best guess at the estimand, given the data we have.

As noted above, we define the target estimand in terms of potential outcomes. In the California example, we used the average effect of treatment on the treated (ATT). This compares the potential outcomes with treatment to the potential outcomes with no treatment, *in the treated group*. For a diff-in-diff, the ATT is the effect of treatment on the treated group *in the post-treatment period*. Written mathematically, the ATT is

**Average effect of treatment on the treated (ATT)**
\[\begin{equation*}
ATT \equiv \mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1\right]
\end{equation*}\]

Recall that \(Y^a(t)\) is the potential outcome given treatment \(a\) at time \(t\). Here, \(t = 2\) represents the post-treatment period, \(a = 1\) represents treatment and \(a = 0\) represents no treatment. Translated literally, the equation is \[\begin{equation*} \mbox{Expected}\left[\mbox{Spending in CA with the new law} - \mbox{Spending in CA without the new law}\right] \end{equation*}\]

If we could observe the potential outcomes both with treatment and with no treatment, estimating the ATT would be easy. We would simply calculate the difference in these two potential outcomes for each treated unit, and take the average. However, we can never observe both potential outcomes at the same time. In the treated group, the potential outcomes with treatment are *factual* (we can observe them), but the potential outcomes with no treatment are *counterfactual* (we cannot observe them).

So how do we estimate the ATT when the some of the potential outcomes are unobservable? In diff-in-diff, we use data from the control group to impute untreated outcomes in the treated group. This is the “secret sauce” of diff-in-diff. Using the control group helps us learn something about the unobservable counterfactual outcomes of the treated group. However, it requires us to make some strong assumptions. Next, we discuss assumptions required for diff-in-diff.

# Assumptions

### Consistency

For diff-in-diff, the treatment status of a unit can vary over time. However, we only permit two treatment histories: never treated (the control group) and treated in the post-intervention period only (the treated group). Thus, we will use \(A=0\) and \(A=1\) to represent the control and treated groups, with the understanding that the treated group only receives treatment whenever \(t > T_0\) (see notation).

Every unit has two potential outcomes, but we only observe one — the one corresponding to their actual treatment status. The consistency assumption links the potential outcomes \(Y^a(t)\) at time \(t\) with treatment \(a\) to the observed outcomes \(Y(t)\).

**Consistency Assumption**

\[
Y(t) = (1 - A) \cdot Y^0(t) + A \cdot Y^1(t)
\]

If a unit is treated \((A=1)\), then the observed outcome is the potential outcome with treatment \(Y(t) = Y^1(t)\) and the potential outcome with no treatment \(Y^0(t)\) is unobserved. If a unit is not treated \((A=0)\), then \(Y(t) = Y^0(t)\) and \(Y^1(t)\) is unobserved.

However, we also assume that future treatment does not affect past outcomes. Thus, in the pre-intervention period, the potential outcome with (future) treatment and the potential outcome with no (future) treatment are the same. We write this assumption mathematically as
**Arrow of time**
\[
Y(t) = Y^0(t) = Y^1(t),\; \mbox{for}\ t \leq T_0
\]

### Counterfactual assumption (Parallel Trends)

A second key assumption we make is that the change in outcomes from pre- to post-intervention in the control group is a good proxy for the *counterfactual* change in untreated potential outcomes in the treated group. When we observe the treated and control units only once before treatment \((t=1)\) and once after treatment \((t=2)\), we write this as:

**Counterfactual Assumption (1)**
\[\begin{align*}
\mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1\right] = \\
\nonumber \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0\right]
\end{align*}\]

This is an *assumption* — not something we can test — because it involves unobserved counterfactual outcomes, namely \(Y^0(2)\) for \(A = 1\).

In the shiny app embedded below, we can see what the counterfactual assumption does and how we calculate the ATT under our assumptions. The solid black lines represent the observed data. When we click the “Impute Counterfactual from Control to Treated” button, the slope of the line of the control group is imputed to the treated group (dashed line). Finally, clicking the “Show diff-in-diff effect button” reveals how we calculate the average effect of treatment on the treated (ATT).

Traditionally, this assumption is called the parallel trends assumption, but as we will soon see, that term can be ambiguous.

### Positivity Assumption

Lastly, we make a positivity assumption. With the positivity assumption, we assume that treatment is not determinant for specific values of \(X\). Thus, for any \(X = x\), the probability of being treated (or untreated) lies between 0 and 1, not inclusive.

**Positivity Assumption**
\[\begin{equation*}
0 < P(A = 1 | X) < 1 \; \text{ for all } X.
\end{equation*}\]

We will invoke the positivity assumption explicitly when we discuss semiparametric and nonparametric estimators.

## Identification

Using the assumptions above, we can re-write the the target estimand (which involved unobserved counterfactuals) in a form that depends only on observed outcomes. This process is called “identification”.

For diff-in-diff, identification begins with the ATT, applies the Counterfactual Assumption (1) and the Consistency Assumption, and ends with the familiar diff-in-diff estimator.

The result is the familiar diff-in-diff estimator

\[\begin{align*} ATT &\equiv \mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1\right] \\ &= \lbrace \mathbb{E}\left[Y(2) \mid A = 1\right] - \mathbb{E}\left[Y(1) \mid A = 1\right] \rbrace - \\ & \ \ \ \ \ \ \lbrace \mathbb{E}\left[Y(2) \mid A = 0\right] - \mathbb{E}\left[Y(1) \mid A = 0\right] \rbrace \end{align*}\]

For a straightforward estimate of the ATT, we could simply plug in the sample averages for the four expectations on the right-hand side:

- The post-intervention average of the treated group for \(\mathbb{E}\left[Y(2) \mid A = 1\right]\);
- The pre-intervention average of the treated group for \(\mathbb{E}\left[Y(1) \mid A = 1\right]\);

- The post-intervention average of the control group for \(\mathbb{E}\left[Y(2) \mid A = 0\right]\);

- The pre-intervention average of the control group for \(\mathbb{E}\left[Y(1) \mid A = 0\right]\).

Finding the standard error for this estimator is a little more complex, but we could estimate it by bootstrapping, for example.

Sometimes the counterfactual assumption may hold only after conditioning on some observed covariates, and the identification becomes more complex. More on this in the Confounding section.

# Multiple time periods

When we observe the treated and control units multiple times before and after treatment, we must adapt the target estimand and identifying assumptions accordingly. Let’s start by looking at possible target estimands.

## Target Estimands

We can calculate the ATT at *any* of the post-treatment time points

**Time-varying ATT**

*Individual time points*

For some \(t > T_0\),
\[\begin{equation*}
ATT(t) \equiv \mathbb{E}\left[Y^1(t) - Y^0(t) \mid A = 1\right]
\end{equation*}\]

or we can compute the *average* ATT across the post-treatment time points

**Time-varying ATT**

*Averaged over time points*

\[\begin{equation*}
ATT \equiv \mathbb{E}\left[\overline{Y^1}_{\{t>T_0\}} - \overline{Y^0}_{\{t>T_0\}} \mid A = 1\right]
\end{equation*}\]

Here, the overbar \(\overline{{\color{white} Y}}\) indicates averaging and the subscript \(_{\{t>T_0\}}\) refers to the time points over which the outcome is averaged.

The above estimands make sense when the treatment is administered at the same time for all treated groups. When treatment timing differences occur, Athey and Imbens (2018) and Goodman-Bacon (2018) discuss the weighted estimands that arise. We discuss diff-in-diff when there is variation in treatment timing briefly in the estimation section.

## Assumptions for Multiple Time Points

What kind of assumptions do we need to estimate the ATTs above? We consider several counterfactual assumptions that may require:

- parallel
*average*outcomes in pre- to post-intervention periods

- parallel outcome trends across
*certain*time points, or

- parallel outcome trends across
*all*time points.

First, consider an assumption that average over the pre- and post-intervention time points, effectively collapsing back to the simple two-period case.

**Counterfactual Assumption (2a)**

*Avg pre, avg post*

\[\begin{align*}
\mathbb{E} \left[\overline{Y^0}_{\{t > T_0\}} - \overline{Y^0}_{\{t \leq T_0\}} \mid A = 0\right] = \\
\mathbb{E} \left[\overline{Y^0}_{\{t > T_0\}} - \overline{Y^0}_{\{t \leq T_0\}} \mid A = 1\right]
\end{align*}\]

Here, we assume that the difference between the *average* of the pre-intervention outcomes and the *average* of the untreated post-intervention outcomes is the same for both treated and control groups.
To identify the time-averaged ATT using this assumption, we use the same identification process as in the simple case with only one observation in each of the pre- and post-intervention periods.

**Counterfactual Assumption (2b)**

*One pre, one post*

For some \(t^* > T_0\), there exists a \(t' \leq T_0\) such that
\[\begin{align*}
\mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 1\right] = \\
\mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 0\right]
\end{align*}\]

Counterfactual Assumption (2b) is a restriction on the data at two time points, one before and one after treatment. In a sense, time points other than these two are not relevant. Or at least, the other time points need not satisfy the “parallel trends” assumption. While this assumption is perfectly valid if true, using such an assumption requires justification. For instance, why do we believe this assumption is satisfied for two time points but not the rest? To identify the ATT using this assumption, we again use the same identification process as in the simple case, since we are back to considering only one time point pre-intervantion and one time point post-intervention.

**Counterfactual Assumption (2c)**

*Avg pre, one post*

For some post-treatment time point \(t^* > T_0\),
\[\begin{align*}
\mathbb{E}\left[Y^0(t^*) - \overline{Y^0}_{\{t \leq T_0\}} \mid A = 0\right] = \\
\mathbb{E}\left[Y^0(t^*) - \overline{Y^0}_{\{t \leq T_0\}} \mid A = 1\right]
\end{align*}\]

In this version we assume that there are “parallel trends” between *one* post-intervention time point and the *average* of the pre-intervention outcomes.

**Counterfactual Assumption (2d)**

*All pre, one post*

For some \(t^* > T_0\) and each \(t' \leq T_0\):
\[\begin{align*}
\mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 1\right] = \\
\mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 0\right]
\end{align*}\]

Counterfactual Assumption (2d) is a stricter version of (2c), where parallel trends holds at post-intervention time \(t^*\) and every possible pre-intervention time point. Note that if Counterfactual Assumption (2d) holds, then Counterfactual Assumption (2c) also must hold, but the reverse is not necessarily true.

Finally, we get to the assumption we’ve been waiting for, in which the untreated potential outcomes evolve in parallel in the treatment and control groups at*every*pre- and post-intervention time point. This is the strictest version of parallel trends and is what researchers often mean by “parallel trends”.

**Counterfactual Assumption (2e)**

*All pre, all post*

For each \(t^* > T_0\) and each \(t' \leq T_0\):
\[\begin{align*}
\mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 1\right] = \\
\mathbb{E}\left[Y^0(t^*) - Y^0(t') \mid A = 0\right]
\end{align*}\]

This is the most restrictive because it requires parallel evolution of the untreated outcomes at *all* pre- and post-intervention time points.

# Parallel trends

Many papers which use diff-in-diff methodology have a line or two stating that they assume “parallel trends” without much further elaboration. As the above assumptions illustrate, the counterfactual assumptions are more diverse and more specific than this general statement gives.

Sometimes authors explicitly impose parallel trends in the pre-treatment period only. This “parallel trends” assumption must be paired with a second assumption called “common shocks” (see Dimick and Ryan 2014; Ryan, Burgess, and Dimick 2015):

**Parallel pre-trends**

In the pre-intervention period, time trends in the outcome are the same in treated and control units.

**Common shocks**

In the post-intervention period, exogenous forces affect treated and control groups equally.

Stating the assumptions this way can be misleading for two reasons. First, not all identifying assumptions require strict parallel pre-intervention trends. For example, Counterfactual Assumption (2d) requires parallel trends in the pre-intervention period, but only Counterfactual Assumption (2e) demands parallel trends *throughout* the study.

Second, parallel pre-intervention trends is not an assumption at all! It is a testable empirical fact about the pre-intervention outcomes, involving no counterfactuals. By contrast, common shocks is an untestable assumption involving exogenous forces that are likely unknown to the researcher. See below for more discussion of parallel trends testing.

We prefer the counterfactual assumptions above because they are *explicitly* stated in terms of counterfactual outcomes, directly identify the diff-in-diff estimator, and avoid any false sense of security from tests of parallel trends.

Which assumptions are reasonable in the data you see? Use the app below to explore potential outcomes that satisfy each of the above assumptions. The app randomly generates outcomes for the control group then randomly generates untreated outcomes (counterfactuals in the post-intervention period) for a treated group that satisfy each assumption above. What do you have in mind when you say that you assume “parallel trends”? Does this match what you see in the app?

### Testing for Parallel Trends in the Pre-Treatment Period

Provided there are enough time points, researchers often test whether trends are parallel in the pre-intervention period. But the test of parallel trends is neither necessary nor sufficient to establish validity of diff-in-diff (Kahn-Lang and Lang 2018). Moreover, conditioning a test of the diff-in-diff effect on “passing” a test for parallel pre-period trends changes the performance of the whole procedure (Roth 2018).

Recognizing that editors, reviewers, and readers may still want to see tests of parallel trends despite our recommendation against it, one possible work-around is to reformulate the test using a different null hypothesis, namely one designed to show “equivalence” of the pre-period trends (Hartman and Hidalgo 2018). Other possibilities include procedures that re-formulate the model to allow for non-parallel pre-period trends and focus on how this impacts treatment effect estimates (Bilinski and Hatfield 2018; Rambachan and Roth 2019). In addition, robustness checks such as pre-period placebo intervention tests can assess the sensitivity of the conclusions to pre-period trend differences. For an excellent review of parallel trends testing in diff-in-diff, see McKenzie’s World Bank blog post.

### Equivalence tests

Our primary concern with (the usual) hypothesis tests of parallel trends (one in which the null hypothesis asserts parallel trends) is that we can never actually prove what we set out to prove. The only conclusions that emerge from a conventional hypothesis test are “fail to reject the null” or “reject the null.” The decision to “fail to reject” is decidedly different than accepting the null. And in tests for parallel trends, the null is typically that the trends are parallel. So we can never actually say that our trends are parallel using the default infrastructure. Maybe this is a problem for some and perhaps not for others. However, there is another problem with hypotheses for testing assumptions. Let’s delve briefly into a thought experiment where the “parallelness” of trends is captured by a single parameter \(\theta\) (where \(\theta = 0\) denotes two lines that are perfectly parallel). Deviations from zero (either negative or positive) denote departures from “parallelness” at varying magnitudes. The hypotheses for testing parallel trends look something like:

**\(H_0:\)** \(\theta = 0\)

**\(H_1:\)** \(\theta \neq 0\).

If we have a big enough sample size we can reject the null if the true value of \(\theta\) is 5 or 3 or 1 or 0.01. But do we really care about deviations of magnitude 0.01 compared to deviations of 5? It would be better if we could insert expert knowledge into this test and incorporate some margin for deviation in our test. Equivalence tests do just this, while at the same time reversing the order of the hypotheses. Let \(\tau\) denote an acceptable margin for deviations from parallel trends so that if \(|\theta| \leq \tau\), we feel OK saying that the trends are parallel (or close enough). The hypotheses for an equivalence test could be something like:

**\(H_0:\)** \(|\theta| > \tau\)

**\(H_1:\)** \(|\theta| \leq \tau\).

Equivalence tests are nothing new. They are sometimes used in clinical trials to determine if a new drug is no worse than a standard-of-care drug, for example. They also happen to provide an intuitive approach to testing for parallel trends in the pre-treatment periods. Unfortunately, this setup won’t solve all our (diff-in-diff) problems. Sample size considerations can be a hindrance in assumption testing, for one. However, this sort of issue arises no matter how we construct our testing framework, so we might as well set up our tests in a way that is more intuitive.

# Confounding

**Unconditionally unconfounded**

\[
Y^a \perp A
\]

**Conditionally unconfounded**

\[
Y^a \perp A \mid X
\]

In both of these versions, the treatment \(A\) is independent of the potential outcomes \(Y^a\), either unconditionally or conditional on \(X\). In practice, these relations are only satisfied in randomized trials; otherwise, there is no guarantee that \(X\) is sufficient to make \(A\) and \(Y^a\) conditionally independent. Even if we continue collecting covariates, it is likely that some unmeasured covariates \(U\) are still a common cause of \(A\) and \(Y^a\).

In diff-in-diff studies, the notion of confounding is fundamentally different. As alluded to in the previous section, confounding in diff-in-diff violates the counterfactual assumption when (1) the covariate is associated with treatment and (2) there is a time-varying relationship between the covariate and outcomes **or** there is differential time evolution in covariate distributions between the treatment and control populations (the covariate must have an effect on the outcome).

To see more in-depth discussions of confounding for diff-in-diff, we recommend Wing, Simon, and Bello-Gomez (2018) or Zeldow and Hatfield (2019).

## Time-Invariant versus Time-Varying Confounding

In an upcoming section, we will explicitly show the effect that a confounder has on the parallel trends assumption. Nevertheless, we begin our discussion of confounding in diff-in-diff by highlighting an important distinction: time-invariant and time-varying. When we have a covariate that satisfies certain properties (associated with treatment group and with outcome *trends*), parallel trends will not hold. As the name suggests, a time-invariant confounder is unaffected by time. It is typically measured prior to administering treatment and remains unaffected by treatment and other external factors.

Another, more pernicious, type of confounder is the time-varying confounder. Time-varying covariates freely change throughout the study. Examples of time-varying covariates seen in observational studies are concomitant medication use and occupational status. Time-varying covariates are particularly troublesome when they predict treatment status and then are subsequently affected by treatment, which in turn affects their treatment status at the next time point. In effect, time-varying confounders act as both a confounder and a mediator. However, recall that treatment status in diff-in-diff is monotonic: the comparison group is always untreated, and the treated group only switches once, from untreated to treated.

With these treatment patterns in mind, let’s talk a bit about time-varying confounders. We need to assess whether the time-varying covariates are affected by treatment or not. In most cases, we cannot know for certain. For example, in a study assessing the effect of Medicaid expansion on hospital reimbursements, we can be fairly certain that the expansion affected insurance coverage in the population. On the other hand, factors such as the average age of the population might change from pre- to post-expansion. How much would Medicaid expansion have affected that change? If the validity of our diff-in-diff model relied on adjusting for these factors, we would have to account for these covariates in some way. Our next section will talk about estimation in diff-in-diff studies, including how to deal with confounding.

# Estimation

With the key identifying assumptions for diff-in-diff freshly in mind, we now turn our attention to estimating causal effects. Recall the simple estimator we identified above:

\[\begin{align*}
ATT &\equiv \mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1\right] \\
&= \lbrace \mathbb{E}\left[Y(2) \mid A = 1\right] -
\mathbb{E}\left[Y(1) \mid A = 1\right] \rbrace - \\
& \ \ \ \ \ \ \lbrace \mathbb{E}\left[Y(2) \mid A = 0\right] -
\mathbb{E}\left[Y(1) \mid A = 0\right] \rbrace .
\end{align*}\]
Using sample means to estimate the ATT works well when there are two time periods and few covariates. In more challenging applications with many time points and many confounders, we will often specify a *model* that can readily be extended to more complex settings. Our discussion herein will motivate the use of regression as one way to estimate diff-in-diff parameters.

A typical linear model for the untreated outcomes \(Y^0_{it}\) (Athey and Imbens (2006) or Angrist and Pischke (2008) p. 228, for example) is written \[\begin{equation*} Y^0_{it} = \alpha + \delta_t + \gamma I(a_i = 1) + \epsilon_{it}\;. \end{equation*}\] The counterfactual untreated outcomes are presented as a sum of an intercept \(\alpha\), main effects for time \(\delta_t\), a main effect for the treated group with coefficient \(\gamma\), and a normally distributed error term \(\epsilon_{it}\). We first present a model for the untreated outcomes assuming no effect of covariates on the outcome. We then transition to the more realistic case that covariates are present and have real effects on the outcome.

Now we can simply connect the untreated outcomes to the observed outcomes \(Y_{it}\) using the relation \[\begin{equation*} Y_{it} = Y^0_{it} + \beta D_{it}\;, \end{equation*}\] where \(D_{it}\) is an indicator of the treatment status of the \(i^{th}\) unit at time \(t\), and \(\beta\) is the traditional diff-in-diff parameter. Note that \(D_{it}\) often will be equivalent to an interaction between indicators for the treatment group and the post-treatment period, \(D_{it} = a_i \cdot I(t > T_0)\). This will be the case when all treated units receive the intervention at the same time. When there is variation in treatment timing, \(D_{it}\) cannot be interpretted as an interaction because pre- and post-treatment periods are not well defined for the control group!

These models impose a constant diff-in-diff effect across units. For more about this strict assumption, please see our discussion of Athey and Imbens (2006).

Let’s return to the simple scenario of two groups and two time periods \(\left(t \in \{1,2\}\right)\). The model for \(Y^0_{it}\) reduces to \[\begin{equation*} Y^0_{it} = \alpha + \delta I(t = 2) + \gamma I(a_i = 1) + \epsilon_{it}\;. \end{equation*}\] If this model is correctly specified, Counterfactual Assumption (1) holds since

\[\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1\right] &= (\alpha + \delta + \gamma) - (\alpha + \gamma) \\ &= \delta \end{align*}\]

and

\[\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0\right] &= (\alpha + \delta ) - \alpha \\ &= \delta\;. \end{align*}\]

Now, let’s introduce the effect of a covariate and see how it affects our counterfactual assumption. For example, write our model for \(Y^0\) including an additive effect of a covariate \(X\), \[\begin{equation*} Y^0_{it} = \alpha + \delta_t + \gamma_a + \lambda_t x_i + \epsilon_{it}\;. \end{equation*}\] Here, the effect of \(X\) on \(Y^0\) may vary across time, so \(\lambda\) is indexed by \(t\).

Initially, we assume a constant effect of \(X\) on \(Y^0\) at \(t = 1\) and \(t = 2\), so \(\lambda_t = \lambda\). In this case, Counterfactual Assumption (1) is still satisfied even if the distribution of \(X\) differs by treatment group because these group-specific means cancel out:

\[\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1\right] &= (\alpha + \delta + \gamma + \lambda \mathbb{E}\left\{X \mid A = 1\right\} ) - \\ & \ \ \ \ \ \ (\alpha + \gamma + \lambda \mathbb{E}\left\{X \mid A = 1\right\}) \\ &= \delta \end{align*}\]

and

\[\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0\right] &= (\alpha + \delta + \lambda \mathbb{E}\left\{X \mid A = 0\right\} ) - \\ & \ \ \ \ \ \ (\alpha + \lambda \mathbb{E}\left\{X \mid A = 0\right\}) \\ &= \delta\;. \end{align*}\]

Lastly, we let the effect of \(X\) on \(Y^0\) vary across time (\(\lambda\) indexed by \(t\)), after which we have a different story:

\[\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1\right] &= (\alpha + \delta + \gamma + \lambda_2 \mathbb{E}\left\{X \mid A = 1\right\} ) - \\ & \ \ \ \ \ \ (\alpha + \gamma + \lambda_1 \mathbb{E}\left\{X \mid A = 1\right\}) \\ &= \delta + \lambda_2 \mathbb{E}\left\{X \mid A = 1\right\} - \\ & \ \ \ \ \ \ \lambda_1 \mathbb{E}\left\{X \mid A = 1\right\} \end{align*}\]

and\[\begin{align*} \mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0\right] &= (\alpha + \delta + \lambda_2 \mathbb{E}\left\{X \mid A = 0\right\} ) - \\ & \ \ \ \ \ \ (\alpha + \lambda_1 \mathbb{E}\left\{X \mid A = 0\right\}) \\ &= \delta + \lambda_2 \mathbb{E}\left\{X \mid A = 0\right\} - \\ & \ \ \ \ \ \ \lambda_1 \mathbb{E}\left\{X \mid A = 0\right\} \end{align*}\]

are not necessarily equal. They are only equal if the effect of \(X\) on \(Y^0\) is constant over time (i.e., \(\lambda_1 = \lambda_2\)) or the mean of the covariate in the two groups is the same (i.e., \(\mathbb{E}\left\{X \mid A = 1\right\} = \mathbb{E}\left\{X \mid A = 0\right\}\)). This illustrates an important connection between the counterfactual assumption and the regression model and introduces the notion of confounding in diff-in-diff.

To better visualize this, use the app below to explore time-varying confounding in simulated data. The y-axis is the mean of the untreated potential outcomes (\(Y^0\)) and the x-axis is time.

**Remember: for Counterfactual Assumption (1) to hold, the lines connecting \(Y^0\) values in the treated and control groups must be parallel.**

Whenever the lines are not parallel (i.e., the differential change over time is not 0), Counterfactual Assumption (1) is violated.

- What happens when the covariate distributions are different in the treated and control groups? (hint: change the values of \(Pr(X=1|A=0)\) and \(Pr(X=1|A=1)\))
- What happens when the covariate effect varies over time? (hint: change the effects of \(X\) on \(Y^0\) at \(t = 1\) and \(t = 2\))

As you may have discovered in the app, \(X\) is a confounder if two conditions hold:

- \(X\) is associated with treatment (\(A\)) and
- the effect of \(X\) on \(Y\) varies across time.

For the remaining parts of the “Estimation” section, we will give general overviews of several of the more common ways to estimate diff-in-diff parameters. We start with linear regression, then discuss matching frameworks, and conclude with semiparametric and nonparametric estimators.

# Regression

Probably the most commonly used estimator in diff-in-diff is a linear regression model. At the very least, the regression model will contain a treatment indicator, an indicator that equals one whenever we are in a post-treatment period, and their interaction. This interaction is typically taken to be the parameter of interest and if the usual diff-in-diff assumptions are true, will equal the ATT. When using R to perform analysis, our code will look something like:

`lm(y ~ a*post)`

Here, \(a\) is a treatment indicator and \(post\) is an indicator for post-treatment period. The notation `a*post`

gives main effects for \(a\) and \(post\) and their interaction.

In reality, most regression models will not be that sparse (only two indicators and an outcome). For example, we frequently encounter regression models such as this:

This example, taken from McWilliams et al. (2014), is much more typical of the kinds of regression models we see in applied settings. Without going into unnecessary detail on this paper’s background, the covariate called “ACO_indicators” is the treatment variable. The covariate \(\beta_{3k}\) represents the causal effect of interest. However, there are many other terms in the model, including time fixed effects, other fixed effects (“HRR_indicators”), and covariates. We talk about the inclusion of fixed effects below, followed by a discussion on adjusting for covariates using regression models.

### Fixed effects in diff-in-diff

Let’s talk about fixed effects briefly (see Mummolo and Peterson (2018) for a more in-depth discussion of fixed effects models and their interpretation). Fixed effects, particularly unit-level fixed effects, are used in causal inference to adjust for unmeasured time-invariant confounders. Of course, there are trade-offs. The discussion from Imai and Kim (n.d.) explains that using unit fixed effects comes at the cost of capturing the dynamic relationship between the treatment and the outcome.

Kropko and Kubinec (2018) discuss the common two-way fixed effects model, which includes unit and time fixed effects. Their main point is that estimates coming from two-way fixed effect models are difficult to interpret when we have many time periods. When we have the canonical (two-period, binary treatment) diff-in-diff setup, the \(\beta\) coefficient from the two-way fixed effect model \(\left(y_{it} = \alpha_i + \delta_t + \beta D_{it} + \epsilon_{it}\right)\) equals the usual estimate. As more time periods are added within the fixed-effects framework, we implicitly add supplementary assumptions. In particular, the diff-in-diff effect is assumed homogenous across time and cases. Homogeneity across time is a stringent assumption that says the diff-in-diff effect is the same no matter how close or far apart the time periods are. We say this not to discourage use of two-way fixed effect models, but to discourage *automatic* use of them. True they work well for some cases (when we need to adjust for unmeasured time-invariant confounders), but we really need to examine our research goals on an application-by-application basis, consider the assumptions implicit in the models we’re thinking of using, and adjust our tools accordingly.

What if treated units are treated at different times? Or what if we don’tcontrol group, only variation in treatment timing? Goodman-Bacon (2018) examines the two-way fixed effect regression model \((Y_i(t) = \alpha_i + \delta_t + \beta D_{it} + \epsilon_{it})\) as a diff-in-diff estimator when there exists treatment variation. It turns out that the diff-in-diff parameter \(\beta\) is a weighted combination of all possible \(2 \times 2\) diff-in-diff estimators found in the data. So each treatment group can be compared to the untreated group (if one exists), but each treatment group also serves as a control to every other treatment group. The global diff-in-diff estimate is a weighted average of all possible \(2 \times 2\) estimates. The weights are determined by sample sizes in each group and the variance in the treatment variable.

Bai (2009) describes an interactive fixed effects model that incorporates time-varying dynamics. Each unit is assumed to have an \(r\)-vector of factor loadings, \(\mathbf{\lambda}_i\), multiplies an \(r\)-vector of common factors at each time point \(\mathbf{F}_{t}\). That is, for outcome \(Y_{it}\) of unit \(i\) at time \(t\), the data-generating model is \[ Y_{it} = X_{it}'\beta + \lambda_{i1}F_{1t} + \ldots + \lambda_{ir}F_{rt} + \epsilon_{it}\;, \] where \(X_{it}\) are observed covariates. Note that the two-way fixed effects model is a special case of this where \(F_{1t} = 1\), \(F_{2t} = \delta_t\) and \(\lambda_{i1}= \alpha_{i}\), \(\lambda_{i2} = 1\). The authors present least-squares estimators for large \(N\) and large \(T\).

Marginal structural models can capture dynamics such as past outcomes affecting future treatments, but cannot account for time-invariant unmeasured confounders. Thus, we can *either* adjust for time-invariant unmeasured confounders and assume no dynamic relationship between treatment and outcome *or* we can assume that there are no unmeasured confounders and allow for more complicated relationships between treatment and outcome.

### Confounding in linear settings

If we know how confounding arises, we can address it. For example, if the truth is a linear data-generating model, we can use a linear regression model to address confounding. The flowchart below outlines six linear data-generating models and the appropriate linear regression adjustment for each.

Of these six scenarios, two require no adjustment at all. Of the 4 that require adjustment, only one requires the regression adjustment type nearly always found in the literature, i.e., adjusting for a time-varying covariates without any interaction with time. In the other three scenarios with confounding bias, the issue is due, in whole or in part, to time-varying covariate effects. For these cases, including an interaction of covariates with time is crucial to addressing confounding bias.

See directed acyclic graphs (DAGs) (together with a brief discussion) for these scenarios by selecting an option below:In this scenario, the covariate \(X\) does not vary over time. The arrow from \(X\) to \(A\) indicates that \(X\) is a cause of \(A\), satifying the first requirement of a confounder. Additionally, are arrows from \(X\) to \(Y(1)\) and to \(Y(2)\) as well as an arrow from \(A\) to \(Y(2)\). [Note: there is no arrow from \(A\) to \(Y(1)\) because treatment is administered after \(Y(1)\).] \(\alpha\) is the effect of \(X\) on \(Y(1)\), and \(\beta\) is the effect of \(X\) on \(Y(2)\). When \(\alpha = \beta\), the effect of \(X\) is time-invariant and we do not require covariate adjustment. When \(\alpha \neq \beta\), we must adjust for the interaction of \(X\) with time.

In this scenario, the time-varying covariate \(X\) in periods 1 and 2 is denoted \(X(1)\) and \(X(2)\). There is no arrow connecting \(A\) to \(X(2)\), indicating that treatment does not affect the evolution of \(X(1)\) to \(X(2)\). When \(\alpha = \beta\), the *effect* of \(X\) is time-invariant and we do not need to adjust for the covariate. When \(\alpha \neq \beta\), we must adjust for the interaction of \(X\) with time.

In this scenario, the time-varying covariate \(X\) evolves differentially by treatment group. However, most diff-in-diff analyses implicitly or explicitly assume that \(X\) does not evolve based on treatment group. See our nonparametric section below. One diff-in-diff estimator that directly accounts for this phenomenon is Stuart et al. (2014), which we discuss in more detail below. When \(\alpha = \beta\), the effect of \(X_t\) on \(Y^0\) is time-invariant and it suffices to adjust only for \(X_t\). When \(\alpha \neq \beta\), we must adjust for the interaction of \(X_t\) with time.

# Matching

Matching estimators adjust for confounding by balancing the treatment groups on measured covariates. Rather than using the entire sample population to estimate the diff-in-diff effect, units in the control group are selected by on their “closeness” to units in the treated group. We introduce this section by a series of Tweets (enhanced with GIFs!) by Laura Hatfield about a recent Daw and Hatfield (2018a) paper on matching and regression to the mean.

Do you use diff-in-diff? Then this thread is for you.

— Laura Hatfield (@laura_tastic) July 27, 2018

You’re no dummy. You already know diverging trends in the pre-period can bias your results.

But I’m here to tell you about a TOTALLY DIFFERENT, SUPER SNEAKY kind of bias.

Friends, let’s talk regression to the mean. (1/N) pic.twitter.com/M2tEEsBiyH

The argument focuses on estimators that match on outcomes in the pre-treatment period. Matching on pre-treatment outcomes is attractive in diff-in-diff because it improves comparability of the groups and possibily of their outcome trends. The crux of the argument in Daw and Hatfield (2018a) is that matching estimators can be dangerous in diff-in-diff settings due to regression to the mean. Regression to the mean is a notorious phenomonom in which extreme values tend to revert to the group mean on subsequent measurements. For example, if we select the ten students who score highest on an exam, at a subsequent exam, the average score for these ten students would drop towards the class mean.

For diff-in-diff, the effect is similar. By constraining the pre-treatment outcomes to be similar, we are more likely to select units of the group that are higher or lower than their respective group means. Once the matching constraint is dropped (in the post-treatment period), these units’ means can revert back to their respective group’s mean and possibly yield a spurious diff-in-diff effect. So in some cases, matching can actually *introduce* bias.

So how can we know whether matching is useful or harmful in our diff-in-diff study? Unfortunately sometimes we can’t know. Take the paper Ryan, Burgess, and Dimick (2015) which presents a simulation study using matching estimators and shows that matching can reduce confounding bias. In their paper, they sampled the treated and control groups from the same population, but the probability of being part of the treated group increased for high pre-treatment outcomes. In contrast, Daw and Hatfield (2018a) set up similar simulations but with treated and controls groups coming from *different*, but overlapping, populations.

For the following thought experiment, assume no diff-in-diff effect is present. If the populations are drawn as in Ryan, Burgess, and Dimick (2015), the two populations have different pre-treatment means. In a diff-in-diff study without matching, these units will regress to the mean in the post-treatment period (but they will regress to the same value since they are drawn from the same population!). This yields a non-zero diff-in-diff effect. Matching actually fixes this issue. If the populations are drawn as in Daw and Hatfield (2018a), the opposite is true. Without matching, the populations are different in the pre-intervention period and remain that way in the post-intervention period (since they are representative of their true populations, there is no regression to the mean). With matching, the populations are constrained to be the same in the pre-treatment period, and once the constraint is released in the post period, the two groups regress back to their group means. So in the Ryan, Burgess, and Dimick (2015) setup, matching is the solution to regression to the mean bias; in the Daw and Hatfield setup, matching is the cause of the regression to the mean bias.

Using real life data, there is no way to check empirically whether our groups come from the same population or from different populations. Determining this must come from expert knowledge from how the treatment assignment mechanisms work. To quote Daw and Hatfield (2018b) in the follow-up to their own paper:

(R)esearchers must carefully think through the possible treatment assignment mechanisms that may be operating in the real-life situation they are investigating. For example, if researchers are aware that a pay-for-performance incentive was assigned to physicians within a state based on average per-patient spending in the past year, one may be comfortable assuming that treatment assignment mechanism is operating at the unit level (i.e., the potential treatment and control units are from the same population). In contrast, if the same incentive was assigned to all physicians within a state and a researcher chooses a control state based on geographic proximity, it may be more reasonable to assume that treatment assignment is operating at the population level (i.e., the potential treatment and control units are from separate populations).

Other researchers have also noted the lurking biasedness of some matching diff-in-diff estimators. Lindner and McConnell (2018), for example, found that biasedness of the estimator was correlated in simulations with the standard deviation of the error term. As the standard error increased, so did the bias.

In a pair of papers, Chab'e-Ferret (Chabé-Ferret 2015, 2017) similarly concluded that matching on pre-treatment outcomes can be problematic and is dominated by symmetric diff-in-diff.

# Semi- and Non-parametric

Up to this point, we can think of a diff-in-diff analysis as a four-step process:

- make assumptions about how our data were generated

- suggest a sensible model for the untreated outcomes

- connect the untreated outcomes to the observed outcomes

- estimate the diff-in-diff parameter (via regression or matching or both)

While this process is simple, our estimates and inference can crumble if we’re wrong at any step along the way. We’ve discussed the importance of counterfactual assumptions and inferential procedures. We now turn our attention to the modeling aspect of diff-in-diff. So far, we have discussed only *parametric* models. Below, we present some semiparametric and nonparametric estimators for diff-in-diff. These give us more flexibility when we don’t believe in linearity in the regression model or fixed unit effects.

### Semi-parametric estimation with baseline covariates

Abadie (2005) addresses diff-in-diff when a pre-treatment covariate differs by treatment status and also affects the dynamics of the outcome variable. In our confounding section above, this is the “Time-invariant X with time-varying effect” scenario.

Let’s return to the two-period setting. When a covariate \(X\) is associated with both treatment \(A\) and changes in the outcome \(Y\), the Counterfactual Assumption 1 no longer holds.

Thus, Abadie (2005) specifies an identifying assumption that conditions on \(X\). That is,

**Conditional Counterfactual Assumption**
\[
\mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 1, X\right] =
\mathbb{E}\left[Y^0(2) - Y^0(1) \mid A = 0, X\right].
\]

This assumption does *not* identify the ATT, but it *can* identify the CATT, that is, the conditional ATT:

**Conditional average effect of treatment on the treated (CATT)**
\[
CATT \equiv \mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1, X\right].
\]

The CATT itself may be of interest, or we may want to average the CATT over the distribution of \(X\) to get back the ATT. To identify the CATT, repeat the identification steps above with expectations conditional on \(X\). As expected, it turns out that

\[\begin{align*} \mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1, X\right] &= \lbrace \mathbb{E}\left[Y(2) \mid A = 1, X\right] - \mathbb{E}\left[Y(1) \mid A = 1, X \right] \rbrace - \\ & \ \ \ \ \ \lbrace \mathbb{E}\left[Y(2) \mid A = 0, X\right] - \mathbb{E}\left[Y(1) \mid A = 0, X \right] \rbrace. \end{align*}\]

Nonparametric estimators for these quantities are easy when \(X\) is a single categorical variable. They are simply sample averages for groups defined by combinations of \(X\) and \(A\):

- The post-treatment average of the treated group with \(X=x\) for \(\mathbb{E}\left[Y(2) \mid A = 1, X=x\right]\)

- The pre-treatment average of the treated group with \(X=x\) for \(\mathbb{E}\left[Y(1) \mid A = 1, X=x\right]\)

- The post-treatment average of the control group with \(X=x\) for \(\mathbb{E}\left[Y(2) \mid A = 0, X=x\right]\)

- The pre-treatment average of the control group with \(X=x\) for \(\mathbb{E}\left[Y(1) \mid A = 0, X=x\right]\)

However, if \(X\) is high-dimensional or contains continuous covariates, these get tricky. Abadie proposes a semiparametric solution using propensity scores. Recall that a propensity score is the estiamted probability of treatment given pre-treatement covariate \(X\), \(P(A = 1 \mid X)\). For this approach to work, we need to use the positivity assumption, which was introduced in the assumption section. That is,

**Positivity Assumption**
\[\begin{equation*}
0 < P(A = 1 | X) < 1 \; \text{ for all } X.
\end{equation*}\]

This assumption ensures the estimand is defined at all \(X\). If some values of \(X\) lead to *guaranteed* treatment or control (i.e., propensity scores of \(0\) or \(1\)) we should reconsider the study population.

With positivity in hand, consider the weighted estimator of Abadie (2005):

\[\begin{equation} \mathbb{E}\left[Y^1(2) - Y^0(2) \mid A = 1\right] = \mathbb{E}\left[\frac{Y(2) - Y(1)}{P(A = 1)} \cdot \frac{A - P(A = 1 \mid X)}{1 - P(A = 1 \mid X)}\right]. \end{equation}\]

To estimate these quantities, we need fitted values of the propensity scores for each unit, i.e., \(\hat{P}(A=1 | X= x_i)\) and then we use sample averages in the treated and control groups. To see the math,

We only need the average change in outcomes among the treated units and the *weighted* average change in outcomes among the control units. The weights are \(\frac{\hat{P}(A=1 | X=x_i)}{1 - \hat{P}(A=1 | X=x_i)}\). What are these weights sensible? Well, outcome changes among control units with \(x_i\) that resemble treated units (i.e., with large \(\hat{P}(A=1 | X=x_i)\)) will get more weight. Outcome changes among control units with \(x_i\) that resemble control units will get less weight.

To model the propensity scores, we could use a parametric model like logistic regression or something more flexible like machine learning. As usual, extending the model to multiple time points in the pre- and post-treatment periods is more complicated.

### Nonparametric estimation with empirical distributions

Athey and Imbens (2006) developed a generalization of diff-in-diff called “changes-in-changes” (of which diff-in-diff is a special case). This method drops many of the parametric assumptions of diff-in-diff and allows both time and treatment effects to vary across individuals. Again we are in a two-period, two-group setting. The Athey and Imbens (2006) model is much less restrictive than the usual parametric model. It only assumes two things:

- \(Y_i^0 = h(u_i, t)\) for unobservable characteristics \(u_i\) and an unknown function \(h\) (increasing in \(u\))
- Within groups, the distribution of \(u_i\) does not change over time

Note that the distribution of \(u_i\) can differ between treatment groups so long as this difference remains constant. Below, we discuss a method for estimating diff-in-diff without this assumption. To estimate the target parameter, Athey and Imbens (2006) estimate empirical outcome distributions in a familiar list of samples:

- The post-treatment distribution of \(Y\) in the treated group,

- The pre-treatment distribution of \(Y\) in the treated group,

- The post-treatment distribution of \(Y\) in the control group, and

- The pre-treatment distribution of \(Y\) in the control group.

What we are missing is

- The post-treatment distribution of \(Y^0\) in the treated group (i.e.,
*counterfactual, untreated*outcomes)

Since we cannot observe this distribution, Athey and Imbens (2006) estimate it through a combination of the empirical distributions for (1), (2), and (3). After estimating (5), the effect of treatment is the difference between observed (4) and estimated (5). Let \(F_{Y^0_{12}}\) be the counterfactual distribution for the untreated outcomes for the treated group at \(t = 2\). We estimate this quantity through the relation:

\[ F_{Y^0_{12}}(y) = F_{Y_{11}}(F^{-1}_{Y_{01}}(F_{Y_{02}}(y)))\;, \] where \(F_{Y_{11}}\) is the distribution function for the (observed) outcomes for the treated group at \(t = 1\); \(F_{Y_{01}}\) is the distribution function for the (observed) outcomes for the untreated group at \(t = 1\); and \(F_{Y_{02}}\) is the distribution function for the (observed) outcomes for the untreated group at \(t = 2\).

The other distribution of note, \(F_{Y^1_{12}}(y)\), is actually observed since this are the treated outcomes for the treated group at \(t = 2\). Since we have estimates for both \(F_{Y^1_{12}}\) and \(F_{Y^0_{12}}\), we can also estimate the diff-in-diff effect. MATLAB code for this estimator is available on the author’s website.

Bonhomme and Sauder (2011) extend this idea to allow the shape of the outcome distributions to differ in the pre- and post-intervention periods. The cost of this additional flexibility is that they must assume additivity.

### Semiparametric estimation with time-varying covariates

One of the key identifying assumptions from Athey and Imbens (2006) is that the distribution of \(u\) — all non-treatment and non-time factors — is invariant across time within the treated and control groups. That is, the distributions do not change with time. Stuart et al. (2014) circumvents this restriction by considering four distinct groups (control/pre-treament, control/post-treatment, treated/pre-treatment, treated/post-treatment) rather than just two groups (control and treated) observed in two time periods. With the four groups, the distribution of \(u\) can change over time. A consequence of this setup is that it no longer makes sense to talk about the diff-in-diff parameter as the effect of treatment on the treated; instead, the estimand is defined as the effect of treatment on treated in pre-treatment period.

The estimator uses propensity scores for predicting the probabilities of each observation being in each of the four groups. We can use some kind of multinomial regression. The treatment effect is then calculated as a weighted average of the observed outcomes (see section 2.4 of Stuart et al. (2014)).

# Double Robustness

Whereas standard parametric techniques rely on estimation of the outcome regression and methods such as those in Abadie (2005) leverage information in the propensity score function, some doubly robust methods use both of these components. Broadly, doubly robust methods will yield unbiased estimators of the parameter of interest if one of these regressions is estimated consistently, and they will be efficient estimators if both are estimated consistently.

Sant’Anna and Zhao (2018) propose a doubly robust estimator that allows for linear and nonlinear specifications of the outcome regression and propensity score function for both panel or repeated cross-section data. Inference yielding simultaneous confidence intervals involves a bootstrapping procedure, which accommodates clusters (although the number of groups must be large).

Elsewhere, Han, Yu, and Friedberg (2017) implement a double robust weighting approach based on Lunceford and Davidian (2004) to study the impact of medical home status on children’s healthcare outcomes. General texts on double robust estimators are available (see Van der Laan and Robins (2003) or Van der Laan and Rose (2011)). For another type of doubly robust method that depends on the outcome regression and unit and time weights, see Arkhangelsky et al. (2019) in the Synthetic Control section.

# Inference

The topic of inference is inherently linked to that of estimation. Once we estimate the causal estimand, we want to make accurate judgment on the uncertainty surrounding our estimate (confidence intervals) and calculate the probability of our estimate under the null hypothesis (p-values). In this section, we highlight some common challenges and give proposed solutions that have been recommended in the literature.

Whether the data arise from repeated measures or from repeated cross-sections, data used in diff-in-diff studies are complex and cannot be assumed to be *iid* (i.e., independently and identically distributed). For example, we may have hierarchical data, in which individual observations are nested within larger units (e.g., individuals in a US state) or longitudinal data, in which repeated measures are obtained for units. In both of these cases, assuming *iid* data will result in standard errors that are too small.

Bertrand, Duflo, and Mullainathan (2004) and Rokicki et al. (2018) discuss diff-in-diff inference in the presence of serial correlation. The authors consider methods to accomodate this issue such as collapsing the data, modeling the covariance structures, and permutation inference.

### Collapsing the data

Collapsing or aggregating again returns us to the simple two-period setting, obviating the need to consider longitudinal correlation in the data. When treatment is administered at the same time point, we can perform ordinary least squares on the aggregated data. On the other hand when treatment laws are staggered (e.g., states pass the same health care law in different years), Bertrand, Duflo, and Mullainathan (2004) suggests aggregating the residuals from a regression model and then analyzing those. See Goodman-Bacon (2018) and Athey and Imbens (2018) for more about varying treatment start times.

In simulation studies, Bertrand, Duflo, and Mullainathan (2004) and Rokicki et al. (2018) find that aggregation has good Type I error and coverage, but it *does* lose some information (and thus power).

### Clustered standard errors

The most popular way to account for serial correlation in diff-in-diff is clustered standard errors (Cameron and Miller 2015; Abadie et al. 2017). In practice, this is typically done in `Stata`

using the `cluster`

option from the `regress`

function. Similar adjustment is available in any common statistical software. We declare which variable or variables constitute our clusters, and the software makes some kind of adjustment to the standard standard errors by accounting for within-cluster correlation.

This type of adjustment fails with only one treated unit (Conley and Taber 2011), for example, when a single state implements a policy of interest. There is no hard and fast rule on the number of treated units needed for clustered standard errors to be appropriate. Generally, it is better to have a balanced treated-control ratio than a lopsided one. As we mentioned in the preface to this section, Rokicki et al. (2018) examined DID inference when there were few groups. In particular, see Figures 2 (panel A) and 4 in that paper. Figure 2 examines 95% confidence interval coverage for the DID parameter under various inference techniques. In panel A, we can see that when the number of groups is small, clustering standard errors results in undercoverage, and this undercoverage is worse when the treated-to-control ratio is unbalanced. In Figure 4, coverage is presented as a proportion of the treated units to the control units. Whenever there is an unbalanced proportion of treated to control units, coverage suffers.

Fortunately, other choices exist if clustered standard errors are untenable and may even be preferred in many situations. Donald and Lang (2007) developed a two-part procedure for estimation and inference in simple models that works well even when the numbers of groups is small.

Mixed models with random effects at the cluster level can account for serial correlation. This is what we used in our demonstration of confounding in a previous section. Generalized estimating equations (GEE) take into account covariance structure and use a robust sandwich estimator for the standard errors (see Figure 2, panel D and Figure 4 in Rokicki et al. (2018)). Both of these methods are widely available in statistical software. In particular, GEE is powerful since it is robust to the misspecification of the correlation structure so if we guess incorrectly, it will not bias our estimate when the underlying regression model is correct. Specifying the correct covariance will increase the efficiency of our estimate. However, note that Rokicki et al. (2018) also found undercoverage in the confidence interval in the GEE estimates when the ratio of treated to control units was lopsided.

### Arbitrary covariance structures

Throughout the diff-in-diff literature, we find simulations and inference techniques based on an autoregressive covariance AR(1) structure for the residuals within a cluster. The AR(1) covariance structure is

\[ \text{Cov}(Y_i) = \sigma^2 \begin{pmatrix} 1 & \rho & \rho^2 & \cdots & \rho^{n-1} \\ \rho & 1 & \rho & \cdots & \rho^{n-2} \\ \rho^2 & \rho & 1 & \cdots & \rho^{n-3}\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \rho^{n-1} & \rho^{n-2} & \rho^{n-3} & \cdots & 1 \end{pmatrix} \]

with an unknown variance parameter \(\sigma^2\) and an unknown autoregressive parameter \(0 \leq \rho \leq 1\). When \(\rho\) is larger, clustered values are more highly correlated; whereas when \(\rho = 0\), observations are independent. This structure assumes that correlation is positive (or zero) across all observations and that observations closer to each other are the most strongly correlated, and observations that are more distant are more weakly correlated.

The AR(1) correlation structure is pervasive in diff-in-diff simulation studies and inference techniques. Bertrand, Duflo, and Mullainathan (2004) considered this correlation structure in simulations and found that “this technique [assuming AR(1)] does little to solve the serial correlation problem” due to the difficulty in estimating \(\rho\). Rokicki et al. (2018) used an AR(1) in their simulations.

McKenzie (2012) also discusses autocorrelation, emphasizing how statistical power relates to the number of time points and to autocorrelation, ultimately concluding that ANCOVA is more powerful than diff-in-diff.

The correlation structure in diff-in-diff applications may follow many different structures. For example, after de-meaning and de-trending, outcomes have a weak positive correlation in adjacent time points but a negative correlation at time points in far apart time points. In the shiny app below, we present correlation structures for simulated data and real data. The real datasets are from (a) the Dartmouth Health Atlas, (b) MarketScan claims data, and (c) Medicare claims, which are described in more detail within the app. Play around with the settings to simulate data that look like your applications. Does the correlation structure look the way you expect?

### Permutation tests

Permutation tests are a resampling method that can be used to test statistical hypotheses. In the diff-in-diff setting, permutation tests comprise the following steps:

Compute the test statistic of interest on the original data. For example, calculate the interaction term between time and treatment from a regression model. Call this \(\hat{\delta}\).

For \(K\) a large positive integer, permute the treatment assignment randomly to the original data, so that the data are the same save for a new treatment assignment. Do this \(K\) times.

For each of the \(K\) new datasets, compute the same test statistic. In our example, we compute a new interaction term from a regression model. Call these \(\hat{\delta}^{(k)}\) for permutation \(k \in \{1, \dots, K\}\).

Compare the test statistic \(\hat{\delta}\) found in the first step to the test statistics \(\hat{\delta}^{(1)}, \dots, \hat{\delta}^{(K)}\) found in the third step.

The fourth step is where we can get a nonparametric p-value for the parameter of interest. If, for instance, \(\hat{\delta}\) is more extreme than 95% of \(\hat{\delta}^{(k)}\) then the permutation test p-value is 0.05.

For more on randomization inference for difference-in-differences, see Conley and Taber (2011) and MacKinnon and Webb (2018).

# Nonlinear Models

Much of diff-in-diff theory and its applications focus on continuous outcomes, but nonlinear outcomes are common too. Nonlinear outcomes include binary outcomes such as death status or count outcomes such as the number of hospitalizations. If we have a binary outcome, we can model the probability directly with a linear probability model; the downside to this approach is that predicted probabilities can fall outside of the \([0, 1]\) range. We can restrict predicted probabilities within \([0, 1]\) using an appropriate transformation — logit and probit transformations are perhaps the most common. However, in doing so we lose a lot of nice properties that come with the linear model. Ai and Norton (2003) first pointed out this vexing occurrence with respect to to diff-in-diff. In particular, they showed that the cross-partial effect can be nonzero even when the treatment/post-period interaction term is 0.

Puhani (2012) noted that while the point from Ai and Norton (2003) is true, the true diff-in-diff estimate is still taken directly from the interaction term in the model. He shows that diff-in-diff is actually a difference of two cross-partial derivatives so the interaction term always has the same sign as the diff-in-diff effect (not necessarily the case in Ai and Norton (2003)). Thus, inference on the treatment effect can be conducted through the usual test of the interaction parameter.

The Karaca-Mandic, Norton, and Dowd (2012) paper ties the previous two papers together. They show in Figures 3 and 4 how the diff-in-diff effect (on the probability scale) can change as the value of the linear predictor \(X\beta\) changes, even when the model does not include an interaction term. The authors then go through an interactive example using `Stata`

, which might be useful to researchers intending to do a diff-in-diff analysis with a nonlinear model.

# Synthetic controls

The first instance of synthetic control is taken from Abadie and Gardeazabal (2003), which studied effects of terrorism on economic growth in Spain. Beginning in the 1960s, the Basque Country experienced a rash of terrorism, which was broken by a 1998 cease-fire. The other regions of Spain were weighted to form a synthetic Basque Country, similar to the real Basque Country in demographics. The results showed that per capita GDP increased in Basque Country after the truce (relative to the synthetic Basque country with no cease-fire). After the cease-fire ended, GDP decreased. Further methodological refinement followed in Abadie, Diamond, and Hainmueller (2010) and in the following ten years, many additional methods papers followed, some of which we summarize below.

Synthetic control methods (SCM) can be thought of as a data-driven approach to choosing a comparison group. In selecting a comparison group of units not impacted by the intervention, there are many options. For example, to evaluate the impact of the Massachusetts health insurance expansion, a researcher could use all other states as controls. She could use nearby states as controls. She could even choose counties from around the country that look similar to Massachusetts. When conducting DID, researchers typically use subject-matter knowledge in order to choose comparison groups and then evaluate whether trends are parallel post-hoc.

Synthetic controls are appealing because they provide a data-driven way to select comparison groups. However, the assumption that a weighted set of comparison units can exactly represent the treatment group is strong, and if not met, can introduce bias. Modifications to add flexibility to SCM include allowing a level difference between the treatment and comparison groups or allowing weights that do not sum to 1 and are negative (Doudchenko and Imbens 2016). Another alternative to address the convex hull issue is to develop synthetic controls for comparison units rather than treatment units and include only those that are well-estimated in the effect estimate (Powell 2018).

Another potential issue with SCM is overfitting. Several authors have observed that under a fixed penalty, SCM is a form of ridge regression. They have, therefore, proposed choosing the SCM penalty term to minimize mean-squared error to avoid overfitting (Doudchenko and Imbens 2016; Kinn 2018). In an alternative approach, Powell (2018) modeled unit-specific trends to reduce noise.

A third concern that has been raised is the so-called ‘curse of dimensionality’ (Ferman and Pinto 2016). Traditional SCM are consistent as the number of time periods goes to infinity. However, an increasing number of time periods decrease the likelihood tha tappropriate weights exist; with more time periods, the treatment unit may not fall inside the convex hull of the comparison units. (**???**) propose adjusting the SCM estimate by the outcome regression-estimated average difference between treatment and comparison units (including lagged outcomes and possibly covariates).

Ding and Li (2019) address the relationship between diff-in-diff and methods that assume treatment assignment is ignorable conditional on past outcomes (which includes synthetic controls and lagged dependent variables regression).

[LAURA: add Li & Ding bracketing paper]

[LAURA: add SCM RTM bias paper]

[LAURA: add Xu et al simulation results]

### Synthetic Difference-in-Differences

Arkhangelsky et al. (2019) propose an extension to SCM called Synthetic Difference-in-Differences (SDID), which combines elements of both DID and SCM. Most DIDs incorporate both unit and time fixed effects, which weights each comparison unit and each time period equally. SCM incorporate unit-level weights and time fixed effects. SDID adds additional flexibility by incorporating all of these elements: 1) unit and time fixed effects and 2) unit- and time-level weights.

\[ \hat{\omega}^{sc} = \underset{\omega \in W}{\text{arg min}} \sum^{T-1}_{t = 1} \left(\sum^{N-1}_{i=1} \omega_iY_i(t) - Y_N(t)\right)^2 \]

\[ \hat{\lambda}^{sc} = \underset{\lambda \in L}{\text{arg min}} \sum^{N-1}_{i = 1} \left(\sum^{T-1}_{t=1} \lambda_iY_i(t) - Y_i(T)\right)^2 \]

\[ \left(\hat{\theta}^{sdid}, \hat{\tau}^{sdid}\right) = \underset{\theta}{\text{arg min}} \sum^N_{i=1}\sum^T_{t=1}\left( Y_i(t) - g(\theta)_{it} - \tau W_{it}\right)^2\hat{\omega}_i\hat{\lambda}_t \]

The authors demonstrate that this approach is doubly-robust: as \(N\) and \(T\) go to infinity, the SDID estimators is consistent if either 1) the unit and time weights or 2) the outcome model are correctly specified. (This roughly corresponds to either SCM with additional weight flexibility or DID being consistently estimated.)

### Augmented Synthetic Control

Augmented SCM (ASCM) is another extension of synthetic controls. Ben-Michael, Feller, and Rothstein (2018) select weights to minimize the difference in pre-intervention level then subtract an estimate of the remaining level difference from the post-intervention difference. This addresses the “curse of dimensionality”: while SCM is unbiased as the number of pre-intervention time periods grows, researchers are less likely to identify a good fit as the number of time periods grows, even when one exists, unless they have a very large number of control units. Both ASCM and SDID methods can weight up recent periods relative to distant periods when estimating this level difference. Further, both ASCM and SDID involve a correction to the SCM estimator that is 0 if there is exact pre-intervention balance. These methods only apply when there is imperfect pre-intervention balance and matter most when there is substantial pre-intervention imbalance.

### Generalized Synthetic Control

Xu (2017) proposed the generalized synthetic control method, which combines an interactive fixed effects model with the framework of synthetic control. The basic idea is to use a parametric factor model for the outcomes to predict the “missing” untreated potential outcomes for the comparison group. The factor model assumes that a small set of time-varying factors interact with unit-specific “factor loadings” to generate the outcome. Note that the popular two-way fixed effects model (with unit and time fixed effect) is a special case of a factor model.

Within the generalized synthetic control framework, Xu (2017) writes the outcome for unit \(i\) at time \(t\) as

\[ Y_i(t) = \delta_{it}D_{it} + x_{it}\beta + \lambda_i f_t + \epsilon_{it}, \] ] where \(D_{it}\) is a treatment indicator which equals 1 whenever the unit \(i\) receives treatment at time \(t\), \(\delta_{it}\) are heterogeneous effects, \(x_{it}\) and \(\beta\) are covariates and their coefficients, and \(\lambda_i\) and \(f_t\) are factor loadings and factors, respectively.

To use the generalized synthetic control method, we first estimate the latent factors, \(f_t\), and the coefficients on the covariates, \(\beta\), using only the data from the control units (with least squares and some additional restrictions like orthogonality of the factors, for one). Then we estimate the factor loadings of each treated unit, \(lambda_i\), by minimizing a least squares equation for the treated units’ outcomes in the pre-treatment period, conditional on the factors and coefficients estimated in the first step. Finally, we use the estimated coefficients, factors, and factor loadings in the parametric outcome model to predict counterfactual untreated outcomes in the post-treatment period for the treated units:

\[ \hat{Y}^0_i(t) = x_{it}\hat{\beta} + \hat{\lambda}\hat{f_t}. \]

Following this step, the ATT is simply the average (within the treated group only) of the observed post-treatment outcomes minus the predicted untreated outcomes:

\[ ATT(t) = \frac{1}{N_t} \sum_i Y_i(t) - \hat{Y}^0_i(t), \] where the summation iterates over all treated units and \(N_t\) denotes the number of treated units.

The strengths of the method are that the factor structure can address time-varying confounding (unlike diff-in-diff); it incorporates heterogeneous treatment effects (unlike interactive fixed effects); and it can accommodate multiple treated units, observed covariates, and treated units outside the convex hull of the controls (unlike the synthetic control formulation of Abadie, Diamond, and Hainmueller (2010)). The limitations of this method include the requirement for a reasonably long pre-treatment period, the reliance on a parametric model for the outcome, and the lack of obvious safeguards against inappropriate controls (i.e., those that lack common support with the treated units).

# Comparative interrupted time series

A related technique (sometimes described as equivalent to diff-in-diff with multiple time points) is comparative interrupted time series (CITS). The causal assumptions of the two methods are different, however. In CITS, the counterfactual is constructed by 1) fitting linear models to the comparison group’s outcomes in the pre- and post-intervention periods, 2) computing the pre- to post-period changes in the intercepts and slopes, 3) fitting a linear model to the treated group’s outcomes in the pre-intervention period, and 4) assuming the comparison group’s intercept and slope changes computed in step 2) would have held in the treated group in the absence of intervention.

We highlight some important differences between DID and CITS:

- CITS
*does not*require parallel outcome evolution in the treated and comparison groups in the pre-intervention period - CITS
*does*require a linear model to capture the pre- to post-intervention change in the outcome process of the comparison group (which is then assumed to also hold for the treated group’s counterfactual untreated outcomes)

Notice that these are *not* merely differences in modeling. They are differences in the construction of the counterfactual:

- DID assumes the pre-to-post change in the average outcomes of the comparison group would also have been observed in the treated group, absent the intervention.
- DID with a pre-period slope difference assumes the pre-to-post change in the average outcomes of the comparison group
*plus*the linearly growing difference observed in the pre period would combine to produce the treated group’s counterfactual outcomes, absent the intervention. - CITS assumes the pre-to-post change in the intercept and slope of the comparison group would have been observed in the treated group, absent the intervention.

# Parallel paths

Up to this point, identifying the diff-in-diff estimator required parallel trends or some type of counterfactual assumption. Mora and Reggio (2012) and Mora and Reggio (2019) developed a diff-in-diff estimator using an alternative assumption called the parallel growth assumption. The parallel growth assumption essentially requires that the derivatives of the paths are parallel. For example, imagine we have two linear functions — \(f(x) = 2x\) and \(g(x) = 3x\). These functions are not parallel, but their derivatives, \(f'(x) = 2\) and \(g'(x) = 3\), are.

Imagine the following scenario with five time points in which treatment is administered after time 3. The trajectory of the untreated outcomes for the control group (all observed by the consistency assumption) are shown with the orange dotted-dashed line. The trajectory of the untreated outcomes for the treated (observed up to the 3rd time point) is shown in the blue solid line.

Clearly, this scenario violates the parallel trends assumption. In fact using parallel trends, the counterfactual untreated outcomes for the treated would deviate from their true trajectory shown by the dashed line below.

In this example any inference using diff-in-diff methods based on parallel trends will be biased. However, in this particular case, Mora and Reggio (2012) showed that the treatment effect is identified under an alternative assumption – parallel growth assumption. This assumption is similar to parallel trends except that we require the derivatives of the trajectories to be parallel. In the above graph, the two trajectories are not parallel, but their derivatives are! Both are straight lines with constant derivatives. Since the derivatives are parallel, the authors show that the estimator is identified. While interesting in theory, it is difficult enough justifying parallel trends using our original data (i.e., not derivatives or lagged differences), and it’s hard to imagine when we could be confident in parallel trends on the derivatives but not on the original scale. However, this paper is useful in understanding the role of the underlying diff-in-diff assumptions and which other assumptions may be possible to obtain similar quantities, especially in the case where parallel trends fails.

# Acknowledgments

The bulk of this website was written and edited by Bret Zeldow and Laura Hatfield. Funding for this website was provided by the Laura and John Arnold Foundation. Additional content contributors are Alyssa Bilinski, Carrie Fry, and Sherri Rose. Many thanks to Savannah Bergquist, Austin Denteh, Alex McDowell, Arman Oganisian, Toyya Pujol-Mitchell, and Kathy Swartz for their helpful comments. This website is built using the R package `blogdown`

and hosted on Netlify. The design is based on the Kraiklyn Hugo theme.

# References

Abadie, A. (2005). Semiparametric difference-in-differences estimators. *Review of Economic Studies*, *72*, 1–19. https://doi.org/10.1111/0034-6527.00321

Abadie, A., Athey, S., Imbens, G., & Wooldridge, J. (2017). When should you adjust standard errors for clustering? *arXiv:1710.02926 [Econ, Math, Stat]*. Retrieved from http://arxiv.org/abs/1710.02926

Abadie, A., & Cattaneo, M. D. (2018). Econometric methods for program evaluation. *Annual Review of Economics*, *10*(1), 465–503. https://doi.org/10.1146/annurev-economics-080217-053402

Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program. *Journal of the American Statistical Association*, *105*, 493–505. https://doi.org/10.1198/jasa.2009.ap08746

Abadie, A., Diamond, A., & Hainmueller, J. (2011). Synth: An R package for synthetic control methods in comparative case studies. *Journal of Statistical Software*, *42*(1), 1–17. https://doi.org/10.18637/jss.v042.i13

Abadie, A., Diamond, A., & Hainmueller, J. (2015). Comparative politics and the synthetic control method. *American Journal of Political Science*, *59*(2), 495–510. https://doi.org/10.1111/ajps.12116

Abadie, A., & Gardeazabal, J. (2003). The economic costs of conflict: A case study of the Basque country. *American Economic Review*, *93*(1), 113–132. https://doi.org/10.1257/000282803321455188

Ai, C., & Norton, E. C. (2003). Interaction terms in logit and probit models. *Economics Letters*, *80*(1), 123–129. https://doi.org/10.1016/S0165-1765(03)00032-6

Altman, D. G., & Bland, J. M. (1995). Statistics notes: Absence of evidence is not evidence of absence. *BMJ*, *311*(7003), 485. https://doi.org/10.1136/bmj.311.7003.485

Angrist, J. D. (2001). Estimation of limited dependent variable models with dummy endogenous regressors: Simple strategies for empirical practice. *Journal of Business and Economic Statistics*, *18*, 2–28. https://doi.org/10.1198/07350010152472571

Angrist, J. D., & Pischke, J.-S. (2008). *Mostly Harmless Econometrics: An Empiricist’s Companion*. Princeton, NJ: Princeton University Press. Retrieved from http://www.mostlyharmlesseconometrics.com/

Angrist, J., & Pischke, J.-S. (2010). *The credibility revolution in empirical economics: How better research design is taking the con out of econometrics* (No. 15794). Cambridge, MA: National Bureau of Economic Research. Retrieved from http://www.nber.org/papers/w15794

Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2019). *Synthetic difference in differences*. National Bureau of Economic Research.

Athey, S., Bayati, M., Doudchenko, N., Imbens, G., & Khosravi, K. (2017). Matrix completion methods for causal panel data models. *arXiv:1710.10251 [Econ, Math, Stat]*. Retrieved from http://arxiv.org/abs/1710.10251

Athey, S., & Imbens, G. (2006). Identification and inference in nonlinear difference-in-differences models. *Econometrica*, *74*(2), 431–497. https://doi.org/10.1111/j.1468-0262.2006.00668.x

Athey, S., & Imbens, G. (2016). The state of applied econometrics - causality and policy evaluation. *arXiv:1607.00699 [Econ, Stat]*. Retrieved from http://arxiv.org/abs/1607.00699

Athey, S., & Imbens, G. (2018). Design-based analysis in difference-in-differences settings with staggered adoption. *arXiv:1808.05293 [Cs, Econ, Math, Stat]*. Retrieved from http://arxiv.org/abs/1808.05293

Bai, J. (2009). Panel data models with interactive fixed effects. *Econometrica*, *77*(4), 1229–1279. https://doi.org/10.3982/ECTA6135

Basu, S., Meghani, A., & Siddiqi, A. (2017). Evaluating the health impact of large-scale public policy changes: Classical and novel approaches. *Annual Review of Public Health*, *38*(1), 351–370. https://doi.org/10.1146/annurev-publhealth-031816-044208

Bauhoff, S. (2014). The effect of school district nutrition policies on dietary intake and overweight: A synthetic control approach. *Economics & Human Biology*, *12*, 45–55. https://doi.org/10.1016/j.ehb.2013.06.001

Ben-Michael, E., Feller, A., & Rothstein, J. (2018). The augmented synthetic control method. *arXiv Preprint arXiv:1811.04170*.

Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How much should we trust differences-in-differences estimates? *Quarterly Journal of Economics*, *119*, 249–275. https://doi.org/10.1162/003355304772839588

Bilinski, A., & Hatfield, L. A. (2018). Seeking evidence of absence: Reconsidering tests of model assumptions. *arXiv:1805.03273 [Stat]*. Retrieved from http://arxiv.org/abs/1805.03273

Blundell, R., & Dias, M. C. (2009). Alternative approaches to evaluation in empirical microeconomics. *Journal of Human Resources*, *44*(3), 565–640. https://doi.org/10.3368/jhr.44.3.565

Bonhomme, S., & Sauder, U. (2011). Recovering distributions in difference-in-differences models: A comparison of selective and comprehensive schooling. *The Review of Economics and Statistics*, *93*, 479–494. https://doi.org/10.1162/REST_a_00164

Brown, T. T., & Atal, J. P. (2018). How robust are reference pricing studies on outpatient medical procedures? Three different preprocessing techniques applied to difference-in differences. *Health Economics*. https://doi.org/10.1002/hec.3841

Cameron, A. C., & Miller, D. L. (2015). A practitioner’s guide to cluster-robust inference. *Journal of Human Resources*, *50*(2), 317–372. https://doi.org/10.3368/jhr.50.2.317

Chabé-Ferret, S. (2015). Analysis of the bias of matching and difference-in-difference under alternative earnings and selection processes. *Journal of Econometrics*, *185*(1), 110–123. https://doi.org/10.1016/j.jeconom.2014.09.013

Chabé-Ferret, S. (2017). *Should we combine difference in differences with conditioning on pre-treatment outcomes?* (No. 17-824). Toulouse School of Economics. Retrieved from https://www.tse-fr.eu/publications/should-we-combine-difference-differences-conditioning-pre-treatment-outcomes

Chernozhukov, V., Wuthrich, K., & Zhu, Y. (2017). An exact and robust conformal inference method for counterfactual and synthetic controls. *arXiv:1712.09089 [Econ, Stat]*. Retrieved from http://arxiv.org/abs/1712.09089

Conley, T. G., & Taber, C. R. (2011). Inference with “difference in differences” with a small number of policy changes. *The Review of Economics and Statistics*, *93*, 113–125. https://doi.org/10.1162/REST_a_00049

Daw, J. R., & Hatfield, L. A. (2018a). Matching and regression-to-the-mean in difference-in-differences analysis. *Health Services Research*, *53*(6), 4138–4156. https://doi.org/10.1111/1475-6773.12993

Daw, J. R., & Hatfield, L. A. (2018b). Matching in difference-in-differences: Between a rock and a hard place. *Health Services Research*, *53*(6), 4111–4117. https://doi.org/10.1111/1475-6773.13017

Dimick, J. B., & Ryan, A. M. (2014). Methods for evaluating changes in health care policy: The difference-in-differences approach. *JAMA*, *312*, 2401–2402. https://doi.org/10.1001/jama.2014.16153

Ding, P., & Li, F. (2019). A bracketing relationship between difference-in-differences and lagged-dependent-variable adjustment. *Political Analysis*, *27*(4), 605–615. https://doi.org/10.1017/pan.2019.25

Donald, S. G., & Lang, K. (2007). Inference with difference-in-differences and other panel data. *The Review of Economics and Statistics*, *89*, 221–233. https://doi.org/10.1162/rest.89.2.221

Doudchenko, N., & Imbens, G. W. (2016). *Balancing, regression, difference-in-differences and synthetic control methods: A synthesis* (No. 22791). Cambridge, MA: National Bureau of Economic Research. Retrieved from http://www.nber.org/papers/w22791

Dube, A., & Zipperer, B. (2015). *Pooling multiple case studies using synthetic controls: An application to minimum wage policies* (No. IZA DP 8944) (p. 60). Bonn, Germany: Institute for the Study of Labor. Retrieved from https://www.iza.org/publications/dp/8944

Ferman, B., & Pinto, C. (2016a). *Revisiting the synthetic control estimator* (No. 86495) (p. 56). Munich: MPRA. Retrieved from https://mpra.ub.uni-muenchen.de/86495/

Ferman, B., & Pinto, C. (2016b). Synthetic controls with imperfect pre-treatment fit.

Ferman, B., Pinto, C., & Possebom, V. (2017). *Cherry picking with synthetic controls* (No. 78213) (p. 56). Retrieved from https://mpra.ub.uni-muenchen.de/78213/

Fretheim, A., Zhang, F., Ross-Degnan, D., Oxman, A. D., Cheyne, H., Foy, R., … Soumerai, S. B. (2015). A reanalysis of cluster randomized trials showed interrupted time-series studies were valuable in health system evaluation. *Journal of Clinical Epidemiology*, *68*(3), 324–333. https://doi.org/10.1016/j.jclinepi.2014.10.003

Freyaldenhoven, S., Hansen, C., & Shapiro, J. M. (2018). *Pre-event trends in the panel event-study design* (No. 24565). Cambridge, MA: National Bureau of Economic Research. Retrieved from http://www.nber.org/papers/w24565

Gaibulloev, K., Sandler, T., & Sul, D. (n.d.). Dynamic panel analysis under cross-sectional dependence. *Political Analysis*, *22*(2), 258–273. https://doi.org/10.1093/pan/mpt029

Glymour, M. M., Weuve, J., Berkman, L. F., Kawachi, I., & Robins, J. M. (2005). When is baseline adjustment useful in analyses of change? An example with education and cognitive change. *American Journal of Epidemiology*, *162*(3), 267–278. https://doi.org/10.1093/aje/kwi187

Gobillon, L., & Magnac, T. (2015). Regional policy evaluation: Interactive fixed effects and synthetic controls. *The Review of Economics and Statistics*, *98*(3), 535–551. https://doi.org/10.1162/REST_a_00537

Goodman-Bacon, A. (2018). *Difference-in-differences with variation in treatment timing* (No. 25018). National Bureau of Economic Research. Retrieved from https://www.nber.org/papers/w25018

Greenaway-McGrevy, R., Han, C., & Sul, D. (2012). Asymptotic distribution of factor augmented estimators for panel regression. *Journal of Econometrics*, *169*(1), 48–53. https://doi.org/10.1016/j.jeconom.2012.01.003

Greene, W. (2004). The behaviour of the maximum likelihood estimator of limited dependent variable models in the presence of fixed effects. *The Econometrics Journal*, *7*(1), 98–119. https://doi.org/10.1111/j.1368-423X.2004.00123.x

Greene, W. (2010). Testing hypotheses about interaction terms in nonlinear models. *Economics Letters*, *107*(2), 291–296. https://doi.org/10.1016/j.econlet.2010.02.014

Hahn, J., & Shi, R. (2017). Synthetic control and inference. *Econometrics*, *5*(4), 52. https://doi.org/10.3390/econometrics5040052

Han, B., Yu, H., & Friedberg, M. W. (2017). Evaluating the impact of parent-reported medical home status on children’s health care utilization, expenditures, and quality: A difference-in-differences analysis with causal inference methods. *Health Services Research*, *52*, 786–806. https://doi.org/10.1111/1475-6773.12512

Hartman, E., & Hidalgo, F. D. (2018). An equivalence approach to balance and placebo tests. *American Journal of Political Science*. https://doi.org/10.1111/ajps.12387

Imai, K., & Kim, I. S. (n.d.). When should we use fixed effects regression models for causal inference with longitudinal data? *American Journal of Political Science*. Retrieved from http://web.mit.edu/insong/www/pdf/FEmatch.pdf

Imbens, G. W., & Angrist, J. D. (1994). Identification and estimation of local average treatment effects. *Econometrica*, *62*, 467–475. https://doi.org/10.2307/2951620

Kahn-Lang, A., & Lang, K. (2018). *The promise and pitfalls of differences-in-differences: Reflections on “16 and Pregnant” and other applications* (No. 24857). Cambridge, MA: National Bureau of Economic Research. https://doi.org/10.3386/w24857

Karaca-Mandic, P., Norton, E. C., & Dowd, B. (2012). Interaction terms in nonlinear models. *Health Services Research*, *47*(1pt1), 255–274. https://doi.org/10.1111/j.1475-6773.2011.01314.x

Kaul, A., Kloßner, S., Pfeifer, G., & Schieler, M. (2015). *Synthetic control methods: Never use all pre-intervention outcomes together with covariates* (No. 83790) (p. 24). Retrieved from https://mpra.ub.uni-muenchen.de/id/eprint/83790

King, G., & Zeng, L. (2006). The dangers of extreme counterfactuals. *Political Analysis*, *14*(2), 131–159. https://doi.org/10.1093/pan/mpj004

Kinn, D. (2018). Synthetic control methods and big data. *arXiv:1803.00096 [Econ]*. Retrieved from http://arxiv.org/abs/1803.00096

Kreif, N., Grieve, R., Hangartner, D., Turner, A. J., Nikolova, S., & Sutton, M. (2016). Examination of the synthetic control method for evaluating health policies with multiple treated units. *Health Economics*, *25*, 1514–1528. https://doi.org/10.1002/hec.3258

Kropko, J., & Kubinec, R. (2018). *Why the two-way fixed effects model is difficult to interpret, and what to do about it*. Retrieved from https://ssrn.com/abstract=3062619

Lechner, M., & others. (2011). The estimation of causal effects by difference-in-difference methods. *Foundations and Trends in Econometrics*, *4*(3), 165–224.

Li, F., & Li, F. (2019). Double-robust estimation in difference-in-differences with an application to traffic safety evaluation. *Observational Studies*, *5*, 1–20.

Lindner, S., & McConnell, K. J. (2018). Difference-in-differences and matching on outcomes: A tale of two unobservables. *Health Services and Outcomes Research Methodology*. https://doi.org/10.1007/s10742-018-0189-0

Lipsitch, M., Tchetgen, E. T., & Cohen, T. (2010). Negative controls: A tool for detecting confounding and bias in observational studies. *Epidemiology*, *21*(3), 383–388. https://doi.org/10.1097/EDE.0b013e3181d61eeb

Lopez Bernal, J., Soumerai, S., & Gasparrini, A. (2018). A methodological framework for model selection in interrupted time series studies. *Journal of Clinical Epidemiology*. https://doi.org/10.1016/j.jclinepi.2018.05.026

Lunceford, J. K., & Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. *Statistics in Medicine*, *23*(19), 2937–2960.

MacKinnon, J. G., & Webb, M. D. (2018). *Randomization inference for difference-in-differences with few treated clusters* (No. 1355) (p. 39). Kingston, Ontario: Queen’s University. Retrieved from http://qed.econ.queensu.ca/working_papers/papers/qed_wp_1355.pdf

McKenzie, D. (2012). Beyond baseline and follow-up: The case for more T in experiments. *Journal of Development Economics*, *99*(2), 210–221. https://doi.org/10.1016/j.jdeveco.2012.01.002

McWilliams, J. M., Landon, B. E., Chernew, M. E., & Zaslavsky, A. M. (2014). Changes in patients’ experiences in Medicare accountable care organizations. *The New England Journal of Medicine*, *371*, 1715–1724. https://doi.org/10.1056/NEJMsa1406552

Meyer, B. D. (1995). Natural and quasi-experiments in economics. *Journal of Business & Economic Statistics*, *13*(2), 151–161. https://doi.org/10.2307/1392369

Mood, C. (2010). Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do About It. *European Sociological Review*, *26*(1), 67–82. https://doi.org/10.1093/esr/jcp006

Moon, H. R., & Weidner, M. (2015). Linear regression for panel with unknown number of factors as interactive fixed effects. *Econometrica*, *83*(4), 1543–1579. https://doi.org/10.3982/ECTA9382

Mora, R., & Reggio, I. (2012). *Treatment effect identification using alternative parallel assumptions* (No. Working Paper 12-33). Madrid: Universidad Carlos III de Madrid. Retrieved from http://hdl.handle.net/10016/16065

Mora, R., & Reggio, I. (2019). Alternative diff-in-diffs estimators with several pretreatment periods. *Econometric Reviews*, *38*(5), 465–486.

Mummolo, J., & Peterson, E. (2018). Improving the interpretation of fixed effects regression results. *Political Science Research and Methods*, 1–7. https://doi.org/10.1017/psrm.2017.44

O’Neill, S., Kreif, N., Grieve, R., Sutton, M., & Sekhon, J. S. (2016). Estimating causal effects: Considering three alternatives to difference-in-differences estimation. *Health Services and Outcomes Research Methodology*, *16*, 1–21. https://doi.org/10.1007/s10742-016-0146-8

Pesaran, M. H. (2006). Estimation and inference in large heterogeneous panels with a multifactor error structure. *Econometrica*, *74*(4), 967–1012. Retrieved from http://www.jstor.org/stable/3805914

Powell, D. (2018). *Imperfect synthetic controls: Did the Massachusetts health care reform save lives?* (No. WR-1246) (p. 44). Santa Monica, CA: RAND Labor & Population. Retrieved from www.rand.org/pubs/working_papers/WR1246.html

Puhani, P. A. (2012). The treatment effect, the cross difference, and the interaction term in nonlinear “difference-in-differences” models. *Economics Letters*, *115*(1), 85–87. https://doi.org/10.1016/j.econlet.2011.11.025

Pustejovsky, J. E., & Tipton, E. (2018). Small-sample methods for cluster-robust variance estimation and hypothesis testing in fixed effects models. *Journal of Business & Economic Statistics*, *36*(4), 672–683. https://doi.org/10.1080/07350015.2016.1247004

Rambachan, A., & Roth, J. (2019). An honest approach to parallel trends. *(Working Paper)*.

Reese, S., & Westerlund, J. (2018). Estimation of factor-augmented panel regressions with weakly influential factors. *Econometric Reviews*, *37*(5), 401–465. https://doi.org/10.1080/07474938.2015.1106758

Robbins, M. W., Saunders, J., & Kilmer, B. (2017). A framework for synthetic control methods with high-dimensional, micro-level data: Evaluating a neighborhood-specific crime intervention. *Journal of the American Statistical Association*, *112*(517), 109–126. https://doi.org/10.1080/01621459.2016.1213634

Rokicki, S., Cohen, J., Fink, G., Salomon, J. A., & Landrum, M. B. (2018). Inference with difference-in-differences with a small number of groups: A review, simulation study, and empirical application using SHARE data. *Medical Care*, *56*, 97–105. https://doi.org/10.1097/MLR.0000000000000830

Roth, J. (2018). Should we adjust for the test for pre-trends in difference-in-difference designs? *arXiv:1804.01208 [Econ, Math, Stat]*. Retrieved from http://arxiv.org/abs/1804.01208

Ryan, A. M. (2018). Well-balanced or too matchy-matchy? The controversy over matching in difference-in-differences. *Health Services Research*, *53*(6), 4106–4110. https://doi.org/10.1111/1475-6773.13015

Ryan, A. M., Burgess, J. F., & Dimick, J. B. (2015). Why we should not be indifferent to specification choices for difference-in-differences. *Health Services Research*. https://doi.org/10.1111/1475-6773.12270

Samartsidis, P., Seaman, S. R., Presanis, A. M., Hickman, M., & De Angelis, D. (2018). Review of methods for assessing the causal effect of binary interventions from aggregate time-series observational data. *arXiv:1804.07683v1 [stat.AP]*. Retrieved from https://arxiv.org/abs/1804.07683

Sant’Anna, P. H. C., & Zhao, J. B. (2018). *Doubly Robust Difference-in-Differences Estimators* (SSRN Scholarly Paper No. ID 3293315). Rochester, NY: Social Science Research Network. Retrieved from https://papers.ssrn.com/abstract=3293315

Sofer, T., Richardson, D. B., Colicino, E., Schwartz, J., & Tchetgen Tchetgen, E. J. (2016). On negative outcome control of unobserved confounding as a generalization of difference-in-differences. *Statistical Science*, *31*(3), 348–361. https://doi.org/10.1214/16-STS558

Stuart, E. A., Huskamp, H. A., Duckworth, K., Simmons, J., Song, Z., Chernew, M. E., & Barry, C. L. (2014). Using propensity scores in difference-in-differences models to estimate the effects of a policy change. *Health Services and Outcomes Research Methodology*, *14*(4), 166–182. https://doi.org/10.1007/s10742-014-0123-z

Van der Laan, M. J., & Robins, J. M. (2003). *Unified methods for censored longitudinal data and causality*. Springer Science & Business Media.

Van der Laan, M. J., & Rose, S. (2011). *Targeted learning: Causal inference for observational and experimental data*. Springer Science & Business Media.

VanderWeele, T. J., & Shpitser, I. (2013). On the definition of a confounder. *The Annals of Statistics*, *41*(1), 196–220. https://doi.org/10.1214/12-AOS1058

Wing, C., Simon, K., & Bello-Gomez, R. A. (2018). Designing difference in difference studies: Best practices for public health policy research. *Annual Review of Public Health*, *39*(1), 453–469. https://doi.org/10.1146/annurev-publhealth-040617-013507

Xu, Y. (2017). Generalized synthetic control method: Causal inference with interactive fixed effects models. *Political Analysis*, *25*(1), 57–76. https://doi.org/10.1017/pan.2016.2

Zeldow, B., & Hatfield, L. A. (2019). Confounding and regression adjustment in difference-in-differences. *arXiv Preprint arXiv:1911.12185*.

© 2019 Bret Zeldow and Laura Hatfield