What’s Trending in Difference-in-Differences?
A Synthesis of the Recent Econometrics Literature
Jonathan Roth
Pedro H. C. Sant’Anna
Alyssa Bilinski
§
John Poe
January 9, 2023
Abstract
This paper synthesizes recent advances in the econometrics of difference-in-differences
(DiD) and provides concrete recommendations for practitioners. We begin by articu-
lating a simple set of “canonical” assumptions under which the econometrics of DiD are
well-understood. We then argue that recent advances in DiD metho ds can be broadly
classified as relaxing some components of the canonical DiD setup, with a focus on
piq multiple periods and variation in treatment timing, piiq potential violations of par-
allel trends, or piiiq alternative frameworks for inference. Our discussion highlights
the different ways that the DiD literature has advanced beyond the canonical model,
and helps to clarify when each of the papers will be relevant for empirical work. We
conclude by discussing some promising areas for future research.
1 Introduction
Differences-in-differences (DiD) is one of the most popular methods in the social sciences for
estimating causal effects in non-experimental settings. The last few years have seen a dizzy-
ing array of new methodological papers on DiD and related designs, making it challenging
for practitioners to keep up with rapidly evolving best practices. Furthermore, the recent
literature has addressed a variety of different components of DiD analyses, which has made
We thank Brant Callaway, Bruno Ferman, Andreas Hagemann, Kevin Lang, David McKenzie, and David
Schönholzer for helpful comments, and Scott Barkowski for suggesting the title.
Brown University. jonathanroth@brown.edu
Microsoft and Vanderbilt University. pedro.h.santanna@vanderbilt.edu
§
Brown University. alyssa_bilinski@brown.edu
University of Michigan. john@johndavidpoe.com
1
it difficult even for experts in the field to understand how all of the new developments fit
together. In this paper, we attempt to synthesize some of the recent advances on DiD and
related designs and to provide concrete recommendations for practitioners.
Our starting point in Section 2 is the “canonical” difference-in-differences model, where
two time periods are available, there is a treated population of units that receives a treatment
of interest beginning in the second period, and a comparison population that does not receive
the treatment in either period. The key identifying assumption is that the average outcome
among the treated and comparison populations would have followed “parallel trends” in
the absence of treatment. We also assume that the treatment has no causal effect before
its implementation (no anticipation). Together, these assumptions allow us to identify the
average treatment effect on the treated (ATT). If we observe a large number of independent
clusters from the treated and comparison populations, the ATT can be consistently estimated
using a two-way fixed effects (TWFE) regression specification, and clustered standard errors
provide asymptotically valid inference.
In practice, DiD applications typically do not meet all of the requirements of the canonical
DiD setup. The recent wave of DiD papers have each typically focused on relaxing one or
two of the key assumptions in the canonical framework while preserving the others. We
taxonomize the recent DiD literature by characterizing which of the key assumptions in
the canonical model are relaxed. We fo cus on recent advances that piq allow for multiple
periods and variation in treatment timing (Section 3); piiq consider p otential violations of
parallel trends (Section 4); or piiiq depart from the assumption of observing a sample of
many independent clusters sampled from a super-population (Section 5). Section 6 briefly
summarizes some other areas of innovation. In the remainder of the Introduction, we briefly
describe each of these strands of literature.
Multiple periods and variation in treatment timing: One strand of the DiD
literature has focused on settings where there are more than two time periods and units are
treated at different point in times. Multiple authors have noted that the coefficients from
standard TWFE models may not represent a straightforward weighted average of unit-level
treatment effects when treatment effects are allowed to be heterogeneous. In short, TWFE
regressions make both “clean” comparisons between treated and not-yet-treated units as well
as “forbidden” comparisons between units who are b oth already-treated. When treatment
effects are heterogeneous, these “forbidden” comparisons potentially lead to severe drawbacks
such as TWFE coefficients having the opposite sign of all individual-level treatment effects
due to “negative weighting” problems. Even if all of the weights are positive, the weights
“chosen” by TWFE regressions may not correspond with the most policy-relevant parameter.
We discuss a variety of straightforward-to-implement strategies that have been proposed
2
to bypass the limitations associated with TWFE regressions and estimate causal parameters
of interest under rich sources of treatment effect heterogeneity. These procedures rely on
generalizations of the parallel trends assumption to the multi-period setting. A common
theme is that these new estimators isolate “clean” comparisons between treated and not-yet-
treated groups, and then aggregate them using user-specified weights to estimate a target
parameter of economic interest. We discuss differences between some of the recent proposals
such as the exact comparison group used and the generalization of the parallel trends
assumption needed for validity and provide concrete recommendations for practitioners.
We also briefly discuss extensions to more complicated settings such as when treatments
turn on-and-off over time or are non-binary.
Non-parallel trends: A second strand of the DiD literature focuses on the possibility
that the parallel trends assumption may be violated. One set of papers considers the set-
ting where parallel trends holds only conditional on observed covariates, and proposes new
estimators that are valid under a conditional parallel trends assumption. However, even if
one conditions on observable covariates, there are often concerns that the necessary parallel
trends assumption may be violated due to time-varying unobserved confounding factors. It
is therefore common practice to test for pre-treatment differences in trends (“pre-trends”)
as a test of the plausibility of the (conditional) parallel trends assumption. While intuitive,
researchers have identified at least three issues with this pre-testing approach. First, the
absence of a significant pre-trend does not necessarily imply that parallel trends holds; in
fact, these tests often have low power. Second, conditioning the analysis on the result of
a pre-test can introduce additional statistical distortions from a selection effect known as
pre-test bias. Third, if a significant difference in trends is detected, we may still wish to
learn something about the treatment effect of interest.
Several recent papers have therefore suggested alternative methods for settings where
there is concern that parallel trends may be violated. One class of solutions involves modifi-
cations to the common practice of pre-trends testing to ensure that the power of pre-tests is
high against relevant violations of parallel trends. A second class of solutions has proposed
methods that remain valid under certain types of violations of parallel trends, such as when
the post-treatment violation of parallel trends is assumed to be no larger than the maximal
pre-treatment violation of parallel trends, or when there are non-treated groups that are
known to be more/less affected by the confounds as the treated group. These approaches
allow for a variety of robustness and sensitivity analyses which are useful in a wide range of
empirical settings, and we discuss them in detail.
Alternative sampling assumptions: A third strand of the DiD literature discusses
alternatives to the classical “sampling-based” approach to inference with a large number of
3
clusters. One topic of interest is inference procedures in settings with a small number of
treated clusters. Standard cluster-robust methods assume that there is a large number of
both treated and untreated clusters, and thus can perform poorly in this case. A variety of
alternatives with better properties have been proposed for this case, including permutation
and bootstrap procedures. These methods typically either model the dependence of errors
across clusters, or alternatively place restrictions on the treatment assignment mechanism.
We briefly highlight these approaches and discuss the different assumptions needed for them
to perform well.
Another direction that has been explored relates to conducting “design-based” inference
for DiD. Canonical approaches to inference suppose that we have access to a sample of
independently-drawn clusters from an infinite super-population. However, it is not always
clear how to define the super-population, or to determine the appropriate level of clustering.
Design-based approaches address these issues by instead treating the source of randomness
in the data as coming from the stochastic assignment of treatment, rather than sampling
from an infinite super-population. Although design-based approaches have typically been
employed in the case of randomized experiments, recent work has extended this to the case of
“quasi-experimental” strategies like DiD. Luckily, the message of this literature is positive, in
the sense that methods that are valid from the canonical sampling-based view are typically
also valid from the design-based view as well. The design-based approach also yields the
clear recommendation that it is appropriate to cluster standard errors at the level at which
treatment is independently assigned.
Other topics: We conclude by briefly touching on some other areas of focus within the
DiD literature, as well as highlighting some areas for future research. Examples include using
DiD to estimate distributional treatment effects; settings with quasi-random treatment tim-
ing; spillover effects; estimating heterogeneous treatment effects; and connections between
DiD and other panel data methods.
Overall, the growing DiD econometrics literature emphasizes the importance of clarity
and precision in a researcher’s discussion of his or her assumptions, comparison group and
time frame selection, causal estimands, estimation methods, and robustness checks. When
used in combination with context-specific information, these new methods can both improve
the validity and interpretability of DiD results and more clearly delineate their limitations.
Given the vast literature on DiD, our goal is not to be comprehensive, but rather to give a
clean presentation of some of the most important directions the literature has gone. Wherever
possible, we try to give clear practical guidance for applied researchers, concluding each
4
section with practical recommendations for applied researchers. For reference, we include
Table 1, which contains a checklist for a practitioner implementing a DiD analysis, and Table
2, which lists R and Stata packages for implementing many of the methods described in this
paper.
2 The Basic Model
This section describes a simple two-period setting in which the econometrics of DiD are well-
understood. Although this “canonical” setting is arguably too simple for most applications,
clearly articulating the assumptions in this setup serves as a useful baseline for understanding
recent innovations in the DiD literature.
2.1 Treatment Assignment and Timing
Consider a model in which there are two time periods, t 1, 2. Units indexed by i are
drawn from one of two populations. Units from the treated population pD
i
1q receive a
treatment of interest between period t 1 and t 2, whereas units from the untreated
(a.k.a. comparison or control) population pD
i
0q remain untreated in both time periods.
The econometrician observes an outcome Y
i,t
and treatment status D
i
for a panel of units,
i 1, ..., N and t 1, 2. For example Y
i,t
could be the fraction of people with insurance
coverage in state i in year t, while D
i
could be an indicator for whether the state expanded
Medicaid in year 2. Although DiD methods also accommodate the case where only repeated
cross-sectional data is available, or where the panel is unbalanced, we focus on the simpler
setup with balanced panel data for ease of exposition.
2.2 Potential Outcomes and Target Parameter
We adopt a potential outcomes framework for the observed outcome, as in, e.g., Rubin
(1974) and Robins (1986). Let Y
i,t
p0, 0q denote unit i’s potential outcome in period t if i
remains untreated in both periods. Likewise, let Y
i,t
p0, 1q denotes unit i’s potential outcome
in period t if i is untreated in the first period but exposed to treatment by the second period.
To simplify notation we will write Y
i,t
p0q Y
i,t
p0, 0q and Y
i,t
p1q Y
i,t
p0, 1q, but it will be
useful for our later discussion to make clear that these potential outcomes in fact correspond
with a path of treatments. As is usually the case, due to the fundamental problem of causal
inference (Holland, 1986), we only observe one of the two potential outcomes for each unit
i. That is, the observed outcome is given by Y
i,t
D
i
Y
i,t
p1q`p1 ´D
i
qY
i,t
p0q. This potential
outcomes framework implicitly encodes the stable unit treatment value assumption (SUTVA)
5
that unit i’s outcomes do not depend on the treatment status of unit j i, which rules out
spillover and general equilibrium effects.
The causal estimand of primary interest in the canonical DiD setup is the average treat-
ment effect on the treated (ATT) in period t 2,
τ
2
E rY
i,2
p1q ´ Y
i,2
p0q|D
i
1s.
It simply measures the average causal effect on treated units in the period that they are
treated (t 2). In our motivating example, τ
2
would be the average effect of Medicaid
expansion on insurance coverage in period 2 for the states who expanded Medicaid.
2.3 The Parallel Trends Assumption and Identification
The challenge in identifying τ
2
is that the untreated potential outcomes, Y
i,2
p0q, are never
observed for the treated group (D
i
1). Difference-in-differences methods overcome this
identification challenge via assumptions that allow us to impute the mean counterfactual
untreated outcomes for the treated group by using (a) the change in outcomes for the un-
treated group and (b) the baseline outcomes for the treated group. The key assumption for
identifying τ
2
is the parallel trends assumption, which intuitively states that the average out-
come for the treated and untreated populations would have evolved in parallel if treatment
had not occurred.
Assumption 1 (Parallel Trends).
E rY
i,2
p0q ´ Y
i,1
p0q|D
i
1s E rY
i,2
p0q ´ Y
i,1
p0q|D
i
0s. (1)
In our motivating example, the parallel trends assumption says that the average change in
insurance coverage for expansion and non-expansion states would have been the same in the
absence of the Medicaid expansion.
The parallel trends assumption can be rationalized by imposing a particular generative
model for the untreated potential outcomes. If Y
i,t
p0q α
i
` ϕ
t
` ϵ
i,t
, where ϵ
i,t
is mean-
independent of D
i
, then Assumption 1 holds. Note that this model allows treatment to be
assigned non-randomly based on characteristics that affect the level of the outcome pα
i
q, but
requires the treatment assignment to be mean-independent of variables that affect the trend
in the outcome (ϵ
i,t
). In other words, parallel trends allows for the presence of selection bias,
but the bias from selecting into treatment must be the same in period t 1 as it is in period
t 2.
6
Another important and often hidden assumption required for identification of τ
2
is the
no-anticipation assumption, which states that the treatment has no causal effect prior to its
implementation. This is important for identification of τ
2
, since otherwise the changes in the
outcome for the treated group between period 1 and 2 could reflect not just the causal effect
in period t 2 but also the anticipatory effect in period t 1 (Abbring and van den Berg,
2003; Malani and Reif, 2015).
Assumption 2 (No anticipatory effects). Y
i,1
p0q Y
i,1
p1q for all i with D
i
1.
In our ongoing example, this implies that in years prior to Medicaid expansion, insurance
coverage in states that expanded Medicaid was not affected by the upcoming Medicaid
expansion.
Under the parallel trends and no-anticipation assumptions, the ATT in period 2 (τ
2
) is
identified. To see why this is the case, observe that by re-arranging terms in the parallel
trends assumption (see equation (1)), we obtain
E
r
Y
i,2
p0q|D
i
1
s
E
r
Y
i,1
p0q|D
i
1
s `
E
r
Y
i,2
p0q ´ Y
i,1
p0q|D
i
0
s
.
Further, by the no anticipation assumption, E rY
i,1
p0q|D
i
1s E rY
i,1
p1q|D
i
1s.
1
It
follows that
E
r
Y
i,2
p0q|D
i
1
s
E
r
Y
i,1
p1q|D
i
1
s
` E
r
Y
i,2
p0q ´ Y
i,1
p0q|D
i
0
s
E
r
Y
i,1
|D
i
1
s
` E
r
Y
i,2
´ Y
i,1
|D
i
0
s
,
where the second equality uses the fact that we observe Y p1q for treated units and Y p0q for
untreated units. The previous display shows that we can infer the counterfactual average
outcome for the treated group by taking its observed pre-treatment mean and adding the
change in mean for the untreated group. Since we observe Y p1q for the treated group directly,
it follows that τ
2
E
r
Y
i,2
p1q ´ Y
i,2
p0q|D
i
1
s
is identified as
τ
2
E rY
i,2
´ Y
i,1
|D
i
1s
loooooooooooomoooooooooooon
Change for D
i
1
´E rY
i,2
´ Y
i,1
|D
i
0s
loooooooooooomoooooooooooon
Change for D
i
0
, (2)
i.e. the “difference-in-differences” of population means!
1
For the identification argument, it suffices to impose only that E rY
i,1
p0q | D
i
1s E rY
i,1
p1q | D
i
1s
directly, i.e. that there is no anticipation on average, which is slightly weaker than Assumption 2. We focus
on Assumption 2 for ease of exposition (especially when we extend it to the staggered case below).
7
2.4 Estimation and Inference
Equation (2) gives an expression for τ
2
in terms of a “difference-in-differences” of population
expectations. Therefore, a natural way to estimate τ
2
is to replace expectations with their
sample analogs,
p
τ
2
pY
t2,D1
´ Y
t1,D1
q ´ pY
t2,D0
´ Y
t1,D0
q,
where Y
tt
1
,Dd
is the sample mean of Y for treatment group d in period t
1
.
Although these sample means could be computed “by hand”, an analogous way of com-
puting
p
τ
2
, which facilitates the computation of standard errors, is to use the two-way fixed
effects (TWFE) regression specification
Y
i,t
α
i
` ϕ
t
` p1rt 2s ¨ D
i
qβ ` ϵ
i,t
, (3)
which regresses the outcome Y
i,t
on an individual fixed effect, a time fixed effect, and an
interaction of a post-treatment indicator with treatment status.
2
In this canonical DiD
setup, it is straightforward to show that the ordinary least squares (OLS) coefficient
p
β is
equivalent to
p
τ
2
.
OLS estimates of
p
β from ( 3) provide consistent estimates and asymptotically valid con-
fidence intervals of τ
2
when Assumptions 1 and 2 are combined with the assumption of
independent sampling.
Assumption 3. Let W
i
pY
i,2
, Y
i,1
, D
i
q
1
denote the vector of outcomes and treatment status
for unit i. We observe a sample of N i.i.d. draws W
i
F for some distribution F satisfying
parallel trends.
Under Assumptions 1-3 and mild regularity conditions,
?
np
p
β ´ τ
2
q Ñ
d
N
`
0, σ
2
˘
,
in the asymptotic as N Ñ 8 and T is fixed. The variance σ
2
is consistently estimable
using standard clustering methods that allow for arbitrary serial correlation at the unit level
(Liang and Zeger, 1986; Arellano, 1987; Wooldridge, 2003; Bertrand, Duflo and Mullainathan,
2
With a balanced panel, the OLS coefficient on β is also numerically identical to the coefficient from a
regression that replaces the fixes effects with a constant, a treatment indicator, a second-period indicator,
and the treatment ˆ second-period interaction,
Y
i,t
α ` D
i
θ ` 1rt 2sζ ` p1rt 2s ¨ D
i
qβ ` ε
i,t
.
The latter regression generalizes to repeated cross-sectional data.
8
2004). The same logic easily extends to cases where the observations are individual units
who are members of independently-sampled clusters (e.g. states), and the standard errors
are clustered at the appropriate level, provided that the number of treated and untreated
clusters both grow large. Constructing consistent point estimates and asymptotically valid
confidence intervals is thus straightforward via OLS.
Having introduced all of the components of the “canonical” DiD model, we now discuss
the ways that different strands of the recent DiD have relaxed each of these components.
3 Relaxing assumptions on treatment assignment and
timing
Several recent papers have focused primarily on relaxing the baseline assumptions about
treatment assignment and timing discussed in Section 2. A topic of considerable attention
has been settings where there are more than two periods, and units adopt a treatment
of interest at different points in time. For example, in practice different states expanded
Medicaid in different years. We provide an overview of some of the key developments in the
literature, and refer the reader to the review by de Chaisemartin and D’Haultfuille (2022)
for additional details.
3.1 Generalized model with staggered treatment adoption
Several recent papers have fo cused on relaxing the timing assumptions discussed in Section
2, while preserving the remaining structure of the stylized model (i.e., parallel trends, no
anticipation, and independent sampling). Since most of the recent literature considers a
setup in which treatment is an absorbing state, we start with that framework; in Section
3.4, we discuss extensions to the case where treatment can turn on and off or is non-binary.
We introduce the following assumptions and notation, which captures the primary setting
studied in this literature.
Treatment timing. There are T periods indexed by t 1, ..., T , and units can receive a
binary treatment of interest in any of the periods t ą 1. Treatment is an absorbing state, so
that once a unit is treated they remain treated for the remainder of the panel. We denote by
D
i,t
an indicator for whether unit i receives treatment in period t, and G
i
mintt : D
i,t
1u
the earliest period at which unit i has received treatment. If i is never treated during the
sample, then G
i
8. Treatment is an absorbing state, so that D
i,t
1 for all t ě G
i
.
Thus, for example, a state that first expanded Medicaid in 2014 would have G
i
2014, a
9
state that first expanded Medicaid in 2015 would have G
i
2015, and a state that has not
expanded Medicaid by time t T would have G
i
8.
Potential outcomes. We extend the potential outcomes framework introduced above to
the multi-period setting. Let 0
s
and 1
s
denote s-dimensional vectors of zeros and ones,
respectively. We denote unit i’s potential outcome in period t if they were first treated at
time g by Y
i,t
p0
g´1
, 1
T ´g`1
q, and denote by Y
i,t
p0
T
q their “never-treated” potential outcome.
This notation again makes explicit that potential outcomes can depend on the entire path of
treatment assignments. Since we have assumed that treatment “stays on” once it is turned
on, the entire path of potential outcomes is summarized by the first treatment date pgq,
and so to simplify notation we can index potential outcomes by treatment starting time:
Y
i,t
pgq Y
i,t
p0
g´1
, 1
T ´g`1
q and Y
i,t
p8q Y
i,t
p0
T
q.
3
Thus, for example, Y
i,2016
p2014q would
represent the insurance coverage in state i in 2016 if they had first expanded Medicaid in
2014.
Parallel trends. There are several ways to extend the canonical parallel trends assumption
to the staggered setting. The simplest extension of the parallel trends assumption to the
staggered case requires that the two-group, two-period version of parallel trends holds for
all combinations of periods and all combinations of “groups” treated at different times.
Assumption 4 (Parallel trends for staggered setting). For all t t
1
and g g
1
,
E
r
Y
i,t
p8q ´ Y
i,t
1
p8q|G
i
g
s
E
r
Y
i,t
p8q ´ Y
i,t
1
p8q|G
i
g
1
s. (4)
This assumption imposes that in the counterfactual where treatment had not occurred, the
average outcomes for all adoption groups would have evolved in parallel. Thus, for example,
Assumption 4 would imply that if there had been no Medicaid expansions insurance
rates would have evolved in parallel on average for all groups of states that adopted Medicaid
expansion in different years, including those who never expanded Medicaid.
Several variants of Assumption 4 have been considered in the literature. For example,
Callaway and SantAnna (2021) consider a relaxation of Assumption 4 that imposes (4) only
for years after some units are treated:
3
If one were to map this staggered potential outcome notation to the one used in canonical DiD setups,
we would write Y
i,t
p2q and Y
i,t
p8q instead of Y
i,t
p1q and Y
i,t
p0q as defined in Section 2.2, respectively. We
use the Y
i,t
p0q, Y
i,t
p1q notation in Section 2.2 because it is likely more familiar to the reader, and widely used
in the literature on the canonical model.
10
Assumption 4.a (Parallel trends for staggered setting - post-treatment only).
E
r
Y
i,t
p8q ´ Y
i,t
1
p8q|G
i
g
s
E
r
Y
i,t
p8q ´ Y
i,t
1
p8q|G
i
g
1
s
.
for all t, t
1
ě g
min
´ 1, where g
min
min G is the first period where a unit is treated.
This would require, for example, that groups of states that expanded Medicaid at different
times have parallel trends in Y
i,t
p8q after Medicaid expansion began, but does not necessar-
ily impose parallel trends in the pre-treatment period. Likewise, several papers, including
Callaway and SantAnna (2021) and Sun and Abraham (2021), consider versions that impose
(4) only for groups that are eventually treated, and not for the never-treated group (i.e.
excluding g 8). This would impose, for example, parallel trends among states that even-
tually expanded Medicaid, but not between eventually-adopting and never-adopting states.
There are tradeoffs between the different forms of Assumption 4: imposing parallel trends
for all groups and all periods is a stronger assumption and thus may be less plausible; on the
other hand, it may allow one to obtain more precise estimates.
4
We return to these tradeoffs
in our discussion of different estimators for the staggered case below.
5
No anticipation. The no-anticipation assumption from the canonical model also extends
in a straightforward way to the staggered setting. Intuitively, it imposes that if a unit is
untreated in period t, their outcome does not depend on what time period they will be
treated in the future that is, units do not act on the knowledge of their future treatment
date before treatment starts.
Assumption 5
(Staggered no anticipation assumption)
. Y
i,t
pgq Y
i,t
p8q for all i and t ă g.
3.2 Interpreting the estimand of two-way fixed effects models
Recall that in the simple two-period model, the estimand (population coefficient) of the two-
way fixed effects specification (3) corresponds with the ATT under the parallel trends and
no anticipation assumptions. A substantial focus of the recent literature has been whether
the estimand of commonly-used generalizations of this TWFE model to the multi-period,
4
If all units are eventually treated, then imposing parallel trends only among treated units also limits the
number of periods for which the ATT is identified.
5
In this paper, we specify the parallel trends assumption based on groups defined by the treatment
starting date. It is also possible to adopt alternative definitions of parallel trends using groups that are more
disaggregated than our G. For instance, one could impose parallel trends for all pairs of states, rather than
for groups of states with the same treatment start date. Using a more disaggregated definition of a group
strengthens the parallel trends assumption, but could potentially enable more efficient estimators. We focus
on the group-level version of parallel trends for simplicity.
11
staggered timing case have a similar, intuitive causal interpretation. In short, the literature
has shown that the estimand of TWFE specifications in the staggered setting often does
not correspond with an intuitive causal parameter even under the natural extensions of the
parallel trends and no-anticipation assumptions described above.
Static TWFE. We begin with a discussion of the “static” TWFE specification, which
regresses the outcome on individual and period fixed effects and an indicator for whether
the unit i is treated in period t,
Y
i,t
α
i
` ϕ
t
` D
i,t
β
post
` ϵ
i,t
. (5)
The static specification yields a sensible estimand when there is no heterogeneity in
treatment effects across either time or units. Formally, let τ
i,t
pgq Y
i,t
pgq´Y
i,t
p8q. Suppose
that for all units i, τ
i,t
pgq τ whenever t ě g. This imposes that (1) all units have the
same treatment effect, and (2) the treatment has the same effect regardless of how long it
has been since treatment started. In our ongoing example, this would impose that the effect
of Medicaid expansion on insurance coverage is the same both across states and across time.
Then, under a suitable generalization of the parallel trends assumption (e.g. Assumption 4)
and no anticipation assumption (Assumption 5), the population regression coefficient β
post
from (5) is equal to τ.
Issues arise with the static specification, however, when there is heterogeneity of treat-
ment effects over time, as shown in Borusyak and Jaravel (2018), de Chaisemartin and
D’Haultfuille (2020), and Goodman-Bacon (2021), among others. Suppose first that there
is heterogeneity in time since treatment only. That is, τ
i,t
pgq
ř
sě0
τ
s
1rt ´ g ss, so all
units have treatment effect τ
s
in the s-th period after they receive treatment. In this case,
β
post
corresponds with a potentially non-convex weighted average of the parameters τ
s
, i.e.
β
post
ř
s
ω
s
τ
s
, where the weights ω
s
sum to 1 but may be negative. The possibility of
negative weights is concerning because, for instance, all of the τ
s
could be positive and yet
the coefficient β
post
may be negative! In particular, longer-run treatment effects will often
receive negative weights. Thus, for example, it is possible that the effect of Medicaid expan-
sion on insurance coverage is positive and grows over time since the expansion, and yet β
post
in (5) will be negative. More generally, if treatment effects vary across both time and units,
then τ
i,t
pgq may get negative weight in the TWFE estimand for some combinations of t and
g.
6
6
We focus in this section on decompositions of the static TWFE model in a standard, sampling-based
framework. Athey and Imbens (2022) study the static specification in a finite-sample randomization-based
framework.
12
Goodman-Bacon (2021) provides some helpful intuition to understand this phenomenon.
He shows that
p
β
post
can be written as a convex weighted average of differences-in-differences
comparisons between pairs of units and time periods in which one unit changed its treat-
ment status and the other did not. Counterintuitively, however, this decomposition includes
difference-in-differences that use as a “control” group units who were treated in earlier peri-
ods. For example, in 2016, a state that first expanded Medicaid in 2014 might be used as the
“control group” for a state that first adopted Medicaid in 2016. Hence, an early-treated unit
can get negative weights if it appears as a “control” for many later-treated units. This decom-
position further highlights that β
post
may not be a sensible estimand when treatment effects
differ across either units or time, because of its inclusion of these “forbidden comparisons”.
7
We now give some more mathematical intuition for why weighting issues arise in the static
specification with heterogeneity. From the Frisch-Waugh-Lovell theorem, the coefficient β
post
from (5) is equivalent to the coefficient from a univariate regression of Y
i,t
on D
i,t
´
p
D
i,t
, where
p
D
i,t
is the predicted value from a regression of D
i,t
on the other right-hand side variables in
(5), D
i,t
˜α
i
`
˜
ϕ
t
`u
i,t
. However, a well-known issue with OLS with binary outcomes is that
its predictions may fall outside the unit interval. If the predicted value
p
D
i,t
is greater than
1, then D
i,t
´
p
D
i,t
will be negative even when a unit is treated, and thus that unit’s outcome
will get negative weight in
p
β
post
. To see this more formally, we can apply the formula for
univariate OLS coefficients to obtain that
p
β
post
ř
i,t
pD
i,t
´
p
D
i,t
qY
i,t
ř
i,t
pD
i,t
´
p
D
i,t
q
2
.
(6)
The denominator is p ositive, and so the weight that
p
β
post
places on Y
i,t
is proportional to
D
i,t
´
p
D
i,t
. Thus, if D
i,t
1 and D
i,t
´
p
D
i,t
ă 0, then
p
β
post
will be decreasing in Y
i,t
even
though unit i is treated at period t. But because Y
i,t
Y
i,t
p8q`τ
i,t
pgq, it follows that τ
i,t
pgq
gets negative weight in
p
β
post
.
These negative weights will tend to arise for early-treated units in periods late in the
sample. To see why this is the case, we note that some algebra shows that
p
D
i,t
D
i
`D
t
´D,
where D
i
T
´1
ř
t
D
i,t
is the time average of D for unit i, D
t
N
´1
ř
i
D
i,t
is the cross-
sectional average of D for period t, and D pN T q
´1
ř
i,t
D
i,t
is the average of D across
both periods and units. It follows that if we have a unit that has been treated for almost all
periods (D
i
« 1) and a period in which almost all units have been treated (D
t
« 1), then
p
D
i,t
« 2 ´ D, which will be strictly greater than 1 if there is a non-substantial fraction of
non-treated units in some period pD ă 1q. We thus see that
p
β
post
will tend to put negative
7
To the best of our knowledge, the phrase “forbidden comparisons” was introduced in Borusyak and
Jaravel (2018).
13
weight on τ
i,t
for early-adopters in late periods within the sample. This decomposition makes
clear that the static OLS coefficient
p
β
post
is not aggregating natural comparisons of units,
and thus will not produce a sensible estimand when there is arbitrary heterogeneity. When
treatment effects are homogeneous i.e. τ
i,t
pgq τ the negative weights on τ for some
units cancel out the positive weights on other units, and thus β
post
recovers the causal effect
under a suitable generalization of parallel trends.
Dynamic TWFE. Next, we turn our attention to the “dynamic specification” that re-
gresses the outcome on individual and period fixed effects, as well as dummies for time
relative to treatment
Y
i,t
α
i
` ϕ
t
`
ÿ
r0
1rR
i,t
rsβ
r
` ϵ
i,t
, (7)
where R
i,t
t ´ G
i
` 1 is the time relative to treatment (e.g. R
i,t
1 in the first treated
period for unit i), and the summation runs over all possible values of R
it
except for 0.
Unlike the static specification, the dynamic specification yields a sensible causal es-
timand when there is heterogeneity only in time since treatment. In particular, the re-
sults in
Borusyak and Jaravel (2018) and Sun and Abraham (2021) imply that if τ
i,t
pgq
ř
sě0
τ
s
1rt ´ g ss, so all units have treatment effect τ
s
in the s-th period after treatment,
then β
s
τ
s
under suitable generalizations of the parallel trends and no anticipation assump-
tions, such as Assumptions 4 and 5.
8
Thus, specification (7) will yield sensible estimates for
the dynamic effect of Medicaid expansion if the effect r years after Medicaid expansion is
the same (on average) regardless of what year the state initially expanded coverage (for each
r 1, 2, ...).
Sun and Abraham (2021) show, however, that when there are heterogeneous dynamic
treatment effects across adoption cohorts, the coefficients from specification (7) become
difficult to interpret. Thus, for example, problems may arise if the average treatment effect
in the first year after adoption is different for states that adopted Medicaid in 2014 as
it is for states that adopted in 2015.
9
There are two issues. First, as with the “static”
regression specification above, the coefficient β
r
may put negative weight on the treatment
effect r periods after treatment for some units. Thus, for example, the treatment effect
8
We note that the homogeneity assumption can be relaxed so that all adoption cohorts have the same
expected treatment effect, i.e. E
r
τ
i,g `s
pgq | G g
s
τ
s
for all s and g. Additionally, these results assume
that all possible relative time indicators are included. As discussed in Sun and Abraham (2021), Baker,
Larcker and Wang (2022), and Schmidheiny and Siegloch (2020), among others, problems may arise if one
“bins” endpoints (e.g. includes a dummy for 5` years since treatment).
9
That is, if the effect in 2015 for the 2014 adoption cohort is different from the effect in 2016 for the 2015
adoption cohort.
14
for some states two years after Medicaid expansion may enter β
2
negatively. Second, the
coefficient β
r
can put non-zero weight on treatment effects at lags r
1
r, so there is cross-lag
“contamination. Thus, for example, the coefficient β
2
may be influenced by the treatment
effect for some states three periods after Medicaid expansion.
Like the static specification, the dynamic specification thus fails to yield sensible estimates
of dynamic causal effects under heterogeneity across cohorts. The derivation of this result
is mathematically more complex, and so we do not pursue it here. The intuition is that, as
in the static case, the dynamic OLS specification does not aggregate natural comparisons of
units and includes “forbidden comparisons” between sets of units both of which have already
been treated. An important implication of the results derived by Sun and Abraham (2021)
is that if treatment effects are heterogeneous, the “treatment lead” coefficients from (7) are
not guaranteed to be zero even if parallel trends is satisfied in all periods (and vice versa),
and thus evaluation of pre-trends based on these coefficients can be very misleading.
3.2.1 Diagnostic approaches
Several recent papers introduce diagnostic approaches for understanding the extent of the
aggregation issues under staggered treatment timing, with a focus on the static specification
(5). de Chaisemartin and D’Haultfuille (2020) propose reporting the number/fraction of
group-time ATTs that receive negative weights, as well as the degree of heterogeneity in
treatment effects that would be necessary for the estimated treatment effect to have the
“wrong sign. Goodman-Bacon (2021) proposes reporting the weights that
p
β
post
places on
the different 2-group, 2-period difference-in-differences, which allows one to evaluate how
much weight is being placed on “forbidden” comparisons of already-treated units and how
removing the comparisons would change the estimate. Jakiela (2021) proposes evaluating
both whether TWFE places negative weights on some treated units and whether the data
rejects the constant treatment effects assumption.
3.3 New Estimators For Staggered Timing
Several recent papers have proposed alternative estimators that more sensibly aggregate
heterogeneous treatment effects in settings with staggered treatment timing. The derivation
of each of these estimators follows a similar logic to the derivation of the DiD estimator in
the motivating example in Section 2. We begin by specifying a causal parameter of interest
(analogous to the ATT τ
2
). With the help of the (generalized) parallel trends and no-
anticipation assumptions, we can infer the counterfactual outcomes for treated units using
trends in outcomes for an appropriately chosen “clean” control group of untreated units.
15
This allows us to express the target parameter in terms of identified expectations, analogous
to equation (2). Finally, we replace population expectations with sample averages to form
an estimator of the target parameter.
The Callaway and Sant’Anna estimator. We first describe in detail the approach taken
by Callaway and SantAnna (2021), and then discuss the connections to other approaches.
They consider as a building block the group-time average treatment effect on the treated,
AT T pg, tq E
r
Y
i,t
pgq ´ Y
i,t
p8q|G
i
g
s
, which gives the average treatment effect at time t
for the cohort first treated in time g. For example AT T p2014, 2016q would be the average
treatment effect in 2016 for states who first expanded Medicaid in 2014. They then consider
identification and estimation under generalizations of the parallel trends assumption to the
staggered setting.
10
Intuitively, under the staggered versions of the parallel trends and no
anticipation assumptions, we can identify AT T pg, tq by comparing the expected change in
outcome for cohort g between periods g ´1 and t to that for a control group not-yet treated
at period t. Formally, under Assumption 4.a,
AT T pg, tq E
r
Y
i,t
´ Y
i,g´1
|G
i
g
s ´
E
r
Y
i,t
´ Y
i,g´1
|G
i
g
1
s
, for any g
1
ą t
which can be viewed as the multi-period analog of the identification result in equation (2).
Since this holds for any comparison group g
1
ą t, it also holds if we average over some set of
comparisons G
comp
such that g
1
ą t for all g
1
P G
comp
,
AT T pg, tq E rY
i,t
´ Y
i,g´1
|G
i
gs ´ E rY
i,t
´ Y
i,g´1
|G
i
P G
comp
s.
We can then estimate AT T pg, tq by replacing expectations with their sample analogs,
z
AT T pg, tq
1
N
g
ÿ
i : G
i
g
rY
i,t
´ Y
i,g´1
s ´
1
N
G
comp
ÿ
i : G
i
PG
comp
rY
i,t
´ Y
i,g´1
s. (8)
Specifically, Callaway and SantAnna (2021) consider two options for G
comp
. The first uses
only never-treated units (G
comp
t8u) and the second uses all not-yet-treated units (G
comp
tg
1
: g
1
ą tu).
11
When there are a relatively small number of periods and treatment cohorts,
reporting
z
AT T pg, tq for all relevant pg, tq may b e reasonable.
When there are many treated periods and/or cohorts, however, reporting all the
z
AT T pg, tq
10
Callaway and SantAnna (2021) also consider generalizations where the parallel trends assumption holds
only conditional on covariates. We discuss this extension in Section 4.2 below, but focus for now on the case
without covariates.
11
We note that if the never-treated units are not included in the comparison group (i.e. 8 R G
comp
), then
one can rely on a weaker version of Assumption 4.a that excludes the never-treated group.
16
may be cumbersome, and each one may be imprecisely estimated. Thankfully, the method
described above extends easily to estimating any weighted average of the AT T pg, tq. For
instance, we may be interested in an “event-study” parameter that gives the (weighted)
average of the treatment effect l periods after adoption across different adoption cohorts,
AT T
w
l
ÿ
g
w
g
AT T pg, g ` lq. (9)
The weights w
g
could be chosen to weight different cohorts equally, or in terms of their relative
frequencies in the treated population. It is straightforward to form estimates for AT T
w
l
by averaging the estimates
z
AT T pg, tq discussed above. We refer the reader to Callaway
and SantAnna (2021) for a discussion of a variety of other weighted averages that may
be economically relevant. Inference is straightforward using either the delta method or a
bootstrap, as described in Callaway and SantAnna (2021).
This alternative approach to estimation has two primary advantages over standard static
or dynamic TWFE regressions. The first is that it provides sensible estimands even under
arbitrary heterogeneity of treatment effects. By sensible we mean both that the approach
avoids negative weighting, but also that the weighting of effects across cohorts is specified
by the researcher (e.g. proportional to cohort size) rather than determined by OLS (i.e.
proportional to the variance of the treatment indicator). The second advantage is that
it makes transparent exactly which units are being used as a control group to infer the
unobserved potential outcomes. This contrasts with standard TWFE models, which we
have seen make unintuitive comparisons under staggered timing.
Imputation estimators. Borusyak, Jaravel and Spiess (2021) introduce a related ap-
proach which they refer to as an imputation estimator (see, also, Gardner (2021), Liu, Wang
and Xu (2022) and Wooldridge (2021) for similar proposals). Specifically, they fit a TWFE
regression, Y
i,t
p8q α
i
` λ
t
` ϵ
i,t
, using observations only for units and time periods that
are not-yet-treated. They then infer the never-treated potential outcome for each treated
unit using the predicted value from this regression,
p
Y
i,t
p8q. This provides an estimate of the
treatment effect for each treated unit, Y
i,t
´
p
Y
i,t
p8q, and these individual-level estimates can
be aggregated to form estimates of summary parameters like the AT T pg, tq described above.
These approaches yield valid estimates when parallel trends holds for all groups and time
periods and there is no anticipation (Assumptions 4 and 5).
Comparison of CS and BJS approaches. How does the approach proposed by Callaway
and SantAnna (2021, CS) compare to that proposed by Borusyak et al. (2021, BJS)? For
17
simplicity, it is instructive to consider a simple non-staggered setting where there are three
periods (t 1, 2, 3) and units are either treated in period 3 or never-treated (G t3, 8u).
In this case, the CS estimator for the treated group in period 3 (i.e. AT T p3, 3q) is simply a
DiD comparing the treated/untreated units between periods 2 and 3,
z
AT T p3, 3q p
¯
Y
3,3
´
¯
Y
3,8
q
loooooomoooooon
Diff at t 3
´p
¯
Y
2,3
´
¯
Y
2,8
q
loooooomoooooon
Diff at t 2
,
where
¯
Y
t,g
is the average outcome in period t for units with G
i
g. By contrast, the BJS
estimator runs a similar DiD, except instead of using period 2 as the baseline, the BJS
estimator uses the average outcome prior to treatment (across periods 1 and 2),
z
AT T
BJS
p3, 3q p
¯
Y
3,3
´
¯
Y
3,8
q
loooooomoooooon
Diff at t 3
´ p
¯
Y
pre,3
´
¯
Y
pre,8
q
loooooooomoooooooon
Avg Diff in Pre-Periods
,
where
¯
Y
pre,g
1
2
p
¯
Y
1,g
`
¯
Y
2,g
q is the average outcome for cohort g across the two pre-treatment
periods. Thus, we see that the key difference between the CS and BJS estimators is that
CS makes all comparisons relative to the last pre-treament period, whereas BJS makes com-
parisons relative to the average of the pre-treatment periods. This primary difference in how
the two approaches use pre-treatment periods extends beyond this simple case to settings
with staggered timing, although the math becomes substantially more complicated in the
staggered case (and thus we do not pursue it).
What are the pros and cons of using the last pre-treatment period versus the average
of the pre-treatment periods? In general, there will be tradeoffs between efficiency and
the strength of the identifying assumption. On the one hand, averaging over multiple pre-
treatment periods can increase precision. Indeed, BJS prove that when Assumption 4 holds,
their estimator is efficient under homoskedasticity and serially uncorrelated errors; see also
Wooldridge (2021). Although these ideal conditions are unlikely to be satisfied exactly, it
does suggest that their estimator will tend to be more efficient than CS when the outcome
is not too heteroskedastic or serially correlated.
12
On the other hand, the two approaches
require different identifying assumptions: in the simple example above, CS only relies on
parallel trends between periods 2 and 3, whereas BJS relies on parallel trends for all three
periods.
13
More generally, the BJS approach imposes parallel trends for all groups and time
periods (Assumption 4), whereas the CS approach only relies on post-treatment parallel
12
By contrast, note that if Y
i,t
p0q follows a random walk, then Y
i,3
p0q is independent of Y
i,1
p0q conditional
on Y
i,2
p0q, and thus it is efficient to ignore the earlier pre-treatment periods as CS does.
13
Or more precisely, between the average outcome in periods 1 and 2, and period 3. See also Marcus and
Sant’Anna (2021) for a discussion about different parallel trends assumptions.
18
trends (Assumption 4.a). Relying on parallel trends over a longer time horizon may lead to
larger biases if the parallel trends assumption holds only approximately: for example, if the
average untreated potential outcome is increasing faster among treated units than untreated
units over time, then the violation of parallel trends is larger when we compare periods farther
apart, and thus the BJS estimator using periods 1 and 2 as the comparison will have larger
bias than the CS estimator using only period 2; see Roth (2018) and de Chaisemartin and
D’Haultfuille (2022) for additional discussion. Thus, the BJS estimator may be preferable
in settings where the outcome is not too serially correlated and the researcher is confident
in parallel trends across all periods; whereas the CS estimator may be preferred in settings
where serial correlation is high or the researcher is concerned about the validity of parallel
trends over longer time horizons.
14
Other related approaches. Several other recent papers propose similar estimation strate-
gies to those described above although with some subtle differences in how they construct
the control group and the weights they place on different cohorts/time periods. de Chaise-
martin and D’Haultfuille (2020) propose an estimator that can be applied when treatment
turns on and off (see Section 3.4 below), but in the context of the staggered setting here
corresponds with the Callaway and Sant’Anna estimator for AT T
w
0
and a particular choice
of weights. Sun and Abraham (2021) propose an estimator that takes the form (8) but uses
either the never-treated units (if they exist) or the last-to-be-treated units as the compar-
ison (G
comp
tmax
i
G
i
u), rather than the not-yet-treated. Marcus and Sant’Anna (2021)
propose a recursive estimator that more efficiently exploits the identifying assumptions in
Callaway and SantAnna (2021). See, also, Imai and Kim (2021) and Strezhnev (2018) for
closely related ideas. Another related approach is to run a stacked regression where each
treated unit is matched to ‘clean’ (i.e. not-yet-treated) controls and there are separate fixed
effects for each set of treated units and its control, as in Cengiz, Dube, Lindner and Zipperer
(2019) among others. Gardner (2021) shows that this approach estimates a convex weighted
average of the AT T pg, tq under parallel trends and no anticipation, although the weights are
determined by the number of treated units and variance of treatment within each stacked
event, rather than by economic considerations.
3.4 Further extensions to treatment timing/assignment
Our discussion so far has focused on the case where there is a binary treatment that is adopted
at a particular date and remains on afterwards. Several recent papers have studied settings
14
We also note that the BJS and CS estimators incorporate covariates differently. BJS adjust linearly for
covariates, where CS consider more general adjustments as described in Section 4.2.
19
with more complicated forms of treatment assignment. We briefly highlight a few of the
recent contributions, and refer the reader to the review in de Chaisemartin and D’Haultfuille
(2022) for more details.
de Chaisemartin and D’Haultfuille (2020) and Imai and Kim (2021) consider settings
where units are treated at different times, but do not necessarily require that treatment is an
absorbing state. Their estimators intuitively compare changes in outcomes for units whose
treatment status changed to other units whose treatment status remained constant over
the same periods. This approach yields an interpretable causal effect under generalizations
of the parallel trends assumption and and an additional “no carryover” assumption that
imposes that the potential outcomes depend only on current treatment status and not on
the full treatment history. We note that, as described in
Bojinov, Rambachan and Shephard
(2021), the no carryover assumption may be restrictive in many settings for example, if
the treatment is a raise in the minimum wage and the outcome is employment, then the
no carryover assumption requires that employment in period t depends only on whether
the minimum wage was raised in period t and not on the history of minimum wage changes.
Recent work has begun to relax the no carryover assumption: one example is de Chaisemartin
and D’Haultfoeuille (2022), who allow potential outcomes to depend on the full path of
treatments, and instead impose a stronger parallel trends assumption that requires parallel
trends in untreated potential outcomes regardless of a unit’s path of treatment.
Other work has considered DiD settings with non-binary treatments. de Chaisemartin
and D’Haultfuille (2018) study “fuzzy” DiD settings in which all groups are treated in both
time periods, but the proportion of units exposed to treatment increases in one group but
not in the other. Finally, de Chaisemartin and D’Haultfuille (2021) and Callaway, Goodman-
Bacon and Sant’Anna (2021) study settings with multi-valued or continuous treatments.
3.5 Recommendations
The results discussed above show that while conventional TWFE specifications make sensible
comparisons of treated and untreated units in the canonical two-period DiD setting, in the
staggered case they typically make “forbidden comparisons” between already-treated units.
As a result, treatment effects for some units and time periods receive negative weights in the
TWFE estimand. In extreme cases, this can lead the TWFE estimand to have the “wrong
sign” e.g., the estimand may be negative even if all the treatment effects are positive.
Even if the weights are not so extreme as to create sign reversals, it may nevertheless be
difficult to interpret which comparisons the TWFE estimator is making, as the “control
group” is not transparent, and the weights it chooses are unlikely to be those most relevant
20
for economic policy.
In our view, if the researcher is not willing to impose assumptions on treatment effect
heterogeneity, the most direct remedy for this problem is to use the methods discussed
in Section 3.3 that explicitly specify the comparisons to be made between treatment and
control groups, as well as the desired weights in the target parameter. These methods allow
one to estimate a well-defined causal parameter (under parallel trends), with transparent
weights and transparent comparison groups (e.g. not-yet-treated or never-treated units).
This approach, in our view, provides a more complete solution to the problem than the
diagnostic approaches discussed in Section 3.2.1. Although it is certainly valuable to have a
sense of the extent to which conventional TWFE specifications are making bad comparisons,
eliminating these undesirable comparisons seems to us a better approach than diagnosing the
extent of the issue. Using a TWFE specification may be justified for efficiency reasons if one
is confident that treatment effects are homogeneous, but researchers will often be unwilling
to restrict treatment effect heterogeneity.
The question of which of the many heterogeneity-robust DiD methods discussed in Sec-
tion 3.3 to use is trickier. As described above, the estimators differ in who they use as the
comparison group (e.g. not-yet-treated versus never-treated) as well as the pre-treatment
time periods used in the comparisons (e.g. the whole pre-treatment period versus the final
untreated period). This leads to some tradeoffs between efficiency and the strength of the
parallel trends assumption needed for identification, as highlighted in the comparison of BJS
and CS above. The best estimator to use will therefore depend on the context partic-
ularly, on which group is the most sensible comparison, and how confident the researcher
is in parallel trends for all periods. Nevertheless, it is our practical experience that the
various heterogeneity-robust DiD estimators typically (although not always) produce similar
answers. The first-order consideration is therefore to use an approach that makes clear what
the target parameter is and which groups are being compared for identification. Thankfully,
there are now statistical packages that make implementing (and comparing) the results from
these estimators straightforward in practice (see Table 2).
We acknowledge that these new methods may initially appear complicated to researchers
accustomed to analyzing seemingly simple regression specifications such as (5) or (7). How-
ever, while traditional TWFE regressions are easy to specify, as discussed above they are
actually quite difficult to interpret, since they make complicated and unintuitive comparisons
across groups. By contrast, the methods that we recommend have a simple interpretation
using a coherent comparison group. And while more complex to express in regression format,
they can be viewed as simple aggregations of comparisons of group means. We suspect that
once researchers gain experience using the newer heterogeneity-robust DiD methods, they
21
will not seem so scary after all!
4 Relaxing or allowing the parallel trends assumption
to be violated
A second strand of the literature has focused on the possibility that the canonical parallel
trends assumption may not hold exactly. Approaches to this problem include relaxing the
parallel trends assumption to hold only conditional on covariates, testing for pre-treatment
violations of the parallel trends assumption, and various tools for robust inference and sen-
sitivity analysis that explore the possibility that parallel trends may be violated in certain
ways.
4.1 Why might parallel trends be violated?
The canonical parallel trends assumption requires that the mean outcome for the treated
group would have evolved in parallel with the mean outcome for the untreated group if the
treatment had not occurred. As discussed in Section 2, this allows for confounding factors
that affect treatment status, but these must have a constant additive effect on the mean
outcome.
In practice, however, we will often be unsure of the validity of the parallel trends assump-
tion for several reasons. First, there will often be concern about time-varying confounding
factors. For example, Democratic-leaning states were more likely to adopt Medicaid expan-
sions but also might be subject to different time-varying macro-economic shocks. A second
concern relates to the potential sensitivity of the parallel trends assumption to the chosen
function form of the outcome. If parallel trends holds using the outcome measured in levels,
Y
i,t
p0q, then it will generally not be the case that it holds for the outcomes measured in logs
l ogpY
i,t
p0qq (or vice versa). Indeed, Roth and Sant’Anna (2022) show that parallel trends
can hold for all monotonic transformations of the outcome gpY
i,t
p0qq essentially only if the
population can be divided into two groups, where the first group is as good as randomly as-
signed between treatment and control, and the second group has the same potential outcome
distribution in both periods. Although there are some cases where these conditions may be
(approximately) met the most prominent of which is random assignment of treatment
they are likely not to hold in most settings where DiD is used, and thus parallel trends will
be sensitive to functional form. It will often not be obvious that parallel trends should hold
for the particular functional form chosen for our analysis e.g. should we use insurance
rates, or log insurance rates? and thus we may be skeptical of its validity.
22
4.2 Parallel trends conditional on covariates
One way to increase the credibility of the parallel trends assumption is to require that it holds
only conditional on covariates. Indeed, if we condition on a rich enough set of covariates
X
i
, we may be willing to believe that treatment is nearly randomly assigned conditional on
X
i
. Imposing only parallel trends conditional on X
i
gives us an extra degree of robustness,
since conditional random assignment can fail so long as the remaining unobservables have
a time-invariant additive effect on the outcome. In the Medicaid expansion example, for
instance, we may want to condition on a state’s partisan lean.
In the canonical model discussed in Section 2, the parallel trends assumption can be
naturally extended to incorporate covariates as follows.
Assumption 6 (Conditional Parallel Trends).
E
r
Y
i,2
p0q ´ Y
i,1
p0q|D
i
1, X
i
s
E
r
Y
i,2
p0q ´ Y
i,1
p0q|D
i
0, X
i
s
(almost surely) , (10)
for X
i
a pre-treatment vector of observable covariates.
For simplicity, we will focus first on the conditional parallel trends assumption in the canon-
ical two-period model, although several papers have also extended this idea to the case of
staggered treatment timing, as we will discuss towards the end of this subsection. We fur-
thermore focus our discussion on covariates that are measured prior to treatment and that
are time-invariant (although they may have a time-varying impact on the outcome); relevant
extensions to this are also discussed below.
In addition to the conditional parallel trends assumption, we will also impose an overlap
condition (a.k.a. positivity condition), which guarantees that for each treated unit with
covariates X
i
, there are at least some untreated units in the population with the same
value of X
i
. This overlap assumption is particularly important for using standard inference
procedures (
Khan and Tamer, 2010).
Assumption 7 (Strong overlap). The conditional probability of belonging to the treatment
group, given observed characteristics, is uniformly bounded away from one, and the propor-
tion of treated units is bounded away from zero. That is, for some ϵ ą 0, P pD
i
1|X
i
q ă
1 ´ ϵ, almost surely, and E
r
D
s
ą 0.
Given the conditional parallel trends assumption, no anticipation assumption, and over-
lap condition, the ATT conditional on X
i
x,
τ
2
pxq E
r
Y
i,2
p1q ´ Y
i,2
p0q|D
i
1, X
i
x
s
,
23
is identified for all x with P pD
i
1|X
i
xq ą 0. In particular,
τ
2
pxq E
r
Y
i,2
´ Y
i,1
|D
i
1, X
i
x
s
loooooooooooooooooomoooooooooooooooooon
Change for D
i
1, X
i
x
´E
r
Y
i,2
´ Y
i,1
|D
i
0, X
i
x
s
loooooooooooooooooomoooooooooooooooooon
Change for D
i
0, X
i
x
. (11)
Note that equation (11) is analogous to (2) in the canonical model, except it conditions
on X
i
x. Intuitively, among the sub-population with X
i
x, we have parallel trends,
and so we can take the same steps as in Section 2 to infer the conditional ATT for that
sub-population. The unconditional ATT can then be identified by averaging τ
2
pxq over the
distribution of X
i
in the treated population. Using the law of iterated expectations, we have
τ
2
E rY
i,2
p1q ´ Y
i,2
p0q|D
i
1s E
»
E rY
i,2
p1q ´ Y
i,2
p0q|D
i
1, X
i
s
loooooooooooooooooomoooooooooooooooooon
τ
2
pX
i
q
|D
i
1
.
When X
i
is discrete and takes a small number of values for example, if X
i
is an
indicator for whether someone has a college degree then estimation is straightforward. We
can just run an unconditional DiD for each value of X
i
, and then aggregate the estimates to
form an estimate for the overall ATT, using the delta method or bootstrap for the standard
errors. When X
i
is either continuously distributed or discrete with a very large number of
support points, estimation becomes more complicated, because we will typically not have a
large enough sample to do an unconditional DiD within each possible value of X
i
. Thankfully,
there are several available econometric approaches to semi-/non-parametrically estimate the
ATT even with continuous covariates. We first discuss the limitations of using TWFE
regressions in this setting, and then discuss several alternative approaches.
Standard linear regression. Given that the TWFE specification (3) yielded consistent
estimates of the ATT under Assumptions 1-3 in the canonical DiD model, it may be tempting
to augment this specification with controls for a time-by-covariate interaction,
Y
i,t
α
i
` ϕ
t
` p1rt 2s ¨ D
i
qβ ` pX
i
¨ 1rt 2sqγ ` ε
i,t
, (12)
for estimation under conditional parallel trends. Unfortunately, this augmented specification
need not yield consistent estimates of the ATT under conditional parallel trends without
additional homogeneity assumptions. The intuition is that equation (12) implicitly models
the conditional expectation function (CEF) of Y
i,2
´Y
i,1
as depending on X
i
with a constant
slope of γ, regardless of i’s treatment status. If there are heterogeneous treatment effects
that dep end on X
i
e.g., the ATT varies by age of participants the derivative of the CEF
24
with respect to X
i
may depend on treatment status D
i
as well. In these practically relevant
setups, estimates of β can be biased for the ATT; see Meyer (1995) and Abadie (2005) for
additional discussion. Fortunately, there are several semi-/non-parametric methods available
that allow for consistent estimation of the ATT under conditional parallel trends under
weaker homogeneity assumptions.
Regression adjustment. An alternative approach to allow for covariate-specific trends
in DiD settings is the regression adjustment procedure proposed by Heckman, Ichimura and
Todd (1997) and Heckman, Ichimura, Smith and Todd (1998). Their main idea exploits the
fact that under conditional parallel trends, strong overlap, and no anticipation we can write
the ATT as
τ
2
E
r
E
r
Y
i,2
´ Y
i,1
|D
i
1, X
i
s ´
E
r
Y
i,2
´ Y
i,1
|D
i
0, X
i
s|
D
i
1
s
,
E
r
Y
i,2
´ Y
i,1
|D
i
1
s ´
E
r
E
r
Y
i,2
´ Y
i,1
|D
i
0, X
i
s|
D
i
1
s
,
where the second equality follows from the law of iterated expectations. Thus, to estimate
the ATT under conditional parallel trends, one simply needs to estimate the conditional
expectation of the outcome among untreated units, and then average these “predictions”
using the empirical distribution of X
i
among treated units. That is, we estimate τ
2
with
p
τ
2
1
N
1
ÿ
i:D
i
1
´
pY
i,2
´ Y
i,1
q ´
p
ErY
i,2
´ Y
i,1
|D
i
0, X
i
s
¯
, (13)
where
p
ErY
i,2
´ Y
i,1
|D
i
0, X
i
s is the estimated conditional expectation function fitted on
the control units (but evaluated at X
i
for a treated unit). We note that if one uses a linear
model for
p
ErY
i,2
´Y
i,1
|D
i
0, X
i
s, then this would be similar to a modification of (12) that
interacts X
i
with both treatment group and time dummies, although the two are not quite
identical because the outcome regression approach re-weights using the distribution of X
i
among units with D
i
1. The researcher need not restrict themselves to linear models
for the CEF, however, and can use more flexible semi-/non-parametric methods instead.
One popular approach in empirical practice is to match each treated unit to a “nearest
neighbor” untreated unit with similar (or identical) covariate values, and then estimate
p
ErY
i,2
´Y
i,1
|D
i
0, X
i
s using Y
lpiq2
´Y
lpiq1
, where lpiq is the untreated unit matched to i, in
which case
p
τ
2
reduces to the simple DiD estimator between treated units and the matched
comparison group.
The outcome regression approach will generally be consistent for the ATT when the out-
come model used to estimate
p
ErY
i,2
´ Y
i,1
|D
i
0, X
i
s is correctly specified. Inference can
25
be done using the delta-method for parametric models, and there are also several methods
available for semi-/non-parametric models (under some additional regularity conditions), in-
cluding the bootstrap, as described in Heckman et al. (1998). Inference is more complicated,
however, when one models the outcome evolution of untreated units using a nearest-neighbor
approach with a fixed number of matches: the resulting estimator is no longer asymptotically
linear and thus standard bootstrap procedures are not asymptotically valid (e.g., Abadie and
Imbens , 2006, 2008, 2011, 2012). Ignoring the matching step can also cause problems, and
one therefore needs to use inference procedures that accommodate matching as described in
the aforementioned papers.
15
Inverse probability weighting An alternative to modeling the conditional expectation
function is to instead model the propensity score, i.e. the conditional probability of belonging
to the treated group given covariates, ppX
i
q P pD
i
1|X
i
q. Indeed, as shown by Abadie
(2005), under Assumptions 2, 6 and 7, the ATT is identified using the following inverse
probability weighting (IPW) formula:
τ
2
E
„ˆ
D
i
´
p1 ´ D
i
qppX
i
q
1 ´ ppX
i
q
˙
p
Y
i,2
´ Y
i,1
q
ȷ
E rD
i
s
. (14)
As in the regression adjustment approach, researchers can use the “plug-in principle”
to estimate the ATT by pluging in an estimate of the propensity score to the equation
above. The propensity score model can be estimated using parametric models or semi-/non-
parametric models (under suitable regularity conditions). The IPW approach will generally
be consistent if the model for the propensity scores is correctly specified. Inference can be
conducted using standard tools; see, e.g., Abadie (2005).
Doubly-robust estimators The outcome regression and IPW approaches described above
can also be combined to form “doubly-robust” (DR) methods that are valid if either the out-
come model or the propensity score model is correctly specified. Specifically, Sant’Anna and
15
Although these nearest-neighbor procedures have been formally justified for cross-sectional data, they
are easily adjustable to the canonical 2x2 DiD setup with balanced panel data. We are not aware of formal
extensions that allows for unbalanced panel data, repeated cross-sectional data, or more general DiD designs.
Abadie and Spiess (2022) show that, in some cases, clustering at the match level is sufficient when matching
is done without replacement.
26
Zhao (2020) show that under Assumptions 2, 6 and 7, the ATT is identified as:
τ
2
E
»
¨
˚
˚
˝
D
i
E rD
i
s
´
p1 ´ D
i
qppX
i
q
1 ´ ppX
i
q
E
p1 ´ D
i
qppX
i
q
1 ´ ppX
i
q
ȷ
˛
p
Y
i,2
´ Y
i,1
´ E
r
Y
i,2
´ Y
i,1
|D
i
0, X
i
sq
.(15)
As before, one can then estimate the ATT by plugging in estimates of both the propen-
sity score and the CEF. The outcome equation and the propensity score can be modeled
with either parametric or semi-/non-parametric methods, and DR methods will generally be
consistent if either of these models is correctly specified. In addition, Chang (2020) shows
that data-adaptive/machine-learning methods can also be used with DR methods. Standard
inference tools can be used as well; see, e.g., Sant’Anna and Zhao (2020). Finally, under
some regularity conditions, the DR estimator achieves the semi-parametric efficiency bound
when both the outcome and propensity score models are correctly specified (Sant’Anna and
Zhao, 2020).
Extensions to staggered treatment timing: Although the discussion above focused on
DiD setups with two groups and two periods, these different procedures have been extended
to staggered DiD setups when treatments are binary and non-reversible. More precisely,
Callaway and SantAnna (2021) extend the regression adjustment, IPW and DR procedures
above to estimate the family of AT T pg, tq’s discussed in Section 3.3. They then aggregate
these estimators to form different treatment effect summary measures. Wooldridge (2021)
proposes an alternative regression adjustment procedure that is suitable for staggered setups.
His proposed estimator differs from the Callaway and SantAnna (2021) regression adjustment
estimator as he exploits additional information from pre-treatment periods, which, in turn,
can lead to improvements in precision. On the other hand, if these additional assumptions are
violated, Wooldridge (2021)’s estimator may be more biased than Callaway and SantAnna
(2021)’s. de Chaisemartin and D’Haultfuille (2020); de Chaisemartin and D’Haultfoeuille
(2022) and Borusyak et al. (2021) consider estimators which include covariates in a linear
manner.
Caveats. Throughout, we assume that the covariates X
i
were measured prior to the in-
troduction of the intervention and are, therefore, unaffected by it. If X
i
can in fact be
affected by treatment, then conditioning on it induces a “bad control” problem that can
induce bias; see Zeldow and Hatfield (2021) for additional discussion. Similar issues arise
if one conditions on time-varying covariates X
i,t
that can be affected by the treatment. If
27
one is willing to assume that a certain time-varying covariate W
i,t
is not affected by the
treatment, then in principle the entire time-path of the covariate W
i
pW
i,1
, ..., W
i,T
q
1
can
be included in the conditioning variable X
i
, and thus exogenous time-varying covariates can
be incorporated similarly to any pre-treatment covariate. See Caetano, Callaway, Payne and
Rodrigues (2022) for additional discussion of time-varying covariates.
Another important question relates to whether researchers should condition on pre-
treatment outcomes. Proponents of including pre-treatment outcomes argue that controlling
for lagged outcomes can reduce bias from unobserved confounders (Ryan, 2018). It is worth
noting when lagged outcomes are included in X
i
, the conditional parallel trends assumption
actually reduces to a conditional mean independence assumption for the untreated potential
outcome, since the Y
i,1
p0q terms on both sides of (
10) cancel out, and thus we are left with
E
r
Y
i,2
p0q|D
i
1, X
i
s
E
r
Y
i,2
p0q|D
i
0, X
i
s
(almost surely) .
Including the lagged outcome in the conditioning variable thus makes sense if one is confident
in the conditional unconfoundedness assumption: i.e., if treatment is as good as randomly
assigned conditional on the lagged outcome and other elements of X
i
. This may be sensible
in settings where treatment takeup decisions are made on the basis of lagged outcomes. How-
ever, it is also important to note that conditioning on lagged outcomes need not necessarily
reduce bias, and can in fact exacerbate it in certain contexts. For example, Daw and Hatfield
(2018) show that when the treated and comparison groups have different outcome distribu-
tions but the same trends, matching the treated and control groups on lagged outcomes
selects control units with a particularly large “shock” in the pre-treatment period. This can
then induce bias owing to a mean-reversion effect, when in fact not conditioning on lagged
outcomes would have produced parallel trends. Thus, whether one should include lagged
outcomes or not depends on whether the researcher prefers the non-nested assumptions of
conditional unconfoundedness (given the lagged outcome) versus parallel trends. See, also
Chabé-Ferret (2015), Angrist and Pischke (2009, Chapter 5.4), and Ding and Li (2019) for
related discussion.
4.3 Testing for pre-existing trends
Although conditioning on pre-existing covariates can help increase the plausibility of the
parallel trends assumption, researchers typically still worry that there remain unobserved
time-varying confounders. An appealing feature of the DiD design is that it allows for a
natural plausibility check on the identifying assumptions: did outcomes for the treated and
comparison groups (possibly conditional on covariates) move in parallel prior to the time
28
of treatment? It has therefore become common practice to check, both visually and using
statistical tests, whether there exist pre-existing differences in trends (“pre-trends”) as a test
of the plausibility of the parallel trends assumption.
To fix ideas, consider a simple extension of the canonical non-staggered DiD model in
Section 2 in which we observe outcomes for an additional period t 0 during which no units
were treated. (These ideas will extend to the case of staggered treatment or conditional
parallel trends). By the no-anticipation assumption, Y
i,t
Y
i,t
p0q for all units in periods
t 0 and t 1. We can thus check whether the analog to the parallel trends assumption
held between periods 0 and 1 that is, is
E
r
Y
i,1
´ Y
i,0
|D
i
1
s
loooooooooooomoooooooooooon
Pre-treatment change for D
i
1
´ E
r
Y
i,1
´ Y
i,0
|D
i
0
s
loooooooooooomoooooooooooon
Pre-treatment change for D
i
0
0?
For example, did average insurance rates evolve in parallel for expansion and non-expansion
states before either of them expanded Medicaid? In the non-staggered setting, this hypoth-
esis can be conveniently tested using a TWFE specification that includes leads and lags of
treatment,
Y
i,t
α
i
` ϕ
t
`
ÿ
r0
1rR
i,t
rsβ
r
` ϵ
i,t
, (16)
where the coefficient on the lead of treatment
p
β
´1
is given by
p
β
´1
1
N
1
ÿ
i:D
i
1
Y
i,0
´ Y
i,1
´
1
N
0
ÿ
i:D
i
0
Y
i,0
´ Y
i,1
.
Testing for pre-treatment trends thus is equivalent to testing the null hypothesis that β
´1
0.
This approach is convenient to implement and extends easily to the case with additional
pre-treatment periods and non-staggered treatment adoption. When there are multiple pre-
treatment periods, it is common to plot the
p
β
r
in what is called as an “event-study” plot.
If all of the pre-treatment coefficients (i.e.,
p
β
r
for r ă 0q are insignificant, this is usually
interpreted as a sign in favor of the validity of the design, since we cannot reject the null
that parallel trends was satisfied in the pre-treatment period.
This pre-testing approach extends easily to settings with staggered adoption and/or
conditional parallel trends assumptions. For example, the Callaway and SantAnna (2021)
estimator can be used to construct “placebo” estimates of AT T
w
l
for l ă 0, i.e. the ATT
l periods before treatment. The estimates
z
AT T
w
l
can be plotted for different values of l
(corresponding to different lengths of time before/after treatment) to form an event-study
plot analogous to that for the non-staggered case. This illustrates that the idea of testing for
29
pre-trends extends easily to the settings with staggered treatment adoption or conditional
parallel trends, since the Callaway and SantAnna (2021) can b e applied for both of these
settings. These results are by no means specific to the Callaway and SantAnna (2021)
estimator, though, and event-study plots can be created in a similar fashion using other
estimators for either staggered or conditional DiD settings. We caution, however, against
using dynamic TWFE specifications like (16) in settings with staggered adoption, since as
noted by Sun and Abraham (2021), the coefficients β
r
may be contaminated by treatment
effects at relative time r
1
ą 0, so with heterogeneous treatment effects the pre-trends test
may reject even if parallel trends holds in the pre-treatment period (or vice versa).
4.4 Issues with testing for pre-trends
Although tests of pre-existing trends are a natural and intuitive plausibility check of the
parallel trends assumption, recent research has highlighted that they also have several lim-
itations. First, even if pre-trends are exactly parallel, this need not guarantee that the
post-treatment parallel trends assumption is satisfied. Kahn-Lang and Lang (2020) give an
intuitive example: the average height of boys and girls evolves in parallel until about age 13
and then diverges, but we should not conclude from this that there is a causal effect of bar
mitzvahs (which occur for boys at age 13) on children’s height!
A second issue is that even if there are pre-existing differences in trends, the tests de-
scribed above may fail to reject owing to low power (Bilinski and Hatfield, 2018; Freyalden-
hoven, Hansen and Shapiro, 2019; Kahn-Lang and Lang, 2020; Roth, 2022). That is, even
if there is a pre-existing trend, it may not b e significant in the data if our pre-treatment
estimates are imprecise.
To develop some intuition for why power may be low, suppose that there is no true
treatment effect but there is a pre-existing linear difference in trends between the treatment
and comparison groups. Then in the simple example from above, the pre-treatment and post-
treatment event-study coefficients will have the same magnitude, |β
´1
| |β
1
|. If the two
estimated coefficients
p
β
´1
and
p
β
1
also have the same sampling variance, then by symmetry
the probability that the pre-treatment coefficient
ˆ
β
´1
is significant will be the same as the
probability that the post-treatment coefficient
ˆ
β
1
is significant. But this means that a linear
violation of parallel trends that would be detected only half the time by a pre-trends test
will also lead us to spuriously find a significant treatment effect half the time
16
that is,
10 times more often than we expect to find a spurious effect using a nominal 95% confidence
16
This is the unconditional probability that
ˆ
β
1
is significant (not conditioning on the result of the pre-
test). However, if
ˆ
β
1
and
ˆ
β
´1
are independent, then this is also the probability of finding a significant effect
conditional on passing the pre-test.
30
interval! Another intuition for this phenomenon, given by Bilinski and Hatfield (2018), is
that pre-trends tests reverse the traditional roles of type I and type II error: they set the
assumption of parallel trends (or no placebo pre-intervention effect) as the null hypothesis
and only “reject” the assumption if there is strong evidence against it. This controls the
probability of finding a violation when parallel trends holds at 5% (or another chosen α-level),
but the probability of failing to identify a violation can be much higher, corresponding to
type II error of the test.
In addition to being concerning from a theoretical point of view, the possibility of low
power appears to be relevant in practice: in simulations calibrated to papers published in
three leading economics journals, Roth (2022) found that linear violations of parallel trends
that conventional tests would detect only 50% of the time often produce biases as large as
(or larger than) the estimated treatment effect.
A third issue with pre-trends testing is that conditioning the analysis on “passing” a pre-
trends test induces a selection bias known as pre-test bias (Roth, 2022). Intuitively, if there
is a pre-existing difference in trends in population, the draws from the DGP in which we fail
to detect a significant pre-trend are a selected sample from the true DGP. Roth (2022) shows
that in many cases, this additional selection bias can exacerbate the bias from a violation of
parallel trends.
A final issue with the current practice of pre-trends testing relates to what happens if
we do detect a significant pre-trend. In this case, the pre-trends test suggests that parallel
trends is likely not to hold exactly, but researchers may still wish to learn something about the
treatment effect of interest. Indeed, it seems likely that with enough precision, we will nearly
always reject that the parallel trends assumption holds exactly in the pre-treatment period.
Nevertheless, we may still wish to learn something about the treatment effect, especially if
the violation of parallel trends is “small” in magnitude. However, the conventional approach
does not make clear how to proceed in this case.
4.4.1 Improved diagnostic tools
A few papers have proposed alternative tools for detecting pre-treatment violations of parallel
trends that take into account some of the limitations discussed above. Roth (2022) developed
tools to conduct power analyses and calculate the likely distortions from pre-testing under
researcher-hypothesized violations of parallel trends. These tools allow the researcher to
assess whether the limitations described above are likely to be severe for potential violations
of parallel trends deemed economically relevant.
Bilinski and Hatfield (2018) and Dette and Schumann (2020) propose “non-inferiority”
approaches to pre-testing that help address the issue of low power by reversing the roles of
31
the null and alternative hypotheses. That is, rather than test the null that pre-treatment
trends are zero, they test the null that the pre-treatment trend is large, and reject only if
the data provides strong evidence that the pre-treatment trend is small. For example, Dette
and Schumann (2020) consider null hypotheses of the form H
0
: max
ră0
|β
r
| ě c, where β
r
are the (population) pre-treatment coefficients from regression (16). This ensures that the
test “detects” a pre-trend with probability at least 1 ´ α when in fact the pre-trend is large
(i.e. has magnitude at least c).
These non-inferiority approaches are an improvement over standard pre-testing methods,
since they guarantee by design that the pre-test is powered against large pre-treatment
violations of parallel trends. However, using these approaches does not provide any formal
guarantees that ensure the validity of confidence intervals for the treatment effect, the main
object of interest. They also do not avoid statistical issues related to pre-testing (Roth,
2022), and do not provide clear guidance on what to do when the test fails to reject the null
of a large pre-trend. This has motivated more formal robust inference and sensitivity analysis
approaches that consider inference on the ATT when parallel trends may be violated, which
we discuss next.
4.5 Robust inference and sensitivity analysis
Bounds using pre-trends. Rambachan and Roth (2022b) propose an approach for robust
inference and sensitivity analysis when parallel trends may be violated, building on earlier
work by Manski and Pepper (2018). Their approach attempts to formalize the intuition
motivating pre-trends testing: that the counterfactual post-treatment trends cannot be “too
different” from the pre-trends. To fix ideas, consider the non-staggered treatment adoption
setting described in Section 4.4. Denote by δ
1
the violation of parallel trends in the first
post-treatment period:
δ
1
E rY
i,2
p0q ´ Y
i,1
p0q|D
i
1s ´ E rY
i,2
p0q ´ Y
i,1
p0q|D
i
0s.
This, for example, could be the counterfactual difference in trends in insurance coverage
between Medicaid expansion and non-expansion states if the expansions had not occurred.
The bias δ
1
is unfortunately not directly identified, since we do not observe the untreated po-
tential outcomes, Y
i,2
p0q, for the treated group. However, by the no anticipation assumption,
we can identify the pre-treatment analog to δ
1
,
δ
´1
E rY
i,0
p0q ´ Y
i,1
p0q|D
i
1s ´ E rY
i,0
p0q ´ Y
i,1
p0q|D
i
0s,
32
which looks at pre-treatment differences in trends between the groups, with δ
´1
E
p
β
´1
ı
from the event study regression (16). For example, δ
´1
corresponds to the pre-treatment
difference in trends between expansion and non-expansion states.
Rambachan and Roth (2022b) then consider robust inference under assumptions that
restrict the possible values of δ
1
given the value of δ
´1
or more generally, given δ
´1
, ..., δ
´K
if there are K pre-treatment coefficients. For example, one type of restriction they consider
states that the magnitude of the post-treatment violation of parallel trends can be no larger
than a constant
¯
M times the maximal pre-treatment violation, i.e. |δ
1
| ď
¯
M max
ră0
|δ
r
|.
If
¯
M 1, for example, then this would impose that post-treatment violations of parallel
trends are no larger than the largest pre-treatment violation. They also consider restrictions
that bound the extent that δ
1
can deviate from a linear extrapolation of the pre-treatment
differences in trends. Rambachan and Roth (2022b) use tools from the partial identification
and sensitivity analysis literature (Armstrong and Kolesár, 2018; Andrews, Roth and Pakes,
2022) to construct confidence sets for the ATT that are uniformly valid under the imposed
restrictions. These confidence sets take into account the fact that we do not observe the true
pre-treatment difference in trends δ
´1
, only an estimate
p
β
´1
. In contrast to conventional
pre-trends tests, the Rambachan and Roth (2022b) confidence sets thus tend to be larger
when there is more uncertainty about the pre-treatment difference in trends (i.e. when the
standard error on
p
δ
´1
is large).
This approach enables a natural form of sensitivity analysis. For example, a researcher
might report that the conclusion of a positive treatment effect is robust up to the value
¯
M 2. This indicates that to invalidate the conclusion of a positive effect, we would need
to allow for a post-treatment violation of parallel trends two times larger than the maximal
pre-treatment violation. For example, we could potentially say that Medicaid expansion
has a significant effect on insurance rates unless we’re willing to allow for post-expansion
differences in trends that were up to twice as large as the largest difference in trends prior
to the expansion. Doing so makes precise exactly what needs to be assumed about possible
violations of parallel trends to reach a particular conclusion.
It is worth highlighting that although we’ve described these tools in the context of non-
staggered treatment timing and an unconditional parallel trends assumption, they extend
easily to the case of staggered treatment timing and conditional parallel trends as well.
Indeed, under mild regularity conditions, these tools can be used anytime the researcher
has a treatment effect estimate
p
β
post
and a placebo estimate
p
β
pre
, so long as she is willing
to bound the possible bias of
p
β
post
given the expected value of
p
β
pre
. For example, in the
staggered setting,
p
β
post
could be an estimate of AT T
w
l
for l ą 0 using one of the estimators
described in Section 3.3, and
p
β
pre
could be placebo estimates of AT T
w
´1
,...,AT T
w
´k
. See
33
https://github.com/pedrohcgs/CS_RR for examples on how these sensitivity analyses can
be combined with the Callaway and SantAnna (2021) estimator in R.
Bounds using bracketing. Ye, Keele, Hasegawa and Small (2021) consider an alternative
partial identification approach where there are two control groups whose trends are assumed
to “bracket” that of the treatment group. Consider the canonical model from Section 2, and
suppose the untreated units can be divided into two control groups, denoted C
i
a and
C
i
b. For ease of notation, let C
i
trt denote the treated group, i.e. units with D
i
1.
Let pcq E
r
Y
i,2
p0q ´ Y
i,1
p0q|C
i
c
s
. Instead of the parallel trends assumption, Ye et al.
(2021) impose that
mintpaq, pbqu ď ptrtq ď maxtpaq, pbqu, (17)
so that the trend in Y p0q for the treated group is bounded above and below (“bracketed”)
by the minimum and maximum trend in groups a and b. An intuitive example where we
may have such bracketing is if each of the groups corresponds with a set of industries, and
one of the control groups (say group a) is more cyclical than the treated group while the
other (say group b) is less cyclical. If the economy was improving between periods t 1 and
t 2, then we would expect group a to have the largest change in the outcome and group b
to have the smallest change; whereas if the economy was getting worse, we would expect the
opposite. Under equation (17) and the no anticipation assumption, the ATT is bounded,
E rY
i,2
´ Y
i,1
|D
i
1s ´ maxtpaq, pbqu ď τ
2
ď E rY
i,2
´ Y
i,1
|D
i
1s ´ mintpaq, pbqu.
This reflects that if we knew the true counterfactual trend for the treated group we could
learn the ATT exactly, and therefore that bounding this trend means we can obtain bounds
on the ATT. Ye et al. (2021) further show how one can construct confidence intervals for the
ATT, and extend this logic to settings with multiple periods (but non-staggered treatment
timing). See, also, Hasegawa, Webster and Small (2019) for a related, earlier approach.
4.5.1 Other approaches.
Keele, Small, Hsu and Fogarty (2019) propose a sensitivity analysis in the canonical two-
period DiD model that summarizes the strength of confounding factors that would be needed
to induce a particular bias. Freyaldenhoven, Hansen, Pérez and Shapiro (2021) propose a
visual sensitivity analysis in which one plots the “smoothest” trend though an event-study
plot that could rationalize the data under the null of no effect. Finally, Freyaldenhoven et
34
al. (2019) propose a GMM-based estimation strategy that allows for parallel trends to be
violated when there exists a covariate assumed to be affected by the same confounds as the
outcome but not by the treatment itself.
4.6 Recommendations
We suspect that in most practical applications of DiD, researchers will not be confident
ex ante that the parallel trends assumption holds exactly, owing to concerns about time-
varying confounds and sensitivity to functional form. The methods discussed in this setting
for relaxing the parallel trends assumption and/or assessing sensitivity to violations of the
parallel trends assumption will therefore be highly relevant in most contexts where DiD is
applied.
A natural starting point for these robustness checks is to consider whether the results
change meaningfully when imposing parallel trends only conditional on covariates. Among
the different estimation procedures we discussed, we view doubly-robust procedures as a
natural default, since they are valid if either the outcome model or propensity score is
well-specified and have desirable efficiency properties. A potential exception to this recom-
mendation arises in settings with limited overlap, i.e., when the estimated propensity score
is close to 0 or 1, in which case regression adjustment estimators may be preferred.
Whether one includes covariates into the DiD analysis or not, we encourage researchers to
continue to plot “event-study plots” that allow for a visual evaluation of pre-existing trends.
These plots convey useful information for the reader to assess whether there appears to
have been a break in the outcome for the treatment group around the time of treatment. In
contexts with a common treatment date, such plots can be created using TWFE specifications
like (16); in contexts with staggered timing, we recommend plotting estimates of AT T
w
l
for
different values of l using one of the estimators for the staggered setting described in Section
3.3 to avoid negative weighting issues with TWFE. See Section 4.3 for additional discussion.
We also refer the reader to Freyaldenhoven et al. (2021) regarding best-practices for creating
such plots, such as displaying simultaneous (rather than pointwise) confidence bands for the
path of the event-study coefficients (Olea and PlagborgMøller, 2019; Callaway and SantAnna,
2021).
While event-study plots play an important role in evaluating the plausibility of the par-
allel trends assumption, we think it is important to appreciate that tests of pre-trends may
be underpowered to detect relevant violations of parallel trends, as discussed in Section 4.4.
The lack of a significant pre-trend does not necessarily imply the validity of the parallel trends
assumption. At minimum, we recommend that researchers assess the power of pre-trends
35
tests against economically relevant violations of parallel trends, as described in Section 4.4.1.
We also think it should become standard practice for researchers to formally assess the
extent to which their conclusions are sensitive to violations of parallel trends. A natural
statistic to report in many contexts is the “breakdown” value of
¯
M using the sensitivity
analysis in Rambachan and Roth (2022b) i.e. how big would the post-treatment violation
of parallel trends have to be relative to the largest pre-treatment violation to invalidate
a particular conclusion? We encourage researchers to routinely report the results of the
sensitivity analyses described in Section 4.5 alongside their event-study plots.
We also encourage researchers to accompany the formal sensitivity tools with a discussion
of possible violations of parallel trends informed by context-specific knowledge. The parallel
trends assumption is much more plausible in settings where we expect the trends for the two
groups to be similar ex-ante (before seeing the pre-trends). Whenever possible, researchers
should therefore provide a justification for why we might expect the two groups to have
similar trends. It is also useful to provide context-specific knowledge about the types of
confounds that might potentially lead to violations of the parallel trends assumption
what time-varying factors may have differentially affected the outcome for the treated group?
Such discussion can often be very useful for interpreting the results of the formal sensitivity
analyses described in Section 4.5. For example, suppose that a particular conclusion is
robust to allowing for violations of parallel trends twice as large the maximum in the pre-
treatment period. In contexts where other factors were quite stable around the time of the
treatment, this might be interpreted as a very robust finding; on the other hand, if the
treatment occurred at the beginning of a recession much larger than anything seen in the
pre-treatment period, then a violation of parallel trends of that magnitude may indeed be
plausible, so that the results are less robust than we might like. Thus, economic knowledge
will be very important in understanding the robustness of a particular result. In our view,
the most scientific approach to dealing with possible violations of parallel trends therefore
involves a combination of state-of-the-art econometric tools and context-specific knowledge
about the types of plausible confounding factors.
5 Relaxing sampling assumptions
We now discuss a third strand of the DiD literature, which considers inference under devi-
ations from the canonical assumption that we have sampled a large number of independent
clusters from an infinite super-population.
36
5.1 Inference procedures with few clusters
As described in Section 2, standard DiD inference procedures rely on researchers having
access to data on a large number of treated and untreated clusters. Confidence intervals are
then based on the central limit theorem, which states that with independently-sampled clus-
ters, the DiD estimator has an asymptotically normal distribution as the number of treated
and untreated clusters grows large. In many practical DiD settings, however, the number of
independent clusters (and, in particular, treated clusters) may be small, so that the central
limit theorem based on a growing number of clusters may provide a poor approximation.
For example, many DiD applications using state-level policy changes may only have a hand-
ful of treated states. The central limit theorem may provide a poor approximation with
few clusters, even if the number of units within each cluster is large. This is because the
standard sampling-based view of clustering allows for arbitrary correlations of the outcome
within each cluster, and thus there may be common components at the cluster level (a.k.a.
cluster-level “shocks”) that do not wash out when averaging over many units within the same
cluster. Since we only observe a few observations of the cluster-specific shocks, the average
of these shocks will generally not be approximately normally distributed.
Model-based approaches. Several papers have made progress on the difficult problem
of conducting inference with a small number of clusters by modeling the dependence within
clusters. These papers typically place some restrictions on the common cluster-level shocks,
although the exact restrictions differ across papers. The starting point for these papers is
typically a structural equation of the form
Y
ijt
α
j
` ϕ
t
` D
jt
β ` pν
jt
` ϵ
ijt
q, (18)
where Y
ijt
is the (realized) outcome of unit i in cluster j at time t, α
j
and ϕ
t
are cluster
and time fixed effects, D
jt
is an indicator for whether cluster j is treated in period t, ν
jt
is a common cluster-by-time error term, and ϵ
ijt
is an idiosyncratic unit-level error term.
Here, the “cluster-level” error term, ν
jt
, induces correlation among units within the same
cluster. It is often assumed that ϵ
ijt
are iid mean-zero across i and j (and sometimes t);
see, e.g., Donald and Lang (2007), Conley and Taber (2011), and Ferman and Pinto (2019).
Letting Y
jt
n
´1
j
ř
i:jpiq“j
Y
ijt
be the average outcome among units in cluster j, where n
j
is
the number of units in cluster j, we can take averages to obtain
Y
jt
α
j
` ϕ
t
` D
jt
β ` η
jt
, (19)
37
where η
jt
ν
jt
` n
´1
j
ř
n
j
i1
ϵ
ijt
. Assuming the canonical set-up with two periods where no
clusters are treated in period t 1 and some clusters are treated in period t 2, the
canonical DiD estimator at the cluster level is equivalent to the estimated OLS coefficient
p
β
from (19), and is given by
p
β β `
1
J
1
ÿ
j:D
j
1
η
j
´
1
J
0
ÿ
j:D
j
0
η
j
β `
1
J
1
ÿ
j:D
j
1
˜
ν
j
` n
´1
j
n
j
ÿ
i1
ϵ
ij
¸
´
1
J
0
ÿ
j:D
j
0
˜
ν
j
` n
´1
j
n
j
ÿ
i1
ϵ
ij
¸
, (20)
where J
d
corresponds with the number of clusters with treatment d, and η
j
η
j2
´ η
j1
(and likewise for the other variables). The equation in the previous display highlights the
challenge in this setup: with few clusters, the averages of the cluster level shocks ν
j
among
treated and untreated clusters will tend not to be approximately normally distributed, and
their variance may be difficult to estimate.
It is worth highlighting that the model described above starts from the structural equa-
tion (18) rather than a model where the primitives are potential outcomes as in Section 2.
We think that connecting the assumptions on the errors in the structural model (18) to re-
strictions on the potential outcomes is an interesting open topic for future work. Although a
general treatment is b eyond the scope of this paper, in Appendix A we show that the errors
in the structural model (18) map to primitives based on potential outcomes in the canonical
model from Section 2. For the remainder of the sub-section, however, we focus primarily
on the restrictions placed on ν
jt
and ϵ
ijt
directly rather than the implications of these
assumptions for the potential outcomes since this simplifies exposition and matches how
these assumptions are stated in the literature.
Donald and Lang (2007) made an important early contribution to the literature on in-
ference with few clusters. Their approach assumes that the cluster level-shocks ν
jt
are
mean-zero Gaussian, homoskedastic with respect to cluster and treatment status, and inde-
pendent of other unit-and-time specific shocks. They also assume the number of units per
cluster is large (n
j
Ñ 8 for all j). They then show that one can obtain valid inference by
using critical values from a t-distribution with J ´2 degrees of freedom, where J is the total
number of clusters. A nice feature of this approach is that it allows for valid inference when
both the number of treated and untreated clusters is small. The disadvantage is the strong
parametric assumption of homoskedastic Gaussian errors, which will often be hard to justify
in practice.
Conley and Taber (2011) introduce an alternative approach for inference that is able
38
to relax the strong assumption of Gaussian errors in settings where there are many control
clusters (J
0
large) but few treated clusters (J
1
small). This may be reasonable if the author
has data from, say, 3 treated states and 47 untreated states. The key idea in Conley and
Taber (2011) is that if we assume that the errors in treated states come from the same
distribution as in control states, then we can learn the distribution of errors from the large
number of control states and use that to construct standard errors. A key advantage of
this approach is that the distribution of errors is not assumed to be Gaussian, but rather is
learned from the data. Nevertheless, the assumption that all treated groups have the same
distribution of errors is still strong, and will often be violated if either there is heterogeneity in
treatment effects or in cluster sizes. Ferman and Pinto (2019) extend the approach of Conley
and Taber (2011) to allow for heteroskedasticity caused by heterogeneity in group sizes or
other observable characteristics, but must still restrict heterogeneity based on unobserved
characteristics (e.g. unobserved treatment effect heterogeneity).
Hagemann (2020) provides an alternative permutation-based approach that avoids the
need to directly estimate the heteroskedasticity. The key insight of Hagemann (2020) is
that if we place a bound on the maximal relative heterogeneity across clusters, then we
can bound the probability of type I error from a permutation approach. He also shows
how one can use this measure of relative heterogeneity to do sensitivity analysis. Like the
other proposals above, though, Hagemann (2020)’s approach must also place some strong
restrictions on certain types of heterogeneity. In particular, his approach essentially requires
that, as cluster size grows large, any single untreated cluster could be used to infer the
counterfactual trend for the treated group, and thus his approach rules out cluster-specific
heterogeneity in trends in untreated potential outcomes.
Another popular solution with few clusters is the cluster wild bootstrap. In an influential
paper, Cameron, Gelbach and Miller (2008) presented simulation evidence that the cluster
wild bootstrap procedure can work well in settings with as few as five clusters. More recently,
however, Canay, Santos and Shaikh (2021) provided a formal analysis about the conditions
under which the cluster wild bootstrap is asymptotically valid in settings with a few large
clusters. Importantly, Canay et al. (2021) show that the reliability of these bootstrap proce-
dures depends on imposing certain homogeneity conditions on treatment effects, as well as
the type of bootstrap weights one uses and the estimation method adopted (e.g., restricted
vs. unrestricted OLS). These restrictions are commonly violated when one uses TWFE re-
gressions with cluster-specific and time fixed effects like (19) or when treatment effects are
allowed to be heterogeneous across clusters see Examples 2 and 3 in Canay et al. (2021).
Simulations have likewise shown that the cluster wild bootstrap may perform poorly in DiD
settings with a small number of treated clusters (MacKinnon and Webb, 2018). Thus, while
39
the wild bootstrap may perform well in certain scenarios with a small number of clusters, it
too requires strong homogeneity assumptions.
Finally, in settings with a large number of time periods, it may be feasible to con-
duct reliable inference with less stringent homogeneity assumptions about treatment effects.
For instance, Canay, Romano and Shaikh (2017), Ibragimov and Müller (2016), Hagemann
(2021), and Chernozhukov, Wüthrich and Zhu (2021) respectively propose permutation-
based, t-test based, adjusted permutation-based, and conformal inference-based procedures
that allow one to relax distributional assumptions about common shocks and accommodate
richer forms of heterogeneity. The key restriction is that one is comfortable limiting the
time-series dependence of the cluster-specific-shocks, and strengthening the parallel trends
assumption to hold in many pre- and post-treatment time periods. These methods have
been shown to be valid under asymptotics where the number of periods grows large. When
in fact the number of time periods is small, as frequently occurs in DiD applications, one
can still use some of these methods, but the underlying assumptions are stronger see, e.g.,
Remark 4.5 and Section 4.2 of Canay et al. (2017).
Alternative approaches. We now briefly discuss two alternative approaches in settings
with a small number of clusters. First, while all of the “model-based” papers above treat
ν
jt
as random, an alternative perspective would be to condition on the values of ν
jt
and
view the remaining uncertainty as coming only from the sampling of the individual units
within clusters, constructing standard errors by clustering only at the unit level. This will
generally produce a violation of parallel trends, but the violation may be relatively small
if the cluster-specific shocks are small relative to the idiosyncratic variation. The violation
of parallel trends could then be accounted for using the methods described in Section 4.
To make this concrete, consider the setting of Card and Krueger (1994) that compares
employment in NJ and PA after NJ raised its minimum wage. The aforementioned papers
would consider NJ and PA as drawn from a super-population of treated and untreated states,
where the state-level shocks are mean-zero, whereas the alternative approach would treat
the two states as fixed and view any state-level shocks between NJ and PA as a violation of
the parallel trends assumption. One could then explore the sensitivity of one’s conclusions
to the magnitude of this violation, potentially benchmarking it relative to the magnitude of
the pre-treatment violations as discussed in Section 4.5.
A second possibility is Fisher Randomization Tests (FRTs), otherwise known as permu-
tation tests. The basic idea is to calculate some statistic of the data (e.g. the t-statistic
of the DiD estimator), and then recompute this statistic under many permutations of the
treatment assignment (at the cluster level). We then reject the null hypothesis of no effect if
40
the test statistic using the original data is larger than 95% of the draws of the test statistics
under the permuted treatment assignment. Such tests have a long history in statistics, dat-
ing to Fisher (1935). If treatment is randomly assigned, then FRTs have exact finite-sample
validity under the sharp null of no treatment effects for all units. The advantage of these
tests is that they place no restrictions on the potential outcomes, and thus allow arbitrary
heterogeneity in potential outcomes across clusters. On the other hand, the assumption of
random treatment assignment may often be questionable in DiD settings. Moreover, the
“sharp” null of no effects for all units may not be as economically interesting as the “weak”
null of no average effects. Nevertheless, permutation tests may be a useful benchmark: if
one cannot reject the null of no treatment effects even if treatment had been randomly as-
signed, this suggests that there is not strong evidence of an effect in the data without other
strong assumptions. In settings with staggered treatment timing, it may be more plausible
to assume that the timing of when a unit gets treated is as good as random; see Roth and
Sant’Anna (2021) for efficient estimators and FRTs for this setting.
Recommendations. In sum, recent research has made progress on the problem of con-
ducting inference with relatively few clusters, but all of the available approaches require the
researcher to impose some potentially strong additional assumptions. Most of the litera-
ture has focused on model-based approaches, which require the researcher to impose some
homogeneity assumptions across clusters. Different homogeneity assumptions may be more
reasonable in different contexts, and so we encourage researchers using these approaches to
choose a method relying on a dimension of homogeneity that is most likely to hold (ap-
proximately) in their context. We also note that allowing for more heterogeneity may often
come at the expense of obtaining tests with lower power. When none of these homogeneity
assumptions is palatable, conditioning the inference on the cluster-level shocks and treating
them as violations of parallel trends, accompanied by appropriate sensitivity analyses, may
be an attractive alternative. Permutation-based methods also offer an intriguing alterna-
tive which requires no assumptions about homogeneity in potential outcomes, but requires
stronger assumptions on the assignment of treatment and tests a potentially less interesting
null hypothesis when the number of clusters is small.
5.2 Design-based inference and the appropriate level of clustering
The canonical approach to inference in DiD assumes that we have a large number of independently-
drawn clusters sampled from an infinite super-population. In practice, however, there are
two related conceptual difficulties with this framework. First, in many settings, it is unclear
what the super-population of clusters is if the clusters in my sample are the 50 US states,
41
should I view these as having been drawn from a super-population of possible states? Sec-
ond, in many settings it is hard to determine what the appropriate level of clustering is if
my data is on individuals who live in counties, which are themselves subsets of states, which
is the appropriate level of clustering?
To address these difficulties, it is often easier to consider a design-based framework that
views the units in the data as fixed (not necessarily sampled from a super-population) and the
treatment assignment as stochastic. This helps to address the difficulties described above,
since we do not need to conceptualize the super-population, and the appropriate level of
clustering is determined by the way that treatment is assigned. Design-based frameworks
have a long history in statistics dating to Neyman (1923), and have received recent attention
in econometrics (e.g. Abadie, Athey, Imbens and Wooldridge, 2020, 2023). However, until
recently, most of the results in the design-based literature has focused on settings where
treatment probabilities are known or depend only on observable characteristics, and thus
were not directly applicable to DiD.
Recent work by Rambachan and Roth (2022a) has extended this design-based view to set-
tings like DiD, where treatment probabilities may differ in unknown ways across units. Ram-
bachan and Roth (2022a) consider a setting similar to the canonical two-period model in Sec-
tion 2. However, following the design-based paradigm, they treat the units in the population
(and their potential outcomes) as fixed rather than drawn from an infinite super-population.
In this set-up, they show that the usual DiD estimator is unbiased for a finite-population
analog to the ATT under a finite-population analog to the parallel trends assumption. In
particular, let π
i
denote the probability that D
i
1, and suppose that
N
ÿ
i1
ˆ
π
i
´
N
1
N
˙
pY
i,2
p0q ´ Y
i,1
p0qq 0,
so that treatment probabilities are uncorrelated with trends in Y p0q (a finite-population
version of parallel trends). Then E
D
r
p
τ
2
s τ
F in
2
, where τ
F in
2
E
D
r
1
N
1
ř
i:D
i
1
pY
i,2
p1q ´
Y
i,2
p0qqs is a finite-population analog to the ATT, i.e. the expected average treatment effect
on the treated, where the expectation is taken over the stochastic distribution of which units
are treated.
Rambachan and Roth (2022a) show that from the design-based perspective, cluster-
robust standard errors are valid (but potentially conservative) if the clustering is done at the
level at which treatment is independently determined. Thus, for example, if the treatment
is assigned independently at the unit-level,
17
then we should cluster at the unit-level; by
17
Formally, if units are assigned independently b efore we condition on the number of treated units, N
1
.
42
contrast, if treatment is determined independently across states, then we should cluster at
the state level. This clear recommendation on the appropriate level of clustering contrasts
with the more traditional model-based view that clustering should be done at the level at
which the errors are correlated, which often makes it challenging to choose the appropriate
level (MacKinnon, Nielsen and Webb, 2022). These results also suggest that it may not
actually be a problem if it is difficult to conceptualize a super-population from which our
clusters are drawn; rather, the “usual” approach remains valid if there is no super-population
and the uncertainty comes from stochastic assignment of treatment.
18
Recommendations. If it is difficult to conceptualize a super-population, fear not! Your
DiD analysis can likely still be sensible from a finite-population perspective where we think of
the treatment assignment as stochastic. Furthermore, if you are unsure about the appropriate
level of clustering, a good rule of thumb (at least from the design-based perspective) is to
cluster at the level at which treatment is independently assigned.
6 Other topics and areas for future research
In this section, we briefly touch on some other areas of interest in the DiD literature, and
highlight some open areas for future research.
Distributional treatment effects. The DiD literature typically focuses on estimation of
the ATT, but researchers may often be interested in the effect of a treatment on the entire
distribution of an outcome. Athey and Imbens (2006) propose the Changes-in-Changes
model, which allows one to infer the full counterfactual distribution of Y p0q for the treated
group in DiD setups. The key assumption is that the mapping between quantiles of Y p0q
for the treated and comparison groups remains stable over time e.g., if the 30th percentile
of the outcome for the treated group was the 70th percentile for the comparison group prior
to treatment, this relationship would have been preserved in the second period if treatment
had not occurred. Bonhomme and Sauder (2011) propose an alternative distributional DiD
model based on a parallel trends assumption for the (log of the) characteristic function, which
is motivated by a model of test scores. Callaway and Li (2019) propose a distributional DiD
18
In some settings, the uncertainty may arise both from sampling and the stochastic assignment of treat-
ment. Abadie et al. (2023) study a model in which both treatment is stochastic and units are sampled from a
larger population, and suggest that one should cluster among units if either their treatment assignments are
correlated or the event that they are included in the sample is correlated. Although the Abadie et al. (2023)
results do not directly apply to DiD, we suspect that a similar heuristic would apply in DiD as well in light
of the results in Rambachan and Roth (2022a) for the case where only treatment is viewed as sto chastic.
Formalizing this intuition strikes us as an interesting area for future research.
43
model based on a copula stability assumption. Finally, Roth and Sant’Anna (2022) show
that parallel trends holds for all functional forms under a “parallel trends”-type assumption
for the cumulative distribution of Y p0q, and this assumption also allows one to infer the full
counterfactual distribution for the treated group.
Quasi-random treatment timing. In settings with staggered treatment timing, the gen-
eralized parallel trends assumption is often justified by arguing that the timing of treatment
is random or quasi-random. Roth and Sant’Anna (2021) show that if one is willing to as-
sume treatment timing is as good as random, one can obtain more efficient estimates than
using the staggered DiD methods discussed in Section 3.3. This builds on earlier work by
McKenzie (2012), who highlighted that DiD is typically inefficient in an RCT where lagged
outcomes are observed, as well as a large literature in statistics on efficient covariate ad-
justment in randomized experiments (e.g., Lin, 2013). Shaikh and Toulis (2021) propose a
method for observational settings where treatment timing is random conditional on fixed ob-
servable characteristics. We think that developing methods for observational settings where
treatment timing is approximately random, possibly conditional on covariates and lagged
outcomes, is an interesting area for further study in the years ahead.
Sequential ignorability. As discussed in Section 3.4, an exciting new literature in DiD
has begun to focus on settings where treatment can turn on and off and potential outcomes
depend on the full path of treatments. A similar setting has been studied extensively in
biostatistics, beginning with the pioneering work of Robins (1986). The key difference is
that the biostatistics literature has focused on sequential random ignorability assumptions
that impose that treatment in each period is random conditional on the path of covariates
and realized outcomes, rather than parallel trends. We suspect that there may be economic
settings where sequential ignorability may be preferable to parallel trends, e.g. when there
is feedback between lagged outcomes and future treatment choices. Integrating these two
literatures e.g., understanding in which economic settings is parallel trends preferable
to sequential ignorability and vice versa is an interesting area for future research. An
interesting step towards incorporating sequential ignorability in economic analyses is Viviano
and Bradic (2021).
Spillover effects. The vast majority of the DiD literature imposes the SUTVA assump-
tion, which rules out spillover effects. However, spillover effects may be important in many
economic applications, such as when policy in one area affects neighboring areas, or when
individuals are connected in a network. Butts (2021) provides some initial work in this
44
direction by extending the framework of Callaway and SantAnna (2021) to allow for local
spatial spillovers. Huber and Steinmayr (2021) also consider extensions to allow for spillover
effects. We suspect that in the coming years, we will see more work on DiD with spillovers.
Conditional treatment effects. The DiD literature has placed a lot of emphasis on
learning about the ATT’s of different groups. However, in many situations, it may also
be desirable to better understand how these ATT’s vary between subpopulations defined by
covariate values. For instance, how does the average treatment effect of a training program on
earnings vary according to the age of its participants? Abadie (2005) provides re-weighting
methods to tackle these types of questions using linear approximations. However, recent
research has shown that data-adaptive/machine-learning procedures can be used to more
flexibly estimate treatment effect heterogeneity in the context of RCTs or cross-sectional
observational studies with unconfoundedness (e.g., Lee, Okui and Whang, 2017; Wager and
Athey, 2018; Chernozhukov, Demirer, Duflo and Fernández-Val, 2020). Whether such tools
can be adapted to estimate treatment effect heterogeneity in DiD setups is a promising area
for future research.
Triple differences. A common variant on DiD is triple-differences (DDD), which compares
the DiD estimate for a demographic group expected to be affected by the treatment to a DiD
for a second demographic group expected not to be affected (or effected less). For example,
Gruber (1994) studies the impacts of mandated maternity leave policies using a DDD design
that compares the evolution of wages between treated/untreated states, before/after the law
passed, and between married women age 20-40 (who are expected to be affected) and other
workers. DDD has received much less attention in the recent literature than standard DiD.
We note, however, that DDD can often be cast as a DiD with a transformed outcome. For
example, if we defined the state-level outcome
˜
Y as the difference in wages between women
age 20-40 and other workers, then Gruber (1994)’s DDD analysis would be equivalent to a
DiD analysis using
˜
Y as the outcome instead of wages. Nevertheless, we think that providing
a more formal analysis of DDD along with practical recommendations for applied researchers
would be a useful direction for future research.
Connections to other panel data methods. DiD is of course one of many possible
panel data methods. One of the most prominent alternatives is the synthetic control (SC)
method, pioneered by Abadie, Diamond and Hainmueller (2010). Much of the DiD and SC
literatures have evolved separately, using different data-generating pro cesses as the baseline
(Abadie, 2021). Recent work has begun to try to combine insights from the two literatures
45
(e.g., Arkhangelsky, Athey, Hirshberg, Imbens and Wager, 2021; Ben-Michael, Feller and
Rothstein, 2021, 2022; Doudchenko and Imbens, 2016). We think that exploring further
connections between the literatures and in particular, providing clear guidance for prac-
titioners on when one we should expect one method to perform better than the other, or
whether one should consider a hybrid of the two is an interesting direction for future
research.
7 Conclusion
This paper synthesizes the recent literature on DiD. Some key themes are that researchers
should be clear about the comparison group used for identification, match the estimation
and inference methods to the identifying assumptions, and explore robustness to possible
violations of those assumptions. We emphasize that context-specific knowledge will often
be needed to choose the right identifying assumptions and accompanying methods. We are
hopeful that these recent developments will help to make DiD analyses more transparent
and credible in the years to come.
46
Table 1: A Checklist for DiD Practitioners
Is everyone treated at the same time?
If yes, and panel is balanced, estimation with TWFE specifications such as (5) or (7)
yield easily interpretable estimates.
If no, consider using a “heterogeneity-robust” estimator for staggered treatment timing
as described in Section 3. The appropriate estimator will depend on whether treatment
turns on/off and which parallel trends assumption you’re willing to impose. Use TWFE
only if you’re confident in treatment effect homogeneity.
Are you sure about the validity of the parallel trends assumption?
If yes, explain why, including a justification for your choice of functional form. If
the justification is (quasi-)random treatment timing, consider using a more efficient
estimator as discussed in Section 6.
If no, consider the following steps:
1. If parallel trends would be more plausible conditional on covariates, consider a
method that conditions on covariates, as described in Section 4.2.
2. Assess the plausibility of the parallel trends assumption by constructing an event-
study plot. If there is a common treatment date and you’re using an unconditional
parallel trends assumption, plot the coefficients from a specification like (16). If
not, then see Section 4.3 for recommendations on event-plot construction.
3. Accompany the event-study plot with diagnostics of the power of the pre-test
against relevant alternatives and/or non-inferiority tests, as described in Section
4.4.1.
4. Report formal sensitivity analyses that describe the robustness of the conclusions
to potential violations of parallel trends, as described in Section 4.5.
Do you have a large number of treated and untreated clusters sampled from
a super-population?
If yes, then use cluster-robust methods at the cluster level. A good rule of thumb is
to cluster at the level at which treatment is independently assigned (e.g. at the state
level when policy is determined at the state level); see Section 5.2.
If you have a small number of treated clusters, consider using one of the alternative
inference methods described in Section 5.1.
If you can’t imagine the super-population, consider a design-based justification for
inference instead, as discussed in Section 5.2.
47
Table 2: Statistical Packages for Recent DiD Methods
Heterogeneity Robust Estimators for Staggered Treatment Timing
Package Software Description
did, csdid R, Stata Implements Callaway and SantAnna (2021)
did2s R, Stata Implements Gardner (2021), Borusyak et al. (2021), Sun and Abraham (2021),
Callaway and SantAnna (2021), Roth and Sant’Anna (2021)
didimputation, did_imputation R, Stata Implements Borusyak et al. (2021)
DIDmultiplegt, did_multiplegt R, Stata Implements de Chaisemartin and D’Haultfuille (2020)
eventstudyinteract Stata Implements Sun and Abraham (2021)
flexpaneldid Stata Implements Dettmann (2020), based on Heckman et al. (1998)
fixest R Implements Sun and Abraham (2021)
stackedev Stata Implements stacking approach in Cengiz et al. (2019)
staggered R Implements Roth and Sant’Anna (2021), Callaway and SantAnna (2021),
and Sun and Abraham (2021)
xtevent Stata Implements Freyaldenhoven et al. (2019)
DiD with Covariates
Package Software Description
DRDID, drdid R, Stata Implements Sant’Anna and Zhao (2020)
Diagnostics for TWFE with Staggered Timing
Package Software Description
bacondecomp, ddtiming R, Stata Diagnostics from Goodman-Bacon (2021)
TwoWayFEWeights R, Stata Diagnostics from de Chaisemartin and D’Haultfuille (2020)
Diagnostic / Sensitivity for Violations of Parallel Trends
Package Software Description
honestDiD R, Stata Implements Rambachan and Roth (2022b)
pretrends R Diagnostics from Roth (2022)
Note: This table lists R and Stata packages for recent DiD methods, and is based on Asjad Naqvi’s rep ository at https://asjadnaqvi.github.io/DiD/. Several
of the packages listed under “Heterogeneity Robust Estimators” also accommodate covariates.
48
References
Abadie, Alberto, “Semiparametric Difference-in-Differences Estimators,” The Review of
Economic Studies , 2005, 72 (1), 1–19.
, “Using Synthetic Controls: Feasibility, Data Requirements, and Methodological As-
pects,” Journal of Economic Literature, June 2021, 59 (2), 391–425.
, Alexis Diamond, and Jens Hainmueller, “Synthetic Control Methods for Com-
parative Case Studies: Estimating the Effect of Californias Tobacco Control Program,”
Journal of the American Statistical Association, June 2010, 105 (490), 493–505.
and Guido W. Imbens, “Large sample properties of matching estimators for average
treatment effects,” Econometrica, 2006, 74 (1), 235–267.
and , “On the Failure of the Bootstrap for Matching Estimators,” Econometrica, 2008,
76 (6), 1537–1557.
and , “Bias-Corrected Matching Estimators for Average Treatment Effects,” Journal
of Business & Economic Statistics, 2011, 29 (1), 1–11.
and , “A Martingale Representation for Matching Estimators,” Journal of the American
Statistical Association, 2012, 107 (498), 833–843.
and Jann Spiess, “Robust Post-Matching Inference,” Journal of the American Statis-
tical Association, April 2022, 117 (538), 983–995. Publisher: Taylor & Francis _eprint:
https://doi.org/10.1080/01621459.2020.1840383.
, Susan Athey, Guido Imbens, and Jeffrey Wooldridge, “When Should You Adjust
Standard Errors for Clustering?,” The Quarterly Journal of Economics, 2023, 138 (1),
1–35.
, , Guido W. Imbens, and Jeffrey M. Wooldridge, “Sampling-Based versus
Design-Based Uncertainty in Regression Analysis,” Econometrica, 2020, 88 (1), 265–296.
Abbring, Jaap H. and Gerard J. van den Berg, “The nonparametric identification of
treatment effects in duration models,” Econometrica, 2003, 71 (5), 1491–1517.
Andrews, Isaiah, Jonathan Roth, and Ariel Pakes, “Inference for Linear Conditional
Moment Inequalities,” Review of Economic Studies, 2022, Forthcoming.
Angrist, Joshua D. and Jorn-Steffen Pischke, Mostly Harmless Econometrics: An
Empiricist’s Companion, Princeton: Princeton University Press, 2009.
Arellano, M., “Practitioners Corner: Computing Robust Standard Errors for Within-
groups Estimators,” Oxford Bulletin of Economics and Statistics, 1987, 49 (4), 431–434.
Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens,
and Stefan Wager, “Synthetic Difference-in-Differences,” American Economic Review,
2021, 111 (12), 4088–4118.
49
Armstrong, Timothy B. and Michal Kolesár, “Optimal Inference in a Class of Regres-
sion Models,” Econometrica, 2018, 86 (2), 655–683.
Athey, Susan and Guido Imbens, “Design-based Analysis in Difference-In-Differences
Settings with Staggered Adoption,” Journal of Econometrics, 2022, 226 (1), 62–79.
and Guido W. Imbens, “Identification and Inference in Nonlinear Difference-in-
Differences Models,” Econometrica, 2006, 74 (2), 431–497.
Baker, Andrew, David F. Larcker, and Charles C. Y. Wang, “How much should
we trust staggered difference-in-differences estimates?,” Journal of Financial Economics,
2022, 144 (2), 370–395.
Ben-Michael, Eli, Avi Feller, and Jesse Rothstein , “The Augmented Synthetic Control
Method,” Journal of the American Statistical Association, 2021, 116 (536), 1789–1803.
, , and , “Synthetic controls with staggered adoption,” Journal of the Royal Statistical
Society: Series B, 2022, 84 (2), 351–381.
Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan , “How Much Should
We Trust Differences-In-Differences Estimates?,” The Quarterly Journal of Economics,
2004, 119 (1), 249–275.
Bilinski, Alyssa and Laura A. Hatfield, “Seeking evidence of absence: Reconsidering
tests of model assumptions,” arXiv:1805.03273 [stat], May 2018.
Bojinov, Iavor, Ashesh Rambachan, and Neil Shephard, “Panel experiments and
dynamic causal effects: A finite population p erspective,” Quantitative Economics, 2021,
12 (4), 1171–1196.
Bonhomme, Stéphane and Ulrich Sauder, “Recovering distributions in difference-in-
differences mo dels: A comparison of selective and comprehensive schooling,” Review of
Economics and Statistics, 2011, 93 (2), 479–494.
Borusyak, Kirill and Xavier Jaravel, “Revisiting Event Study Designs,” SSRN Scholarly
Paper ID 2826228, Social Science Research Network, Rochester, NY 2018.
, , and Jann Spiess, “Revisiting Event Study Designs: Robust and Efficient Estima-
tion,” arXiv:2108.12419 [econ], 2021.
Butts, Kyle, “Difference-in-Differences Estimation with Spatial Spillovers,”
arXiv:2105.03737 [econ], 2021.
Caetano, Carolina, Brantly Callaway, Stroud Payne, and Hugo Sant’Anna
Rodrigues, “Difference in Differences with Time-Varying Covariates,” February 2022.
arXiv:2202.02903 [econ].
Callaway, Brantly and Pedro H. C. SantAnna, “Difference-in-Differences with multiple
time periods,” Journal of Econometrics, 2021, 225 (2), 200–230.
and Tong Li, “Quantile treatment effects in difference in differences models with panel
50
data,” Quantitative Economics, 2019, 10 (4), 1579–1618.
, Andrew Goodman-Bacon, and Pedro H. C. Sant’Anna, “Difference-in-Differences
with a Continuous Treatment,” arXiv:2107.02637 [econ], 2021.
Cameron, A Colin, Jonah B Gelbach, and Douglas L Miller, “Bootstrap-Based
Improvements for Inference With Clustered Errors,” Review of Economics and Statistics,
2008, 90 (3), 414–427.
Canay, Ivan A., Andres Santos, and Azeem M. Shaikh, “The wild bootstrap with
a small number of large clusters,” Review of Economics and Statistics, 2021, 103 (2),
346–363.
, Joseph P. Romano, and Azeem M. Shaikh, “Randomization Tests Under an Ap-
proximate Symmetry Assumption,” Econometrica, 2017, 85 (3), 1013–1030.
Card, David and Alan B Krueger, “Minimum Wages and Employment: A Case Study
of the Fast-Food Industry in New Jersey and Pennsylvania,” American Economic Review,
1994, 84 (4), 772–793.
Cengiz, Doruk, Arindrajit Dube, Attila Lindner, and Ben Zipperer, “The Effect
of Minimum Wages on Low-Wage Jobs,” The Quarterly Journal of Economics, 2019, 134
(3), 1405–1454.
Chabé-Ferret, Sylvain, “Analysis of the bias of Matching and Difference-in-Difference
under alternative earnings and selection processes,” Journal of Econometrics, 2015, 185
(1), 110–123.
Chang, Neng-Chieh, “Double/debiased machine learning for difference-in-differences,”
Econometrics Journal , 2020, 23, 177–191.
Chernozhukov, Victor, Kaspar Wüthrich, and Yinchu Zhu, “An Exact and Robust
Conformal Inference Method for Counterfactual and Synthetic Controls,” Journal of the
American Statistical Association, 2021, 116 (536), 1849–1864.
, Mert Demirer, Esther Duflo, and Iván Fernández-Val, “Generic Machine Learn-
ing Inference on Heterogeneous Treatment Effects in Randomized Experiments,” arXiv:
1712.04802, 2020, pp. 1–52.
Conley, Timothy G. and Christopher R. Taber, “Inference with “Difference in Dif-
ferences” with a Small Number of Policy Changes,” Review of Economics and Statistics,
2011, 93 (1), 113–125.
Daw, Jamie R. and Laura A. Hatfield, “Matching and Regression to the Mean in
DifferenceinDifferences Analysis,” Health Services Research, 2018.
de Chaisemartin, Clément and Xavier D’Haultfoeuille, “Difference-in-Differences
Estimators of Intertemporal Treatment Effects,” March 2022. arXiv:2007.04267 [econ].
and Xavier D’Haultfuille, “Fuzzy Differences-in-Differences,” The Review of Economic
51
Studies, 2018, 85 (2), 999–1028.
and , “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects,”
American Economic Review, 2020, 110 (9), 2964–2996.
and , “Two-way Fixed Effects Regressions with Several Treatments,” SSRN Scholarly
Paper ID 3751060, Social Science Research Network, Rochester, NY 2021.
and , “Two-way fixed effects and differences-in-differences with heterogeneous treat-
ment effects: a survey,” Econometrics Journal, 2022, Forthcoming.
Dette, Holger and Martin Schumann, “Difference-in-Differences Estimation Under Non-
Parallel Trends,” Working Paper, 2020.
Dettmann, Eva, “Flexpaneldid: A Stata Toolbox for Causal Analysis with Varying Treat-
ment Time and Duration,” SSRN Scholarly Paper ID 3692458, Social Science Research
Network, Rochester, NY 2020.
Ding, Peng and Fan Li, “A Bracketing Relationship between Difference-in-Differences
and Lagged-Dependent-Variable Adjustment,” Political Analysis, 2019, 27, 605—-615.
Donald, Stephen G. and Kevin Lang, “Inference with Difference-in-Differences and
Other Panel Data,” The Review of Economics and Statistics, 2007, 89 (2), 221–233.
Doudchenko, Nikolay and Guido W. Imbens, “Balancing, Regression, Difference-In-
Differences and Synthetic Control Methods: A Synthesis,” Working Paper 22791, National
Bureau of Economic Research 2016.
Ferman, Bruno and Cristine Pinto, “Inference in Differences-in-Differences with Few
Treated Groups and Heteroskedasticity,” The Review of Economics and Statistics, 2019,
101 (3), 452–467.
Fisher, R. A., The design of experiments The design of experiments, Oxford, England:
Oliver & Boyd, 1935. Pages: xi, 251.
Freyaldenhoven, Simon, Christian Hansen, and Jesse M. Shapiro, “Pre-event
Trends in the Panel Event-Study Design,” American Economic Review, 2019, 109 (9),
3307–3338.
, , Jorge Pérez Pérez, and Jesse M. Shapiro, “Visualization, identification, and
estimation in the linear panel event-study design.,” Advances in Economics and Econo-
metrics: Twelfth World Congress, 2021, Forthcoming.
Gardner, John, “Two-stage differences in differences,” Working Paper, 2021.
Goodman-Bacon, Andrew, “Difference-in-differences with variation in treatment timing,”
Journal of Econometrics, 2021, 225 (2), 254–277.
Gruber, Jonathan, “The Incidence of Mandated Maternity Benefits,” The American Eco-
nomic Review, 1994, 84 (3), 622–641.
Hagemann, Andreas, “Inference with a single treated cluster,” arXiv:2010.04076
52
[econ.EM], 2020, pp. 1–23.
, “Permutation inference with a finite number of heterogeneous clusters,” arXiv:1907.01049
[econ.EM], 2021.
Hasegawa, Raiden B., Daniel W. Webster, and Dylan S. Small, “Evaluating Mis-
souris Handgun Purchaser Law: A Bracketing Method for Addressing Concerns About
History Interacting with Group,” Epidemiology, 2019, 30 (3), 371–379.
Heckman, James, Hidehiko Ichimura, Jeffrey Smith, and Petra Todd, “Character-
izing Selection Bias Using Experimental Data,” Econometrica, 1998, 66 (5), 1017–1098.
Heckman, James J., Hidehiko Ichimura, and Petra Todd, “Matching as an econo-
metric evaluation estimator: Evidence from evaluating a job training programme,” The
Review of Economic Studies, 1997, 64 (4), 605–654.
Holland, Paul W., “Statistics and Causal Inference,” Journal of the American Statistical
Association, 1986, 81 (396), 945–960.
Huber, Martin and Andreas Steinmayr, “A Framework for Separating Individual-Level
Treatment Effects From Spillover Effects,” Journal of Business & Economic Statistics,
2021, 39 (2), 422–436.
Ibragimov, Rustam and Ulrich K. Müller, “Inference with few heterogeneous clusters,”
Review of Economics and Statistics, 2016, 98 (1), 83–96.
Imai, Kosuke and In Song Kim, “On the Use of Two-way Fixed Effects Regression
Models for Causal Inference with Panel Data,” Political Analysis, 2021, 29 (3), 405–415.
Jakiela, Pamela, “Simple Diagnostics for Two-Way Fixed Effects,” arXiv:2103.13229 [econ,
q-fin], March 2021.
Kahn-Lang, Ariella and Kevin Lang, “The Promise and Pitfalls of Differences-in-
Differences: Reflections on 16 and Pregnant and Other Applications,” Journal of Business
& Economic Statistics, 2020, 38 (3), 613–620.
Keele, Luke J., Dylan S. Small, Jesse Y. Hsu, and Colin B. Fogarty, “Patterns
of Effects and Sensitivity Analysis for Differences-in-Differences,” arXiv:1901.01869 [stat],
February 2019. arXiv: 1901.01869.
Khan, Shakeeb and Elie Tamer, “Irregular Identification, Support Conditions, and In-
verse Weight Estimation,” Econometrica, 2010, 78 (6), 2021–2042.
Lee, Sokbae, Ryo Okui, and Yoon-Jae Jae Whang, “Doubly robust uniform con-
fidence band for the conditional average treatment effect function,” Journal of Applied
Econometrics, 2017, 32 (7), 1207–1225.
Liang, Kung-Yee and Scott L. Zeger, “Longitudinal data analysis using generalized
linear models,” Biometrika, 1986, 73 (1), 13–22.
Lin, Winston, “Agnostic notes on regression adjustments to experimental data: Reexam-
53
ining Freedmans critique,” Annals of Applied Statistics, 2013, 7 (1), 295–318.
Liu, Licheng, Ye Wang, and Yiqing Xu, “A Practical Guide to Counterfactual Esti-
mators for Causal Inference with Time-Series Cross-Sectional Data,” American Journal of
Political Science, 2022, Forthcoming.
MacKinnon, James G. and Matthew D. Webb, “The wild bootstrap for few (treated)
clusters,” The Econometrics Journal, 2018, 21 (2), 114–135.
, Morten Ørregaard Nielsen, and Matthew D. Webb, “Cluster-robust inference: A
guide to empirical practice,” Journal of Econometrics, 2022, Forthcoming.
Malani, Anup and Julian Reif , “Interpreting pre-trends as anticipation: Impact on
estimated treatment effects from tort reform,” Journal of Public Economics, 2015, 124,
1–17.
Manski, Charles F. and John V. Pepper, “How Do Right-to-Carry Laws Affect Crime
Rates? Coping with Ambiguity Using Bounded-Variation Assumptions,” The Review of
Economics and Statistics, 2018, 100 (2), 232–244.
Marcus, Michelle and Pedro H. C. Sant’Anna, “The role of parallel trends in event
study settings : An application to environmental economics,” Journal of the Association
of Environmental and Resource Economists, 2021, 8 (2), 235–275.
McKenzie, David, “Beyond baseline and follow-up: The case for more T in experiments,”
Journal of Development Economics, 2012, 99 (2), 210–221.
Meyer, Bruce D., “Natural and Quasi-Experiments in Economics,” Journal of Business
& Economic Statistics, 1995, 13 (2), 151–161.
Neyman, Jerzy, “On the Application of Probability Theory to Agricultural Experiments.
Essay on Principles. Section 9.,” Statistical Science, 1923, 5 (4), 465–472.
Olea, José Luis Montiel and Mikkel PlagborgMøller, “Simultaneous confidence bands:
Theory, implementation, and an application to SVARs,” Journal of Applied Econometrics,
2019, 34 (1), 1–17.
Rambachan, Ashesh and Jonathan Roth, “Design-Based Uncertainty for Quasi-
Experiments,” November 2022. arXiv:2008.00602 [econ, stat].
and , “A More Credible Approach to Parallel Trends,” Review of Economic Studies,
2022, Forthcoming.
Robins, J. M., “A New Approach To Causal Inference in Mortality Studies With a Sus-
tained Exposure Period - Application To Control of the Healthy Worker Survivor Effect,”
Mathematical Modelling, 1986, 7, 1393–1512.
Roth, Jonathan, “Should We Adjust for the Test for Pre-trends in Difference-in-Difference
Designs?,” arXiv:1804.01208 [econ, math, stat], 2018.
, “Pre-test with Caution: Event-study Estimates After Testing for Parallel Trends,” Amer-
54
ican Economic Review: Insights, 2022, 4 (3), 305–322.
and Pedro H. C. Sant’Anna, “Efficient Estimation for Staggered Rollout Designs,”
arXiv:2102.01291 [econ, math, stat], 2021.
and , “When Is Parallel Trends Sensitive to Functional Form?,” Econometrica, 2022,
Forthcoming.
Rubin, Donald B., “Estimating causal effects of treatments in randomized and nonran-
domized studies,” Journal of E ducational Psychology, 1974, 66 (5), 688–70.
Ryan, Andrew M., “Well-Balanced or too MatchyMatchy? The Controversy over Match-
ing in Difference-in-Differences,” Health Services Research, 2018, 53 (6), 4106–4110.
Sant’Anna, Pedro H. C. and Jun Zhao, “Doubly robust difference-in-differences esti-
mators,” Journal of Econometrics, 2020, 219 (1), 101–122.
Schmidheiny, Kurt and Sebastian Siegloch, “On Event Studies and Distributed-Lags
in Two-Way Fixed Effects Models: Identification, Equivalence, and Generalization,” SSRN
Scholarly Paper 3571164, Social Science Research Network, Rochester, NY 2020.
Shaikh, Azeem and Panos Toulis, “Randomization Tests in Observational Studies With
Staggered Adoption of Treatment,” Journal of the American Statistical Association, 2021,
116 (536), 1835–1848.
Strezhnev, Anton, “Semiparametric weighting estimators for multi-period difference-in-
differences designs,” Working Paper, 2018.
Sun, Liyang and Sarah Abraham, “Estimating dynamic treatment effects in event stud-
ies with heterogeneous treatment effects,” Journal of Econometrics, 2021, 225 (2), 175–199.
Viviano, Davide and Jelena Bradic, “Dynamic covariate balancing: estimating treat-
ment effects over time,” June 2021. arXiv:2103.01280 [econ, math, stat].
Wager, Stefan and Susan Athey, “Estimation and Inference of Heterogeneous Treatment
Effects using Random Forests,” Journal of the American Statistical Association, 2018, 113
(523), 1228–1242.
Wooldridge, Jeffrey M, “Cluster-Sample Methods in Applied Econometrics,” American
Economic Review P&P , 2003, 93 (2), 133–138.
, “Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-
Differences Estimators,” Working Paper, 2021, pp. 1–89.
Ye, Ting, Luke Keele, Raiden Hasegawa, and Dylan S. Small, “A Negative Corre-
lation Strategy for Bracketing in Difference-in-Differences,” arXiv:2006.02423 [econ, stat],
2021.
Zeldow, Bret and Laura A. Hatfield, “Confounding and regression adjustment in
difference-in-differences studies,” Health Services Research, 2021, 56 (5), 932–941.
55
A Connecting model-based assumptions to potential
outcomes
This section formalizes connections between the model-based assumptions in Section 5.1
and the potential outcomes framework. We derive how the errors in the structural model
(18) map to primitives based on potential outcomes in the canonical model from Section 2.
Specifically, we show that under the set-up of Section 2, Assumptions 1 and 2 imply that
the canonical DiD estimator takes the form given in (20), where β τ
2
is the ATT at the
cluster level, ν
jt
ν
jt,0
` D
j
ν
jt,1
and ϵ
ijt
ϵ
ijt,0
` D
j
ϵ
ijt,1
, where
19
ϵ
ijt,0
Y
ijt
p0q ´ E
r
Y
ijt
p0q|jpiq j
s
ϵ
ijt,1
Y
ijt
p1q ´ Y
ijt
p0q ´ E
r
Y
ijt
p1q ´ Y
ijt
p0q|jpiq j
s
ν
jt,0
E
r
Y
ijt
p0q|jpiq j
s
´ E
r
Y
ijt
p0q|D
j
s
ν
jt,1
E
r
Y
ijt
p1q ´ Y
ijt
p0q|jpiq j
s
´ τ
t
.
Thus, in the canonical set-up, restrictions on ν
jt
and ϵ
ijt
can be viewed as restrictions on
primitives that are functions of the potential outcomes.
Adopt the notation and set-up in Section 2, except now each unit i belongs to a cluster
j and treatment is assigned at the cluster level D
j
. We assume clusters are drawn iid from
a superpopulation of clusters and then units are drawn iid within the sampled cluster. We
write J
d
to denote the number of clusters with treatment d, and n
j
the number of units in
cluster j. As in the main text, let Y
jt
n
j
´1
ř
i:jpiq“j
Y
ijt
be the sample mean within cluster
j. The canonical DiD estimator at the cluster level can then be written as:
p
τ
1
J
1
ÿ
j:D
j
1
pY
j2
´ Y
j1
q ´
1
J
0
ÿ
i:D
j
0
pY
j2
´ Y
j1
q
1
J
1
ÿ
j:D
j
1
1
n
j
ÿ
i:jpiq“j
pY
ij2
´ Y
ij1
q ´
1
J
0
ÿ
i:D
j
0
1
n
j
ÿ
i:jpiq“j
pY
ij2
´ Y
ij1
q.
Since the observed outcome is Y p1q for treated units and Y p0q for control units, under the
no anticipation assumption it follows that
p
τ
1
J
1
ÿ
j:D
j
1
1
n
j
ÿ
i:jpiq“j
pY
ij2
p1q ´ Y
ij1
p0qq ´
1
J
0
ÿ
j:D
j
0
1
n
j
ÿ
i:jpiq“j
pY
ij2
p0q ´ Y
ij1
p0qq,
19
In what follows, we write E
r
Y
ijt
p0q|D
j
d
s
to denote the exp ectation where one first draws j from the
population with D
j
d and then draws Y
ijt
p0q from that cluster.
56
or equivalently,
p
τ
1
J
1
ÿ
j:D
j
1
1
n
j
ÿ
i:jpiq“j
pY
ij2
p1q ´ Y
ij2
p0qq`
1
J
1
ÿ
j:D
j
1
1
n
j
ÿ
i:jpiq“j
pY
ij2
p0q ´ Y
ij1
p0qq´
1
J
0
ÿ
j:D
j
0
1
n
j
ÿ
i:jpiq“j
pY
ij2
p0q ´ Y
ij1
p0qq.
Adding and subtracting terms of the form E
r
Y
ijt
|jpiq 1
s
, we obtain
p
τ τ
2
`
1
J
1
ÿ
j:D
j
1
p
E
r
Y
ij2
p1q ´ Y
ij2
p0q|jpiq js ´ τ
2
q`
1
J
1
ÿ
j:D
j
1
1
n
j
ÿ
i:jpiq“j
pY
ij2
p1q ´ Y
ij2
p0q ´ E
r
Y
ij2
p1q ´ Y
ij2
p0q|jpiq j
s
q`
1
J
1
ÿ
j:D
j
1
1
n
j
ÿ
i:jpiq“j
pY
ij2
p0q ´ Y
ij1
p0q ´ E rY
ij2
p0q ´ Y
ij1
p0q|jpiq jsq´
1
J
0
ÿ
j:D
j
0
1
n
j
ÿ
i:jpiq“j
pY
ij2
p0q ´ Y
ij1
p0q ´ E
r
Y
ij2
p0q ´ Y
ij1
p0q|jpiq j
sq`
1
J
1
ÿ
j:D
j
1
E
r
Y
ij2
p0q ´ Y
ij1
p0q|jpiq j
s ´
1
J
0
ÿ
j:D
j
0
E
r
Y
ij2
p0q ´ Y
ij1
p0q|jpiq j
s
,
where τ
2
E
J
1
´1
ř
j:D
j
1
E rY
ij2
p1q ´ Y
ij2
p0q|jpiq js
ı
E rY
ij2
p1q ´ Y
ij2
p0q|D
j
1s is
the ATT among treated clusters (weighting all clusters equally).
Now, we assume parallel trends at the cluster level, so that
E rY
ij2
p0q ´ Y
ij1
p0q|D
j
1s ´ E rY
ij2
p0q ´ Y
ij1
p0q|D
j
0s 0,
which implies that
57
p
τ τ
2
`
1
J
1
ÿ
j:D
j
1
pE
r
Y
ij2
p1q ´ Y
ij2
p0q|jpiq j
s ´
τ
2
q
loooooooooooooooooooooomoooooooooooooooooooooon
ν
j,1
`
1
J
1
ÿ
j:D
j
1
1
n
j
ÿ
i:jpiq“j
pY
ij2
p1q ´ Y
ij2
p0q ´ E rY
ij2
p1q ´ Y
ij2
p0q|jpiq jsq
looooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooon
ϵ
ij,1
`
1
J
1
ÿ
j:D
j
1
1
n
j
ÿ
i:jpiq“j
pY
ij2
p0q ´ Y
ij1
p0q ´ E
r
Y
ij2
p0q ´ Y
ij1
p0q|jpiq j
sq
looooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooon
ϵ
ij,0
´
1
J
0
ÿ
i:D
j
0
1
n
j
ÿ
i:jpiq“j
pY
ij2
p0q ´ Y
ij1
p0q ´ E rY
ij2
p0q ´ Y
ij1
p0q|jpiq jsq
looooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooon
ϵ
ij,0
`
1
J
1
ÿ
j:D
j
1
pE
r
Y
ij2
p0q ´ Y
ij1
p0q|jpiq j
s ´
E
r
Y
ij2
p0q ´ Y
ij1
p0q|D
j
sq
loooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooon
ν
j,0
´
1
J
0
ÿ
j:D
j
0
pE rY
ij2
p0q ´ Y
ij1
p0q|jpiq js ´ E rY
ij2
p0q ´ Y
ij1
p0q|D
j
sq
loooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooon
ν
j,0
.
Letting ν
j
ν
j,0
`D
j
ν
j,1
and ϵ
ij
ϵ
ij,0
`D
j
ϵ
ij,1
, it follows that
p
τ takes the form
(20) with β τ
2
.
58