What's Trending in Difference-in-Differences? A Synthesis of the

What’s Trending in Diﬀerence-in-Diﬀerences?

A Synthesis of the Recent Econometrics Literature

∗

Jonathan Roth

†

Pedro H. C. Sant’Anna

‡

Alyssa Bilinski

John Poe

January 9, 2023

Abstract

This paper synthesizes recent advances in the econometrics of diﬀerence-in-diﬀerences

(DiD) and provides concrete recommendations for practitioners. We begin by articu-

lating a simple set of “canonical” assumptions under which the econometrics of DiD are

well-understood. We then argue that recent advances in DiD metho ds can be broadly

classiﬁed as relaxing some components of the canonical DiD setup, with a focus on

piq multiple periods and variation in treatment timing, piiq potential violations of par-

allel trends, or piiiq alternative frameworks for inference. Our discussion highlights

the diﬀerent ways that the DiD literature has advanced beyond the canonical model,

and helps to clarify when each of the papers will be relevant for empirical work. We

conclude by discussing some promising areas for future research.

1 Introduction

Diﬀerences-in-diﬀerences (DiD) is one of the most popular methods in the social sciences for

estimating causal eﬀects in non-experimental settings. The last few years have seen a dizzy-

ing array of new methodological papers on DiD and related designs, making it challenging

for practitioners to keep up with rapidly evolving best practices. Furthermore, the recent

literature has addressed a variety of diﬀerent components of DiD analyses, which has made

∗

We thank Brant Callaway, Bruno Ferman, Andreas Hagemann, Kevin Lang, David McKenzie, and David

Schönholzer for helpful comments, and Scott Barkowski for suggesting the title.

†

Brown University. jonathanroth@brown.edu

‡

Microsoft and Vanderbilt University. pedro.h.santanna@vanderbilt.edu

Brown University. alyssa_bilinski@brown.edu

University of Michigan. john@johndavidpoe.com

it diﬃcult even for experts in the ﬁeld to understand how all of the new developments ﬁt

together. In this paper, we attempt to synthesize some of the recent advances on DiD and

related designs and to provide concrete recommendations for practitioners.

Our starting point in Section 2 is the “canonical” diﬀerence-in-diﬀerences model, where

two time periods are available, there is a treated population of units that receives a treatment

of interest beginning in the second period, and a comparison population that does not receive

the treatment in either period. The key identifying assumption is that the average outcome

among the treated and comparison populations would have followed “parallel trends” in

the absence of treatment. We also assume that the treatment has no causal eﬀect before

its implementation (no anticipation). Together, these assumptions allow us to identify the

average treatment eﬀect on the treated (ATT). If we observe a large number of independent

clusters from the treated and comparison populations, the ATT can be consistently estimated

using a two-way ﬁxed eﬀects (TWFE) regression speciﬁcation, and clustered standard errors

provide asymptotically valid inference.

In practice, DiD applications typically do not meet all of the requirements of the canonical

DiD setup. The recent wave of DiD papers have each typically focused on relaxing one or

two of the key assumptions in the canonical framework while preserving the others. We

taxonomize the recent DiD literature by characterizing which of the key assumptions in

the canonical model are relaxed. We fo cus on recent advances that piq allow for multiple

periods and variation in treatment timing (Section 3); piiq consider p otential violations of

parallel trends (Section 4); or piiiq depart from the assumption of observing a sample of

many independent clusters sampled from a super-population (Section 5). Section 6 brieﬂy

summarizes some other areas of innovation. In the remainder of the Introduction, we brieﬂy

describe each of these strands of literature.

Multiple periods and variation in treatment timing: One strand of the DiD

literature has focused on settings where there are more than two time periods and units are

treated at diﬀerent point in times. Multiple authors have noted that the coeﬃcients from

standard TWFE models may not represent a straightforward weighted average of unit-level

treatment eﬀects when treatment eﬀects are allowed to be heterogeneous. In short, TWFE

regressions make both “clean” comparisons between treated and not-yet-treated units as well

as “forbidden” comparisons between units who are b oth already-treated. When treatment

eﬀects are heterogeneous, these “forbidden” comparisons potentially lead to severe drawbacks

such as TWFE coeﬃcients having the opposite sign of all individual-level treatment eﬀects

due to “negative weighting” problems. Even if all of the weights are positive, the weights

“chosen” by TWFE regressions may not correspond with the most policy-relevant parameter.

We discuss a variety of straightforward-to-implement strategies that have been proposed

to bypass the limitations associated with TWFE regressions and estimate causal parameters

of interest under rich sources of treatment eﬀect heterogeneity. These procedures rely on

generalizations of the parallel trends assumption to the multi-period setting. A common

theme is that these new estimators isolate “clean” comparisons between treated and not-yet-

treated groups, and then aggregate them using user-speciﬁed weights to estimate a target

parameter of economic interest. We discuss diﬀerences between some of the recent proposals

— such as the exact comparison group used and the generalization of the parallel trends

assumption needed for validity — and provide concrete recommendations for practitioners.

We also brieﬂy discuss extensions to more complicated settings such as when treatments

turn on-and-oﬀ over time or are non-binary.

Non-parallel trends: A second strand of the DiD literature focuses on the possibility

that the parallel trends assumption may be violated. One set of papers considers the set-

ting where parallel trends holds only conditional on observed covariates, and proposes new

estimators that are valid under a conditional parallel trends assumption. However, even if

one conditions on observable covariates, there are often concerns that the necessary parallel

trends assumption may be violated due to time-varying unobserved confounding factors. It

is therefore common practice to test for pre-treatment diﬀerences in trends (“pre-trends”)

as a test of the plausibility of the (conditional) parallel trends assumption. While intuitive,

researchers have identiﬁed at least three issues with this pre-testing approach. First, the

absence of a signiﬁcant pre-trend does not necessarily imply that parallel trends holds; in

fact, these tests often have low power. Second, conditioning the analysis on the result of

a pre-test can introduce additional statistical distortions from a selection eﬀect known as

pre-test bias. Third, if a signiﬁcant diﬀerence in trends is detected, we may still wish to

learn something about the treatment eﬀect of interest.

Several recent papers have therefore suggested alternative methods for settings where

there is concern that parallel trends may be violated. One class of solutions involves modiﬁ-

cations to the common practice of pre-trends testing to ensure that the power of pre-tests is

high against relevant violations of parallel trends. A second class of solutions has proposed

methods that remain valid under certain types of violations of parallel trends, such as when

the post-treatment violation of parallel trends is assumed to be no larger than the maximal

pre-treatment violation of parallel trends, or when there are non-treated groups that are

known to be more/less aﬀected by the confounds as the treated group. These approaches

allow for a variety of robustness and sensitivity analyses which are useful in a wide range of

empirical settings, and we discuss them in detail.

Alternative sampling assumptions: A third strand of the DiD literature discusses

alternatives to the classical “sampling-based” approach to inference with a large number of

clusters. One topic of interest is inference procedures in settings with a small number of

treated clusters. Standard cluster-robust methods assume that there is a large number of

both treated and untreated clusters, and thus can perform poorly in this case. A variety of

alternatives with better properties have been proposed for this case, including permutation

and bootstrap procedures. These methods typically either model the dependence of errors

across clusters, or alternatively place restrictions on the treatment assignment mechanism.

We brieﬂy highlight these approaches and discuss the diﬀerent assumptions needed for them

to perform well.

Another direction that has been explored relates to conducting “design-based” inference

for DiD. Canonical approaches to inference suppose that we have access to a sample of

independently-drawn clusters from an inﬁnite super-population. However, it is not always

clear how to deﬁne the super-population, or to determine the appropriate level of clustering.

Design-based approaches address these issues by instead treating the source of randomness

in the data as coming from the stochastic assignment of treatment, rather than sampling

from an inﬁnite super-population. Although design-based approaches have typically been

employed in the case of randomized experiments, recent work has extended this to the case of

“quasi-experimental” strategies like DiD. Luckily, the message of this literature is positive, in

the sense that methods that are valid from the canonical sampling-based view are typically

also valid from the design-based view as well. The design-based approach also yields the

clear recommendation that it is appropriate to cluster standard errors at the level at which

treatment is independently assigned.

Other topics: We conclude by brieﬂy touching on some other areas of focus within the

DiD literature, as well as highlighting some areas for future research. Examples include using

DiD to estimate distributional treatment eﬀects; settings with quasi-random treatment tim-

ing; spillover eﬀects; estimating heterogeneous treatment eﬀects; and connections between

DiD and other panel data methods.

Overall, the growing DiD econometrics literature emphasizes the importance of clarity

and precision in a researcher’s discussion of his or her assumptions, comparison group and

time frame selection, causal estimands, estimation methods, and robustness checks. When

used in combination with context-speciﬁc information, these new methods can both improve

the validity and interpretability of DiD results and more clearly delineate their limitations.

Given the vast literature on DiD, our goal is not to be comprehensive, but rather to give a

clean presentation of some of the most important directions the literature has gone. Wherever

possible, we try to give clear practical guidance for applied researchers, concluding each

section with practical recommendations for applied researchers. For reference, we include

Table 1, which contains a checklist for a practitioner implementing a DiD analysis, and Table

2, which lists R and Stata packages for implementing many of the methods described in this

paper.

2 The Basic Model

This section describes a simple two-period setting in which the econometrics of DiD are well-

understood. Although this “canonical” setting is arguably too simple for most applications,

clearly articulating the assumptions in this setup serves as a useful baseline for understanding

recent innovations in the DiD literature.

2.1 Treatment Assignment and Timing

Consider a model in which there are two time periods, t “ 1, 2. Units indexed by i are

drawn from one of two populations. Units from the treated population pD

“ 1q receive a

treatment of interest between period t “ 1 and t “ 2, whereas units from the untreated

(a.k.a. comparison or control) population pD

“ 0q remain untreated in both time periods.

The econometrician observes an outcome Y

i,t

and treatment status D

for a panel of units,

i “ 1, ..., N and t “ 1, 2. For example Y

i,t

could be the fraction of people with insurance

coverage in state i in year t, while D

could be an indicator for whether the state expanded

Medicaid in year 2. Although DiD methods also accommodate the case where only repeated

cross-sectional data is available, or where the panel is unbalanced, we focus on the simpler

setup with balanced panel data for ease of exposition.

2.2 Potential Outcomes and Target Parameter

We adopt a potential outcomes framework for the observed outcome, as in, e.g., Rubin

(1974) and Robins (1986). Let Y

i,t

p0, 0q denote unit i’s potential outcome in period t if i

remains untreated in both periods. Likewise, let Y

i,t

p0, 1q denotes unit i’s potential outcome

in period t if i is untreated in the ﬁrst period but exposed to treatment by the second period.

To simplify notation we will write Y

i,t

p0q “ Y

i,t

p0, 0q and Y

i,t

p1q “ Y

i,t

p0, 1q, but it will be

useful for our later discussion to make clear that these potential outcomes in fact correspond

with a path of treatments. As is usually the case, due to the fundamental problem of causal

inference (Holland, 1986), we only observe one of the two potential outcomes for each unit

i. That is, the observed outcome is given by Y

i,t

“ D

i,t

p1q`p1 ´D

i,t

p0q. This potential

outcomes framework implicitly encodes the stable unit treatment value assumption (SUTVA)

that unit i’s outcomes do not depend on the treatment status of unit j ‰ i, which rules out

spillover and general equilibrium eﬀects.

The causal estimand of primary interest in the canonical DiD setup is the average treat-

ment eﬀect on the treated (ATT) in period t “ 2,

“ E rY

i,2

p1q ´ Y

i,2

p0q|D

“ 1s.

It simply measures the average causal eﬀect on treated units in the period that they are

treated (t “ 2). In our motivating example, τ

would be the average eﬀect of Medicaid

expansion on insurance coverage in period 2 for the states who expanded Medicaid.

2.3 The Parallel Trends Assumption and Identiﬁcation

The challenge in identifying τ

is that the untreated potential outcomes, Y

i,2

p0q, are never

observed for the treated group (D

“ 1). Diﬀerence-in-diﬀerences methods overcome this

identiﬁcation challenge via assumptions that allow us to impute the mean counterfactual

untreated outcomes for the treated group by using (a) the change in outcomes for the un-

treated group and (b) the baseline outcomes for the treated group. The key assumption for

identifying τ

is the parallel trends assumption, which intuitively states that the average out-

come for the treated and untreated populations would have evolved in parallel if treatment

had not occurred.

Assumption 1 (Parallel Trends).

E rY

i,2

p0q ´ Y

i,1

p0q|D

“ 1s “ E rY

i,2

p0q ´ Y

i,1

p0q|D

“ 0s. (1)

In our motivating example, the parallel trends assumption says that the average change in

insurance coverage for expansion and non-expansion states would have been the same in the

absence of the Medicaid expansion.

The parallel trends assumption can be rationalized by imposing a particular generative

model for the untreated potential outcomes. If Y

i,t

p0q “ α

` ϕ

` ϵ

i,t

, where ϵ

i,t

is mean-

independent of D

, then Assumption 1 holds. Note that this model allows treatment to be

assigned non-randomly based on characteristics that aﬀect the level of the outcome pα

q, but

requires the treatment assignment to be mean-independent of variables that aﬀect the trend

in the outcome (ϵ

i,t

). In other words, parallel trends allows for the presence of selection bias,

but the bias from selecting into treatment must be the same in period t “ 1 as it is in period

t “ 2.

Another important and often hidden assumption required for identiﬁcation of τ

is the

no-anticipation assumption, which states that the treatment has no causal eﬀect prior to its

implementation. This is important for identiﬁcation of τ

, since otherwise the changes in the

outcome for the treated group between period 1 and 2 could reﬂect not just the causal eﬀect

in period t “ 2 but also the anticipatory eﬀect in period t “ 1 (Abbring and van den Berg,

2003; Malani and Reif, 2015).

Assumption 2 (No anticipatory eﬀects). Y

i,1

p0q “ Y

i,1

p1q for all i with D

“ 1.

In our ongoing example, this implies that in years prior to Medicaid expansion, insurance

coverage in states that expanded Medicaid was not aﬀected by the upcoming Medicaid

expansion.

Under the parallel trends and no-anticipation assumptions, the ATT in period 2 (τ

) is

identiﬁed. To see why this is the case, observe that by re-arranging terms in the parallel

trends assumption (see equation (1)), we obtain

i,2

p0q|D

“ 1

s “

i,1

p0q|D

“ 1

s `

i,2

p0q ´ Y

i,1

p0q|D

“ 0

Further, by the no anticipation assumption, E rY

i,1

p0q|D

“ 1s “ E rY

i,1

p1q|D

“ 1s.

follows that

i,2

p0q|D

“ 1

“ E

i,1

p1q|D

“ 1

` E

i,2

p0q ´ Y

i,1

p0q|D

“ 0

“ E

i,1

“ 1

` E

i,2

´ Y

i,1

“ 0

where the second equality uses the fact that we observe Y p1q for treated units and Y p0q for

untreated units. The previous display shows that we can infer the counterfactual average

outcome for the treated group by taking its observed pre-treatment mean and adding the

change in mean for the untreated group. Since we observe Y p1q for the treated group directly,

it follows that τ

“ E

i,2

p1q ´ Y

i,2

p0q|D

“ 1

is identiﬁed as

“ E rY

i,2

´ Y

i,1

“ 1s

loooooooooooomoooooooooooon

Change for D

“ 1

´E rY

i,2

´ Y

i,1

“ 0s

loooooooooooomoooooooooooon

Change for D

“ 0

, (2)

i.e. the “diﬀerence-in-diﬀerences” of population means!

For the identiﬁcation argument, it suﬃces to impose only that E rY

i,1

p0q | D

“ 1s “ E rY

i,1

p1q | D

“ 1s

directly, i.e. that there is no anticipation on average, which is slightly weaker than Assumption 2. We focus

on Assumption 2 for ease of exposition (especially when we extend it to the staggered case below).

2.4 Estimation and Inference

Equation (2) gives an expression for τ

in terms of a “diﬀerence-in-diﬀerences” of population

expectations. Therefore, a natural way to estimate τ

is to replace expectations with their

sample analogs,

“ pY

t“2,D“1

´ Y

t“1,D“1

q ´ pY

t“2,D“0

´ Y

t“1,D“0

where Y

t“t

,D“d

is the sample mean of Y for treatment group d in period t

Although these sample means could be computed “by hand”, an analogous way of com-

puting

, which facilitates the computation of standard errors, is to use the two-way ﬁxed

eﬀects (TWFE) regression speciﬁcation

i,t

“ α

` ϕ

` p1rt “ 2s ¨ D

qβ ` ϵ

i,t

, (3)

which regresses the outcome Y

i,t

on an individual ﬁxed eﬀect, a time ﬁxed eﬀect, and an

interaction of a post-treatment indicator with treatment status.

In this canonical DiD

setup, it is straightforward to show that the ordinary least squares (OLS) coeﬃcient

β is

equivalent to

OLS estimates of

β from ( 3) provide consistent estimates and asymptotically valid con-

ﬁdence intervals of τ

when Assumptions 1 and 2 are combined with the assumption of

independent sampling.

Assumption 3. Let W

“ pY

i,2

, Y

i,1

, D

denote the vector of outcomes and treatment status

for unit i. We observe a sample of N i.i.d. draws W

„ F for some distribution F satisfying

parallel trends.

Under Assumptions 1-3 and mild regularity conditions,

β ´ τ

q Ñ

0, σ

in the asymptotic as N Ñ 8 and T is ﬁxed. The variance σ

is consistently estimable

using standard clustering methods that allow for arbitrary serial correlation at the unit level

(Liang and Zeger, 1986; Arellano, 1987; Wooldridge, 2003; Bertrand, Duﬂo and Mullainathan,

With a balanced panel, the OLS coeﬃcient on β is also numerically identical to the coeﬃcient from a

regression that replaces the ﬁxes eﬀects with a constant, a treatment indicator, a second-period indicator,

and the treatment ˆ second-period interaction,

i,t

“ α ` D

θ ` 1rt “ 2sζ ` p1rt “ 2s ¨ D

qβ ` ε

i,t

The latter regression generalizes to repeated cross-sectional data.

2004). The same logic easily extends to cases where the observations are individual units

who are members of independently-sampled clusters (e.g. states), and the standard errors

are clustered at the appropriate level, provided that the number of treated and untreated

clusters both grow large. Constructing consistent point estimates and asymptotically valid

conﬁdence intervals is thus straightforward via OLS.

Having introduced all of the components of the “canonical” DiD model, we now discuss

the ways that diﬀerent strands of the recent DiD have relaxed each of these components.

3 Relaxing assumptions on treatment assignment and

timing

Several recent papers have focused primarily on relaxing the baseline assumptions about

treatment assignment and timing discussed in Section 2. A topic of considerable attention

has been settings where there are more than two periods, and units adopt a treatment

of interest at diﬀerent points in time. For example, in practice diﬀerent states expanded

Medicaid in diﬀerent years. We provide an overview of some of the key developments in the

literature, and refer the reader to the review by de Chaisemartin and D’Haultfuille (2022)

for additional details.

3.1 Generalized model with staggered treatment adoption

Several recent papers have fo cused on relaxing the timing assumptions discussed in Section

2, while preserving the remaining structure of the stylized model (i.e., parallel trends, no

anticipation, and independent sampling). Since most of the recent literature considers a

setup in which treatment is an absorbing state, we start with that framework; in Section

3.4, we discuss extensions to the case where treatment can turn on and oﬀ or is non-binary.

We introduce the following assumptions and notation, which captures the primary setting

studied in this literature.

Treatment timing. There are T periods indexed by t “ 1, ..., T , and units can receive a

binary treatment of interest in any of the periods t ą 1. Treatment is an absorbing state, so

that once a unit is treated they remain treated for the remainder of the panel. We denote by

i,t

an indicator for whether unit i receives treatment in period t, and G

“ mintt : D

i,t

“ 1u

the earliest period at which unit i has received treatment. If i is never treated during the

sample, then G

“ 8. Treatment is an absorbing state, so that D

i,t

“ 1 for all t ě G

Thus, for example, a state that ﬁrst expanded Medicaid in 2014 would have G

“ 2014, a

state that ﬁrst expanded Medicaid in 2015 would have G

“ 2015, and a state that has not

expanded Medicaid by time t “ T would have G

“ 8.

Potential outcomes. We extend the potential outcomes framework introduced above to

the multi-period setting. Let 0

and 1

denote s-dimensional vectors of zeros and ones,

respectively. We denote unit i’s potential outcome in period t if they were ﬁrst treated at

time g by Y

i,t

g´1

, 1

T ´g`1

q, and denote by Y

i,t

q their “never-treated” potential outcome.

This notation again makes explicit that potential outcomes can depend on the entire path of

treatment assignments. Since we have assumed that treatment “stays on” once it is turned

on, the entire path of potential outcomes is summarized by the ﬁrst treatment date pgq,

and so to simplify notation we can index potential outcomes by treatment starting time:

i,t

pgq “ Y

i,t

g´1

, 1

T ´g`1

q and Y

i,t

p8q “ Y

i,t

Thus, for example, Y

i,2016

p2014q would

represent the insurance coverage in state i in 2016 if they had ﬁrst expanded Medicaid in

2014.

Parallel trends. There are several ways to extend the canonical parallel trends assumption

to the staggered setting. The simplest extension of the parallel trends assumption to the

staggered case requires that the two-group, two-period version of parallel trends holds for

all combinations of periods and all combinations of “groups” treated at diﬀerent times.

Assumption 4 (Parallel trends for staggered setting). For all t ‰ t

and g ‰ g

i,t

p8q ´ Y

i,t

p8q|G

“ g

“ E

i,t

p8q ´ Y

i,t

p8q|G

“ g

s. (4)

This assumption imposes that in the counterfactual where treatment had not occurred, the

average outcomes for all adoption groups would have evolved in parallel. Thus, for example,

Assumption 4 would imply that — if there had been no Medicaid expansions — insurance

rates would have evolved in parallel on average for all groups of states that adopted Medicaid

expansion in diﬀerent years, including those who never expanded Medicaid.

Several variants of Assumption 4 have been considered in the literature. For example,

Callaway and SantAnna (2021) consider a relaxation of Assumption 4 that imposes (4) only

for years after some units are treated:

If one were to map this staggered potential outcome notation to the one used in canonical DiD setups,

we would write Y

i,t

p2q and Y

i,t

p8q instead of Y

i,t

p1q and Y

i,t

p0q as deﬁned in Section 2.2, respectively. We

use the Y

i,t

p0q, Y

i,t

p1q notation in Section 2.2 because it is likely more familiar to the reader, and widely used

in the literature on the canonical model.

Assumption 4.a (Parallel trends for staggered setting - post-treatment only).

i,t

p8q ´ Y

i,t

p8q|G

“ g

s “

i,t

p8q ´ Y

i,t

p8q|G

“ g

for all t, t

ě g

min

´ 1, where g

min

“ min G is the ﬁrst period where a unit is treated.

This would require, for example, that groups of states that expanded Medicaid at diﬀerent

times have parallel trends in Y

i,t

p8q after Medicaid expansion began, but does not necessar-

ily impose parallel trends in the pre-treatment period. Likewise, several papers, including

Callaway and SantAnna (2021) and Sun and Abraham (2021), consider versions that impose

(4) only for groups that are eventually treated, and not for the never-treated group (i.e.

excluding g “ 8). This would impose, for example, parallel trends among states that even-

tually expanded Medicaid, but not between eventually-adopting and never-adopting states.

There are tradeoﬀs between the diﬀerent forms of Assumption 4: imposing parallel trends

for all groups and all periods is a stronger assumption and thus may be less plausible; on the

other hand, it may allow one to obtain more precise estimates.

We return to these tradeoﬀs

in our discussion of diﬀerent estimators for the staggered case below.

No anticipation. The no-anticipation assumption from the canonical model also extends

in a straightforward way to the staggered setting. Intuitively, it imposes that if a unit is

untreated in period t, their outcome does not depend on what time period they will be

treated in the future — that is, units do not act on the knowledge of their future treatment

date before treatment starts.

Assumption 5

(Staggered no anticipation assumption)

. Y

i,t

pgq “ Y

i,t

p8q for all i and t ă g.

3.2 Interpreting the estimand of two-way ﬁxed eﬀects models

Recall that in the simple two-period model, the estimand (population coeﬃcient) of the two-

way ﬁxed eﬀects speciﬁcation (3) corresponds with the ATT under the parallel trends and

no anticipation assumptions. A substantial focus of the recent literature has been whether

the estimand of commonly-used generalizations of this TWFE model to the multi-period,

If all units are eventually treated, then imposing parallel trends only among treated units also limits the

number of periods for which the ATT is identiﬁed.

In this paper, we specify the parallel trends assumption based on groups deﬁned by the treatment

starting date. It is also possible to adopt alternative deﬁnitions of parallel trends using groups that are more

disaggregated than our G. For instance, one could impose parallel trends for all pairs of states, rather than

for groups of states with the same treatment start date. Using a more disaggregated deﬁnition of a group

strengthens the parallel trends assumption, but could potentially enable more eﬃcient estimators. We focus

on the group-level version of parallel trends for simplicity.

staggered timing case have a similar, intuitive causal interpretation. In short, the literature

has shown that the estimand of TWFE speciﬁcations in the staggered setting often does

not correspond with an intuitive causal parameter even under the natural extensions of the

parallel trends and no-anticipation assumptions described above.

Static TWFE. We begin with a discussion of the “static” TWFE speciﬁcation, which

regresses the outcome on individual and period ﬁxed eﬀects and an indicator for whether

the unit i is treated in period t,

i,t

“ α

` ϕ

` D

i,t

post

` ϵ

i,t

. (5)

The static speciﬁcation yields a sensible estimand when there is no heterogeneity in

treatment eﬀects across either time or units. Formally, let τ

i,t

pgq “ Y

i,t

pgq´Y

i,t

p8q. Suppose

that for all units i, τ

i,t

pgq “ τ whenever t ě g. This imposes that (1) all units have the

same treatment eﬀect, and (2) the treatment has the same eﬀect regardless of how long it

has been since treatment started. In our ongoing example, this would impose that the eﬀect

of Medicaid expansion on insurance coverage is the same both across states and across time.

Then, under a suitable generalization of the parallel trends assumption (e.g. Assumption 4)

and no anticipation assumption (Assumption 5), the population regression coeﬃcient β

post

from (5) is equal to τ.

Issues arise with the static speciﬁcation, however, when there is heterogeneity of treat-

ment eﬀects over time, as shown in Borusyak and Jaravel (2018), de Chaisemartin and

D’Haultfuille (2020), and Goodman-Bacon (2021), among others. Suppose ﬁrst that there

is heterogeneity in time since treatment only. That is, τ

i,t

pgq “

sě0

1rt ´ g “ ss, so all

units have treatment eﬀect τ

in the s-th period after they receive treatment. In this case,

post

corresponds with a potentially non-convex weighted average of the parameters τ

, i.e.

post

“

, where the weights ω

sum to 1 but may be negative. The possibility of

negative weights is concerning because, for instance, all of the τ

could be positive and yet

the coeﬃcient β

post

may be negative! In particular, longer-run treatment eﬀects will often

receive negative weights. Thus, for example, it is possible that the eﬀect of Medicaid expan-

sion on insurance coverage is positive and grows over time since the expansion, and yet β

post

in (5) will be negative. More generally, if treatment eﬀects vary across both time and units,

then τ

i,t

pgq may get negative weight in the TWFE estimand for some combinations of t and

We focus in this section on decompositions of the static TWFE model in a standard, sampling-based

framework. Athey and Imbens (2022) study the static speciﬁcation in a ﬁnite-sample randomization-based

framework.

Goodman-Bacon (2021) provides some helpful intuition to understand this phenomenon.

He shows that

post

can be written as a convex weighted average of diﬀerences-in-diﬀerences

comparisons between pairs of units and time periods in which one unit changed its treat-

ment status and the other did not. Counterintuitively, however, this decomposition includes

diﬀerence-in-diﬀerences that use as a “control” group units who were treated in earlier peri-

ods. For example, in 2016, a state that ﬁrst expanded Medicaid in 2014 might be used as the

“control group” for a state that ﬁrst adopted Medicaid in 2016. Hence, an early-treated unit

can get negative weights if it appears as a “control” for many later-treated units. This decom-

position further highlights that β

post

may not be a sensible estimand when treatment eﬀects

diﬀer across either units or time, because of its inclusion of these “forbidden comparisons”.

We now give some more mathematical intuition for why weighting issues arise in the static

speciﬁcation with heterogeneity. From the Frisch-Waugh-Lovell theorem, the coeﬃcient β

post

from (5) is equivalent to the coeﬃcient from a univariate regression of Y

i,t

on D

i,t

, where

i,t

is the predicted value from a regression of D

i,t

on the other right-hand side variables in

(5), D

i,t

“ ˜α

i,t

. However, a well-known issue with OLS with binary outcomes is that

its predictions may fall outside the unit interval. If the predicted value

i,t

is greater than

1, then D

i,t

will be negative even when a unit is treated, and thus that unit’s outcome

will get negative weight in

post

. To see this more formally, we can apply the formula for

univariate OLS coeﬃcients to obtain that

post

“

i,t

(6)

The denominator is p ositive, and so the weight that

post

places on Y

i,t

is proportional to

i,t

. Thus, if D

i,t

“ 1 and D

i,t

ă 0, then

post

will be decreasing in Y

i,t

even

though unit i is treated at period t. But because Y

i,t

“ Y

i,t

p8q`τ

i,t

pgq, it follows that τ

i,t

pgq

gets negative weight in

post

These negative weights will tend to arise for early-treated units in periods late in the

sample. To see why this is the case, we note that some algebra shows that

i,t

“ D

´D,

where D

“ T

´1

i,t

is the time average of D for unit i, D

“ N

´1

i,t

is the cross-

sectional average of D for period t, and D “ pN T q

´1

i,t

is the average of D across

both periods and units. It follows that if we have a unit that has been treated for almost all

periods (D

« 1) and a period in which almost all units have been treated (D

« 1), then

i,t

« 2 ´ D, which will be strictly greater than 1 if there is a non-substantial fraction of

non-treated units in some period pD ă 1q. We thus see that

post

will tend to put negative

To the best of our knowledge, the phrase “forbidden comparisons” was introduced in Borusyak and

Jaravel (2018).

weight on τ

i,t

for early-adopters in late periods within the sample. This decomposition makes

clear that the static OLS coeﬃcient

post

is not aggregating natural comparisons of units,

and thus will not produce a sensible estimand when there is arbitrary heterogeneity. When

treatment eﬀects are homogeneous – i.e. τ

i,t

pgq ” τ – the negative weights on τ for some

units cancel out the positive weights on other units, and thus β

post

recovers the causal eﬀect

under a suitable generalization of parallel trends.

Dynamic TWFE. Next, we turn our attention to the “dynamic speciﬁcation” that re-

gresses the outcome on individual and period ﬁxed eﬀects, as well as dummies for time

relative to treatment

i,t

“ α

` ϕ

r‰0

1rR

i,t

“ rsβ

` ϵ

i,t

, (7)

where R

i,t

“ t ´ G

` 1 is the time relative to treatment (e.g. R

i,t

“ 1 in the ﬁrst treated

period for unit i), and the summation runs over all possible values of R

except for 0.

Unlike the static speciﬁcation, the dynamic speciﬁcation yields a sensible causal es-

timand when there is heterogeneity only in time since treatment. In particular, the re-

sults in

Borusyak and Jaravel (2018) and Sun and Abraham (2021) imply that if τ

i,t

pgq “

sě0

1rt ´ g “ ss, so all units have treatment eﬀect τ

in the s-th period after treatment,

then β

“ τ

under suitable generalizations of the parallel trends and no anticipation assump-

tions, such as Assumptions 4 and 5.

Thus, speciﬁcation (7) will yield sensible estimates for

the dynamic eﬀect of Medicaid expansion if the eﬀect r years after Medicaid expansion is

the same (on average) regardless of what year the state initially expanded coverage (for each

r “ 1, 2, ...).

Sun and Abraham (2021) show, however, that when there are heterogeneous dynamic

treatment eﬀects across adoption cohorts, the coeﬃcients from speciﬁcation (7) become

diﬃcult to interpret. Thus, for example, problems may arise if the average treatment eﬀect

in the ﬁrst year after adoption is diﬀerent for states that adopted Medicaid in 2014 as

it is for states that adopted in 2015.

There are two issues. First, as with the “static”

regression speciﬁcation above, the coeﬃcient β

may put negative weight on the treatment

eﬀect r periods after treatment for some units. Thus, for example, the treatment eﬀect

We note that the homogeneity assumption can be relaxed so that all adoption cohorts have the same

expected treatment eﬀect, i.e. E

i,g `s

pgq | G “ g

s ”

for all s and g. Additionally, these results assume

that all possible relative time indicators are included. As discussed in Sun and Abraham (2021), Baker,

Larcker and Wang (2022), and Schmidheiny and Siegloch (2020), among others, problems may arise if one

“bins” endpoints (e.g. includes a dummy for 5` years since treatment).

That is, if the eﬀect in 2015 for the 2014 adoption cohort is diﬀerent from the eﬀect in 2016 for the 2015

adoption cohort.

for some states two years after Medicaid expansion may enter β

negatively. Second, the

coeﬃcient β

can put non-zero weight on treatment eﬀects at lags r

‰ r, so there is cross-lag

“contamination.” Thus, for example, the coeﬃcient β

may be inﬂuenced by the treatment

eﬀect for some states three periods after Medicaid expansion.

Like the static speciﬁcation, the dynamic speciﬁcation thus fails to yield sensible estimates

of dynamic causal eﬀects under heterogeneity across cohorts. The derivation of this result

is mathematically more complex, and so we do not pursue it here. The intuition is that, as

in the static case, the dynamic OLS speciﬁcation does not aggregate natural comparisons of

units and includes “forbidden comparisons” between sets of units both of which have already

been treated. An important implication of the results derived by Sun and Abraham (2021)

is that if treatment eﬀects are heterogeneous, the “treatment lead” coeﬃcients from (7) are

not guaranteed to be zero even if parallel trends is satisﬁed in all periods (and vice versa),

and thus evaluation of pre-trends based on these coeﬃcients can be very misleading.

3.2.1 Diagnostic approaches

Several recent papers introduce diagnostic approaches for understanding the extent of the

aggregation issues under staggered treatment timing, with a focus on the static speciﬁcation

(5). de Chaisemartin and D’Haultfuille (2020) propose reporting the number/fraction of

group-time ATTs that receive negative weights, as well as the degree of heterogeneity in

treatment eﬀects that would be necessary for the estimated treatment eﬀect to have the

“wrong sign.” Goodman-Bacon (2021) proposes reporting the weights that

post

places on

the diﬀerent 2-group, 2-period diﬀerence-in-diﬀerences, which allows one to evaluate how

much weight is being placed on “forbidden” comparisons of already-treated units and how

removing the comparisons would change the estimate. Jakiela (2021) proposes evaluating

both whether TWFE places negative weights on some treated units and whether the data

rejects the constant treatment eﬀects assumption.

3.3 New Estimators For Staggered Timing

Several recent papers have proposed alternative estimators that more sensibly aggregate

heterogeneous treatment eﬀects in settings with staggered treatment timing. The derivation

of each of these estimators follows a similar logic to the derivation of the DiD estimator in

the motivating example in Section 2. We begin by specifying a causal parameter of interest

(analogous to the ATT τ

). With the help of the (generalized) parallel trends and no-

anticipation assumptions, we can infer the counterfactual outcomes for treated units using

trends in outcomes for an appropriately chosen “clean” control group of untreated units.

This allows us to express the target parameter in terms of identiﬁed expectations, analogous

to equation (2). Finally, we replace population expectations with sample averages to form

an estimator of the target parameter.

The Callaway and Sant’Anna estimator. We ﬁrst describe in detail the approach taken

by Callaway and SantAnna (2021), and then discuss the connections to other approaches.

They consider as a building block the group-time average treatment eﬀect on the treated,

AT T pg, tq “ E

i,t

pgq ´ Y

i,t

p8q|G

“ g

, which gives the average treatment eﬀect at time t

for the cohort ﬁrst treated in time g. For example AT T p2014, 2016q would be the average

treatment eﬀect in 2016 for states who ﬁrst expanded Medicaid in 2014. They then consider

identiﬁcation and estimation under generalizations of the parallel trends assumption to the

staggered setting.

Intuitively, under the staggered versions of the parallel trends and no

anticipation assumptions, we can identify AT T pg, tq by comparing the expected change in

outcome for cohort g between periods g ´1 and t to that for a control group not-yet treated

at period t. Formally, under Assumption 4.a,

AT T pg, tq “ E

i,t

´ Y

i,g´1

“ g

s ´

i,t

´ Y

i,g´1

“ g

, for any g

ą t

which can be viewed as the multi-period analog of the identiﬁcation result in equation (2).

Since this holds for any comparison group g

ą t, it also holds if we average over some set of

comparisons G

comp

such that g

ą t for all g

P G

comp

AT T pg, tq “ E rY

i,t

´ Y

i,g´1

“ gs ´ E rY

i,t

´ Y

i,g´1

P G

comp

We can then estimate AT T pg, tq by replacing expectations with their sample analogs,

AT T pg, tq “

i : G

“g

i,t

´ Y

i,g´1

s ´

comp

i : G

comp

i,t

´ Y

i,g´1

s. (8)

Speciﬁcally, Callaway and SantAnna (2021) consider two options for G

comp

. The ﬁrst uses

only never-treated units (G

comp

“ t8u) and the second uses all not-yet-treated units (G

comp

“

: g

ą tu).

When there are a relatively small number of periods and treatment cohorts,

reporting

AT T pg, tq for all relevant pg, tq may b e reasonable.

When there are many treated periods and/or cohorts, however, reporting all the

AT T pg, tq

Callaway and SantAnna (2021) also consider generalizations where the parallel trends assumption holds

only conditional on covariates. We discuss this extension in Section 4.2 below, but focus for now on the case

without covariates.

We note that if the never-treated units are not included in the comparison group (i.e. 8 R G

comp

), then

one can rely on a weaker version of Assumption 4.a that excludes the never-treated group.

may be cumbersome, and each one may be imprecisely estimated. Thankfully, the method

described above extends easily to estimating any weighted average of the AT T pg, tq. For

instance, we may be interested in an “event-study” parameter that gives the (weighted)

average of the treatment eﬀect l periods after adoption across diﬀerent adoption cohorts,

AT T

“

AT T pg, g ` lq. (9)

The weights w

could be chosen to weight diﬀerent cohorts equally, or in terms of their relative

frequencies in the treated population. It is straightforward to form estimates for AT T

by averaging the estimates

AT T pg, tq discussed above. We refer the reader to Callaway

and SantAnna (2021) for a discussion of a variety of other weighted averages that may

be economically relevant. Inference is straightforward using either the delta method or a

bootstrap, as described in Callaway and SantAnna (2021).

This alternative approach to estimation has two primary advantages over standard static

or dynamic TWFE regressions. The ﬁrst is that it provides sensible estimands even under

arbitrary heterogeneity of treatment eﬀects. By sensible we mean both that the approach

avoids negative weighting, but also that the weighting of eﬀects across cohorts is speciﬁed

by the researcher (e.g. proportional to cohort size) rather than determined by OLS (i.e.

proportional to the variance of the treatment indicator). The second advantage is that

it makes transparent exactly which units are being used as a control group to infer the

unobserved potential outcomes. This contrasts with standard TWFE models, which we

have seen make unintuitive comparisons under staggered timing.

Imputation estimators. Borusyak, Jaravel and Spiess (2021) introduce a related ap-

proach which they refer to as an imputation estimator (see, also, Gardner (2021), Liu, Wang

and Xu (2022) and Wooldridge (2021) for similar proposals). Speciﬁcally, they ﬁt a TWFE

regression, Y

i,t

p8q “ α

` λ

` ϵ

i,t

, using observations only for units and time periods that

are not-yet-treated. They then infer the never-treated potential outcome for each treated

unit using the predicted value from this regression,

i,t

p8q. This provides an estimate of the

treatment eﬀect for each treated unit, Y

i,t

p8q, and these individual-level estimates can

be aggregated to form estimates of summary parameters like the AT T pg, tq described above.

These approaches yield valid estimates when parallel trends holds for all groups and time

periods and there is no anticipation (Assumptions 4 and 5).

Comparison of CS and BJS approaches. How does the approach proposed by Callaway

and SantAnna (2021, CS) compare to that proposed by Borusyak et al. (2021, BJS)? For

simplicity, it is instructive to consider a simple non-staggered setting where there are three

periods (t “ 1, 2, 3) and units are either treated in period 3 or never-treated (G “ t3, 8u).

In this case, the CS estimator for the treated group in period 3 (i.e. AT T p3, 3q) is simply a

DiD comparing the treated/untreated units between periods 2 and 3,

AT T p3, 3q “ p

3,3

3,8

loooooomoooooon

Diﬀ at t “ 3

´p

2,3

2,8

loooooomoooooon

Diﬀ at t “ 2

where

t,g

is the average outcome in period t for units with G

“ g. By contrast, the BJS

estimator runs a similar DiD, except instead of using period 2 as the baseline, the BJS

estimator uses the average outcome prior to treatment (across periods 1 and 2),

AT T

BJS

p3, 3q “ p

3,3

3,8

loooooomoooooon

Diﬀ at t “ 3

´ p

pre,3

pre,8

loooooooomoooooooon

Avg Diﬀ in Pre-Periods

where

pre,g

“

1,g

2,g

q is the average outcome for cohort g across the two pre-treatment

periods. Thus, we see that the key diﬀerence between the CS and BJS estimators is that

CS makes all comparisons relative to the last pre-treament period, whereas BJS makes com-

parisons relative to the average of the pre-treatment periods. This primary diﬀerence in how

the two approaches use pre-treatment periods extends beyond this simple case to settings

with staggered timing, although the math becomes substantially more complicated in the

staggered case (and thus we do not pursue it).

What are the pros and cons of using the last pre-treatment period versus the average

of the pre-treatment periods? In general, there will be tradeoﬀs between eﬃciency and

the strength of the identifying assumption. On the one hand, averaging over multiple pre-

treatment periods can increase precision. Indeed, BJS prove that when Assumption 4 holds,

their estimator is eﬃcient under homoskedasticity and serially uncorrelated errors; see also

Wooldridge (2021). Although these ideal conditions are unlikely to be satisﬁed exactly, it

does suggest that their estimator will tend to be more eﬃcient than CS when the outcome

is not too heteroskedastic or serially correlated.

On the other hand, the two approaches

require diﬀerent identifying assumptions: in the simple example above, CS only relies on

parallel trends between periods 2 and 3, whereas BJS relies on parallel trends for all three

periods.

More generally, the BJS approach imposes parallel trends for all groups and time

periods (Assumption 4), whereas the CS approach only relies on post-treatment parallel

By contrast, note that if Y

i,t

p0q follows a random walk, then Y

i,3

p0q is independent of Y

i,1

p0q conditional

on Y

i,2

p0q, and thus it is eﬃcient to ignore the earlier pre-treatment periods as CS does.

Or more precisely, between the average outcome in periods 1 and 2, and period 3. See also Marcus and

Sant’Anna (2021) for a discussion about diﬀerent parallel trends assumptions.

trends (Assumption 4.a). Relying on parallel trends over a longer time horizon may lead to

larger biases if the parallel trends assumption holds only approximately: for example, if the

average untreated potential outcome is increasing faster among treated units than untreated

units over time, then the violation of parallel trends is larger when we compare periods farther

apart, and thus the BJS estimator using periods 1 and 2 as the comparison will have larger

bias than the CS estimator using only period 2; see Roth (2018) and de Chaisemartin and

D’Haultfuille (2022) for additional discussion. Thus, the BJS estimator may be preferable

in settings where the outcome is not too serially correlated and the researcher is conﬁdent

in parallel trends across all periods; whereas the CS estimator may be preferred in settings

where serial correlation is high or the researcher is concerned about the validity of parallel

trends over longer time horizons.

Other related approaches. Several other recent papers propose similar estimation strate-

gies to those described above — although with some subtle diﬀerences in how they construct

the control group and the weights they place on diﬀerent cohorts/time periods. de Chaise-

martin and D’Haultfuille (2020) propose an estimator that can be applied when treatment

turns on and oﬀ (see Section 3.4 below), but in the context of the staggered setting here

corresponds with the Callaway and Sant’Anna estimator for AT T

and a particular choice

of weights. Sun and Abraham (2021) propose an estimator that takes the form (8) but uses

either the never-treated units (if they exist) or the last-to-be-treated units as the compar-

ison (G

comp

“ tmax

u), rather than the not-yet-treated. Marcus and Sant’Anna (2021)

propose a recursive estimator that more eﬃciently exploits the identifying assumptions in

Callaway and SantAnna (2021). See, also, Imai and Kim (2021) and Strezhnev (2018) for

closely related ideas. Another related approach is to run a stacked regression where each

treated unit is matched to ‘clean’ (i.e. not-yet-treated) controls and there are separate ﬁxed

eﬀects for each set of treated units and its control, as in Cengiz, Dube, Lindner and Zipperer

(2019) among others. Gardner (2021) shows that this approach estimates a convex weighted

average of the AT T pg, tq under parallel trends and no anticipation, although the weights are

determined by the number of treated units and variance of treatment within each stacked

event, rather than by economic considerations.

3.4 Further extensions to treatment timing/assignment

Our discussion so far has focused on the case where there is a binary treatment that is adopted

at a particular date and remains on afterwards. Several recent papers have studied settings

We also note that the BJS and CS estimators incorporate covariates diﬀerently. BJS adjust linearly for

covariates, where CS consider more general adjustments as described in Section 4.2.

with more complicated forms of treatment assignment. We brieﬂy highlight a few of the

recent contributions, and refer the reader to the review in de Chaisemartin and D’Haultfuille

(2022) for more details.

de Chaisemartin and D’Haultfuille (2020) and Imai and Kim (2021) consider settings

where units are treated at diﬀerent times, but do not necessarily require that treatment is an

absorbing state. Their estimators intuitively compare changes in outcomes for units whose

treatment status changed to other units whose treatment status remained constant over

the same periods. This approach yields an interpretable causal eﬀect under generalizations

of the parallel trends assumption and and an additional “no carryover” assumption that

imposes that the potential outcomes depend only on current treatment status and not on

the full treatment history. We note that, as described in

Bojinov, Rambachan and Shephard

(2021), the no carryover assumption may be restrictive in many settings — for example, if

the treatment is a raise in the minimum wage and the outcome is employment, then the

no carryover assumption requires that employment in period t depends only on whether

the minimum wage was raised in period t and not on the history of minimum wage changes.

Recent work has begun to relax the no carryover assumption: one example is de Chaisemartin

and D’Haultfoeuille (2022), who allow potential outcomes to depend on the full path of

treatments, and instead impose a stronger parallel trends assumption that requires parallel

trends in untreated potential outcomes regardless of a unit’s path of treatment.

Other work has considered DiD settings with non-binary treatments. de Chaisemartin

and D’Haultfuille (2018) study “fuzzy” DiD settings in which all groups are treated in both

time periods, but the proportion of units exposed to treatment increases in one group but

not in the other. Finally, de Chaisemartin and D’Haultfuille (2021) and Callaway, Goodman-

Bacon and Sant’Anna (2021) study settings with multi-valued or continuous treatments.

3.5 Recommendations

The results discussed above show that while conventional TWFE speciﬁcations make sensible

comparisons of treated and untreated units in the canonical two-period DiD setting, in the

staggered case they typically make “forbidden comparisons” between already-treated units.

As a result, treatment eﬀects for some units and time periods receive negative weights in the

TWFE estimand. In extreme cases, this can lead the TWFE estimand to have the “wrong

sign” — e.g., the estimand may be negative even if all the treatment eﬀects are positive.

Even if the weights are not so extreme as to create sign reversals, it may nevertheless be

diﬃcult to interpret which comparisons the TWFE estimator is making, as the “control

group” is not transparent, and the weights it chooses are unlikely to be those most relevant

for economic policy.

In our view, if the researcher is not willing to impose assumptions on treatment eﬀect

heterogeneity, the most direct remedy for this problem is to use the methods discussed

in Section 3.3 that explicitly specify the comparisons to be made between treatment and

control groups, as well as the desired weights in the target parameter. These methods allow

one to estimate a well-deﬁned causal parameter (under parallel trends), with transparent

weights and transparent comparison groups (e.g. not-yet-treated or never-treated units).

This approach, in our view, provides a more complete solution to the problem than the

diagnostic approaches discussed in Section 3.2.1. Although it is certainly valuable to have a

sense of the extent to which conventional TWFE speciﬁcations are making bad comparisons,

eliminating these undesirable comparisons seems to us a better approach than diagnosing the

extent of the issue. Using a TWFE speciﬁcation may be justiﬁed for eﬃciency reasons if one

is conﬁdent that treatment eﬀects are homogeneous, but researchers will often be unwilling

to restrict treatment eﬀect heterogeneity.

The question of which of the many heterogeneity-robust DiD methods discussed in Sec-

tion 3.3 to use is trickier. As described above, the estimators diﬀer in who they use as the

comparison group (e.g. not-yet-treated versus never-treated) as well as the pre-treatment

time periods used in the comparisons (e.g. the whole pre-treatment period versus the ﬁnal

untreated period). This leads to some tradeoﬀs between eﬃciency and the strength of the

parallel trends assumption needed for identiﬁcation, as highlighted in the comparison of BJS

and CS above. The best estimator to use will therefore depend on the context — partic-

ularly, on which group is the most sensible comparison, and how conﬁdent the researcher

is in parallel trends for all periods. Nevertheless, it is our practical experience that the

various heterogeneity-robust DiD estimators typically (although not always) produce similar

answers. The ﬁrst-order consideration is therefore to use an approach that makes clear what

the target parameter is and which groups are being compared for identiﬁcation. Thankfully,

there are now statistical packages that make implementing (and comparing) the results from

these estimators straightforward in practice (see Table 2).

We acknowledge that these new methods may initially appear complicated to researchers

accustomed to analyzing seemingly simple regression speciﬁcations such as (5) or (7). How-

ever, while traditional TWFE regressions are easy to specify, as discussed above they are

actually quite diﬃcult to interpret, since they make complicated and unintuitive comparisons

across groups. By contrast, the methods that we recommend have a simple interpretation

using a coherent comparison group. And while more complex to express in regression format,

they can be viewed as simple aggregations of comparisons of group means. We suspect that

once researchers gain experience using the newer heterogeneity-robust DiD methods, they

will not seem so scary after all!

4 Relaxing or allowing the parallel trends assumption

to be violated

A second strand of the literature has focused on the possibility that the canonical parallel

trends assumption may not hold exactly. Approaches to this problem include relaxing the

parallel trends assumption to hold only conditional on covariates, testing for pre-treatment

violations of the parallel trends assumption, and various tools for robust inference and sen-

sitivity analysis that explore the possibility that parallel trends may be violated in certain

ways.

4.1 Why might parallel trends be violated?

The canonical parallel trends assumption requires that the mean outcome for the treated

group would have evolved in parallel with the mean outcome for the untreated group if the

treatment had not occurred. As discussed in Section 2, this allows for confounding factors

that aﬀect treatment status, but these must have a constant additive eﬀect on the mean

outcome.

In practice, however, we will often be unsure of the validity of the parallel trends assump-

tion for several reasons. First, there will often be concern about time-varying confounding

factors. For example, Democratic-leaning states were more likely to adopt Medicaid expan-

sions but also might be subject to diﬀerent time-varying macro-economic shocks. A second

concern relates to the potential sensitivity of the parallel trends assumption to the chosen

function form of the outcome. If parallel trends holds using the outcome measured in levels,

i,t

p0q, then it will generally not be the case that it holds for the outcomes measured in logs

l ogpY

i,t

p0qq (or vice versa). Indeed, Roth and Sant’Anna (2022) show that parallel trends

can hold for all monotonic transformations of the outcome gpY

i,t

p0qq essentially only if the

population can be divided into two groups, where the ﬁrst group is as good as randomly as-

signed between treatment and control, and the second group has the same potential outcome

distribution in both periods. Although there are some cases where these conditions may be

(approximately) met — the most prominent of which is random assignment of treatment —

they are likely not to hold in most settings where DiD is used, and thus parallel trends will

be sensitive to functional form. It will often not be obvious that parallel trends should hold

for the particular functional form chosen for our analysis — e.g. should we use insurance

rates, or log insurance rates? — and thus we may be skeptical of its validity.

4.2 Parallel trends conditional on covariates

One way to increase the credibility of the parallel trends assumption is to require that it holds

only conditional on covariates. Indeed, if we condition on a rich enough set of covariates

, we may be willing to believe that treatment is nearly randomly assigned conditional on

. Imposing only parallel trends conditional on X

gives us an extra degree of robustness,

since conditional random assignment can fail so long as the remaining unobservables have

a time-invariant additive eﬀect on the outcome. In the Medicaid expansion example, for

instance, we may want to condition on a state’s partisan lean.

In the canonical model discussed in Section 2, the parallel trends assumption can be

naturally extended to incorporate covariates as follows.

Assumption 6 (Conditional Parallel Trends).

i,2

p0q ´ Y

i,1

p0q|D

“ 1, X

s “

i,2

p0q ´ Y

i,1

p0q|D

“ 0, X

(almost surely) , (10)

for X

a pre-treatment vector of observable covariates.

For simplicity, we will focus ﬁrst on the conditional parallel trends assumption in the canon-

ical two-period model, although several papers have also extended this idea to the case of

staggered treatment timing, as we will discuss towards the end of this subsection. We fur-

thermore focus our discussion on covariates that are measured prior to treatment and that

are time-invariant (although they may have a time-varying impact on the outcome); relevant

extensions to this are also discussed below.

In addition to the conditional parallel trends assumption, we will also impose an overlap

condition (a.k.a. positivity condition), which guarantees that for each treated unit with

covariates X

, there are at least some untreated units in the population with the same

value of X

. This overlap assumption is particularly important for using standard inference

procedures (

Khan and Tamer, 2010).

Assumption 7 (Strong overlap). The conditional probability of belonging to the treatment

group, given observed characteristics, is uniformly bounded away from one, and the propor-

tion of treated units is bounded away from zero. That is, for some ϵ ą 0, P pD

“ 1|X

q ă

1 ´ ϵ, almost surely, and E

ą 0.

Given the conditional parallel trends assumption, no anticipation assumption, and over-

lap condition, the ATT conditional on X

“ x,

pxq “ E

i,2

p1q ´ Y

i,2

p0q|D

“ 1, X

“ x

is identiﬁed for all x with P pD

“ 1|X

“ xq ą 0. In particular,

pxq “ E

i,2

´ Y

i,1

“ 1, X

“ x

loooooooooooooooooomoooooooooooooooooon

Change for D

“ 1, X

“ x

´E

i,2

´ Y

i,1

“ 0, X

“ x

loooooooooooooooooomoooooooooooooooooon

Change for D

“ 0, X

“ x

. (11)

Note that equation (11) is analogous to (2) in the canonical model, except it conditions

on X

“ x. Intuitively, among the sub-population with X

“ x, we have parallel trends,

and so we can take the same steps as in Section 2 to infer the conditional ATT for that

sub-population. The unconditional ATT can then be identiﬁed by averaging τ

pxq over the

distribution of X

in the treated population. Using the law of iterated expectations, we have

“ E rY

i,2

p1q ´ Y

i,2

p0q|D

“ 1s “ E

—

–

E rY

i,2

p1q ´ Y

i,2

p0q|D

“ 1, X

loooooooooooooooooomoooooooooooooooooon

“ 1

ﬁ

ﬃ

ﬂ

When X

is discrete and takes a small number of values — for example, if X

is an

indicator for whether someone has a college degree – then estimation is straightforward. We

can just run an unconditional DiD for each value of X

, and then aggregate the estimates to

form an estimate for the overall ATT, using the delta method or bootstrap for the standard

errors. When X

is either continuously distributed or discrete with a very large number of

support points, estimation becomes more complicated, because we will typically not have a

large enough sample to do an unconditional DiD within each possible value of X

. Thankfully,

there are several available econometric approaches to semi-/non-parametrically estimate the

ATT even with continuous covariates. We ﬁrst discuss the limitations of using TWFE

regressions in this setting, and then discuss several alternative approaches.

Standard linear regression. Given that the TWFE speciﬁcation (3) yielded consistent

estimates of the ATT under Assumptions 1-3 in the canonical DiD model, it may be tempting

to augment this speciﬁcation with controls for a time-by-covariate interaction,

i,t

“ α

` ϕ

` p1rt “ 2s ¨ D

qβ ` pX

¨ 1rt “ 2sqγ ` ε

i,t

, (12)

for estimation under conditional parallel trends. Unfortunately, this augmented speciﬁcation

need not yield consistent estimates of the ATT under conditional parallel trends without

additional homogeneity assumptions. The intuition is that equation (12) implicitly models

the conditional expectation function (CEF) of Y

i,2

´Y

i,1

as depending on X

with a constant

slope of γ, regardless of i’s treatment status. If there are heterogeneous treatment eﬀects

that dep end on X

— e.g., the ATT varies by age of participants — the derivative of the CEF

with respect to X

may depend on treatment status D

as well. In these practically relevant

setups, estimates of β can be biased for the ATT; see Meyer (1995) and Abadie (2005) for

additional discussion. Fortunately, there are several semi-/non-parametric methods available

that allow for consistent estimation of the ATT under conditional parallel trends under

weaker homogeneity assumptions.

Regression adjustment. An alternative approach to allow for covariate-speciﬁc trends

in DiD settings is the regression adjustment procedure proposed by Heckman, Ichimura and

Todd (1997) and Heckman, Ichimura, Smith and Todd (1998). Their main idea exploits the

fact that under conditional parallel trends, strong overlap, and no anticipation we can write

the ATT as

“ E

i,2

´ Y

i,1

“ 1, X

s ´

i,2

´ Y

i,1

“ 0, X

“ 1

“ E

i,2

´ Y

i,1

“ 1

s ´

i,2

´ Y

i,1

“ 0, X

“ 1

where the second equality follows from the law of iterated expectations. Thus, to estimate

the ATT under conditional parallel trends, one simply needs to estimate the conditional

expectation of the outcome among untreated units, and then average these “predictions”

using the empirical distribution of X

among treated units. That is, we estimate τ

with

“

i:D

“1

i,2

´ Y

i,1

q ´

ErY

i,2

´ Y

i,1

“ 0, X

, (13)

where

ErY

i,2

´ Y

i,1

“ 0, X

s is the estimated conditional expectation function ﬁtted on

the control units (but evaluated at X

for a treated unit). We note that if one uses a linear

model for

ErY

i,2

´Y

i,1

“ 0, X

s, then this would be similar to a modiﬁcation of (12) that

interacts X

with both treatment group and time dummies, although the two are not quite

identical because the outcome regression approach re-weights using the distribution of X

among units with D

“ 1. The researcher need not restrict themselves to linear models

for the CEF, however, and can use more ﬂexible semi-/non-parametric methods instead.

One popular approach in empirical practice is to match each treated unit to a “nearest

neighbor” untreated unit with similar (or identical) covariate values, and then estimate

ErY

i,2

´Y

i,1

“ 0, X

s using Y

lpiq2

´Y

lpiq1

, where lpiq is the untreated unit matched to i, in

which case

reduces to the simple DiD estimator between treated units and the matched

comparison group.

The outcome regression approach will generally be consistent for the ATT when the out-

come model used to estimate

ErY

i,2

´ Y

i,1

“ 0, X

s is correctly speciﬁed. Inference can

be done using the delta-method for parametric models, and there are also several methods

available for semi-/non-parametric models (under some additional regularity conditions), in-

cluding the bootstrap, as described in Heckman et al. (1998). Inference is more complicated,

however, when one models the outcome evolution of untreated units using a nearest-neighbor

approach with a ﬁxed number of matches: the resulting estimator is no longer asymptotically

linear and thus standard bootstrap procedures are not asymptotically valid (e.g., Abadie and

Imbens , 2006, 2008, 2011, 2012). Ignoring the matching step can also cause problems, and

one therefore needs to use inference procedures that accommodate matching as described in

the aforementioned papers.

Inverse probability weighting An alternative to modeling the conditional expectation

function is to instead model the propensity score, i.e. the conditional probability of belonging

to the treated group given covariates, ppX

q “ P pD

“ 1|X

q. Indeed, as shown by Abadie

(2005), under Assumptions 2, 6 and 7, the ATT is identiﬁed using the following inverse

probability weighting (IPW) formula:

“

„ˆ

p1 ´ D

qppX

1 ´ ppX

i,2

´ Y

i,1

E rD

. (14)

As in the regression adjustment approach, researchers can use the “plug-in principle”

to estimate the ATT by pluging in an estimate of the propensity score to the equation

above. The propensity score model can be estimated using parametric models or semi-/non-

parametric models (under suitable regularity conditions). The IPW approach will generally

be consistent if the model for the propensity scores is correctly speciﬁed. Inference can be

conducted using standard tools; see, e.g., Abadie (2005).

Doubly-robust estimators The outcome regression and IPW approaches described above

can also be combined to form “doubly-robust” (DR) methods that are valid if either the out-

come model or the propensity score model is correctly speciﬁed. Speciﬁcally, Sant’Anna and

Although these nearest-neighbor procedures have been formally justiﬁed for cross-sectional data, they

are easily adjustable to the canonical 2x2 DiD setup with balanced panel data. We are not aware of formal

extensions that allows for unbalanced panel data, repeated cross-sectional data, or more general DiD designs.

Abadie and Spiess (2022) show that, in some cases, clustering at the match level is suﬃcient when matching

is done without replacement.

Zhao (2020) show that under Assumptions 2, 6 and 7, the ATT is identiﬁed as:

“ E

—

–

E rD

p1 ´ D

qppX

1 ´ ppX

„

p1 ´ D

qppX

1 ´ ppX

‹

‚

i,2

´ Y

i,1

´ E

i,2

´ Y

i,1

“ 0, X

ﬁ

ﬃ

ﬂ

.(15)

As before, one can then estimate the ATT by plugging in estimates of both the propen-

sity score and the CEF. The outcome equation and the propensity score can be modeled

with either parametric or semi-/non-parametric methods, and DR methods will generally be

consistent if either of these models is correctly speciﬁed. In addition, Chang (2020) shows

that data-adaptive/machine-learning methods can also be used with DR methods. Standard

inference tools can be used as well; see, e.g., Sant’Anna and Zhao (2020). Finally, under

some regularity conditions, the DR estimator achieves the semi-parametric eﬃciency bound

when both the outcome and propensity score models are correctly speciﬁed (Sant’Anna and

Zhao, 2020).

Extensions to staggered treatment timing: Although the discussion above focused on

DiD setups with two groups and two periods, these diﬀerent procedures have been extended

to staggered DiD setups when treatments are binary and non-reversible. More precisely,

Callaway and SantAnna (2021) extend the regression adjustment, IPW and DR procedures

above to estimate the family of AT T pg, tq’s discussed in Section 3.3. They then aggregate

these estimators to form diﬀerent treatment eﬀect summary measures. Wooldridge (2021)

proposes an alternative regression adjustment procedure that is suitable for staggered setups.

His proposed estimator diﬀers from the Callaway and SantAnna (2021) regression adjustment

estimator as he exploits additional information from pre-treatment periods, which, in turn,

can lead to improvements in precision. On the other hand, if these additional assumptions are

violated, Wooldridge (2021)’s estimator may be more biased than Callaway and SantAnna

(2021)’s. de Chaisemartin and D’Haultfuille (2020); de Chaisemartin and D’Haultfoeuille

(2022) and Borusyak et al. (2021) consider estimators which include covariates in a linear

manner.

Caveats. Throughout, we assume that the covariates X

were measured prior to the in-

troduction of the intervention and are, therefore, unaﬀected by it. If X

can in fact be

aﬀected by treatment, then conditioning on it induces a “bad control” problem that can

induce bias; see Zeldow and Hatﬁeld (2021) for additional discussion. Similar issues arise

if one conditions on time-varying covariates X

i,t

that can be aﬀected by the treatment. If

one is willing to assume that a certain time-varying covariate W

i,t

is not aﬀected by the

treatment, then in principle the entire time-path of the covariate W

“ pW

i,1

, ..., W

i,T

can

be included in the conditioning variable X

, and thus exogenous time-varying covariates can

be incorporated similarly to any pre-treatment covariate. See Caetano, Callaway, Payne and

Rodrigues (2022) for additional discussion of time-varying covariates.

Another important question relates to whether researchers should condition on pre-

treatment outcomes. Proponents of including pre-treatment outcomes argue that controlling

for lagged outcomes can reduce bias from unobserved confounders (Ryan, 2018). It is worth

noting when lagged outcomes are included in X

, the conditional parallel trends assumption

actually reduces to a conditional mean independence assumption for the untreated potential

outcome, since the Y

i,1

p0q terms on both sides of (

10) cancel out, and thus we are left with

i,2

p0q|D

“ 1, X

“ E

i,2

p0q|D

“ 0, X

(almost surely) .

Including the lagged outcome in the conditioning variable thus makes sense if one is conﬁdent

in the conditional unconfoundedness assumption: i.e., if treatment is as good as randomly

assigned conditional on the lagged outcome and other elements of X

. This may be sensible

in settings where treatment takeup decisions are made on the basis of lagged outcomes. How-

ever, it is also important to note that conditioning on lagged outcomes need not necessarily

reduce bias, and can in fact exacerbate it in certain contexts. For example, Daw and Hatﬁeld

(2018) show that when the treated and comparison groups have diﬀerent outcome distribu-

tions but the same trends, matching the treated and control groups on lagged outcomes

selects control units with a particularly large “shock” in the pre-treatment period. This can

then induce bias owing to a mean-reversion eﬀect, when in fact not conditioning on lagged

outcomes would have produced parallel trends. Thus, whether one should include lagged

outcomes or not depends on whether the researcher prefers the non-nested assumptions of

conditional unconfoundedness (given the lagged outcome) versus parallel trends. See, also

Chabé-Ferret (2015), Angrist and Pischke (2009, Chapter 5.4), and Ding and Li (2019) for

related discussion.

4.3 Testing for pre-existing trends

Although conditioning on pre-existing covariates can help increase the plausibility of the

parallel trends assumption, researchers typically still worry that there remain unobserved

time-varying confounders. An appealing feature of the DiD design is that it allows for a

natural plausibility check on the identifying assumptions: did outcomes for the treated and

comparison groups (possibly conditional on covariates) move in parallel prior to the time

of treatment? It has therefore become common practice to check, both visually and using

statistical tests, whether there exist pre-existing diﬀerences in trends (“pre-trends”) as a test

of the plausibility of the parallel trends assumption.

To ﬁx ideas, consider a simple extension of the canonical non-staggered DiD model in

Section 2 in which we observe outcomes for an additional period t “ 0 during which no units

were treated. (These ideas will extend to the case of staggered treatment or conditional

parallel trends). By the no-anticipation assumption, Y

i,t

“ Y

i,t

p0q for all units in periods

t “ 0 and t “ 1. We can thus check whether the analog to the parallel trends assumption

held between periods 0 and 1 — that is, is

i,1

´ Y

i,0

“ 1

loooooooooooomoooooooooooon

Pre-treatment change for D

“ 1

´ E

i,1

´ Y

i,0

“ 0

loooooooooooomoooooooooooon

Pre-treatment change for D

“ 0

“ 0?

For example, did average insurance rates evolve in parallel for expansion and non-expansion

states before either of them expanded Medicaid? In the non-staggered setting, this hypoth-

esis can be conveniently tested using a TWFE speciﬁcation that includes leads and lags of

treatment,

i,t

“ α

` ϕ

r‰0

1rR

i,t

“ rsβ

` ϵ

i,t

, (16)

where the coeﬃcient on the lead of treatment

´1

is given by

´1

“

i:D

“1

i,0

´ Y

i,1

i:D

“0

i,0

´ Y

i,1

Testing for pre-treatment trends thus is equivalent to testing the null hypothesis that β

´1

“ 0.

This approach is convenient to implement and extends easily to the case with additional

pre-treatment periods and non-staggered treatment adoption. When there are multiple pre-

treatment periods, it is common to plot the

in what is called as an “event-study” plot.

If all of the pre-treatment coeﬃcients (i.e.,

for r ă 0q are insigniﬁcant, this is usually

interpreted as a sign in favor of the validity of the design, since we cannot reject the null

that parallel trends was satisﬁed in the pre-treatment period.

This pre-testing approach extends easily to settings with staggered adoption and/or

conditional parallel trends assumptions. For example, the Callaway and SantAnna (2021)

estimator can be used to construct “placebo” estimates of AT T

for l ă 0, i.e. the ATT

l periods before treatment. The estimates

AT T

can be plotted for diﬀerent values of l

(corresponding to diﬀerent lengths of time before/after treatment) to form an event-study

plot analogous to that for the non-staggered case. This illustrates that the idea of testing for

pre-trends extends easily to the settings with staggered treatment adoption or conditional

parallel trends, since the Callaway and SantAnna (2021) can b e applied for both of these

settings. These results are by no means speciﬁc to the Callaway and SantAnna (2021)

estimator, though, and event-study plots can be created in a similar fashion using other

estimators for either staggered or conditional DiD settings. We caution, however, against

using dynamic TWFE speciﬁcations like (16) in settings with staggered adoption, since as

noted by Sun and Abraham (2021), the coeﬃcients β

may be contaminated by treatment

eﬀects at relative time r

ą 0, so with heterogeneous treatment eﬀects the pre-trends test

may reject even if parallel trends holds in the pre-treatment period (or vice versa).

4.4 Issues with testing for pre-trends

Although tests of pre-existing trends are a natural and intuitive plausibility check of the

parallel trends assumption, recent research has highlighted that they also have several lim-

itations. First, even if pre-trends are exactly parallel, this need not guarantee that the

post-treatment parallel trends assumption is satisﬁed. Kahn-Lang and Lang (2020) give an

intuitive example: the average height of boys and girls evolves in parallel until about age 13

and then diverges, but we should not conclude from this that there is a causal eﬀect of bar

mitzvahs (which occur for boys at age 13) on children’s height!

A second issue is that even if there are pre-existing diﬀerences in trends, the tests de-

scribed above may fail to reject owing to low power (Bilinski and Hatﬁeld, 2018; Freyalden-

hoven, Hansen and Shapiro, 2019; Kahn-Lang and Lang, 2020; Roth, 2022). That is, even

if there is a pre-existing trend, it may not b e signiﬁcant in the data if our pre-treatment

estimates are imprecise.

To develop some intuition for why power may be low, suppose that there is no true

treatment eﬀect but there is a pre-existing linear diﬀerence in trends between the treatment

and comparison groups. Then in the simple example from above, the pre-treatment and post-

treatment event-study coeﬃcients will have the same magnitude, |β

´1

| “ |β

|. If the two

estimated coeﬃcients

´1

and

also have the same sampling variance, then by symmetry

the probability that the pre-treatment coeﬃcient

´1

is signiﬁcant will be the same as the

probability that the post-treatment coeﬃcient

is signiﬁcant. But this means that a linear

violation of parallel trends that would be detected only half the time by a pre-trends test

will also lead us to spuriously ﬁnd a signiﬁcant treatment eﬀect half the time

— that is,

10 times more often than we expect to ﬁnd a spurious eﬀect using a nominal 95% conﬁdence

This is the unconditional probability that

is signiﬁcant (not conditioning on the result of the pre-

test). However, if

and

´1

are independent, then this is also the probability of ﬁnding a signiﬁcant eﬀect

conditional on passing the pre-test.

interval! Another intuition for this phenomenon, given by Bilinski and Hatﬁeld (2018), is

that pre-trends tests reverse the traditional roles of type I and type II error: they set the

assumption of parallel trends (or no placebo pre-intervention eﬀect) as the null hypothesis

and only “reject” the assumption if there is strong evidence against it. This controls the

probability of ﬁnding a violation when parallel trends holds at 5% (or another chosen α-level),

but the probability of failing to identify a violation can be much higher, corresponding to

type II error of the test.

In addition to being concerning from a theoretical point of view, the possibility of low

power appears to be relevant in practice: in simulations calibrated to papers published in

three leading economics journals, Roth (2022) found that linear violations of parallel trends

that conventional tests would detect only 50% of the time often produce biases as large as

(or larger than) the estimated treatment eﬀect.

A third issue with pre-trends testing is that conditioning the analysis on “passing” a pre-

trends test induces a selection bias known as pre-test bias (Roth, 2022). Intuitively, if there

is a pre-existing diﬀerence in trends in population, the draws from the DGP in which we fail

to detect a signiﬁcant pre-trend are a selected sample from the true DGP. Roth (2022) shows

that in many cases, this additional selection bias can exacerbate the bias from a violation of

parallel trends.

A ﬁnal issue with the current practice of pre-trends testing relates to what happens if

we do detect a signiﬁcant pre-trend. In this case, the pre-trends test suggests that parallel

trends is likely not to hold exactly, but researchers may still wish to learn something about the

treatment eﬀect of interest. Indeed, it seems likely that with enough precision, we will nearly

always reject that the parallel trends assumption holds exactly in the pre-treatment period.

Nevertheless, we may still wish to learn something about the treatment eﬀect, especially if

the violation of parallel trends is “small” in magnitude. However, the conventional approach

does not make clear how to proceed in this case.

4.4.1 Improved diagnostic tools

A few papers have proposed alternative tools for detecting pre-treatment violations of parallel

trends that take into account some of the limitations discussed above. Roth (2022) developed

tools to conduct power analyses and calculate the likely distortions from pre-testing under

researcher-hypothesized violations of parallel trends. These tools allow the researcher to

assess whether the limitations described above are likely to be severe for potential violations

of parallel trends deemed economically relevant.

Bilinski and Hatﬁeld (2018) and Dette and Schumann (2020) propose “non-inferiority”

approaches to pre-testing that help address the issue of low power by reversing the roles of

the null and alternative hypotheses. That is, rather than test the null that pre-treatment

trends are zero, they test the null that the pre-treatment trend is large, and reject only if

the data provides strong evidence that the pre-treatment trend is small. For example, Dette

and Schumann (2020) consider null hypotheses of the form H

: max

ră0

|β

| ě c, where β

are the (population) pre-treatment coeﬃcients from regression (16). This ensures that the

test “detects” a pre-trend with probability at least 1 ´ α when in fact the pre-trend is large

(i.e. has magnitude at least c).

These non-inferiority approaches are an improvement over standard pre-testing methods,

since they guarantee by design that the pre-test is powered against large pre-treatment

violations of parallel trends. However, using these approaches does not provide any formal

guarantees that ensure the validity of conﬁdence intervals for the treatment eﬀect, the main

object of interest. They also do not avoid statistical issues related to pre-testing (Roth,

2022), and do not provide clear guidance on what to do when the test fails to reject the null

of a large pre-trend. This has motivated more formal robust inference and sensitivity analysis

approaches that consider inference on the ATT when parallel trends may be violated, which

we discuss next.

4.5 Robust inference and sensitivity analysis

Bounds using pre-trends. Rambachan and Roth (2022b) propose an approach for robust

inference and sensitivity analysis when parallel trends may be violated, building on earlier

work by Manski and Pepper (2018). Their approach attempts to formalize the intuition

motivating pre-trends testing: that the counterfactual post-treatment trends cannot be “too

diﬀerent” from the pre-trends. To ﬁx ideas, consider the non-staggered treatment adoption

setting described in Section 4.4. Denote by δ

the violation of parallel trends in the ﬁrst

post-treatment period:

“ E rY

i,2

p0q ´ Y

i,1

p0q|D

“ 1s ´ E rY

i,2

p0q ´ Y

i,1

p0q|D

“ 0s.

This, for example, could be the counterfactual diﬀerence in trends in insurance coverage

between Medicaid expansion and non-expansion states if the expansions had not occurred.

The bias δ

is unfortunately not directly identiﬁed, since we do not observe the untreated po-

tential outcomes, Y

i,2

p0q, for the treated group. However, by the no anticipation assumption,

we can identify the pre-treatment analog to δ

´1

“ E rY

i,0

p0q ´ Y

i,1

p0q|D

“ 1s ´ E rY

i,0

p0q ´ Y

i,1

p0q|D

“ 0s,

which looks at pre-treatment diﬀerences in trends between the groups, with δ

´1

“ E

”

´1

from the event study regression (16). For example, δ

´1

corresponds to the pre-treatment

diﬀerence in trends between expansion and non-expansion states.

Rambachan and Roth (2022b) then consider robust inference under assumptions that

restrict the possible values of δ

given the value of δ

´1

— or more generally, given δ

´1

, ..., δ

´K

if there are K pre-treatment coeﬃcients. For example, one type of restriction they consider

states that the magnitude of the post-treatment violation of parallel trends can be no larger

than a constant

M times the maximal pre-treatment violation, i.e. |δ

| ď

M max

ră0

|δ

M “ 1, for example, then this would impose that post-treatment violations of parallel

trends are no larger than the largest pre-treatment violation. They also consider restrictions

that bound the extent that δ

can deviate from a linear extrapolation of the pre-treatment

diﬀerences in trends. Rambachan and Roth (2022b) use tools from the partial identiﬁcation

and sensitivity analysis literature (Armstrong and Kolesár, 2018; Andrews, Roth and Pakes,

2022) to construct conﬁdence sets for the ATT that are uniformly valid under the imposed

restrictions. These conﬁdence sets take into account the fact that we do not observe the true

pre-treatment diﬀerence in trends δ

´1

, only an estimate

´1

. In contrast to conventional

pre-trends tests, the Rambachan and Roth (2022b) conﬁdence sets thus tend to be larger

when there is more uncertainty about the pre-treatment diﬀerence in trends (i.e. when the

standard error on

´1

is large).

This approach enables a natural form of sensitivity analysis. For example, a researcher

might report that the conclusion of a positive treatment eﬀect is robust up to the value

M “ 2. This indicates that to invalidate the conclusion of a positive eﬀect, we would need

to allow for a post-treatment violation of parallel trends two times larger than the maximal

pre-treatment violation. For example, we could potentially say that Medicaid expansion

has a signiﬁcant eﬀect on insurance rates unless we’re willing to allow for post-expansion

diﬀerences in trends that were up to twice as large as the largest diﬀerence in trends prior

to the expansion. Doing so makes precise exactly what needs to be assumed about possible

violations of parallel trends to reach a particular conclusion.

It is worth highlighting that although we’ve described these tools in the context of non-

staggered treatment timing and an unconditional parallel trends assumption, they extend

easily to the case of staggered treatment timing and conditional parallel trends as well.

Indeed, under mild regularity conditions, these tools can be used anytime the researcher

has a treatment eﬀect estimate

post

and a placebo estimate

pre

, so long as she is willing

to bound the possible bias of

post

given the expected value of

pre

. For example, in the

staggered setting,

post

could be an estimate of AT T

for l ą 0 using one of the estimators

described in Section 3.3, and

pre

could be placebo estimates of AT T

´1

,...,AT T

´k

. See

https://github.com/pedrohcgs/CS_RR for examples on how these sensitivity analyses can

be combined with the Callaway and SantAnna (2021) estimator in R.

Bounds using bracketing. Ye, Keele, Hasegawa and Small (2021) consider an alternative

partial identiﬁcation approach where there are two control groups whose trends are assumed

to “bracket” that of the treatment group. Consider the canonical model from Section 2, and

suppose the untreated units can be divided into two control groups, denoted C

“ a and

“ b. For ease of notation, let C

“ trt denote the treated group, i.e. units with D

“ 1.

Let ∆pcq “ E

i,2

p0q ´ Y

i,1

p0q|C

“ c

. Instead of the parallel trends assumption, Ye et al.

(2021) impose that

mint∆paq, ∆pbqu ď ∆ptrtq ď maxt∆paq, ∆pbqu, (17)

so that the trend in Y p0q for the treated group is bounded above and below (“bracketed”)

by the minimum and maximum trend in groups a and b. An intuitive example where we

may have such bracketing is if each of the groups corresponds with a set of industries, and

one of the control groups (say group a) is more cyclical than the treated group while the

other (say group b) is less cyclical. If the economy was improving between periods t “ 1 and

t “ 2, then we would expect group a to have the largest change in the outcome and group b

to have the smallest change; whereas if the economy was getting worse, we would expect the

opposite. Under equation (17) and the no anticipation assumption, the ATT is bounded,

E rY

i,2

´ Y

i,1

“ 1s ´ maxt∆paq, ∆pbqu ď τ

ď E rY

i,2

´ Y

i,1

“ 1s ´ mint∆paq, ∆pbqu.

This reﬂects that if we knew the true counterfactual trend for the treated group we could

learn the ATT exactly, and therefore that bounding this trend means we can obtain bounds

on the ATT. Ye et al. (2021) further show how one can construct conﬁdence intervals for the

ATT, and extend this logic to settings with multiple periods (but non-staggered treatment

timing). See, also, Hasegawa, Webster and Small (2019) for a related, earlier approach.

4.5.1 Other approaches.

Keele, Small, Hsu and Fogarty (2019) propose a sensitivity analysis in the canonical two-

period DiD model that summarizes the strength of confounding factors that would be needed

to induce a particular bias. Freyaldenhoven, Hansen, Pérez and Shapiro (2021) propose a

visual sensitivity analysis in which one plots the “smoothest” trend though an event-study

plot that could rationalize the data under the null of no eﬀect. Finally, Freyaldenhoven et

al. (2019) propose a GMM-based estimation strategy that allows for parallel trends to be

violated when there exists a covariate assumed to be aﬀected by the same confounds as the

outcome but not by the treatment itself.

4.6 Recommendations

We suspect that in most practical applications of DiD, researchers will not be conﬁdent

ex ante that the parallel trends assumption holds exactly, owing to concerns about time-

varying confounds and sensitivity to functional form. The methods discussed in this setting

for relaxing the parallel trends assumption and/or assessing sensitivity to violations of the

parallel trends assumption will therefore be highly relevant in most contexts where DiD is

applied.

A natural starting point for these robustness checks is to consider whether the results

change meaningfully when imposing parallel trends only conditional on covariates. Among

the diﬀerent estimation procedures we discussed, we view doubly-robust procedures as a

natural default, since they are valid if either the outcome model or propensity score is

well-speciﬁed and have desirable eﬃciency properties. A potential exception to this recom-

mendation arises in settings with limited overlap, i.e., when the estimated propensity score

is close to 0 or 1, in which case regression adjustment estimators may be preferred.

Whether one includes covariates into the DiD analysis or not, we encourage researchers to

continue to plot “event-study plots” that allow for a visual evaluation of pre-existing trends.

These plots convey useful information for the reader to assess whether there appears to

have been a break in the outcome for the treatment group around the time of treatment. In

contexts with a common treatment date, such plots can be created using TWFE speciﬁcations

like (16); in contexts with staggered timing, we recommend plotting estimates of AT T

for

diﬀerent values of l using one of the estimators for the staggered setting described in Section

3.3 to avoid negative weighting issues with TWFE. See Section 4.3 for additional discussion.

We also refer the reader to Freyaldenhoven et al. (2021) regarding best-practices for creating

such plots, such as displaying simultaneous (rather than pointwise) conﬁdence bands for the

path of the event-study coeﬃcients (Olea and PlagborgMøller, 2019; Callaway and SantAnna,

2021).

While event-study plots play an important role in evaluating the plausibility of the par-

allel trends assumption, we think it is important to appreciate that tests of pre-trends may

be underpowered to detect relevant violations of parallel trends, as discussed in Section 4.4.

The lack of a signiﬁcant pre-trend does not necessarily imply the validity of the parallel trends

assumption. At minimum, we recommend that researchers assess the power of pre-trends

tests against economically relevant violations of parallel trends, as described in Section 4.4.1.

We also think it should become standard practice for researchers to formally assess the

extent to which their conclusions are sensitive to violations of parallel trends. A natural

statistic to report in many contexts is the “breakdown” value of

M using the sensitivity

analysis in Rambachan and Roth (2022b) — i.e. how big would the post-treatment violation

of parallel trends have to be relative to the largest pre-treatment violation to invalidate

a particular conclusion? We encourage researchers to routinely report the results of the

sensitivity analyses described in Section 4.5 alongside their event-study plots.

We also encourage researchers to accompany the formal sensitivity tools with a discussion

of possible violations of parallel trends informed by context-speciﬁc knowledge. The parallel

trends assumption is much more plausible in settings where we expect the trends for the two

groups to be similar ex-ante (before seeing the pre-trends). Whenever possible, researchers

should therefore provide a justiﬁcation for why we might expect the two groups to have

similar trends. It is also useful to provide context-speciﬁc knowledge about the types of

confounds that might potentially lead to violations of the parallel trends assumption —

what time-varying factors may have diﬀerentially aﬀected the outcome for the treated group?

Such discussion can often be very useful for interpreting the results of the formal sensitivity

analyses described in Section 4.5. For example, suppose that a particular conclusion is

robust to allowing for violations of parallel trends twice as large the maximum in the pre-

treatment period. In contexts where other factors were quite stable around the time of the

treatment, this might be interpreted as a very robust ﬁnding; on the other hand, if the

treatment occurred at the beginning of a recession much larger than anything seen in the

pre-treatment period, then a violation of parallel trends of that magnitude may indeed be

plausible, so that the results are less robust than we might like. Thus, economic knowledge

will be very important in understanding the robustness of a particular result. In our view,

the most scientiﬁc approach to dealing with possible violations of parallel trends therefore

involves a combination of state-of-the-art econometric tools and context-speciﬁc knowledge

about the types of plausible confounding factors.

5 Relaxing sampling assumptions

We now discuss a third strand of the DiD literature, which considers inference under devi-

ations from the canonical assumption that we have sampled a large number of independent

clusters from an inﬁnite super-population.

5.1 Inference procedures with few clusters

As described in Section 2, standard DiD inference procedures rely on researchers having

access to data on a large number of treated and untreated clusters. Conﬁdence intervals are

then based on the central limit theorem, which states that with independently-sampled clus-

ters, the DiD estimator has an asymptotically normal distribution as the number of treated

and untreated clusters grows large. In many practical DiD settings, however, the number of

independent clusters (and, in particular, treated clusters) may be small, so that the central

limit theorem based on a growing number of clusters may provide a poor approximation.

For example, many DiD applications using state-level policy changes may only have a hand-

ful of treated states. The central limit theorem may provide a poor approximation with

few clusters, even if the number of units within each cluster is large. This is because the

standard sampling-based view of clustering allows for arbitrary correlations of the outcome

within each cluster, and thus there may be common components at the cluster level (a.k.a.

cluster-level “shocks”) that do not wash out when averaging over many units within the same

cluster. Since we only observe a few observations of the cluster-speciﬁc shocks, the average

of these shocks will generally not be approximately normally distributed.

Model-based approaches. Several papers have made progress on the diﬃcult problem

of conducting inference with a small number of clusters by modeling the dependence within

clusters. These papers typically place some restrictions on the common cluster-level shocks,

although the exact restrictions diﬀer across papers. The starting point for these papers is

typically a structural equation of the form

ijt

“ α

` ϕ

` D

β ` pν

` ϵ

ijt

q, (18)

where Y

ijt

is the (realized) outcome of unit i in cluster j at time t, α

and ϕ

are cluster

and time ﬁxed eﬀects, D

is an indicator for whether cluster j is treated in period t, ν

is a common cluster-by-time error term, and ϵ

ijt

is an idiosyncratic unit-level error term.

Here, the “cluster-level” error term, ν

, induces correlation among units within the same

cluster. It is often assumed that ϵ

ijt

are iid mean-zero across i and j (and sometimes t);

see, e.g., Donald and Lang (2007), Conley and Taber (2011), and Ferman and Pinto (2019).

Letting Y

“ n

´1

i:jpiq“j

ijt

be the average outcome among units in cluster j, where n

the number of units in cluster j, we can take averages to obtain

“ α

` ϕ

` D

β ` η

, (19)

where η

“ ν

` n

´1

i“1

ijt

. Assuming the canonical set-up with two periods where no

clusters are treated in period t “ 1 and some clusters are treated in period t “ 2, the

canonical DiD estimator at the cluster level is equivalent to the estimated OLS coeﬃcient

from (19), and is given by

β “ β `

j:D

“1

∆η

j:D

“0

∆η

“ β `

j:D

“1

∆ν

` n

´1

i“1

∆ϵ

j:D

“0

∆ν

` n

´1

i“1

∆ϵ

, (20)

where J

corresponds with the number of clusters with treatment d, and ∆η

“ η

´ η

(and likewise for the other variables). The equation in the previous display highlights the

challenge in this setup: with few clusters, the averages of the cluster level shocks ∆ν

among

treated and untreated clusters will tend not to be approximately normally distributed, and

their variance may be diﬃcult to estimate.

It is worth highlighting that the model described above starts from the structural equa-

tion (18) rather than a model where the primitives are potential outcomes as in Section 2.

We think that connecting the assumptions on the errors in the structural model (18) to re-

strictions on the potential outcomes is an interesting open topic for future work. Although a

general treatment is b eyond the scope of this paper, in Appendix A we show that the errors

in the structural model (18) map to primitives based on potential outcomes in the canonical

model from Section 2. For the remainder of the sub-section, however, we focus primarily

on the restrictions placed on ν

and ϵ

ijt

directly — rather than the implications of these

assumptions for the potential outcomes — since this simpliﬁes exposition and matches how

these assumptions are stated in the literature.

Donald and Lang (2007) made an important early contribution to the literature on in-

ference with few clusters. Their approach assumes that the cluster level-shocks ν

are

mean-zero Gaussian, homoskedastic with respect to cluster and treatment status, and inde-

pendent of other unit-and-time speciﬁc shocks. They also assume the number of units per

cluster is large (n

Ñ 8 for all j). They then show that one can obtain valid inference by

using critical values from a t-distribution with J ´2 degrees of freedom, where J is the total

number of clusters. A nice feature of this approach is that it allows for valid inference when

both the number of treated and untreated clusters is small. The disadvantage is the strong

parametric assumption of homoskedastic Gaussian errors, which will often be hard to justify

in practice.

Conley and Taber (2011) introduce an alternative approach for inference that is able

to relax the strong assumption of Gaussian errors in settings where there are many control

clusters (J

large) but few treated clusters (J

small). This may be reasonable if the author

has data from, say, 3 treated states and 47 untreated states. The key idea in Conley and

Taber (2011) is that if we assume that the errors in treated states come from the same

distribution as in control states, then we can learn the distribution of errors from the large

number of control states and use that to construct standard errors. A key advantage of

this approach is that the distribution of errors is not assumed to be Gaussian, but rather is

learned from the data. Nevertheless, the assumption that all treated groups have the same

distribution of errors is still strong, and will often be violated if either there is heterogeneity in

treatment eﬀects or in cluster sizes. Ferman and Pinto (2019) extend the approach of Conley

and Taber (2011) to allow for heteroskedasticity caused by heterogeneity in group sizes or

other observable characteristics, but must still restrict heterogeneity based on unobserved

characteristics (e.g. unobserved treatment eﬀect heterogeneity).

Hagemann (2020) provides an alternative permutation-based approach that avoids the

need to directly estimate the heteroskedasticity. The key insight of Hagemann (2020) is

that if we place a bound on the maximal relative heterogeneity across clusters, then we

can bound the probability of type I error from a permutation approach. He also shows

how one can use this measure of relative heterogeneity to do sensitivity analysis. Like the

other proposals above, though, Hagemann (2020)’s approach must also place some strong

restrictions on certain types of heterogeneity. In particular, his approach essentially requires

that, as cluster size grows large, any single untreated cluster could be used to infer the

counterfactual trend for the treated group, and thus his approach rules out cluster-speciﬁc

heterogeneity in trends in untreated potential outcomes.

Another popular solution with few clusters is the cluster wild bootstrap. In an inﬂuential

paper, Cameron, Gelbach and Miller (2008) presented simulation evidence that the cluster

wild bootstrap procedure can work well in settings with as few as ﬁve clusters. More recently,

however, Canay, Santos and Shaikh (2021) provided a formal analysis about the conditions

under which the cluster wild bootstrap is asymptotically valid in settings with a few large

clusters. Importantly, Canay et al. (2021) show that the reliability of these bootstrap proce-

dures depends on imposing certain homogeneity conditions on treatment eﬀects, as well as

the type of bootstrap weights one uses and the estimation method adopted (e.g., restricted

vs. unrestricted OLS). These restrictions are commonly violated when one uses TWFE re-

gressions with cluster-speciﬁc and time ﬁxed eﬀects like (19) or when treatment eﬀects are

allowed to be heterogeneous across clusters — see Examples 2 and 3 in Canay et al. (2021).

Simulations have likewise shown that the cluster wild bootstrap may perform poorly in DiD

settings with a small number of treated clusters (MacKinnon and Webb, 2018). Thus, while

the wild bootstrap may perform well in certain scenarios with a small number of clusters, it

too requires strong homogeneity assumptions.

Finally, in settings with a large number of time periods, it may be feasible to con-

duct reliable inference with less stringent homogeneity assumptions about treatment eﬀects.

For instance, Canay, Romano and Shaikh (2017), Ibragimov and Müller (2016), Hagemann

(2021), and Chernozhukov, Wüthrich and Zhu (2021) respectively propose permutation-

based, t-test based, adjusted permutation-based, and conformal inference-based procedures

that allow one to relax distributional assumptions about common shocks and accommodate

richer forms of heterogeneity. The key restriction is that one is comfortable limiting the

time-series dependence of the cluster-speciﬁc-shocks, and strengthening the parallel trends

assumption to hold in many pre- and post-treatment time periods. These methods have

been shown to be valid under asymptotics where the number of periods grows large. When

in fact the number of time periods is small, as frequently occurs in DiD applications, one

can still use some of these methods, but the underlying assumptions are stronger — see, e.g.,

Remark 4.5 and Section 4.2 of Canay et al. (2017).

Alternative approaches. We now brieﬂy discuss two alternative approaches in settings

with a small number of clusters. First, while all of the “model-based” papers above treat

as random, an alternative perspective would be to condition on the values of ν

and

view the remaining uncertainty as coming only from the sampling of the individual units

within clusters, constructing standard errors by clustering only at the unit level. This will

generally produce a violation of parallel trends, but the violation may be relatively small

if the cluster-speciﬁc shocks are small relative to the idiosyncratic variation. The violation

of parallel trends could then be accounted for using the methods described in Section 4.

To make this concrete, consider the setting of Card and Krueger (1994) that compares

employment in NJ and PA after NJ raised its minimum wage. The aforementioned papers

would consider NJ and PA as drawn from a super-population of treated and untreated states,

where the state-level shocks are mean-zero, whereas the alternative approach would treat

the two states as ﬁxed and view any state-level shocks between NJ and PA as a violation of

the parallel trends assumption. One could then explore the sensitivity of one’s conclusions

to the magnitude of this violation, potentially benchmarking it relative to the magnitude of

the pre-treatment violations as discussed in Section 4.5.

A second possibility is Fisher Randomization Tests (FRTs), otherwise known as permu-

tation tests. The basic idea is to calculate some statistic of the data (e.g. the t-statistic

of the DiD estimator), and then recompute this statistic under many permutations of the

treatment assignment (at the cluster level). We then reject the null hypothesis of no eﬀect if

the test statistic using the original data is larger than 95% of the draws of the test statistics

under the permuted treatment assignment. Such tests have a long history in statistics, dat-

ing to Fisher (1935). If treatment is randomly assigned, then FRTs have exact ﬁnite-sample

validity under the sharp null of no treatment eﬀects for all units. The advantage of these

tests is that they place no restrictions on the potential outcomes, and thus allow arbitrary

heterogeneity in potential outcomes across clusters. On the other hand, the assumption of

random treatment assignment may often be questionable in DiD settings. Moreover, the

“sharp” null of no eﬀects for all units may not be as economically interesting as the “weak”

null of no average eﬀects. Nevertheless, permutation tests may be a useful benchmark: if

one cannot reject the null of no treatment eﬀects even if treatment had been randomly as-

signed, this suggests that there is not strong evidence of an eﬀect in the data without other

strong assumptions. In settings with staggered treatment timing, it may be more plausible

to assume that the timing of when a unit gets treated is as good as random; see Roth and

Sant’Anna (2021) for eﬃcient estimators and FRTs for this setting.

Recommendations. In sum, recent research has made progress on the problem of con-

ducting inference with relatively few clusters, but all of the available approaches require the

researcher to impose some potentially strong additional assumptions. Most of the litera-

ture has focused on model-based approaches, which require the researcher to impose some

homogeneity assumptions across clusters. Diﬀerent homogeneity assumptions may be more

reasonable in diﬀerent contexts, and so we encourage researchers using these approaches to

choose a method relying on a dimension of homogeneity that is most likely to hold (ap-

proximately) in their context. We also note that allowing for more heterogeneity may often

come at the expense of obtaining tests with lower power. When none of these homogeneity

assumptions is palatable, conditioning the inference on the cluster-level shocks and treating

them as violations of parallel trends, accompanied by appropriate sensitivity analyses, may

be an attractive alternative. Permutation-based methods also oﬀer an intriguing alterna-

tive which requires no assumptions about homogeneity in potential outcomes, but requires

stronger assumptions on the assignment of treatment and tests a potentially less interesting

null hypothesis when the number of clusters is small.

5.2 Design-based inference and the appropriate level of clustering

The canonical approach to inference in DiD assumes that we have a large number of independently-

drawn clusters sampled from an inﬁnite super-population. In practice, however, there are

two related conceptual diﬃculties with this framework. First, in many settings, it is unclear

what the super-population of clusters is — if the clusters in my sample are the 50 US states,

should I view these as having been drawn from a super-population of possible states? Sec-

ond, in many settings it is hard to determine what the appropriate level of clustering is — if

my data is on individuals who live in counties, which are themselves subsets of states, which

is the appropriate level of clustering?

To address these diﬃculties, it is often easier to consider a design-based framework that

views the units in the data as ﬁxed (not necessarily sampled from a super-population) and the

treatment assignment as stochastic. This helps to address the diﬃculties described above,

since we do not need to conceptualize the super-population, and the appropriate level of

clustering is determined by the way that treatment is assigned. Design-based frameworks

have a long history in statistics dating to Neyman (1923), and have received recent attention

in econometrics (e.g. Abadie, Athey, Imbens and Wooldridge, 2020, 2023). However, until

recently, most of the results in the design-based literature has focused on settings where

treatment probabilities are known or depend only on observable characteristics, and thus

were not directly applicable to DiD.

Recent work by Rambachan and Roth (2022a) has extended this design-based view to set-

tings like DiD, where treatment probabilities may diﬀer in unknown ways across units. Ram-

bachan and Roth (2022a) consider a setting similar to the canonical two-period model in Sec-

tion 2. However, following the design-based paradigm, they treat the units in the population

(and their potential outcomes) as ﬁxed rather than drawn from an inﬁnite super-population.

In this set-up, they show that the usual DiD estimator is unbiased for a ﬁnite-population

analog to the ATT under a ﬁnite-population analog to the parallel trends assumption. In

particular, let π

denote the probability that D

“ 1, and suppose that

i“1

i,2

p0q ´ Y

i,1

p0qq “ 0,

so that treatment probabilities are uncorrelated with trends in Y p0q (a ﬁnite-population

version of parallel trends). Then E

s “ τ

F in

, where τ

F in

“ E

i:D

“1

i,2

p1q ´

i,2

p0qqs is a ﬁnite-population analog to the ATT, i.e. the expected average treatment eﬀect

on the treated, where the expectation is taken over the stochastic distribution of which units

are treated.

Rambachan and Roth (2022a) show that from the design-based perspective, cluster-

robust standard errors are valid (but potentially conservative) if the clustering is done at the

level at which treatment is independently determined. Thus, for example, if the treatment

is assigned independently at the unit-level,

then we should cluster at the unit-level; by

Formally, if units are assigned independently b efore we condition on the number of treated units, N

contrast, if treatment is determined independently across states, then we should cluster at

the state level. This clear recommendation on the appropriate level of clustering contrasts

with the more traditional model-based view that clustering should be done at the level at

which the errors are correlated, which often makes it challenging to choose the appropriate

level (MacKinnon, Nielsen and Webb, 2022). These results also suggest that it may not

actually be a problem if it is diﬃcult to conceptualize a super-population from which our

clusters are drawn; rather, the “usual” approach remains valid if there is no super-population

and the uncertainty comes from stochastic assignment of treatment.

Recommendations. If it is diﬃcult to conceptualize a super-population, fear not! Your

DiD analysis can likely still be sensible from a ﬁnite-population perspective where we think of

the treatment assignment as stochastic. Furthermore, if you are unsure about the appropriate

level of clustering, a good rule of thumb (at least from the design-based perspective) is to

cluster at the level at which treatment is independently assigned.

6 Other topics and areas for future research

In this section, we brieﬂy touch on some other areas of interest in the DiD literature, and

highlight some open areas for future research.

Distributional treatment eﬀects. The DiD literature typically focuses on estimation of

the ATT, but researchers may often be interested in the eﬀect of a treatment on the entire

distribution of an outcome. Athey and Imbens (2006) propose the Changes-in-Changes

model, which allows one to infer the full counterfactual distribution of Y p0q for the treated

group in DiD setups. The key assumption is that the mapping between quantiles of Y p0q

for the treated and comparison groups remains stable over time – e.g., if the 30th percentile

of the outcome for the treated group was the 70th percentile for the comparison group prior

to treatment, this relationship would have been preserved in the second period if treatment

had not occurred. Bonhomme and Sauder (2011) propose an alternative distributional DiD

model based on a parallel trends assumption for the (log of the) characteristic function, which

is motivated by a model of test scores. Callaway and Li (2019) propose a distributional DiD

In some settings, the uncertainty may arise both from sampling and the stochastic assignment of treat-

ment. Abadie et al. (2023) study a model in which both treatment is stochastic and units are sampled from a

larger population, and suggest that one should cluster among units if either their treatment assignments are

correlated or the event that they are included in the sample is correlated. Although the Abadie et al. (2023)

results do not directly apply to DiD, we suspect that a similar heuristic would apply in DiD as well in light

of the results in Rambachan and Roth (2022a) for the case where only treatment is viewed as sto chastic.

Formalizing this intuition strikes us as an interesting area for future research.

model based on a copula stability assumption. Finally, Roth and Sant’Anna (2022) show

that parallel trends holds for all functional forms under a “parallel trends”-type assumption

for the cumulative distribution of Y p0q, and this assumption also allows one to infer the full

counterfactual distribution for the treated group.

Quasi-random treatment timing. In settings with staggered treatment timing, the gen-

eralized parallel trends assumption is often justiﬁed by arguing that the timing of treatment

is random or quasi-random. Roth and Sant’Anna (2021) show that if one is willing to as-

sume treatment timing is as good as random, one can obtain more eﬃcient estimates than

using the staggered DiD methods discussed in Section 3.3. This builds on earlier work by

McKenzie (2012), who highlighted that DiD is typically ineﬃcient in an RCT where lagged

outcomes are observed, as well as a large literature in statistics on eﬃcient covariate ad-

justment in randomized experiments (e.g., Lin, 2013). Shaikh and Toulis (2021) propose a

method for observational settings where treatment timing is random conditional on ﬁxed ob-

servable characteristics. We think that developing methods for observational settings where

treatment timing is approximately random, possibly conditional on covariates and lagged

outcomes, is an interesting area for further study in the years ahead.

Sequential ignorability. As discussed in Section 3.4, an exciting new literature in DiD

has begun to focus on settings where treatment can turn on and oﬀ and potential outcomes

depend on the full path of treatments. A similar setting has been studied extensively in

biostatistics, beginning with the pioneering work of Robins (1986). The key diﬀerence is

that the biostatistics literature has focused on sequential random ignorability assumptions

that impose that treatment in each period is random conditional on the path of covariates

and realized outcomes, rather than parallel trends. We suspect that there may be economic

settings where sequential ignorability may be preferable to parallel trends, e.g. when there

is feedback between lagged outcomes and future treatment choices. Integrating these two

literatures — e.g., understanding in which economic settings is parallel trends preferable

to sequential ignorability and vice versa — is an interesting area for future research. An

interesting step towards incorporating sequential ignorability in economic analyses is Viviano

and Bradic (2021).

Spillover eﬀects. The vast majority of the DiD literature imposes the SUTVA assump-

tion, which rules out spillover eﬀects. However, spillover eﬀects may be important in many

economic applications, such as when policy in one area aﬀects neighboring areas, or when

individuals are connected in a network. Butts (2021) provides some initial work in this

direction by extending the framework of Callaway and SantAnna (2021) to allow for local

spatial spillovers. Huber and Steinmayr (2021) also consider extensions to allow for spillover

eﬀects. We suspect that in the coming years, we will see more work on DiD with spillovers.

Conditional treatment eﬀects. The DiD literature has placed a lot of emphasis on

learning about the ATT’s of diﬀerent groups. However, in many situations, it may also

be desirable to better understand how these ATT’s vary between subpopulations deﬁned by

covariate values. For instance, how does the average treatment eﬀect of a training program on

earnings vary according to the age of its participants? Abadie (2005) provides re-weighting

methods to tackle these types of questions using linear approximations. However, recent

research has shown that data-adaptive/machine-learning procedures can be used to more

ﬂexibly estimate treatment eﬀect heterogeneity in the context of RCTs or cross-sectional

observational studies with unconfoundedness (e.g., Lee, Okui and Whang, 2017; Wager and

Athey, 2018; Chernozhukov, Demirer, Duﬂo and Fernández-Val, 2020). Whether such tools

can be adapted to estimate treatment eﬀect heterogeneity in DiD setups is a promising area

for future research.

Triple diﬀerences. A common variant on DiD is triple-diﬀerences (DDD), which compares

the DiD estimate for a demographic group expected to be aﬀected by the treatment to a DiD

for a second demographic group expected not to be aﬀected (or eﬀected less). For example,

Gruber (1994) studies the impacts of mandated maternity leave policies using a DDD design

that compares the evolution of wages between treated/untreated states, before/after the law

passed, and between married women age 20-40 (who are expected to be aﬀected) and other

workers. DDD has received much less attention in the recent literature than standard DiD.

We note, however, that DDD can often be cast as a DiD with a transformed outcome. For

example, if we deﬁned the state-level outcome

Y as the diﬀerence in wages between women

age 20-40 and other workers, then Gruber (1994)’s DDD analysis would be equivalent to a

DiD analysis using

Y as the outcome instead of wages. Nevertheless, we think that providing

a more formal analysis of DDD along with practical recommendations for applied researchers

would be a useful direction for future research.

Connections to other panel data methods. DiD is of course one of many possible

panel data methods. One of the most prominent alternatives is the synthetic control (SC)

method, pioneered by Abadie, Diamond and Hainmueller (2010). Much of the DiD and SC

literatures have evolved separately, using diﬀerent data-generating pro cesses as the baseline

(Abadie, 2021). Recent work has begun to try to combine insights from the two literatures

(e.g., Arkhangelsky, Athey, Hirshberg, Imbens and Wager, 2021; Ben-Michael, Feller and

Rothstein, 2021, 2022; Doudchenko and Imbens, 2016). We think that exploring further

connections between the literatures — and in particular, providing clear guidance for prac-

titioners on when one we should expect one method to perform better than the other, or

whether one should consider a hybrid of the two — is an interesting direction for future

research.

7 Conclusion

This paper synthesizes the recent literature on DiD. Some key themes are that researchers

should be clear about the comparison group used for identiﬁcation, match the estimation

and inference methods to the identifying assumptions, and explore robustness to possible

violations of those assumptions. We emphasize that context-speciﬁc knowledge will often

be needed to choose the right identifying assumptions and accompanying methods. We are

hopeful that these recent developments will help to make DiD analyses more transparent

and credible in the years to come.

Table 1: A Checklist for DiD Practitioners

– Is everyone treated at the same time?

If yes, and panel is balanced, estimation with TWFE speciﬁcations such as (5) or (7)

yield easily interpretable estimates.

If no, consider using a “heterogeneity-robust” estimator for staggered treatment timing

as described in Section 3. The appropriate estimator will depend on whether treatment

turns on/oﬀ and which parallel trends assumption you’re willing to impose. Use TWFE

only if you’re conﬁdent in treatment eﬀect homogeneity.

– Are you sure about the validity of the parallel trends assumption?

If yes, explain why, including a justiﬁcation for your choice of functional form. If

the justiﬁcation is (quasi-)random treatment timing, consider using a more eﬃcient

estimator as discussed in Section 6.

If no, consider the following steps:

1. If parallel trends would be more plausible conditional on covariates, consider a

method that conditions on covariates, as described in Section 4.2.

2. Assess the plausibility of the parallel trends assumption by constructing an event-

study plot. If there is a common treatment date and you’re using an unconditional

parallel trends assumption, plot the coeﬃcients from a speciﬁcation like (16). If

not, then see Section 4.3 for recommendations on event-plot construction.

3. Accompany the event-study plot with diagnostics of the power of the pre-test

against relevant alternatives and/or non-inferiority tests, as described in Section

4.4.1.

4. Report formal sensitivity analyses that describe the robustness of the conclusions

to potential violations of parallel trends, as described in Section 4.5.

– Do you have a large number of treated and untreated clusters sampled from

a super-population?

If yes, then use cluster-robust methods at the cluster level. A good rule of thumb is

to cluster at the level at which treatment is independently assigned (e.g. at the state

level when policy is determined at the state level); see Section 5.2.

If you have a small number of treated clusters, consider using one of the alternative

inference methods described in Section 5.1.

If you can’t imagine the super-population, consider a design-based justiﬁcation for

inference instead, as discussed in Section 5.2.

Table 2: Statistical Packages for Recent DiD Methods

Heterogeneity Robust Estimators for Staggered Treatment Timing

Package Software Description

did, csdid R, Stata Implements Callaway and SantAnna (2021)

did2s R, Stata Implements Gardner (2021), Borusyak et al. (2021), Sun and Abraham (2021),

Callaway and SantAnna (2021), Roth and Sant’Anna (2021)

didimputation, did_imputation R, Stata Implements Borusyak et al. (2021)

DIDmultiplegt, did_multiplegt R, Stata Implements de Chaisemartin and D’Haultfuille (2020)

eventstudyinteract Stata Implements Sun and Abraham (2021)

ﬂexpaneldid Stata Implements Dettmann (2020), based on Heckman et al. (1998)

ﬁxest R Implements Sun and Abraham (2021)

stackedev Stata Implements stacking approach in Cengiz et al. (2019)

staggered R Implements Roth and Sant’Anna (2021), Callaway and SantAnna (2021),

and Sun and Abraham (2021)

xtevent Stata Implements Freyaldenhoven et al. (2019)

DiD with Covariates

Package Software Description

DRDID, drdid R, Stata Implements Sant’Anna and Zhao (2020)

Diagnostics for TWFE with Staggered Timing

Package Software Description

bacondecomp, ddtiming R, Stata Diagnostics from Goodman-Bacon (2021)

TwoWayFEWeights R, Stata Diagnostics from de Chaisemartin and D’Haultfuille (2020)

Diagnostic / Sensitivity for Violations of Parallel Trends

Package Software Description

honestDiD R, Stata Implements Rambachan and Roth (2022b)

pretrends R Diagnostics from Roth (2022)

Note: This table lists R and Stata packages for recent DiD methods, and is based on Asjad Naqvi’s rep ository at https://asjadnaqvi.github.io/DiD/. Several

of the packages listed under “Heterogeneity Robust Estimators” also accommodate covariates.

References

Abadie, Alberto, “Semiparametric Diﬀerence-in-Diﬀerences Estimators,” The Review of

Economic Studies , 2005, 72 (1), 1–19.

, “Using Synthetic Controls: Feasibility, Data Requirements, and Methodological As-

pects,” Journal of Economic Literature, June 2021, 59 (2), 391–425.

, Alexis Diamond, and Jens Hainmueller, “Synthetic Control Methods for Com-

parative Case Studies: Estimating the Eﬀect of Californias Tobacco Control Program,”

Journal of the American Statistical Association, June 2010, 105 (490), 493–505.

and Guido W. Imbens, “Large sample properties of matching estimators for average

treatment eﬀects,” Econometrica, 2006, 74 (1), 235–267.

and , “On the Failure of the Bootstrap for Matching Estimators,” Econometrica, 2008,

76 (6), 1537–1557.

and , “Bias-Corrected Matching Estimators for Average Treatment Eﬀects,” Journal

of Business & Economic Statistics, 2011, 29 (1), 1–11.

and , “A Martingale Representation for Matching Estimators,” Journal of the American

Statistical Association, 2012, 107 (498), 833–843.

and Jann Spiess, “Robust Post-Matching Inference,” Journal of the American Statis-

tical Association, April 2022, 117 (538), 983–995. Publisher: Taylor & Francis _eprint:

https://doi.org/10.1080/01621459.2020.1840383.

, Susan Athey, Guido Imbens, and Jeﬀrey Wooldridge, “When Should You Adjust

Standard Errors for Clustering?,” The Quarterly Journal of Economics, 2023, 138 (1),

1–35.

, , Guido W. Imbens, and Jeﬀrey M. Wooldridge, “Sampling-Based versus

Design-Based Uncertainty in Regression Analysis,” Econometrica, 2020, 88 (1), 265–296.

Abbring, Jaap H. and Gerard J. van den Berg, “The nonparametric identiﬁcation of

treatment eﬀects in duration models,” Econometrica, 2003, 71 (5), 1491–1517.

Andrews, Isaiah, Jonathan Roth, and Ariel Pakes, “Inference for Linear Conditional

Moment Inequalities,” Review of Economic Studies, 2022, Forthcoming.

Angrist, Joshua D. and Jorn-Steﬀen Pischke, Mostly Harmless Econometrics: An

Empiricist’s Companion, Princeton: Princeton University Press, 2009.

Arellano, M., “Practitioners Corner: Computing Robust Standard Errors for Within-

groups Estimators,” Oxford Bulletin of Economics and Statistics, 1987, 49 (4), 431–434.

Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens,

and Stefan Wager, “Synthetic Diﬀerence-in-Diﬀerences,” American Economic Review,

2021, 111 (12), 4088–4118.

Armstrong, Timothy B. and Michal Kolesár, “Optimal Inference in a Class of Regres-

sion Models,” Econometrica, 2018, 86 (2), 655–683.

Athey, Susan and Guido Imbens, “Design-based Analysis in Diﬀerence-In-Diﬀerences

Settings with Staggered Adoption,” Journal of Econometrics, 2022, 226 (1), 62–79.

and Guido W. Imbens, “Identiﬁcation and Inference in Nonlinear Diﬀerence-in-

Diﬀerences Models,” Econometrica, 2006, 74 (2), 431–497.

Baker, Andrew, David F. Larcker, and Charles C. Y. Wang, “How much should

we trust staggered diﬀerence-in-diﬀerences estimates?,” Journal of Financial Economics,

2022, 144 (2), 370–395.

Ben-Michael, Eli, Avi Feller, and Jesse Rothstein , “The Augmented Synthetic Control

Method,” Journal of the American Statistical Association, 2021, 116 (536), 1789–1803.

, , and , “Synthetic controls with staggered adoption,” Journal of the Royal Statistical

Society: Series B, 2022, 84 (2), 351–381.

Bertrand, Marianne, Esther Duﬂo, and Sendhil Mullainathan , “How Much Should

We Trust Diﬀerences-In-Diﬀerences Estimates?,” The Quarterly Journal of Economics,

2004, 119 (1), 249–275.

Bilinski, Alyssa and Laura A. Hatﬁeld, “Seeking evidence of absence: Reconsidering

tests of model assumptions,” arXiv:1805.03273 [stat], May 2018.

Bojinov, Iavor, Ashesh Rambachan, and Neil Shephard, “Panel experiments and

dynamic causal eﬀects: A ﬁnite population p erspective,” Quantitative Economics, 2021,

12 (4), 1171–1196.

Bonhomme, Stéphane and Ulrich Sauder, “Recovering distributions in diﬀerence-in-

diﬀerences mo dels: A comparison of selective and comprehensive schooling,” Review of

Economics and Statistics, 2011, 93 (2), 479–494.

Borusyak, Kirill and Xavier Jaravel, “Revisiting Event Study Designs,” SSRN Scholarly

Paper ID 2826228, Social Science Research Network, Rochester, NY 2018.

, , and Jann Spiess, “Revisiting Event Study Designs: Robust and Eﬃcient Estima-

tion,” arXiv:2108.12419 [econ], 2021.

Butts, Kyle, “Diﬀerence-in-Diﬀerences Estimation with Spatial Spillovers,”

arXiv:2105.03737 [econ], 2021.

Caetano, Carolina, Brantly Callaway, Stroud Payne, and Hugo Sant’Anna

Rodrigues, “Diﬀerence in Diﬀerences with Time-Varying Covariates,” February 2022.

arXiv:2202.02903 [econ].

Callaway, Brantly and Pedro H. C. SantAnna, “Diﬀerence-in-Diﬀerences with multiple

time periods,” Journal of Econometrics, 2021, 225 (2), 200–230.

and Tong Li, “Quantile treatment eﬀects in diﬀerence in diﬀerences models with panel

data,” Quantitative Economics, 2019, 10 (4), 1579–1618.

, Andrew Goodman-Bacon, and Pedro H. C. Sant’Anna, “Diﬀerence-in-Diﬀerences

with a Continuous Treatment,” arXiv:2107.02637 [econ], 2021.

Cameron, A Colin, Jonah B Gelbach, and Douglas L Miller, “Bootstrap-Based

Improvements for Inference With Clustered Errors,” Review of Economics and Statistics,

2008, 90 (3), 414–427.

Canay, Ivan A., Andres Santos, and Azeem M. Shaikh, “The wild bootstrap with

a small number of large clusters,” Review of Economics and Statistics, 2021, 103 (2),

346–363.

, Joseph P. Romano, and Azeem M. Shaikh, “Randomization Tests Under an Ap-

proximate Symmetry Assumption,” Econometrica, 2017, 85 (3), 1013–1030.

Card, David and Alan B Krueger, “Minimum Wages and Employment: A Case Study

of the Fast-Food Industry in New Jersey and Pennsylvania,” American Economic Review,

1994, 84 (4), 772–793.

Cengiz, Doruk, Arindrajit Dube, Attila Lindner, and Ben Zipperer, “The Eﬀect

of Minimum Wages on Low-Wage Jobs,” The Quarterly Journal of Economics, 2019, 134

(3), 1405–1454.

Chabé-Ferret, Sylvain, “Analysis of the bias of Matching and Diﬀerence-in-Diﬀerence

under alternative earnings and selection processes,” Journal of Econometrics, 2015, 185

(1), 110–123.

Chang, Neng-Chieh, “Double/debiased machine learning for diﬀerence-in-diﬀerences,”

Econometrics Journal , 2020, 23, 177–191.

Chernozhukov, Victor, Kaspar Wüthrich, and Yinchu Zhu, “An Exact and Robust

Conformal Inference Method for Counterfactual and Synthetic Controls,” Journal of the

American Statistical Association, 2021, 116 (536), 1849–1864.

, Mert Demirer, Esther Duﬂo, and Iván Fernández-Val, “Generic Machine Learn-

ing Inference on Heterogeneous Treatment Eﬀects in Randomized Experiments,” arXiv:

1712.04802, 2020, pp. 1–52.

Conley, Timothy G. and Christopher R. Taber, “Inference with “Diﬀerence in Dif-

ferences” with a Small Number of Policy Changes,” Review of Economics and Statistics,

2011, 93 (1), 113–125.

Daw, Jamie R. and Laura A. Hatﬁeld, “Matching and Regression to the Mean in

DiﬀerenceinDiﬀerences Analysis,” Health Services Research, 2018.

de Chaisemartin, Clément and Xavier D’Haultfoeuille, “Diﬀerence-in-Diﬀerences

Estimators of Intertemporal Treatment Eﬀects,” March 2022. arXiv:2007.04267 [econ].

and Xavier D’Haultfuille, “Fuzzy Diﬀerences-in-Diﬀerences,” The Review of Economic

Studies, 2018, 85 (2), 999–1028.

and , “Two-Way Fixed Eﬀects Estimators with Heterogeneous Treatment Eﬀects,”

American Economic Review, 2020, 110 (9), 2964–2996.

and , “Two-way Fixed Eﬀects Regressions with Several Treatments,” SSRN Scholarly

Paper ID 3751060, Social Science Research Network, Rochester, NY 2021.

and , “Two-way ﬁxed eﬀects and diﬀerences-in-diﬀerences with heterogeneous treat-

ment eﬀects: a survey,” Econometrics Journal, 2022, Forthcoming.

Dette, Holger and Martin Schumann, “Diﬀerence-in-Diﬀerences Estimation Under Non-

Parallel Trends,” Working Paper, 2020.

Dettmann, Eva, “Flexpaneldid: A Stata Toolbox for Causal Analysis with Varying Treat-

ment Time and Duration,” SSRN Scholarly Paper ID 3692458, Social Science Research

Network, Rochester, NY 2020.

Ding, Peng and Fan Li, “A Bracketing Relationship between Diﬀerence-in-Diﬀerences

and Lagged-Dependent-Variable Adjustment,” Political Analysis, 2019, 27, 605—-615.

Donald, Stephen G. and Kevin Lang, “Inference with Diﬀerence-in-Diﬀerences and

Other Panel Data,” The Review of Economics and Statistics, 2007, 89 (2), 221–233.

Doudchenko, Nikolay and Guido W. Imbens, “Balancing, Regression, Diﬀerence-In-

Diﬀerences and Synthetic Control Methods: A Synthesis,” Working Paper 22791, National

Bureau of Economic Research 2016.

Ferman, Bruno and Cristine Pinto, “Inference in Diﬀerences-in-Diﬀerences with Few

Treated Groups and Heteroskedasticity,” The Review of Economics and Statistics, 2019,

101 (3), 452–467.

Fisher, R. A., The design of experiments The design of experiments, Oxford, England:

Oliver & Boyd, 1935. Pages: xi, 251.

Freyaldenhoven, Simon, Christian Hansen, and Jesse M. Shapiro, “Pre-event

Trends in the Panel Event-Study Design,” American Economic Review, 2019, 109 (9),

3307–3338.

, , Jorge Pérez Pérez, and Jesse M. Shapiro, “Visualization, identiﬁcation, and

estimation in the linear panel event-study design.,” Advances in Economics and Econo-

metrics: Twelfth World Congress, 2021, Forthcoming.

Gardner, John, “Two-stage diﬀerences in diﬀerences,” Working Paper, 2021.

Goodman-Bacon, Andrew, “Diﬀerence-in-diﬀerences with variation in treatment timing,”

Journal of Econometrics, 2021, 225 (2), 254–277.

Gruber, Jonathan, “The Incidence of Mandated Maternity Beneﬁts,” The American Eco-

nomic Review, 1994, 84 (3), 622–641.

Hagemann, Andreas, “Inference with a single treated cluster,” arXiv:2010.04076

[econ.EM], 2020, pp. 1–23.

, “Permutation inference with a ﬁnite number of heterogeneous clusters,” arXiv:1907.01049

[econ.EM], 2021.

Hasegawa, Raiden B., Daniel W. Webster, and Dylan S. Small, “Evaluating Mis-

souris Handgun Purchaser Law: A Bracketing Method for Addressing Concerns About

History Interacting with Group,” Epidemiology, 2019, 30 (3), 371–379.

Heckman, James, Hidehiko Ichimura, Jeﬀrey Smith, and Petra Todd, “Character-

izing Selection Bias Using Experimental Data,” Econometrica, 1998, 66 (5), 1017–1098.

Heckman, James J., Hidehiko Ichimura, and Petra Todd, “Matching as an econo-

metric evaluation estimator: Evidence from evaluating a job training programme,” The

Review of Economic Studies, 1997, 64 (4), 605–654.

Holland, Paul W., “Statistics and Causal Inference,” Journal of the American Statistical

Association, 1986, 81 (396), 945–960.

Huber, Martin and Andreas Steinmayr, “A Framework for Separating Individual-Level

Treatment Eﬀects From Spillover Eﬀects,” Journal of Business & Economic Statistics,

2021, 39 (2), 422–436.

Ibragimov, Rustam and Ulrich K. Müller, “Inference with few heterogeneous clusters,”

Review of Economics and Statistics, 2016, 98 (1), 83–96.

Imai, Kosuke and In Song Kim, “On the Use of Two-way Fixed Eﬀects Regression

Models for Causal Inference with Panel Data,” Political Analysis, 2021, 29 (3), 405–415.

Jakiela, Pamela, “Simple Diagnostics for Two-Way Fixed Eﬀects,” arXiv:2103.13229 [econ,

q-ﬁn], March 2021.

Kahn-Lang, Ariella and Kevin Lang, “The Promise and Pitfalls of Diﬀerences-in-

Diﬀerences: Reﬂections on 16 and Pregnant and Other Applications,” Journal of Business

& Economic Statistics, 2020, 38 (3), 613–620.

Keele, Luke J., Dylan S. Small, Jesse Y. Hsu, and Colin B. Fogarty, “Patterns

of Eﬀects and Sensitivity Analysis for Diﬀerences-in-Diﬀerences,” arXiv:1901.01869 [stat],

February 2019. arXiv: 1901.01869.

Khan, Shakeeb and Elie Tamer, “Irregular Identiﬁcation, Support Conditions, and In-

verse Weight Estimation,” Econometrica, 2010, 78 (6), 2021–2042.

Lee, Sokbae, Ryo Okui, and Yoon-Jae Jae Whang, “Doubly robust uniform con-

ﬁdence band for the conditional average treatment eﬀect function,” Journal of Applied

Econometrics, 2017, 32 (7), 1207–1225.

Liang, Kung-Yee and Scott L. Zeger, “Longitudinal data analysis using generalized

linear models,” Biometrika, 1986, 73 (1), 13–22.

Lin, Winston, “Agnostic notes on regression adjustments to experimental data: Reexam-

ining Freedmans critique,” Annals of Applied Statistics, 2013, 7 (1), 295–318.

Liu, Licheng, Ye Wang, and Yiqing Xu, “A Practical Guide to Counterfactual Esti-

mators for Causal Inference with Time-Series Cross-Sectional Data,” American Journal of

Political Science, 2022, Forthcoming.

MacKinnon, James G. and Matthew D. Webb, “The wild bootstrap for few (treated)

clusters,” The Econometrics Journal, 2018, 21 (2), 114–135.

, Morten Ørregaard Nielsen, and Matthew D. Webb, “Cluster-robust inference: A

guide to empirical practice,” Journal of Econometrics, 2022, Forthcoming.

Malani, Anup and Julian Reif , “Interpreting pre-trends as anticipation: Impact on

estimated treatment eﬀects from tort reform,” Journal of Public Economics, 2015, 124,

1–17.

Manski, Charles F. and John V. Pepper, “How Do Right-to-Carry Laws Aﬀect Crime

Rates? Coping with Ambiguity Using Bounded-Variation Assumptions,” The Review of

Economics and Statistics, 2018, 100 (2), 232–244.

Marcus, Michelle and Pedro H. C. Sant’Anna, “The role of parallel trends in event

study settings : An application to environmental economics,” Journal of the Association

of Environmental and Resource Economists, 2021, 8 (2), 235–275.

McKenzie, David, “Beyond baseline and follow-up: The case for more T in experiments,”

Journal of Development Economics, 2012, 99 (2), 210–221.

Meyer, Bruce D., “Natural and Quasi-Experiments in Economics,” Journal of Business

& Economic Statistics, 1995, 13 (2), 151–161.

Neyman, Jerzy, “On the Application of Probability Theory to Agricultural Experiments.

Essay on Principles. Section 9.,” Statistical Science, 1923, 5 (4), 465–472.

Olea, José Luis Montiel and Mikkel PlagborgMøller, “Simultaneous conﬁdence bands:

Theory, implementation, and an application to SVARs,” Journal of Applied Econometrics,

2019, 34 (1), 1–17.

Rambachan, Ashesh and Jonathan Roth, “Design-Based Uncertainty for Quasi-

Experiments,” November 2022. arXiv:2008.00602 [econ, stat].

and , “A More Credible Approach to Parallel Trends,” Review of Economic Studies,

2022, Forthcoming.

Robins, J. M., “A New Approach To Causal Inference in Mortality Studies With a Sus-

tained Exposure Period - Application To Control of the Healthy Worker Survivor Eﬀect,”

Mathematical Modelling, 1986, 7, 1393–1512.

Roth, Jonathan, “Should We Adjust for the Test for Pre-trends in Diﬀerence-in-Diﬀerence

Designs?,” arXiv:1804.01208 [econ, math, stat], 2018.

, “Pre-test with Caution: Event-study Estimates After Testing for Parallel Trends,” Amer-

ican Economic Review: Insights, 2022, 4 (3), 305–322.

and Pedro H. C. Sant’Anna, “Eﬃcient Estimation for Staggered Rollout Designs,”

arXiv:2102.01291 [econ, math, stat], 2021.

and , “When Is Parallel Trends Sensitive to Functional Form?,” Econometrica, 2022,

Forthcoming.

Rubin, Donald B., “Estimating causal eﬀects of treatments in randomized and nonran-

domized studies,” Journal of E ducational Psychology, 1974, 66 (5), 688–70.

Ryan, Andrew M., “Well-Balanced or too MatchyMatchy? The Controversy over Match-

ing in Diﬀerence-in-Diﬀerences,” Health Services Research, 2018, 53 (6), 4106–4110.

Sant’Anna, Pedro H. C. and Jun Zhao, “Doubly robust diﬀerence-in-diﬀerences esti-

mators,” Journal of Econometrics, 2020, 219 (1), 101–122.

Schmidheiny, Kurt and Sebastian Siegloch, “On Event Studies and Distributed-Lags

in Two-Way Fixed Eﬀects Models: Identiﬁcation, Equivalence, and Generalization,” SSRN

Scholarly Paper 3571164, Social Science Research Network, Rochester, NY 2020.

Shaikh, Azeem and Panos Toulis, “Randomization Tests in Observational Studies With

Staggered Adoption of Treatment,” Journal of the American Statistical Association, 2021,

116 (536), 1835–1848.

Strezhnev, Anton, “Semiparametric weighting estimators for multi-period diﬀerence-in-

diﬀerences designs,” Working Paper, 2018.

Sun, Liyang and Sarah Abraham, “Estimating dynamic treatment eﬀects in event stud-

ies with heterogeneous treatment eﬀects,” Journal of Econometrics, 2021, 225 (2), 175–199.

Viviano, Davide and Jelena Bradic, “Dynamic covariate balancing: estimating treat-

ment eﬀects over time,” June 2021. arXiv:2103.01280 [econ, math, stat].

Wager, Stefan and Susan Athey, “Estimation and Inference of Heterogeneous Treatment

Eﬀects using Random Forests,” Journal of the American Statistical Association, 2018, 113

(523), 1228–1242.

Wooldridge, Jeﬀrey M, “Cluster-Sample Methods in Applied Econometrics,” American

Economic Review P&P , 2003, 93 (2), 133–138.

, “Two-Way Fixed Eﬀects, the Two-Way Mundlak Regression, and Diﬀerence-in-

Diﬀerences Estimators,” Working Paper, 2021, pp. 1–89.

Ye, Ting, Luke Keele, Raiden Hasegawa, and Dylan S. Small, “A Negative Corre-

lation Strategy for Bracketing in Diﬀerence-in-Diﬀerences,” arXiv:2006.02423 [econ, stat],

2021.

Zeldow, Bret and Laura A. Hatﬁeld, “Confounding and regression adjustment in

diﬀerence-in-diﬀerences studies,” Health Services Research, 2021, 56 (5), 932–941.

A Connecting model-based assumptions to potential

outcomes

This section formalizes connections between the model-based assumptions in Section 5.1

and the potential outcomes framework. We derive how the errors in the structural model

(18) map to primitives based on potential outcomes in the canonical model from Section 2.

Speciﬁcally, we show that under the set-up of Section 2, Assumptions 1 and 2 imply that

the canonical DiD estimator takes the form given in (20), where β “ τ

is the ATT at the

cluster level, ν

“ ν

jt,0

` D

jt,1

and ϵ

ijt

“ ϵ

ijt,0

` D

ijt,1

, where

ijt,0

“ Y

ijt

p0q ´ E

ijt

p0q|jpiq “ j

ijt,1

“ Y

ijt

p1q ´ Y

ijt

p0q ´ E

ijt

p1q ´ Y

ijt

p0q|jpiq “ j

jt,0

“ E

ijt

p0q|jpiq “ j

´ E

ijt

p0q|D

jt,1

“ E

ijt

p1q ´ Y

ijt

p0q|jpiq “ j

´ τ

Thus, in the canonical set-up, restrictions on ν

and ϵ

ijt

can be viewed as restrictions on

primitives that are functions of the potential outcomes.

Adopt the notation and set-up in Section 2, except now each unit i belongs to a cluster

j and treatment is assigned at the cluster level D

. We assume clusters are drawn iid from

a superpopulation of clusters and then units are drawn iid within the sampled cluster. We

write J

to denote the number of clusters with treatment d, and n

the number of units in

cluster j. As in the main text, let Y

“ n

´1

i:jpiq“j

ijt

be the sample mean within cluster

j. The canonical DiD estimator at the cluster level can then be written as:

τ “

j:D

“1

´ Y

q ´

i:D

“0

´ Y

“

j:D

“1

i:jpiq“j

ij2

´ Y

ij1

q ´

i:D

“0

i:jpiq“j

ij2

´ Y

ij1

Since the observed outcome is Y p1q for treated units and Y p0q for control units, under the

no anticipation assumption it follows that

τ “

j:D

“1

i:jpiq“j

ij2

p1q ´ Y

ij1

p0qq ´

j:D

“0

i:jpiq“j

ij2

p0q ´ Y

ij1

p0qq,

In what follows, we write E

ijt

p0q|D

“ d

to denote the exp ectation where one ﬁrst draws j from the

population with D

“ d and then draws Y

ijt

p0q from that cluster.

or equivalently,

τ “

j:D

“1

i:jpiq“j

ij2

p1q ´ Y

ij2

p0qq`

j:D

“1

i:jpiq“j

ij2

p0q ´ Y

ij1

p0qq´

j:D

“0

i:jpiq“j

ij2

p0q ´ Y

ij1

p0qq.

Adding and subtracting terms of the form E

ijt

|jpiq “ 1

, we obtain

τ “τ

j:D

“1

ij2

p1q ´ Y

ij2

p0q|jpiq “ js ´ τ

j:D

“1

i:jpiq“j

ij2

p1q ´ Y

ij2

p0q ´ E

ij2

p1q ´ Y

ij2

p0q|jpiq “ j

j:D

“1

i:jpiq“j

ij2

p0q ´ Y

ij1

p0q ´ E rY

ij2

p0q ´ Y

ij1

p0q|jpiq “ jsq´

j:D

“0

i:jpiq“j

ij2

p0q ´ Y

ij1

p0q ´ E

ij2

p0q ´ Y

ij1

p0q|jpiq “ j

sq`

j:D

“1

ij2

p0q ´ Y

ij1

p0q|jpiq “ j

s ´

j:D

“0

ij2

p0q ´ Y

ij1

p0q|jpiq “ j

where τ

“ E

”

´1

j:D

“1

E rY

ij2

p1q ´ Y

ij2

p0q|jpiq “ js

“ E rY

ij2

p1q ´ Y

ij2

p0q|D

“ 1s is

the ATT among treated clusters (weighting all clusters equally).

Now, we assume parallel trends at the cluster level, so that

E rY

ij2

p0q ´ Y

ij1

p0q|D

“ 1s ´ E rY

ij2

p0q ´ Y

ij1

p0q|D

“ 0s “ 0,

which implies that

τ “τ

j:D

“1

ij2

p1q ´ Y

ij2

p0q|jpiq “ j

s ´

loooooooooooooooooooooomoooooooooooooooooooooon

“∆ν

j,1

j:D

“1

i:jpiq“j

ij2

p1q ´ Y

ij2

p0q ´ E rY

ij2

p1q ´ Y

ij2

p0q|jpiq “ jsq

looooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooon

“∆ϵ

ij,1

j:D

“1

i:jpiq“j

ij2

p0q ´ Y

ij1

p0q ´ E

ij2

p0q ´ Y

ij1

p0q|jpiq “ j

looooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooon

“∆ϵ

ij,0

i:D

“0

i:jpiq“j

ij2

p0q ´ Y

ij1

p0q ´ E rY

ij2

p0q ´ Y

ij1

p0q|jpiq “ jsq

looooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooon

“∆ϵ

ij,0

j:D

“1

ij2

p0q ´ Y

ij1

p0q|jpiq “ j

s ´

ij2

p0q ´ Y

ij1

p0q|D

loooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooon

∆ν

j,0

j:D

“0

pE rY

ij2

p0q ´ Y

ij1

p0q|jpiq “ js ´ E rY

ij2

p0q ´ Y

ij1

p0q|D

loooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooon

∆ν

j,0

Letting ∆ν

“ ∆ν

j,0

∆ν

j,1

and ∆ϵ

“ ∆ϵ

ij,0

∆ϵ

ij,1

, it follows that

τ takes the form

(20) with β “ τ