Economics makes causal claims — minimum wages affect employment, education raises earnings, institutions determine growth. Testing these claims requires data and a method for distinguishing causation from correlation. Econometrics is that method.
This chapter is not a statistics course. We assume familiarity with basic probability and regression. Instead, we focus on the central problem of empirical economics: identification — finding credible sources of exogenous variation that allow us to estimate causal effects. Every tool in this chapter — OLS, instrumental variables, difference-in-differences, regression discontinuity — is a strategy for solving the identification problem.
Prerequisites: Chapters 2 and 5 (economic context for examples). Mathematical prerequisites: linear algebra, probability and statistics.
Consider the question: does an additional year of education increase earnings? We observe that more-educated people earn more. But is this because:
Both are consistent with the observed correlation. The identification problem is that we cannot directly compare the same person with and without education — the counterfactual is unobserved.
The fundamental equation:
where $Y_i$ is the outcome (earnings), $X_i$ is the treatment (years of education), $\beta$ is the causal parameter of interest, and $\varepsilon_i$ captures everything else affecting $Y_i$ — ability, family background, motivation, luck, health, and thousands of other factors.
The identification problem arises when $X_i$ is correlated with $\varepsilon_i$ — when the "treatment" is not randomly assigned. In statistics, this is called endogeneity. In economics, it is the norm, not the exception: people choose their education (and the choice is correlated with ability), countries choose their policies (and the choice is correlated with their economic conditions), firms choose their prices (and the choice is correlated with demand conditions).
In a randomized experiment, the treatment $X_i$ is assigned by a coin flip — it is independent of $\varepsilon_i$ by construction. But economists rarely have the luxury of randomization for the big questions. The methods in this chapter — OLS, IV, DiD, RD — are strategies for finding "natural experiments" that approximate randomization in observational data.
For the multivariate model $Y = X\beta + \varepsilon$ (matrix notation):
Under the Gauss-Markov assumptions, OLS has desirable properties:
Under these assumptions, OLS is BLUE — the Best Linear Unbiased Estimator. "Best" means lowest variance among all linear unbiased estimators. "Unbiased" means $E[\hat{\beta}] = \beta$.
The critical assumption is #4: $E[\varepsilon|X] = 0$. When this fails — due to omitted variables, simultaneity, or measurement error in $X$ — OLS is biased. The estimate $\hat{\beta}$ no longer converges to the true $\beta$ even with infinite data. This is not a small-sample problem — it is a fundamental design flaw that more data cannot fix.
A scatter plot with a fitted OLS regression line. Drag the slider to add an outlier at different vertical positions and watch the regression line tilt. Observe how a single high-leverage point can dramatically change the slope, $R^2$, and coefficients.
Figure 9.1. OLS regression with an adjustable outlier. The outlier is placed at $X=14$ (high leverage). Drag the slider above "No outlier" to introduce it and watch the line tilt. Hover for values.
Suppose the true model is $Y = \beta_0 + \beta_1 X + \beta_2 Z + u$, but we omit $Z$ and run $Y = \alpha_0 + \alpha_1 X + e$. Then:
The bias equals the effect of the omitted variable ($\beta_2$) times the association between the omitted variable and the included regressor.
Sign of bias:
| $Cov(X, Z) > 0$ | $Cov(X, Z) < 0$ | |
|---|---|---|
| $\beta_2 > 0$ | Upward bias (overestimate $\beta_1$) | Downward bias |
| $\beta_2 < 0$ | Downward bias | Upward bias |
Suppose ability ($Z$) is positively correlated with both education ($X$) and earnings ($Y$). Then $\beta_2 > 0$ (ability raises earnings) and $Cov(X,Z) > 0$ (more able people get more education). The OLS estimate of the return to education is biased upward — it attributes some of the ability effect to education.
Two panels show the same data. Left: the true relationship with the confounder (ability) shown as point color. Right: the naive OLS regression that omits ability. Drag the slider to change confounding strength and watch the bias grow.
Left: True model with confounder (ability) shown as color. Darker = higher ability.
Right: Naive OLS ignoring ability. The biased line (red dashed) is steeper than the true causal effect (blue).
When OLS is biased because $X$ is endogenous ($Cov(X, \varepsilon) \neq 0$), an instrumental variable can rescue the estimation.
Two-Stage Least Squares (2SLS):
First stage: Regress $X$ on $Z$ (and any control variables):
This isolates the part of $X$ driven by the instrument — the exogenous part. The fitted values $\hat{X}_i$ represent the "clean" variation in $X$.
Second stage: Regress $Y$ on $\hat{X}$. In matrix form:
In the simple case with one instrument and one endogenous regressor:
The IV estimate is the ratio of the reduced form (effect of $Z$ on $Y$) to the first stage (effect of $Z$ on $X$). The intuition: $Z$ affects $Y$ only through $X$ (exclusion restriction), so dividing out the first stage isolates the causal effect of $X$ on $Y$.
What IV estimates. With heterogeneous treatment effects, IV identifies the Local Average Treatment Effect (LATE) — the causal effect for the subpopulation whose behavior is changed by the instrument (the "compliers").
If $Z$ is weakly correlated with $X$, the first stage is weak, and the IV estimate is unreliable (biased toward OLS, wide confidence intervals). Rule of thumb: first-stage F-statistic > 10.
Quarter of birth was used as an instrument for years of schooling. Compulsory schooling laws mean students born earlier in the year can drop out with slightly less education. Quarter of birth is plausibly: (a) correlated with schooling (relevance), and (b) not directly related to earnings (exclusion). The IV estimate of the return to schooling was approximately 7–8% per year.
This directed acyclic graph shows the causal structure of an IV design. Toggle between views to see how an instrument Z breaks the confounding path.
Figure 9.2. DAG for the instrumental variables design. Z is the instrument, X is the endogenous regressor, Y is the outcome, and U is the unobserved confounder. The IV strategy uses only the variation in X that is driven by Z, bypassing the confounding path through U.
The first difference removes time-invariant group characteristics. The second difference removes common time trends.
Key assumption: Parallel trends. In the absence of treatment, the treatment and control groups would have followed the same trend. This is untestable for the post-treatment period but assessable for the pre-treatment period.
New Jersey raised its minimum wage from \$1.25 to \$1.05 in April 1992; Pennsylvania did not. The DiD estimate of the employment effect was positive (+2.7 FTE workers), contradicting the simple competitive model prediction. This study spurred a revolution in empirical labor economics.
Regression formulation:
Two time series show a treatment group and a control group. The treatment occurs at $t = 5$. Drag the slider to change the treatment effect size and see how the DiD estimate updates. Pre-treatment parallel trends are visible.
Figure 9.3. Difference-in-differences design. The dashed line shows the counterfactual — what would have happened to the treatment group without treatment (parallel to control). The gap between the actual and counterfactual outcomes at the end is the treatment effect.
Key assumption: Continuity. All factors affecting $Y$ (other than treatment) vary continuously at the cutoff — no sorting or manipulation around the threshold.
A scholarship is awarded to students scoring above 80 on an exam. Students scoring 79 and 81 are similar in ability but one gets the scholarship and the other does not. The discontinuity in outcomes (e.g., college completion rates) at the 80-point threshold estimates the causal effect of the scholarship.
A scatter plot with a running variable (test score). Students above the cutoff receive treatment (scholarship). Polynomial fits on each side reveal the jump at the cutoff. Adjust the cutoff position and the bandwidth to see how the estimated treatment effect changes.
Figure 9.4. Regression discontinuity. The vertical dashed line marks the cutoff. Points left of the cutoff are untreated (gray); right are treated (green). The jump at the cutoff is the treatment effect estimate. Adjust the bandwidth to focus on observations near the cutoff.
RCTs are the "gold standard" for internal validity because randomization guarantees $E[\varepsilon|X] = 0$ by construction. Banerjee, Duflo, and Kremer received the 2019 Nobel Prize for their experimental approach to alleviating global poverty.
A job training program randomly assigns 500 individuals to treatment and 500 to control. Only 60% of those assigned to treatment actually attend the program (compliance rate = 0.6).
Results: Average earnings: treatment group = \$15,000, control group = \$13,000.
ITT: $\hat{\tau}_{ITT} = 25{,}000 - 23{,}000 = \\$1{,}000$. This is the effect of being offered the program.
TOT: $\hat{\tau}_{TOT} = 2{,}000 / 0.6 = \\$1{,}333$. This estimates the effect of actually attending the program (for compliers). The TOT is larger because the ITT is diluted by non-compliers.
Power check: With $n = 500$ per group, $\sigma = \\$1{,}000$, and a true effect of $\\$1{,}000$, power $\approx 0.80$. The study is adequately powered to detect the ITT.
Statistical power is the probability of detecting a true treatment effect. Use the sliders to explore how effect size, sample size, and variance affect power. The power curve updates in real time, and the minimum detectable effect (MDE) at 80% power is highlighted.
Figure 9.5. Power curve: probability of detecting the effect as a function of effect size. The red dashed line marks 80% power. The green diamond marks the current parameter combination. The MDE is the smallest effect detectable at 80% power given sample size and variance.
A point estimate without a measure of uncertainty is nearly useless.
Standard errors (SE) are the square roots of the diagonal elements. A 95% confidence interval is approximately $\hat{\beta} \pm 1.96 \cdot SE(\hat{\beta})$.
Statistical significance: We reject $H_0: \beta = 0$ at the 5% level if $|t| = |\hat{\beta}/SE(\hat{\beta})| > 1.96$.
Economic significance vs statistical significance: A coefficient can be statistically significant but economically trivial. Conversely, an imprecise estimate can be economically large but statistically insignificant. Good empirical work discusses both.
A practical rule: In modern applied economics, always use robust or clustered standard errors.
Every empirical strategy has assumptions that can fail:
| Strategy | Key Assumption | Threat | Diagnostic |
|---|---|---|---|
| OLS | No omitted variables ($E[\varepsilon|X]=0$) | Confounding | Theory + sensitivity analysis |
| IV | Exclusion restriction | Direct effect of $Z$ on $Y$ | Cannot test directly; argue theoretically |
| IV | Relevance | Weak instruments | First-stage F > 10 |
| DiD | Parallel trends | Differential pre-trends | Plot pre-treatment trends |
| RD | No manipulation at cutoff | Sorting around threshold | McCrary density test |
| RCT | No attrition, no spillovers | Differential dropout; contamination | Balance checks, attrition analysis |
An economist wants to estimate the effect of Kaelani's new education policy (free textbooks for grades 1–6) on test scores. The policy was implemented in the eastern provinces in 2024 but not the western provinces.
Design: Difference-in-differences.
| Pre-policy (2023) | Post-policy (2025) | Change | |
|---|---|---|---|
| Eastern (treatment) | 55 | 63 | +8 |
| Western (control) | 52 | 56 | +4 |
| DiD estimate | +4 |
The DiD estimate is 4 points. Free textbooks raised test scores by 4 points, after controlling for the common upward trend.
Threats: (1) Parallel trends: Were eastern provinces already improving faster? (2) Spillovers: Did families near the border send children to eastern schools? (3) Composition changes: Did free textbooks change enrollment?
A complementary approach: regression discontinuity at the provincial border, comparing villages just on either side.
| Label | Equation | Description |
|---|---|---|
| Eq. 9.1 | $Y_i = \alpha + \beta X_i + \varepsilon_i$ | Structural equation |
| Eq. 9.2 | $\hat{\beta}_{OLS} = (X'X)^{-1}X'Y$ | OLS estimator |
| Eq. 9.3 | $E[\hat{\alpha}_1] = \beta_1 + \beta_2 \cdot Cov(X,Z)/Var(X)$ | Omitted variable bias formula |
| Eq. 9.5 | $\hat{\beta}_{IV} = Cov(Z,Y)/Cov(Z,X)$ | IV estimator (simple) |
| Eq. 9.6 | $\hat{\tau}_{DiD}$ = (treat change) − (control change) | DiD estimator |
| Eq. 9.7 | $Y_{it} = \alpha + \beta_1 Treat + \beta_2 Post + \tau(Treat \times Post) + \varepsilon$ | DiD regression |
| Eq. 9.8 | $\hat{\tau}_{RD} = \lim_{x \downarrow c} E[Y|X=x] - \lim_{x \uparrow c} E[Y|X=x]$ | RD estimator |
| Eq. 9.9 | $\hat{\tau}_{RCT} = \bar{Y}_{treat} - \bar{Y}_{control}$ | RCT estimator |
| Eq. 9.10 | $Var(\hat{\beta}) = \sigma^2(X'X)^{-1}$ | OLS variance |