StatsOtter Causal inference workflows
11
Workflow·5 steps

Matching for causal inference (MatchIt)

Summary by StatsOtter

Preprocesses observational data by matching treated and control units on covariates, so downstream models depend less on modeling assumptions.

1

Input · what goes in

A data frame with a binary treatment indicator and the covariates to balance on.

Show data format & exampleHide example
treat age educ married
1 37 11 1
0 22 9 0
1 30 12 1
0 45 14 0
2

Pipeline · the recipe

↑ Click any step in the diagram to read its logic, code, assumptions & discussion.

1
Data prep

Load MatchIt and the lalonde data

Data preparation — shapes the raw inputs into what the estimator expects.

What happens here

Bring in the package and the canonical Lalonde job-training observational dataset.

Reads from the input data Feeds into #2
Key code
# Install:  install.packages("MatchIt")
library("MatchIt")
data("lalonde")
head(lalonde)

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
2
Estimation

Run 1:1 nearest-neighbor PS matching

The core estimate — where the causal quantity itself is computed.

What happens here

Fit a logistic propensity score and match each treated unit to its nearest control without replacement.

Formula
e(X_i) = \Pr(T_i = 1 \mid X_i)
Reads from #1 Feeds into #3
Key code
m.out1 <- matchit(treat ~ age + educ + race + married +
                    nodegree + re74 + re75,
                  data = lalonde,
                  method = "nearest",
                  distance = "glm")

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
3
Diagnostic / pre-tests

Assess covariate balance

A pre-flight check — run this before trusting any estimate downstream.

What happens here

Inspect standardized mean differences before and after matching to check that matching reduced imbalance.

Formula
\text{SMD} = \frac{\bar{X}_t - \bar{X}_c}{s_t}
Reads from #2 Feeds into #4
Key code
summary(m.out1)
plot(m.out1, type = "density", interactive = FALSE)

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
4
Data prep

Extract the matched dataset

Data preparation — shapes the raw inputs into what the estimator expects.

What happens here

Pull out the matched sample with matching weights and subclass identifiers for the outcome model.

Reads from #3 Feeds into #5
Key code
m.data <- match.data(m.out1)

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
5
Estimation

Estimate the ATT

The core estimate — where the causal quantity itself is computed.

What happens here

Fit a weighted outcome regression and use marginaleffects to get the average treatment effect on the treated with cluster-robust SEs.

Formula
\tau_{ATT} = E[Y_1 - Y_0 \mid T = 1]
Reads from #4 Feeds into the final output
Key code
library("marginaleffects")
fit <- lm(re78 ~ treat * (age + educ + race + married +
                           nodegree + re74 + re75),
          data = m.data, weights = weights)
avg_comparisons(fit, variables = "treat",
                vcov = ~subclass,
                newdata = subset(m.data, treat == 1))

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
3

Output · what you get 4 figures

Love plot — standardized mean differences drop below the balance threshold after matching, far from the unmatched points.
Fig 1Love plot — standardized mean differences drop below the balance threshold after matching, far from the unmatched points.
Jitter plot of propensity scores: no treated units are dropped while many low-propensity controls are pruned.
Fig 2Jitter plot of propensity scores: no treated units are dropped while many low-propensity controls are pruned.
Empirical-CDF plots for educ, married and re75 comparing treated vs control in the matched sample.
Fig 3Empirical-CDF plots for educ, married and re75 comparing treated vs control in the matched sample.
Mirrored propensity-score histogram for the treated and control groups (drawn with cobalt).
Fig 4Mirrored propensity-score histogram for the treated and control groups (drawn with cobalt).

Figures reproduced from the package's official documentation — unofficial community showcase; all credit to the original authors.

Result · the numbers

\hat\tau_{\mathrm{ATT}}=\frac{1}{n_1}\sum_{i:\,Z_i=1}\Big(Y_i-\sum_{j} w_{ij}\,Y_j\Big)

⚠️ Unofficial community showcase of MatchIt (docs). Not affiliated with the authors — all credit to Gary King & coauthors; this summarizes public documentation.

What it does: MatchIt selects matched subsamples of treated and control units with similar covariate distributions, so that a subsequent parametric model (e.g. a regression) is less sensitive to specification. How it works: It supports many methods—nearest-neighbor and optimal propensity-score matching, exact and coarsened exact matching, genetic matching, and full/subclassification—then reports covariate balance (standardized mean differences, eCDF, Love plots) before and after. Effects are estimated on the matched data, typically with weights and robust/cluster-robust standard errors. Assumptions: Causal interpretation requires unconfoundedness (selection on observables) and overlap/common support between groups; matching only addresses observed covariates, not unmeasured confounding. It implements Ho, Imai, King & Stuart's 'matching as nonparametric preprocessing' recommendations.

What you get — A matched dataset (with weights/subclasses) plus balance diagnostics for estimating treatment effects.

Example output

Call:
matchit(formula = treat ~ age + educ + race + married + nodegree +
    re74 + re75, data = lalonde, method = "nearest", distance = "glm")

Summary of Balance for All Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance          0.5774        0.1822          1.7941     0.9211    0.3774   0.6444
age              25.8162       28.0303         -0.3094     0.4400    0.0813   0.1577
educ             10.3459       10.2354          0.0550     0.4959    0.0347   0.1114
raceblack         0.8432        0.2028          1.7615          .    0.6404   0.6404
married           0.1892        0.5128         -0.8263          .    0.3236   0.3236
re74           2095.5737     5619.2365         -0.7211     0.5181    0.2248   0.4470

Summary of Balance for Matched Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance          0.5774        0.3629          0.9739     0.7566    0.1321   0.4216
age              25.8162       25.3027          0.0718     0.4568    0.0847   0.2541
married           0.1892        0.2108         -0.0552          .    0.0216   0.0216
re74           2095.5737     2342.1076         -0.0505     1.3289    0.0469   0.2757

Sample Sizes:
          Control Treated
All           429     185
Matched       185     185
Unmatched     244       0

Links: package · paper

Discussion (2)

  • 3

    Matching as the design stage — outcome-free — is the discipline people skip. MatchIt makes it the path of least resistance.

    2

    And match.data() → any outcome model. Pairs perfectly with cobalt for the balance plots.

  • 6

    Nearest, optimal, full, genetic — all behind one matchit() call. Great teaching tool.