StatsOtter Causal inference workflows
11
Workflow·5 steps

Coarsened exact matching (CEM)

Summary by StatsOtter

Temporarily coarsens each covariate into bins, exact-matches treated and controls within bins, then estimates effects on the matched data.

1

Input · what goes in

A data frame with a treatment indicator and covariates, plus optional coarsening (cutpoints/groupings) per covariate.

Show data format & exampleHide example
treated age sex income
1 34 F 52000
0 31 F 48000
1 60 M 75000
0 58 M 71000
2

Pipeline · the recipe

↑ Click any step in the diagram to read its logic, code, assumptions & discussion.

1
Data prep

Load cem and the LeLonde data

Data preparation — shapes the raw inputs into what the estimator expects.

What happens here

Load the package and the LL (LeLonde) dataset, and define the variables to drop from matching.

Reads from the input data Feeds into #2
Key code
# Install:  install.packages("cem")
library(cem)
data(LL)
todrop <- c("treated", "re78")

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
2
Diagnostic / pre-tests

Measure imbalance before matching

A pre-flight check — run this before trusting any estimate downstream.

What happens here

Compute the multivariate L1 imbalance and per-variable differences on the raw data.

Formula
L_1 = \frac{1}{2} \sum_{\ell} |f_{\ell} - g_{\ell}|
Reads from #1 Feeds into #3
Key code
imbalance(group = LL$treated, data = LL, drop = todrop)

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
3
Estimation

Coarsen and match with cem()

The core estimate — where the causal quantity itself is computed.

What happens here

Run coarsened exact matching with automatic binning, dropping the outcome from the coarsening.

Formula
\hat\tau=\sum_{\ell\in\mathcal L}\frac{n_\ell^{T}}{n^{T}}\big(\bar Y_\ell^{T}-\bar Y_\ell^{C}\big)
Reads from #2 Feeds into #4
Key code
mat <- cem(treatment = "treated", data = LL, drop = "re78")
mat

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
4
Diagnostic / pre-tests

Check imbalance after matching

A pre-flight check — run this before trusting any estimate downstream.

What happens here

Re-evaluate the L1 statistic on the matched, weighted sample to confirm balance improved.

Reads from #3 Feeds into #5
Key code
mat$imbalance

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
5
Inference

Estimate the SATT

Uncertainty quantification — standard errors, intervals, and aggregation.

What happens here

Use att() to estimate the sample average treatment effect on the treated within CEM strata.

Formula
\mathrm{SATT} = \sum_s \frac{n_{s}^{T}}{n^{T}} \left(\bar{Y}_{s}^{T} - \bar{Y}_{s}^{C}\right)
Reads from #4 Feeds into the final output
Key code
est <- att(mat, re78 ~ treated, data = LL)
est

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
3

Output · what you get

SATT on the CEM-matched sample with its 95% confidence interval.
Fig 1SATT on the CEM-matched sample with its 95% confidence interval.

Result figure rendered by StatsOtter from the package's documented example — unofficial community showcase; all credit to the original authors.

Result · the numbers

\hat\tau=\sum_{\ell\in\mathcal L}\frac{n_\ell^{T}}{n^{T}}\big(\bar Y_\ell^{T}-\bar Y_\ell^{C}\big)

⚠️ Unofficial community showcase of cem (docs). Not affiliated with the authors — all credit to Gary King & coauthors; this summarizes public documentation.

What it does: CEM implements Coarsened Exact Matching, a monotonic-imbalance-bounding matching method that improves covariate balance between treated and control groups in observational studies. How it works: Each covariate is temporarily coarsened into substantively meaningful bins (e.g. age into decades), units are sorted into strata defined by all coarsened covariates, and only strata containing both treated and control units are retained. The original (uncoarsened) values are then used for analysis, with weights correcting for differing stratum sizes. Bounding imbalance on one variable never increases it on another (MIB property), and the user directly controls the balance/sample-size tradeoff via the coarsening. Assumptions: Unconfoundedness given observed covariates and common support; pruned units reduce the sample but improve balance. Output: matched strata, CEM weights, an imbalance measure, and effect estimates (commonly the ATT) on the matched sample.

What you get — Matched strata with CEM weights and an imbalance statistic, used to estimate treatment effects (e.g. ATT).

Example output

G0      G1
429     185
Matched Data
G0      G1
222     163

Linear regression model on CEM matched data:

SATT point estimate: 550.625110 (p.value=0.347096)
95% conf. interval: [-606.207019, 1707.457238]

Multivariate L1 distance after matching: 0.59

Number of strata: 162
Number of matched strata: 67

          G0  G1
All      429 185
Matched  222 163
Unmatched 207  22

Links: package · paper

Discussion (0)

  • No comments yet — start the conversation.