StatsOtter Causal inference workflows
11
Workflow·5 steps

Multiple imputation of missing data (Amelia)

Summary by StatsOtter

Fills in missing values via fast bootstrap-EM multiple imputation, producing several complete datasets you analyze and combine.

1

Input · what goes in

A data frame with missing values (NA), optionally with declared time-series/cross-section ID variables.

Show data format & exampleHide example
year country gdp trade
1990 A 5.2 NA
1991 A NA 41.0
1990 B 3.1 28.4
1991 B 3.4 NA
2

Pipeline · the recipe

↑ Click any step in the diagram to read its logic, code, assumptions & discussion.

1
Data prep

Load Amelia and the freetrade data

Data preparation — shapes the raw inputs into what the estimator expects.

What happens here

Load the package and the freetrade panel of trade policy in Asian countries, which has missing values in tariff and other columns.

Reads from the input data Feeds into #2
Key code
# Install:  install.packages("Amelia")
library(Amelia)
data(freetrade)
summary(freetrade)

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
2
Estimation

Run multiple imputation with m = 5

The core estimate — where the causal quantity itself is computed.

What happens here

Impute five completed datasets, declaring the panel/time structure with cs and ts.

Formula
\bar Q=\frac1m\sum_{k=1}^{m}\hat Q_k,\qquad T=\bar U+\Big(1+\tfrac1m\Big)B
Reads from #1 Feeds into #3
Key code
a.out <- amelia(freetrade, m = 5, ts = "year", cs = "country")

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
3
Diagnostic / pre-tests

Check imputation diagnostics

A pre-flight check — run this before trusting any estimate downstream.

What happens here

Compare the observed and imputed densities and run an overimputation check to evaluate imputation quality.

Reads from #2 Feeds into #4
Key code
compare.density(a.out, var = "tariff")
overimpute(a.out, var = "tariff")

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
4
Estimation

Fit the model on each imputed dataset

The core estimate — where the causal quantity itself is computed.

What happens here

Run the same regression on all five completed datasets and collect coefficients and standard errors.

Reads from #3 Feeds into #5
Key code
b.out <- NULL; se.out <- NULL
for (i in 1:a.out$m) {
  ols <- lm(tariff ~ polity + pop + gdp.pc + year + country,
            data = a.out$imputations[[i]])
  b.out <- rbind(b.out, coef(ols))
  se.out <- rbind(se.out, coef(summary(ols))[, 2])
}

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
5
Inference

Combine estimates with Rubin's rules

Uncertainty quantification — standard errors, intervals, and aggregation.

What happens here

Use mi.meld() to pool the per-imputation estimates into a single point estimate and standard error.

Formula
\bar{q} = \frac{1}{m}\sum_{j=1}^{m} q_j,\quad T = \bar{U} + (1 + m^{-1}) B
Reads from #4 Feeds into the final output
Key code
combined <- mi.meld(q = b.out, se = se.out)
combined$q.mi
combined$se.mi

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
3

Output · what you get

Observed vs imputed distribution of tariff across the five imputations.
Fig 1Observed vs imputed distribution of tariff across the five imputations.

Result figure rendered by StatsOtter from the package's documented example — unofficial community showcase; all credit to the original authors.

Result · the numbers

\bar Q=\frac1m\sum_{k=1}^{m}\hat Q_k,\qquad T=\bar U+\Big(1+\tfrac1m\Big)B

⚠️ Unofficial community showcase of Amelia (docs). Not affiliated with the authors — all credit to Gary King & coauthors; this summarizes public documentation.

What it does: Amelia (Amelia II) performs multiple imputation of missing data for cross-sectional, time-series, and time-series-cross-section datasets, generating m complete datasets that reflect imputation uncertainty. How it works: It assumes the data are jointly multivariate normal and missing at random, then uses an EMB (Expectation-Maximization with Bootstrapping) algorithm—bootstrapping the data and running EM on each replicate—which is far faster and more stable than MCMC approaches while giving comparable answers. It supports priors, transformations for skewed/bounded variables, and time and cross-section structure via polynomials of time and lags/leads. Assumptions: Multivariate normality (after transformation) and missing-at-random (MAR). Analysts fit their model on each imputed dataset and combine results with Rubin's rules. Output: m completed datasets plus diagnostics (overimputation, density comparisons) to assess imputation quality.

What you get — m completed datasets (imputations) plus diagnostics; analyze each and pool with Rubin's rules.

Example output

Amelia output with 5 imputed datasets.
Return code:  1
Message:  Normal EM convergence.

Chain Lengths:
--------------
Imputation 1:  17
Imputation 2:  20
Imputation 3:  16
Imputation 4:  18
Imputation 5:  19

Rows after Listwise Deletion:  96
Rows after Imputation:  171
Patterns of missingness in the data:  8

Fraction Missing for original variables:
-----------------------------------------

           Fraction Missing
tariff            0.34502924
polity            0.01169591
intresmi          0.07602339
signed            0.01754386
fiveop            0.10526316

Links: package · paper

Discussion (0)

  • No comments yet — start the conversation.