StatsOtter Causal inference workflows
11
Workflow·3 steps

Predicting race/ethnicity from name and geography (wru)

Summary by StatsOtter

"Who Are You?" predicts an individual's probable race/ethnicity from surname, first/middle name, and geolocation using Bayesian (BISG) updating.

1

Input · what goes in

A data frame of individuals with surname (and optionally first/middle name) plus geographic identifiers (state, county, tract, or block FIPS codes).

Show data format & exampleHide example
surname state county tract
Smith NJ 021 000100
Garcia CA 037 207103
Nguyen TX 201 412900
Lee NY 061 010300
2

Pipeline · the recipe

↑ Click any step in the diagram to read its logic, code, assumptions & discussion.

1
Data prep

Load wru and the voter file

Data preparation — shapes the raw inputs into what the estimator expects.

What happens here

Load the package and the bundled example voter file containing surnames and geographic identifiers.

Reads from the input data Feeds into #2
Key code
# Install:  install.packages("wru")
library(wru)
data(voters)

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
2
Estimation

Predict race with BISG

The core estimate — where the causal quantity itself is computed.

What happens here

Call predict_race() to combine the surname likelihood with Census tract-level racial composition (Bayesian Improved Surname Geocoding).

Formula
P(R \mid S, G) = \frac{P(R \mid S)\, P(G \mid R)}{\sum_{r} P(r \mid S)\, P(G \mid r)}
Reads from #1 Feeds into #3
Key code
predict_race(voter.file = voters, census.geo = "tract",
             census.key = Sys.getenv("CENSUS_API_KEY"), party = "PID")

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
3
Reporting

Inspect posterior probabilities

Reporting — turn the numbers into a figure or table a reader can act on.

What happens here

The returned data.frame appends posterior race probabilities (pred.whi/bla/his/asi/oth) that sum to one per voter.

Reads from #2 Feeds into the final output
Key code
head(predict_race(voter.file = voters, surname.only = TRUE))

Reference / docs ↗

Discussion on this step (0)
  • No comments on this step yet — be the first.
3

Output · what you get

Predicted race/ethnicity probabilities (BISG) for the example voters.
Fig 1Predicted race/ethnicity probabilities (BISG) for the example voters.

Result figure rendered by StatsOtter from the package's documented example — unofficial community showcase; all credit to the original authors.

Result · the numbers

\Pr(R\mid S,G)=\frac{\Pr(S\mid R)\,\Pr(R\mid G)}{\sum_{r}\Pr(S\mid r)\,\Pr(r\mid G)}

⚠️ Unofficial community showcase of wru (docs). Not affiliated with the authors — all credit to Kosuke Imai & coauthors; this summarizes public documentation.

What it does: wru (Who Are You) produces probabilistic predictions of an individual's racial/ethnic category when race is unobserved—common in voter files, administrative records, and audits of disparities. How it works: it applies Bayesian Improved Surname Geocoding (BISG), combining Census surname (and optionally first/middle-name) race distributions with the racial composition of the person's geographic unit (state, county, tract, or block) via Bayes' Rule. Newer versions add fully Bayesian name-and-geography models and embedding-based features. The core call predict_race() returns, per person, posterior probabilities of being White, Black, Hispanic, Asian, or Other. Assumptions: accuracy depends on the conditional independence of surname and geography given race, on correct and current Census reference tables, and on representative geocoding; predictions are population-level probabilities, not certainties, and can be biased for groups or regions where the reference data fit poorly.

What you get — Per-individual posterior probabilities for each racial/ethnic category (pred.whi, pred.bla, pred.his, pred.asi, pred.oth).

Example output

  surname state county  tract age sex pred.whi pred.bla pred.his pred.asi pred.oth
   Khanna    NJ    021 004000  29   0   0.0676   0.0043   0.0082   0.8668   0.0531
     Imai    NJ    021 004501  40   0   0.0812   0.0024   0.0689   0.7375   0.1100
  Velasco    NY    061 004800  33   0   0.0594   0.0026   0.8227   0.1051   0.0102
  Fifield    NJ    021 004501  27   0   0.9356   0.0022   0.0285   0.0078   0.0259
     Zhou    NJ    021 004501  28   1   0.0098   0.0018   0.0007   0.9820   0.0058
 Ratkovic    NJ    021 004000  35   0   0.9187   0.0108   0.0108   0.0108   0.0488

Links: package · paper

Discussion (0)

  • No comments yet — start the conversation.