We assign institutions at the paper-author level: for a given paper, author A can be attached to Berkeley while author B is attached to Stanford if the underlying authorship data records those affiliations separately. The file is not an all-paper-institutions-to-all-authors expansion.
How well does it work?
We validate the author-institution file against faculty rosters for Stanford, UC Berkeley, UC Davis, and UCLA. The main check asks: when the roster data say a matched person is at one of these schools in a stable year, does the paper-author affiliation include that school?
Conservative validation rows
Stable person-years in the four-school faculty panel.
The current imputed affiliation file uses a window-based modal vote that can override individual raw rows when an author's surrounding publication history is strongly concentrated at a different school.
Multi-school coauthor stress test
Papers with matched authors from different validation schools.
This is deliberately hard: it directly tests whether a Stanford-Berkeley paper assigns each school to the correct author.
Where authors are concentrated in 1980
These charts use the institution sample that clears the publication threshold used in the codebase. We count distinct authors attached to each institution in 1980, keeping only authors with at least 10 OpenAlex publications. If an author is attached to two sample institutions in 1980, they count once for each institution. The institution threshold is computed from sample papers in 1960-1969.
Authors per institution
Linear scale, sorted by number of 1980 authors.
Cumulative share by number of universities
Share of sample author placements covered by the top N universities.
Largest institutions in 1980
How the institution file is built
We want one row per (paper, author) pair giving the
university we believe that author was at when the paper came out.
The starting point is the raw OpenAlex authorship table: each
coauthorship row says “author a appeared on paper
p with affiliation tag i”. A paper-author
pair can have zero such rows (OpenAlex never
tagged an affiliation), one, or
several (joint appointments, or OpenAlex split a
single department into a university and a medical school). We
keep only rows whose institution OpenAlex classifies as
education — we throw out hospitals, government
labs, and companies, because we want a university affiliation
file, not a full employer history.
Step 1: build the author's own publication history per year
For each author we count, for every year, how many of their
papers in that year were raw-tagged with each university. So if
Jane Doe wrote 3 papers in 1980 and all three were tagged
Stanford, then for her, (year=1980,
institution=Stanford, count=3).
Step 2: the ±3-year window vote
Now take any one of her papers, say from 1981. We pool her institution counts across 1978–1984 — a seven-year window centered on the paper. Suppose those counts add up to 8 Stanford and 1 MIT. The window total is 9, the window-modal university (the one that wins the vote) is Stanford with 8 papers, and its window share is 8/9 ≈ 89%.
Step 3: rules, applied in order — first match wins
-
Strong-window rule. If the window-modal
university wins by ≥ 75% of the window
and shows up on at least 4 of the author's papers in
that window, we assign it. This is strong enough that we use it
even if the raw row on this paper says something
different. That different raw row is usually a coauthor
leak: OpenAlex sometimes copies one author's affiliation onto
every author on the paper, so a Stanford visitor on a Berkeley
paper turns the whole author list into “Stanford”.
Less than 1% of slots in the file actually have their
raw row overridden this way (we tag those as
imputed = 3). -
Use the raw row. If the strong-window rule
didn't fire, but the slot has at least one raw OpenAlex
affiliation row, we use it. If there are several raw rows on
the same slot (joint appointment), we pick the one that has
the most support in the author's window. Tagged
imputed = 0. -
Weak-window fill. If the slot has
no raw row at all but the window vote still has a
winner with ≥ 50% share over at least 2 papers,
use that. Tagged
imputed = 1. -
Long interpolation. Some slots have no raw
row and not enough nearby publications for a window vote. For
those we look at the author's nearest publication year
before the paper and nearest publication year
after, and ask which university wins each of those
years on its own. If those two years agree on the same
university, we assign it. If they disagree, we leave the slot
blank. Tagged
imputed = 2. - Otherwise we leave the slot blank.
Roughly a third of the file is from raw rows, a third is window-filled, a small share is long-distance interpolated, and less than 1% is the strong-window override.
Show the underlying code (DuckDB SQL)
Excerpts from
build/upstream/03_impute_affiliations.py. The
thresholds --s-strong=0.75,
--p-strong=4, --s-weak=0.50,
--p-weak=2, and --window=3 are the
production defaults.
Step 1 — author-year-institution counts (the raw vote tally each author carries with them):
-- one row per (author, year, institution): how many of the author's
-- papers in that year were raw-tagged with that institution.
CREATE TABLE ayi AS
SELECT
AuthorID,
year AS obs_year,
InstitutionID,
COUNT(DISTINCT PaperID) AS n_papers
FROM raw_obs -- restricted to type='education'
GROUP BY AuthorID, year, InstitutionID;
Step 2 — for each paper-author slot, the ±3-year window vote:
-- pool the author's per-year counts across [year-3, year+3]
CREATE TABLE window_evidence AS
SELECT s.PaperID, s.AuthorID, s.year,
ay.InstitutionID,
SUM(ay.n_papers) AS n_papers_w
FROM slots s
INNER JOIN ayi ay
ON s.AuthorID = ay.AuthorID
AND ay.obs_year BETWEEN s.year - 3 AND s.year + 3
GROUP BY s.PaperID, s.AuthorID, s.year, ay.InstitutionID;
-- pick the winning institution per slot (window-modal) and compute its share
CREATE TABLE slot_window AS
WITH totals AS (
SELECT PaperID, AuthorID, year, SUM(n_papers_w) AS total_papers_w
FROM window_evidence GROUP BY PaperID, AuthorID, year
),
ranked AS (
SELECT we.*,
ROW_NUMBER() OVER (
PARTITION BY PaperID, AuthorID, year
ORDER BY n_papers_w DESC, InstitutionID
) AS rn
FROM window_evidence we
)
SELECT r.PaperID, r.AuthorID, r.year,
r.InstitutionID AS modal_inst,
r.n_papers_w AS modal_papers,
r.n_papers_w * 1.0 / NULLIF(t.total_papers_w, 0) AS modal_share
FROM ranked r JOIN totals t USING (PaperID, AuthorID, year)
WHERE r.rn = 1;
Step 3 — the priority cascade (strong-window → raw → weak-window → long-interp):
SELECT s.PaperID, s.AuthorID, s.year,
CASE
-- (a) strong-window rule: >= 75% share AND >= 4 papers
WHEN sw.modal_share >= 0.75 AND sw.modal_papers >= 4
THEN sw.modal_inst
-- (b) raw paper-author-institution row from OpenAlex
WHEN srp.raw_pick_inst IS NOT NULL
THEN srp.raw_pick_inst
-- (c) weak-window fill: no raw row, >= 50% AND >= 2 papers
WHEN sw.modal_share >= 0.50 AND sw.modal_papers >= 2
THEN sw.modal_inst
-- (d) long interpolation: nearest pre and post year agree
WHEN li.interp_inst IS NOT NULL
THEN li.interp_inst
ELSE NULL
END AS InstitutionID,
CASE
WHEN sw.modal_share >= 0.75 AND sw.modal_papers >= 4 THEN
CASE
WHEN srp.raw_pick_inst = sw.modal_inst THEN 0 -- agreed with raw
WHEN srp.raw_pick_inst IS NULL THEN 1 -- filled missing
ELSE 3 -- overrode raw
END
WHEN srp.raw_pick_inst IS NOT NULL THEN 0
WHEN sw.modal_share >= 0.50 AND sw.modal_papers >= 2 THEN 1
WHEN li.interp_inst IS NOT NULL THEN 2
ELSE NULL
END AS imputed
FROM slots s
LEFT JOIN slot_window sw USING (PaperID, AuthorID, year)
LEFT JOIN slot_raw_picked srp USING (PaperID, AuthorID, year)
LEFT JOIN long_interp li USING (PaperID, AuthorID, year);
The long_interp helper is also a SQL block: it
finds, per slot, the largest observation year strictly before
the paper and the smallest strictly after, then keeps the slot
only if both years' modal university match.
-- nearest pre/post observation years
CREATE TABLE slot_brackets AS
SELECT s.PaperID, s.AuthorID, s.year,
MAX(CASE WHEN ay.obs_year < s.year THEN ay.obs_year END) AS before_year,
MIN(CASE WHEN ay.obs_year > s.year THEN ay.obs_year END) AS after_year
FROM slots s
INNER JOIN ayi ay ON s.AuthorID = ay.AuthorID
GROUP BY s.PaperID, s.AuthorID, s.year;
-- assign only when both sides exist AND agree
CREATE TABLE long_interp AS
SELECT sb.PaperID, sb.AuthorID, sb.year, bm.modal_inst AS interp_inst
FROM slot_brackets sb
INNER JOIN ayi_modal bm ON sb.AuthorID = bm.AuthorID AND bm.obs_year = sb.before_year
INNER JOIN ayi_modal am ON sb.AuthorID = am.AuthorID AND am.obs_year = sb.after_year
WHERE sb.before_year IS NOT NULL
AND sb.after_year IS NOT NULL
AND bm.modal_inst = am.modal_inst;
How to read it
The file is useful for asking where authors appear to be publishing from in a given paper-year. It is not a personnel roster. It can miss campus labels when OpenAlex records a parent system, medical campus, or older institution name, and it can be noisy for authors with common names or genuine multi-institution careers.
The institution candidate browser shows the OpenAlex authors attached to a chosen university, year, and field without using faculty roster data.