Institution affiliation imputation

We assign institutions at the paper-author level: for a given paper, author A can be attached to Berkeley while author B is attached to Stanford if the underlying authorship data records those affiliations separately. The file is not an all-paper-institutions-to-all-authors expansion.

How well does it work?

We validate the author-institution file against faculty rosters for Stanford, UC Berkeley, UC Davis, and UCLA. The main check asks: when the roster data say a matched person is at one of these schools in a stable year, does the paper-author affiliation include that school?

Conservative validation rows

Stable person-years in the four-school faculty panel.

Correct school 91%

Other school 0.4%

The current imputed affiliation file uses a window-based modal vote that can override individual raw rows when an author's surrounding publication history is strongly concentrated at a different school.

Multi-school coauthor stress test

Papers with matched authors from different validation schools.

Correct school 75%

Coauthor school 13%

This is deliberately hard: it directly tests whether a Stanford-Berkeley paper assigns each school to the correct author.

Where authors are concentrated in 1980

These charts use the institution sample that clears the publication threshold used in the codebase. We count distinct authors attached to each institution in 1980, keeping only authors with at least 10 OpenAlex publications. If an author is attached to two sample institutions in 1980, they count once for each institution. The institution threshold is computed from sample papers in 1960-1969.

Authors per institution

Linear scale, sorted by number of 1980 authors.

Loading 1980 concentration data...

Cumulative share by number of universities

Share of sample author placements covered by the top N universities.

Loading 1980 concentration data...

Largest institutions in 1980

How the institution file is built

We want one row per (paper, author) pair giving the university we believe that author was at when the paper came out. The starting point is the raw OpenAlex authorship table: each coauthorship row says “author a appeared on paper p with affiliation tag i”. A paper-author pair can have zero such rows (OpenAlex never tagged an affiliation), one, or several (joint appointments, or OpenAlex split a single department into a university and a medical school). We keep only rows whose institution OpenAlex classifies as education — we throw out hospitals, government labs, and companies, because we want a university affiliation file, not a full employer history.

Step 1: build the author's own publication history per year

For each author we count, for every year, how many of their papers in that year were raw-tagged with each university. So if Jane Doe wrote 3 papers in 1980 and all three were tagged Stanford, then for her, (year=1980, institution=Stanford, count=3).

Step 2: the ±3-year window vote

Now take any one of her papers, say from 1981. We pool her institution counts across 1978–1984 — a seven-year window centered on the paper. Suppose those counts add up to 8 Stanford and 1 MIT. The window total is 9, the window-modal university (the one that wins the vote) is Stanford with 8 papers, and its window share is 8/9 ≈ 89%.

Step 3: rules, applied in order — first match wins

Strong-window rule. If the window-modal university wins by ≥ 75% of the window and shows up on at least 4 of the author's papers in that window, we assign it. This is strong enough that we use it even if the raw row on this paper says something different. That different raw row is usually a coauthor leak: OpenAlex sometimes copies one author's affiliation onto every author on the paper, so a Stanford visitor on a Berkeley paper turns the whole author list into “Stanford”. Less than 1% of slots in the file actually have their raw row overridden this way (we tag those as imputed = 3).
Use the raw row. If the strong-window rule didn't fire, but the slot has at least one raw OpenAlex affiliation row, we use it. If there are several raw rows on the same slot (joint appointment), we pick the one that has the most support in the author's window. Tagged imputed = 0.
Weak-window fill. If the slot has no raw row at all but the window vote still has a winner with ≥ 50% share over at least 2 papers, use that. Tagged imputed = 1.
Long interpolation. Some slots have no raw row and not enough nearby publications for a window vote. For those we look at the author's nearest publication year before the paper and nearest publication year after, and ask which university wins each of those years on its own. If those two years agree on the same university, we assign it. If they disagree, we leave the slot blank. Tagged imputed = 2.
Otherwise we leave the slot blank.

Roughly a third of the file is from raw rows, a third is window-filled, a small share is long-distance interpolated, and less than 1% is the strong-window override.

Show the underlying code (DuckDB SQL)

Excerpts from build/upstream/03_impute_affiliations.py. The thresholds --s-strong=0.75, --p-strong=4, --s-weak=0.50, --p-weak=2, and --window=3 are the production defaults.

Step 1 — author-year-institution counts (the raw vote tally each author carries with them):

-- one row per (author, year, institution): how many of the author's
-- papers in that year were raw-tagged with that institution.
CREATE TABLE ayi AS
SELECT
    AuthorID,
    year AS obs_year,
    InstitutionID,
    COUNT(DISTINCT PaperID) AS n_papers
FROM raw_obs                        -- restricted to type='education'
GROUP BY AuthorID, year, InstitutionID;

Step 2 — for each paper-author slot, the ±3-year window vote:

-- pool the author's per-year counts across [year-3, year+3]
CREATE TABLE window_evidence AS
SELECT s.PaperID, s.AuthorID, s.year,
       ay.InstitutionID,
       SUM(ay.n_papers) AS n_papers_w
FROM slots s
INNER JOIN ayi ay
  ON s.AuthorID = ay.AuthorID
 AND ay.obs_year BETWEEN s.year - 3 AND s.year + 3
GROUP BY s.PaperID, s.AuthorID, s.year, ay.InstitutionID;

-- pick the winning institution per slot (window-modal) and compute its share
CREATE TABLE slot_window AS
WITH totals AS (
    SELECT PaperID, AuthorID, year, SUM(n_papers_w) AS total_papers_w
    FROM window_evidence GROUP BY PaperID, AuthorID, year
),
ranked AS (
    SELECT we.*,
           ROW_NUMBER() OVER (
               PARTITION BY PaperID, AuthorID, year
               ORDER BY n_papers_w DESC, InstitutionID
           ) AS rn
    FROM window_evidence we
)
SELECT r.PaperID, r.AuthorID, r.year,
       r.InstitutionID AS modal_inst,
       r.n_papers_w     AS modal_papers,
       r.n_papers_w * 1.0 / NULLIF(t.total_papers_w, 0) AS modal_share
FROM ranked r JOIN totals t USING (PaperID, AuthorID, year)
WHERE r.rn = 1;

Step 3 — the priority cascade (strong-window → raw → weak-window → long-interp):

SELECT s.PaperID, s.AuthorID, s.year,
    CASE
        -- (a) strong-window rule: >= 75% share AND >= 4 papers
        WHEN sw.modal_share >= 0.75 AND sw.modal_papers >= 4
            THEN sw.modal_inst
        -- (b) raw paper-author-institution row from OpenAlex
        WHEN srp.raw_pick_inst IS NOT NULL
            THEN srp.raw_pick_inst
        -- (c) weak-window fill: no raw row, >= 50% AND >= 2 papers
        WHEN sw.modal_share >= 0.50 AND sw.modal_papers >= 2
            THEN sw.modal_inst
        -- (d) long interpolation: nearest pre and post year agree
        WHEN li.interp_inst IS NOT NULL
            THEN li.interp_inst
        ELSE NULL
    END AS InstitutionID,
    CASE
        WHEN sw.modal_share >= 0.75 AND sw.modal_papers >= 4 THEN
            CASE
                WHEN srp.raw_pick_inst = sw.modal_inst THEN 0   -- agreed with raw
                WHEN srp.raw_pick_inst IS NULL         THEN 1   -- filled missing
                ELSE 3                                          -- overrode raw
            END
        WHEN srp.raw_pick_inst IS NOT NULL                 THEN 0
        WHEN sw.modal_share >= 0.50 AND sw.modal_papers >= 2 THEN 1
        WHEN li.interp_inst IS NOT NULL                    THEN 2
        ELSE NULL
    END AS imputed
FROM slots s
LEFT JOIN slot_window      sw  USING (PaperID, AuthorID, year)
LEFT JOIN slot_raw_picked  srp USING (PaperID, AuthorID, year)
LEFT JOIN long_interp      li  USING (PaperID, AuthorID, year);

The long_interp helper is also a SQL block: it finds, per slot, the largest observation year strictly before the paper and the smallest strictly after, then keeps the slot only if both years' modal university match.

-- nearest pre/post observation years
CREATE TABLE slot_brackets AS
SELECT s.PaperID, s.AuthorID, s.year,
       MAX(CASE WHEN ay.obs_year < s.year THEN ay.obs_year END) AS before_year,
       MIN(CASE WHEN ay.obs_year > s.year THEN ay.obs_year END) AS after_year
FROM slots s
INNER JOIN ayi ay ON s.AuthorID = ay.AuthorID
GROUP BY s.PaperID, s.AuthorID, s.year;

-- assign only when both sides exist AND agree
CREATE TABLE long_interp AS
SELECT sb.PaperID, sb.AuthorID, sb.year, bm.modal_inst AS interp_inst
FROM slot_brackets sb
INNER JOIN ayi_modal bm ON sb.AuthorID = bm.AuthorID AND bm.obs_year = sb.before_year
INNER JOIN ayi_modal am ON sb.AuthorID = am.AuthorID AND am.obs_year = sb.after_year
WHERE sb.before_year IS NOT NULL
  AND sb.after_year  IS NOT NULL
  AND bm.modal_inst = am.modal_inst;

How to read it

The file is useful for asking where authors appear to be publishing from in a given paper-year. It is not a personnel roster. It can miss campus labels when OpenAlex records a parent system, medical campus, or older institution name, and it can be noisy for authors with common names or genuine multi-institution careers.

The institution candidate browser shows the OpenAlex authors attached to a chosen university, year, and field without using faculty roster data.