Work deduplication

How candidate pairs are generated

A duplicate pair must first be a candidate. We use seven blocking signals; the same pair may fire under multiple. All blocking is restricted to ``block_size ≤ 200`` to keep memory tractable on modern years where stub titles like "powered by nict" (34,815 papers in 2017 alone) otherwise blow up.

T0 same DOI — same normalized DOI in the same year. Strongest signal; near-perfect precision.
T1 same PMID, T2 same PMCID, T3 same MAG ID — three identifier-equality signals in the same year. T1 / T2 / T3 are tiny in practice; OpenAlex already merges them at ingest, but the block catches the residual slipthrough.
T4 same title-key — same Unicode-normalised, whitespace-collapsed, lowercase title with token count ≥ 3 and length > 5, in the same year. Title-key blocks of size ≤ 3 become T4a (high precision); 4 – 10 become T4b (allowed iff neither title is in the 84-key frontmatter blocklist).
T5 shared author + fuzzy title — within the same year, two papers share at least one OpenAlex AuthorID and have token_set_ratio ≥ 95 on titles, plus length-ratio and token-count guards.
T6 DOI suffix variant — DOIs are equal except for a trailing single-letter or punctuation+letter suffix (e.g. 10.1000/abc vs 10.1000/abc-a); common Crossref publishing pattern.
T7 cross-year same DOI — same normalized DOI within year_diff ≤ 2; catches reissue / preprint year-drift.
T9 cross-year shared author + fuzzy title — shared author + token_set_ratio ≥ 95 at year_diff = 1 or token_set_ratio ≥ 98 at year_diff ∈ {2, 3}; window ±3 years (widened from ±1 on 2026-05-09 after a 50-pair API spot-check showed 98% precision at the wider window). Catches the same paper indexed once as preprint, once as published version, with different paperids and a 1–3 year drift.

Author IDs are pre-deduplicated. T5 and T9 do shared-author matching on the cleaned AuthorID space coming out of the author-deduplication pipeline (see Author dup. overview), so two papers whose authors live under different OpenAlex AuthorIDs but the same canonical author still register as a shared-author match.

The acceptance tiers

Pairs are evaluated in priority order; the first tier that fires becomes the pair’s accepted_tier. Lower number means stronger evidence. A cluster’s best tier is the lowest-priority tier among its pairs.

Priority	Tier	Rule (key conditions)
0	`T0_same_doi`	Same normalized DOI, same year — the strongest evidence we have.
1	`T1_same_pmid`	Same PubMed ID, same year. Tiny in practice (OpenAlex already merges these at ingest); tier exists to catch residual slipthrough.
2	`T2_same_pmcid`	Same PubMed Central ID, same year. Same caveat as T1.
3	`T3_same_mag`	Same Microsoft Academic Graph ID, same year. Almost never fires (MAG IDs are mostly null in OpenAlex).
4	`T7_cross_year_same_doi`	Same normalized DOI across `year_diff ≤ 2`. Catches reissue and preprint-vs-published year drift.
5	`T6_doi_suffix_variant`	DOIs equal except for trailing single-letter / punct suffix; Crossref publishing pattern.
6	`T4_same_title`	Same Unicode-normalised title key, same year, length `> 5`, token count `≥ 3`, not in 84-key frontmatter blocklist. Step 07 internally splits this on per-pair block size: `T4a_same_title_small_block` (block size `≤ 3`) vs `T4b_same_title_medium_block` (block size `4–10`). The cluster explorer collapses both into the one `T4` class because per-pair block_size is not preserved in the global crosswalk.
7	`T5_shared_author_fuzzy_title`	Same year + `≥ 1` shared OpenAlex AuthorID + `token_set_ratio ≥ 95` on titles + token / length guards.
8	`T9_cross_year_shared_author`	Same as T5, but spanning years (window `±3`; widened from `±1` on 2026-05-09 after spot-check validated 98% precision). Title threshold is `tsr ≥ 95` at `year_diff = 1`, stricter `tsr ≥ 98` at `year_diff ∈ {2, 3}` to keep cross-year FPs from creeping in.

Connected components on the accepted-pair graph form clusters (DuckDB label-propagation, ~10 min on 16.7M nodes / 16.8M edges). Each cluster’s canonical PaperID is the member with the richest metadata (non-null title > non-null DOI > doctype article|journal-article > highest cite count > most references > lowest paperid).

Pipeline structure

Implementation lives at build/work_cleaning/; each step is one numbered Python script.

01_attach_titles.py — per-year scan of the 114 GB papers.parquet + paper_text.parquet → year_{Y}/papers_with_titles.parquet (paper-level metadata + normalized title key).
02_candidates_t0_t4.py — T0 / T1 / T2 / T3 / T4 same-year blocking. --block-size-cap 200 prevents stub-title runaway (an early version segfaulted at 710M same-title pairs in year 2017 from 7 stub-title blocks alone).
03_features.py — per-pair feature engineering: shared authors, shared institutions, title similarity, DOI prefix match, year gap, doctype agreement, cite-count overlap.
04_candidates_t5_t6.py — T5 (shared author + fuzzy title same year) and T6 (DOI suffix variant) candidates, deduplicated against the T0/T4 pool.
05_candidates_t7.py — T7 cross-year same-DOI pairs. Reads adjacent-year title files when present.
06_candidates_t9.py — T9 cross-year shared-author + fuzzy-title pairs. Anchored on the focal year; skips with an empty-schema parquet when the year is outside the 1940-2000 imputed-edges coverage.
07_tier_rules.py — applies tier acceptance rules + 84-key frontmatter blocklist; runs DuckDB label-propagation per year; picks canonical paperid; writes year_{Y}/crosswalk.parquet and canonical_papers.parquet.
08_stitch_years.py — concatenates the 125 per-year crosswalks, runs label-propagation a second time on the union graph (paperids that appear in adjacent-year slices because of the T7/T9 cross-year window), re-picks canonical against unified metadata, writes global/crosswalk.parquet and global/canonical_papers.parquet.

The whole pipeline runs as a SLURM array on Sherlock: one task per year (steps 02–07), then one final stitcher job. Total wall time on 1900–2024: ~3 hours of array compute + ~10 min stitch.

What kinds of clusters each tier catches

Below are real example clusters from the 1900–2024 run, one block per tier in priority order. The work-dup explorer has many more.

loading…

Stats by tier

How accepted pairs and clusters break down by tier. A cluster’s tier is its strongest edge (lowest-priority tier among its pairs).

loading…

Per-tier precision: 50-pair API spot-check

To get an honest read on each tier's false-positive rate, we drew 50 accepted pairs per tier (30 for T1, 0 available for T3) stratified across years 1940–2024 in 5-year steps, fetched both works from the live OpenAlex API, and assigned a verdict (true_dup, false_positive, or ambiguous) via build/work_cleaning/spot_check_api.py:assess().

Translation-aware verdict. The naive verdict flagged every pair whose API titles diverged (token_set_ratio < 95) as a false positive. That badly underestimated T1/T2 precision because PubMed routinely indexes the English + Spanish (etc.) versions of a paper under one PMID/PMCID. The refined rule promotes a diverged-title pair to true_dup when the merge tier is identifier-decisive (T0/T1/T2/T3/T7) and the API shows ≥ 1 shared author — the canonical fingerprint of a translation duplicate.

Tier	n sampled	strict	loose	accepted edges (% of total)
`T0_same_doi`	50	96.0%	96.0%	27,010 (0.22%)
`T1_same_pmid`	30	76.7%	76.7%	12 (0.00%)
`T2_same_pmcid`	50	58.0%	58.0%	41 (0.00%)
`T3_same_mag`	0	n/a	n/a	0
`T4_same_title`	50	38.0%	54.0%	11,321,952 (92.8%)
`T5_shared_author_fuzzy_title`	50	66.0%	66.0%	535,107 (4.4%)
`T6_doi_suffix_variant`	50	30.0%	44.0%	4,593 (0.04%)
`T7_cross_year_same_doi`	50	94.0%	94.0%	3,956 (0.03%)
`T9_cross_year_shared_author`	50	98.0%	98.0%	307,930 (2.5%)

Read. T0/T7/T9 (96/94/98%) are production-ready. T5 (66%) is the borderline mid-tier signal that contributes 4.4% of edges. T1/T2 look low at 27%/24% naive but clear 77%/58% after the translation-aware fix; their combined contribution (53 edges out of 12.2 M, 0.0004%) is negligible regardless. T6 (44% loose) is borderline but only 0.04% of edges.

The dominant tier — T4 — sits at 38% strict, 54% loose precision. T4 contributes 92.8% of all merges, so its precision is the headline number for the whole pipeline. The current frontmatter blocklist (60+ keys), reply-like regex, and block-size cap clear obvious junk; remaining FPs are pairs with a generic medical / scientific title appearing twice in the same year by different authors (e.g. "SURGERY IN DIABETES MELLITUS", "Mental Health Services"). Tightening this further would require an additional shared-author or same- DOI-prefix gate, which would push T4 toward T5's 66% precision but also drop a substantive share of accepted edges — the kind of substantive trade-off a referee should weigh in on. We report T4's strict precision as-is in the paper.

Total accepted edges across the run: 12.2 M, collapsing 16.8 M OpenAlex paperids into 7.6 M canonical clusters (modal cluster size 2 covers 89% of clusters; max single cluster size 2,190). The "edges" column above sums to 12,200,601 — matching work_duplicates_stats.json.

Erratum / correction / retraction handling

OpenAlex assigns a separate PaperID to most erratum, correction, retraction, and reply notices. These should NOT be merged with the original paper — they are distinct publications with their own citation graphs.

Step 07 (07_tier_rules.py) rejects any candidate pair from the title-based tiers (T4, T5, T6, T9) where either side's title matches a REPLYLIKE_REGEXES pattern (^erratum, ^errata, ^correction(s| to|:), ^corrigendum, ^retract(ion|ed|ing), ^withdrawn, ^reply (to|on), ^response (to|on), ^comment (on|to), ^addendum, ^rejoinder, etc. — 33 patterns total). The identifier-decisive tiers (T0/T1/T2/T3/T7) bypass this filter since same-DOI / PMID / PMCID is a publisher-side merge signal we want to preserve.

Audit (post-2026-05-09 rebuild): audit_current_output.py finds 132 canonical clusters out of 7.6 M (0.0017%) whose canonical titles start with erratum / correction / retraction / reply patterns. Inspection of the largest of these (sizes 8, 6, 5, 5, 5, 4, 4, 4...) shows they are genuinely-titled papers using "correction" in its medical / practical sense — not erratum notices — e.g. "Correction of the Cornrow Hair Transplant", "Correction of the Paralytic Claw-Thumb...", "Correction of an Inborn Error of Metabolism...". The dedup pipeline correctly keeps standalone erratum publications separate while merging amended re-publications (T0 / T7 same DOI) and translations.

Citation aggregation: union, not max

Two members of a cluster can share citers (a citing paper that lists both versions in its references) or have disjoint citers (e.g. preprint cited in arXiv-citing literature, journal version cited in journal-citing literature). For each canonical cluster we now compute the union of distinct citing paperids across all members (union_cited_by) — implemented in step 09_citation_union.py as a single DuckDB pass over the citation edge table joined to the global crosswalk.

The naive alternative (max_cited_by across cluster members) under-counts citations whenever the citers are disjoint. In the 1900–2024 run, union_cited_by recovers 33% more citers than max_cited_by on average across multi-member clusters. Both columns are written to canonical_papers.parquet; union_cited_by is the default for downstream panels.

Downstream wiring: canonical-view registry

Rather than rebuild the 114 GB papers.parquet with canonical IDs baked in, every panel builder (build_authors, build_pairs, build_institutions, precompute_aux, build_exposure) now opens its DuckDB connection through bitnet.work_dedup.register_canonical_views(), which exposes views like pa_canon and sci_canon that apply the work crosswalk and the author crosswalk on the fly:

CREATE OR REPLACE VIEW pa_canon AS
  SELECT
    COALESCE(work_xw.canonical_paperid, raw.paperid)  AS paperid,
    COALESCE(auth_xw.canonical_authorid, raw.authorid) AS authorid,
    raw.year, raw.field, raw.affiliation_id, raw....
  FROM read_parquet('paper_author_imputed.parquet') raw
  LEFT JOIN read_parquet('crosswalk.parquet')        work_xw
    ON raw.paperid = work_xw.paperid
  LEFT JOIN read_parquet('author_crosswalk.parquet') auth_xw
    ON raw.authorid = auth_xw.authorid;

This means downstream panels see the deduplicated paper-author edges without any pre-materialization step, and the same crosswalks are applied to the citation graph (paper_references_canon) and to SciSciNet metadata (sci_canon). The toggle is governed by the environment variable BITNET_WORK_DEDUP=1 (set in every Sherlock SLURM script that rebuilds panels).

Canonical clean datasets (SciNET fields, default-on)

As of 2026-05-10 the panels also read paper- and author-level field labels from the SciNET classifier rather than the legacy OpenAlex.primary_topic_id → topic_mapping.csv join. The single source of truth lives at ${OPENALEX}/data/clean/ and is built once by build/upstream/build_canonical_clean_datasets.py (a single DuckDB pass, ~3 min). Eight parquets land alongside the crosswalks:

paper_field.parquet — canonical_paperid × top-3 SciNET fields with renormalized probabilities (14.2 M papers).
paper_subfield.parquet — canonical_paperid × top-5 SciNET subfields.
author_year_institution.parquet — canonical_authorid × year × institution_id with shares + modal flag.
author_field.parquet — career-level top-3 SciNET fields per canonical author, plus an is_confident boolean.
author_subfield.parquet — career-level top-5 subfields.
field_codes_to_names.csv — SciNET 4-letter code ↔ OpenAlex long name lookup.
paper_crosswalk.parquet, author_crosswalk.parquet — copies of the upstream dedup crosswalks for at-rest provenance.

The author confidence flag is a two-criterion threshold: is_confident = (field_p1 ≥ 0.7) AND (n_papers ≥ 10). Thresholds live in bitnet/clean.py as FIELD_CONFIDENCE_THRESHOLD and MIN_PAPERS; changing the constant and re-running the canonical builder re-derives the boolean everywhere. With (0.7, 10), 408 K of the 4.18 M canonical authors (9.8%) are flagged confident — heavily weighted toward authors with 10+ SciNET papers.

SciNET ↔ OpenAlex field-name mapping

SciNET classifies papers into 30 four-letter field codes; OpenAlex uses 26 long-name fields. Most STEM fields map 1:1 (COMP → Computer Science, PHYS → Physics and Astronomy, MATH → Mathematics, ENGG → Engineering, CHEM → Chemistry, MATR → Materials Science). The collapses happen on the non-STEM side (six SciNET arts/humanities codes → OpenAlex "Arts and Humanities"; six SciNET social codes → "Social Sciences").

Known edge case: OpenAlex has a separate Chemical Engineering field (~13 topics in topic_mapping.csv); SciNET does not. ChemE papers land under CHEM or ENGG based on title content. The body's STEM_FIELDS bundle includes Chemical Engineering for BitNet/Usenet/NSFNET, so flipping BITNET_CLEAN_FIELDS=1 shifts those papers into chemistry / engineering. The before/after numerics in the next section quantify the resulting movement on fig20-23.

Toggles

Env var	Default	Effect
`BITNET_CLEAN_FIELDS`	on if parquets exist	Switches all six panel builders to SciNET field labels.
`BITNET_AUTHOR_CONFIDENT_ONLY`	`0`	Restricts the author panel to `is_confident = TRUE`.
`BITNET_CLEAN_DIR`	auto-detect	Override the location of the clean parquets.

Setting BITNET_CLEAN_FIELDS=0 falls back to the legacy topic_mapping.csv join byte-for-byte, so referee questions about the field-source can be answered with a direct A/B comparison.

SciNET vs legacy field-source: fig20-23 before/after

Pre-rebuild snapshot at data/intermediate/panels_pre_scinet/; post-rebuild panels at data/intermediate/panels/. Both fits use identical panel construction and estimator; the only thing that differs is the field label source. Per-event-time tables at results/scinet_before_after/fig{20,21,22,23}_pre_post_estimates.md.

fig	network	mean \|Δcoef\|	max \|Δcoef\|	year-5 legacy	year-5 SciNET
fig20 (binary)	arpanet	0.030	0.062	-0.039	0.023
fig20 (binary)	usenet	0.007	0.022	-0.007	0.015
fig20 (binary)	bitnet	0.020	0.050	0.106	0.129
fig20 (binary)	nsfnet	0.027	0.064	-0.002	0.052
fig21 (E_join)	arpanet	0.086	0.174	-0.114	0.060
fig21 (E_join)	usenet	0.024	0.079	-0.037	0.043
fig21 (E_join)	bitnet	0.032	0.062	0.256	0.226
fig21 (E_join)	nsfnet	0.245	0.670	0.130	0.535
fig22 (random pair)	arpanet	0.011	0.029	0.020	0.049
fig22 (random pair)	usenet	0.003	0.007	0.011	0.015
fig22 (random pair)	bitnet	0.004	0.016	0.022	0.038
fig22 (random pair)	nsfnet	0.020	0.111	-0.068	0.043
fig23 (bilateral)	arpanet	0.027	0.056	0.040	0.043
fig23 (bilateral)	usenet	0.004	0.010	0.003	0.012
fig23 (bilateral)	bitnet	0.004	0.016	0.011	0.027
fig23 (bilateral)	nsfnet	0.013	0.035	-0.029	-0.016

BitNet (the headline network) shifts modestly: $\beta_5$ goes from $0.106$ to $0.129$ on fig20 and from $0.256$ to $0.226$ on fig21. Pre-trends remain flat and there are no sign flips on any pre- or post-treatment lead in fig20, fig22, or fig23 across all four networks. The NSFNET fig21 swing is dominated by small-sample noise (federal-tail universe, late-cohort exclusions); fig20/22/23 move much less on the same network. ARPANET fig21 also wiggles (small CS-only frame), but neither pre- nor post-treatment coefficients flip sign. Usenet moves least across all four figures.

Effect on published estimates (R1, primary fields)

We re-fit the two headline event studies on both pre-dedup and post-dedup panels for all four networks: fig20 (binary treatment) and fig21 (continuous E_join dose). Same R1 frame, same primary-field bundle, same residualization on institution-size×year FE. Computed by analysis/dedup_numeric_diff.py; side-by-side PNGs in results/dedup_before_after/<net>/.

fig20 (binary)

Network	mean \|Δcoef\|	max \|Δcoef\|	year-5 (pre)	year-5 (post)
ARPANET	0.027	0.042	-0.014	-0.039
USENET	0.003	0.008	-0.004	-0.007
BITNET	0.016	0.045	0.151	0.106
NSFNET	0.032	0.088	0.086	-0.002

fig21 (continuous E_join)

Network	mean \|Δcoef\|	max \|Δcoef\|	year-5 (pre)	year-5 (post)
ARPANET	0.079	0.131	-0.013	-0.114
USENET	0.013	0.032	-0.009	-0.037
BITNET	0.042	0.175	0.432	0.256
NSFNET	0.443	1.056	1.186	0.130

No sign flips on any pre-trend. USENET is essentially unchanged (the R1 STEM USENET sample was already duplicate-light). BITNET shifts modestly in both specifications: binary year-5 drops 0.151 → 0.106 (~30%); continuous year-5 drops 0.43 → 0.26 (~40%). The slow-ramp shape and significance pattern are intact in both. ARPANET shows a uniform slight downward shift consistent with a small denominator effect. NSFNET attenuates the most: the previously- positive (but already noisy) NSFNET year-5 coefficients drop to near zero. NSFNET is the network we already flagged as the federally-subsidized tail (rather than the full American research internet — see docs/treatment_implementation.md §3.8); the continuous fig21 NSFNET SE ≈ 0.65 means neither pre nor post coefficients are statistically distinguishable from zero.

fig22 (random 1 pair / paper) and fig23 (bilateral only)

Spec	Network	mean \|Δcoef\|	max \|Δcoef\|	year-5 (pre)	year-5 (post)
fig22 (random)	ARPANET	0.011	0.029	0.036	0.020
	USENET	0.002	0.006	0.005	0.011
	BITNET	0.002	0.007	0.026	0.022
	NSFNET	0.002	0.006	-0.067	-0.068
fig23 (bilateral)	ARPANET	0.021	0.044	0.042	0.040
	USENET	0.003	0.005	0.007	0.003
	BITNET	0.001	0.005	0.011	0.012
	NSFNET	0.008	0.021	-0.044	-0.029

Both robustness specifications are dedup-robust: max coefficient shift ≤ 0.044 across 88 (rel_time × network) cells, with most cells shifting < 0.005. Random-1-pair-per-paper already implicitly controls duplicate-paper inflation (each paper contributes to one pair regardless of how many duplicate paperids it has), and bilateral-only papers tend to be cleaner records to begin with — both keep the same conclusions before and after dedup. The body's reliance on these specs as the reviewer-facing robustness panels is therefore conservative.

Browse the clusters

The work-dup explorer shows ~1500 clusters stratified by best tier so you can see the kind of merges each rule produces. For each cluster you get every member PaperID, its title, DOI, doctype, year, cite count, and which paperid was picked as canonical.

Open work-dup explorer

Deduplicating OpenAlex paper IDs

Headline numbers