OpenAlex assigns each paper a single PaperID, but the same real publication often appears under multiple PaperIDs — the journal version vs. an indexed preprint, two near-identical crossref records, MAG/PubMed double-coverage, or a year-typo on a reissue. We built a multi-tier duplicate-detection pipeline that collapses … conflicting OpenAlex PaperIDs into … deduplicated clusters across the full 1900–2024 publication record, removing … excess IDs.
Headline numbers
The pipeline runs as 125 independent per-year sub-pipelines plus a single global stitcher. Per-year clusters that share a paperid (because adjacent-year title slices were unioned in to catch year-drift duplicates) are merged at the stitch step.
How candidate pairs are generated
A duplicate pair must first be a candidate. We use seven
blocking signals; the same pair may fire under multiple. All
blocking is restricted to ``block_size ≤ 200`` to keep memory
tractable on modern years where stub titles like
"powered by nict" (34,815 papers in 2017 alone)
otherwise blow up.
- T0 same DOI — same normalized DOI in the same year. Strongest signal; near-perfect precision.
- T1 same PMID, T2 same PMCID, T3 same MAG ID — three identifier-equality signals in the same year. T1 / T2 / T3 are tiny in practice; OpenAlex already merges them at ingest, but the block catches the residual slipthrough.
- T4 same title-key — same Unicode-normalised,
whitespace-collapsed, lowercase title with token count
≥ 3and length> 5, in the same year. Title-key blocks of size≤ 3become T4a (high precision);4 – 10become T4b (allowed iff neither title is in the 84-key frontmatter blocklist). - T5 shared author + fuzzy title — within the same
year, two papers share at least one OpenAlex AuthorID and have
token_set_ratio ≥ 95on titles, plus length-ratio and token-count guards. - T6 DOI suffix variant — DOIs are equal except
for a trailing single-letter or punctuation+letter suffix
(e.g.
10.1000/abcvs10.1000/abc-a); common Crossref publishing pattern. - T7 cross-year same DOI — same normalized DOI
within
year_diff ≤ 2; catches reissue / preprint year-drift. - T9 cross-year shared author + fuzzy title —
shared author +
token_set_ratio ≥ 95atyear_diff = 1ortoken_set_ratio ≥ 98atyear_diff ∈ {2, 3}; window±3years (widened from±1on 2026-05-09 after a 50-pair API spot-check showed 98% precision at the wider window). Catches the same paper indexed once as preprint, once as published version, with different paperids and a 1–3 year drift.
Author IDs are pre-deduplicated. T5 and T9 do shared-author matching on the cleaned AuthorID space coming out of the author-deduplication pipeline (see Author dup. overview), so two papers whose authors live under different OpenAlex AuthorIDs but the same canonical author still register as a shared-author match.
The acceptance tiers
Pairs are evaluated in priority order; the first tier that fires
becomes the pair’s accepted_tier. Lower number
means stronger evidence. A cluster’s
best tier is the lowest-priority tier among its pairs.
| Priority | Tier | Rule (key conditions) |
|---|---|---|
| 0 | T0_same_doi |
Same normalized DOI, same year — the strongest evidence we have. |
| 1 | T1_same_pmid |
Same PubMed ID, same year. Tiny in practice (OpenAlex already merges these at ingest); tier exists to catch residual slipthrough. |
| 2 | T2_same_pmcid |
Same PubMed Central ID, same year. Same caveat as T1. |
| 3 | T3_same_mag |
Same Microsoft Academic Graph ID, same year. Almost never fires (MAG IDs are mostly null in OpenAlex). |
| 4 | T7_cross_year_same_doi |
Same normalized DOI across year_diff ≤ 2.
Catches reissue and preprint-vs-published year drift. |
| 5 | T6_doi_suffix_variant |
DOIs equal except for trailing single-letter / punct suffix; Crossref publishing pattern. |
| 6 | T4_same_title |
Same Unicode-normalised title key, same year, length
> 5, token count ≥ 3,
not in 84-key frontmatter blocklist.Step 07 internally splits this on per-pair block size: T4a_same_title_small_block (block size
≤ 3) vs T4b_same_title_medium_block
(block size 4–10). The cluster
explorer collapses both into the one T4
class because per-pair block_size is not preserved in
the global crosswalk. |
| 7 | T5_shared_author_fuzzy_title |
Same year + ≥ 1 shared OpenAlex AuthorID +
token_set_ratio ≥ 95 on titles + token /
length guards. |
| 8 | T9_cross_year_shared_author |
Same as T5, but spanning years
(window ±3; widened from
±1 on 2026-05-09 after spot-check
validated 98% precision). Title threshold is
tsr ≥ 95 at year_diff = 1,
stricter tsr ≥ 98 at
year_diff ∈ {2, 3} to keep
cross-year FPs from creeping in. |
Connected components on the accepted-pair graph form clusters
(DuckDB label-propagation, ~10 min on 16.7M nodes / 16.8M edges).
Each cluster’s canonical PaperID is the member with the
richest metadata (non-null title > non-null DOI > doctype
article|journal-article > highest cite count >
most references > lowest paperid).
Pipeline structure
Implementation lives at build/work_cleaning/; each
step is one numbered Python script.
- 01_attach_titles.py — per-year
scan of the 114 GB
papers.parquet+paper_text.parquet→year_{Y}/papers_with_titles.parquet(paper-level metadata + normalized title key). - 02_candidates_t0_t4.py — T0 / T1 / T2 /
T3 / T4 same-year blocking.
--block-size-cap 200prevents stub-title runaway (an early version segfaulted at 710M same-title pairs in year 2017 from 7 stub-title blocks alone). - 03_features.py — per-pair feature engineering: shared authors, shared institutions, title similarity, DOI prefix match, year gap, doctype agreement, cite-count overlap.
- 04_candidates_t5_t6.py — T5 (shared author + fuzzy title same year) and T6 (DOI suffix variant) candidates, deduplicated against the T0/T4 pool.
- 05_candidates_t7.py — T7 cross-year same-DOI pairs. Reads adjacent-year title files when present.
- 06_candidates_t9.py — T9 cross-year shared-author + fuzzy-title pairs. Anchored on the focal year; skips with an empty-schema parquet when the year is outside the 1940-2000 imputed-edges coverage.
- 07_tier_rules.py — applies tier
acceptance rules + 84-key frontmatter blocklist; runs DuckDB
label-propagation per year; picks canonical paperid; writes
year_{Y}/crosswalk.parquetandcanonical_papers.parquet. - 08_stitch_years.py — concatenates
the 125 per-year crosswalks, runs label-propagation a second
time on the union graph (paperids that appear in adjacent-year
slices because of the T7/T9 cross-year window), re-picks
canonical against unified metadata, writes
global/crosswalk.parquetandglobal/canonical_papers.parquet.
The whole pipeline runs as a SLURM array on Sherlock: one task per year (steps 02–07), then one final stitcher job. Total wall time on 1900–2024: ~3 hours of array compute + ~10 min stitch.
What kinds of clusters each tier catches
Below are real example clusters from the 1900–2024 run, one block per tier in priority order. The work-dup explorer has many more.
Stats by tier
How accepted pairs and clusters break down by tier. A cluster’s tier is its strongest edge (lowest-priority tier among its pairs).
Per-tier precision: 50-pair API spot-check
To get an honest read on each tier's false-positive rate, we
drew 50 accepted pairs per tier (30 for T1, 0 available for T3)
stratified across years 1940–2024 in 5-year steps, fetched both
works from the live OpenAlex API, and assigned a verdict
(true_dup, false_positive, or ambiguous)
via build/work_cleaning/spot_check_api.py:assess().
Translation-aware verdict. The naive verdict
flagged every pair whose API titles diverged
(token_set_ratio < 95) as a false positive.
That badly underestimated T1/T2 precision because PubMed
routinely indexes the English + Spanish (etc.) versions of a
paper under one PMID/PMCID. The refined rule promotes a
diverged-title pair to true_dup when the merge tier is
identifier-decisive (T0/T1/T2/T3/T7) and the API shows
≥ 1 shared author — the canonical fingerprint of
a translation duplicate.
| Tier | n sampled | strict | loose | accepted edges (% of total) |
|---|---|---|---|---|
T0_same_doi | 50 | 96.0% | 96.0% | 27,010 (0.22%) |
T1_same_pmid | 30 | 76.7% | 76.7% | 12 (0.00%) |
T2_same_pmcid | 50 | 58.0% | 58.0% | 41 (0.00%) |
T3_same_mag | 0 | n/a | n/a | 0 |
T4_same_title | 50 | 38.0% | 54.0% | 11,321,952 (92.8%) |
T5_shared_author_fuzzy_title | 50 | 66.0% | 66.0% | 535,107 (4.4%) |
T6_doi_suffix_variant | 50 | 30.0% | 44.0% | 4,593 (0.04%) |
T7_cross_year_same_doi | 50 | 94.0% | 94.0% | 3,956 (0.03%) |
T9_cross_year_shared_author | 50 | 98.0% | 98.0% | 307,930 (2.5%) |
Read. T0/T7/T9 (96/94/98%) are production-ready. T5 (66%) is the borderline mid-tier signal that contributes 4.4% of edges. T1/T2 look low at 27%/24% naive but clear 77%/58% after the translation-aware fix; their combined contribution (53 edges out of 12.2 M, 0.0004%) is negligible regardless. T6 (44% loose) is borderline but only 0.04% of edges.
The dominant tier — T4 — sits at 38% strict, 54% loose precision. T4 contributes 92.8% of all merges, so its precision is the headline number for the whole pipeline. The current frontmatter blocklist (60+ keys), reply-like regex, and block-size cap clear obvious junk; remaining FPs are pairs with a generic medical / scientific title appearing twice in the same year by different authors (e.g. "SURGERY IN DIABETES MELLITUS", "Mental Health Services"). Tightening this further would require an additional shared-author or same- DOI-prefix gate, which would push T4 toward T5's 66% precision but also drop a substantive share of accepted edges — the kind of substantive trade-off a referee should weigh in on. We report T4's strict precision as-is in the paper.
Total accepted edges across the run: 12.2 M, collapsing 16.8 M
OpenAlex paperids into 7.6 M canonical clusters (modal cluster
size 2 covers 89% of clusters; max single cluster size 2,190).
The "edges" column above sums to 12,200,601 — matching
work_duplicates_stats.json.
Erratum / correction / retraction handling
OpenAlex assigns a separate PaperID to most erratum, correction, retraction, and reply notices. These should NOT be merged with the original paper — they are distinct publications with their own citation graphs.
Step 07 (07_tier_rules.py) rejects any candidate
pair from the title-based tiers (T4, T5, T6, T9) where either
side's title matches a REPLYLIKE_REGEXES pattern
(^erratum, ^errata,
^correction(s| to|:),
^corrigendum,
^retract(ion|ed|ing), ^withdrawn,
^reply (to|on), ^response (to|on),
^comment (on|to), ^addendum,
^rejoinder, etc. — 33 patterns total). The
identifier-decisive tiers (T0/T1/T2/T3/T7) bypass this filter
since same-DOI / PMID / PMCID is a publisher-side merge signal
we want to preserve.
Audit (post-2026-05-09 rebuild):
audit_current_output.py finds 132
canonical clusters out of 7.6 M (0.0017%) whose
canonical titles start with erratum / correction / retraction
/ reply patterns. Inspection of the largest of these
(sizes 8, 6, 5, 5, 5, 4, 4, 4...) shows they are
genuinely-titled papers using "correction" in its medical /
practical sense — not erratum notices — e.g. "Correction
of the Cornrow Hair Transplant", "Correction of the
Paralytic Claw-Thumb...", "Correction of an Inborn
Error of Metabolism...". The dedup pipeline correctly
keeps standalone erratum publications separate while merging
amended re-publications (T0 / T7 same DOI) and translations.
Citation aggregation: union, not max
Two members of a cluster can share citers (a citing paper that
lists both versions in its references) or have disjoint citers
(e.g. preprint cited in arXiv-citing literature, journal version
cited in journal-citing literature). For each canonical cluster
we now compute the union of distinct citing
paperids across all members
(union_cited_by) — implemented in step
09_citation_union.py as a single DuckDB pass over
the citation edge table joined to the global crosswalk.
The naive alternative (max_cited_by across cluster
members) under-counts citations whenever the citers are disjoint.
In the 1900–2024 run, union_cited_by recovers 33%
more citers than max_cited_by on average across
multi-member clusters. Both columns are written to
canonical_papers.parquet; union_cited_by
is the default for downstream panels.
Downstream wiring: canonical-view registry
Rather than rebuild the 114 GB papers.parquet with
canonical IDs baked in, every panel builder
(build_authors, build_pairs,
build_institutions, precompute_aux,
build_exposure) now opens its DuckDB connection
through bitnet.work_dedup.register_canonical_views(),
which exposes views like pa_canon and
sci_canon that apply the work crosswalk and the
author crosswalk on the fly:
CREATE OR REPLACE VIEW pa_canon AS
SELECT
COALESCE(work_xw.canonical_paperid, raw.paperid) AS paperid,
COALESCE(auth_xw.canonical_authorid, raw.authorid) AS authorid,
raw.year, raw.field, raw.affiliation_id, raw....
FROM read_parquet('paper_author_imputed.parquet') raw
LEFT JOIN read_parquet('crosswalk.parquet') work_xw
ON raw.paperid = work_xw.paperid
LEFT JOIN read_parquet('author_crosswalk.parquet') auth_xw
ON raw.authorid = auth_xw.authorid;
This means downstream panels see the deduplicated paper-author
edges without any pre-materialization step, and the same
crosswalks are applied to the citation graph
(paper_references_canon) and to SciSciNet metadata
(sci_canon). The toggle is governed by the
environment variable BITNET_WORK_DEDUP=1 (set in
every Sherlock SLURM script that rebuilds panels).
Canonical clean datasets (SciNET fields, default-on)
As of 2026-05-10 the panels also read paper- and author-level
field labels from the SciNET classifier rather
than the legacy
OpenAlex.primary_topic_id → topic_mapping.csv
join. The single source of truth lives at
${OPENALEX}/data/clean/ and is built once by
build/upstream/build_canonical_clean_datasets.py
(a single DuckDB pass, ~3 min). Eight parquets land alongside
the crosswalks:
paper_field.parquet— canonical_paperid × top-3 SciNET fields with renormalized probabilities (14.2 M papers).paper_subfield.parquet— canonical_paperid × top-5 SciNET subfields.author_year_institution.parquet— canonical_authorid × year × institution_id with shares + modal flag.author_field.parquet— career-level top-3 SciNET fields per canonical author, plus anis_confidentboolean.author_subfield.parquet— career-level top-5 subfields.field_codes_to_names.csv— SciNET 4-letter code ↔ OpenAlex long name lookup.paper_crosswalk.parquet,author_crosswalk.parquet— copies of the upstream dedup crosswalks for at-rest provenance.
The author confidence flag is a two-criterion threshold:
is_confident = (field_p1 ≥ 0.7) AND (n_papers ≥ 10).
Thresholds live in bitnet/clean.py as
FIELD_CONFIDENCE_THRESHOLD and MIN_PAPERS;
changing the constant and re-running the canonical builder
re-derives the boolean everywhere. With (0.7, 10), 408 K of the
4.18 M canonical authors (9.8%) are flagged confident —
heavily weighted toward authors with 10+ SciNET papers.
SciNET ↔ OpenAlex field-name mapping
SciNET classifies papers into 30 four-letter field codes; OpenAlex
uses 26 long-name fields. Most STEM fields map 1:1
(COMP → Computer Science,
PHYS → Physics and Astronomy,
MATH → Mathematics,
ENGG → Engineering,
CHEM → Chemistry,
MATR → Materials Science). The collapses
happen on the non-STEM side (six SciNET arts/humanities codes
→ OpenAlex "Arts and Humanities"; six SciNET social codes
→ "Social Sciences").
Known edge case: OpenAlex has a separate
Chemical Engineering field (~13 topics in
topic_mapping.csv); SciNET does not. ChemE papers
land under CHEM or ENGG based on title
content. The body's STEM_FIELDS bundle includes
Chemical Engineering for BitNet/Usenet/NSFNET, so flipping
BITNET_CLEAN_FIELDS=1 shifts those papers into
chemistry / engineering. The before/after numerics in the next
section quantify the resulting movement on fig20-23.
Toggles
| Env var | Default | Effect |
|---|---|---|
BITNET_CLEAN_FIELDS | on if parquets exist | Switches all six panel builders to SciNET field labels. |
BITNET_AUTHOR_CONFIDENT_ONLY | 0 | Restricts the author panel to is_confident = TRUE. |
BITNET_CLEAN_DIR | auto-detect | Override the location of the clean parquets. |
Setting BITNET_CLEAN_FIELDS=0 falls back to the
legacy topic_mapping.csv join byte-for-byte, so
referee questions about the field-source can be answered with
a direct A/B comparison.
SciNET vs legacy field-source: fig20-23 before/after
Pre-rebuild snapshot at
data/intermediate/panels_pre_scinet/;
post-rebuild panels at
data/intermediate/panels/. Both fits use identical
panel construction and estimator; the only thing that differs
is the field label source. Per-event-time tables at
results/scinet_before_after/fig{20,21,22,23}_pre_post_estimates.md.
| fig | network | mean |Δcoef| | max |Δcoef| | year-5 legacy | year-5 SciNET |
|---|---|---|---|---|---|
| fig20 (binary) | arpanet | 0.030 | 0.062 | -0.039 | 0.023 |
| fig20 (binary) | usenet | 0.007 | 0.022 | -0.007 | 0.015 |
| fig20 (binary) | bitnet | 0.020 | 0.050 | 0.106 | 0.129 |
| fig20 (binary) | nsfnet | 0.027 | 0.064 | -0.002 | 0.052 |
| fig21 (Ejoin) | arpanet | 0.086 | 0.174 | -0.114 | 0.060 |
| fig21 (Ejoin) | usenet | 0.024 | 0.079 | -0.037 | 0.043 |
| fig21 (Ejoin) | bitnet | 0.032 | 0.062 | 0.256 | 0.226 |
| fig21 (Ejoin) | nsfnet | 0.245 | 0.670 | 0.130 | 0.535 |
| fig22 (random pair) | arpanet | 0.011 | 0.029 | 0.020 | 0.049 |
| fig22 (random pair) | usenet | 0.003 | 0.007 | 0.011 | 0.015 |
| fig22 (random pair) | bitnet | 0.004 | 0.016 | 0.022 | 0.038 |
| fig22 (random pair) | nsfnet | 0.020 | 0.111 | -0.068 | 0.043 |
| fig23 (bilateral) | arpanet | 0.027 | 0.056 | 0.040 | 0.043 |
| fig23 (bilateral) | usenet | 0.004 | 0.010 | 0.003 | 0.012 |
| fig23 (bilateral) | bitnet | 0.004 | 0.016 | 0.011 | 0.027 |
| fig23 (bilateral) | nsfnet | 0.013 | 0.035 | -0.029 | -0.016 |
BitNet (the headline network) shifts modestly: $\beta_5$ goes from $0.106$ to $0.129$ on fig20 and from $0.256$ to $0.226$ on fig21. Pre-trends remain flat and there are no sign flips on any pre- or post-treatment lead in fig20, fig22, or fig23 across all four networks. The NSFNET fig21 swing is dominated by small-sample noise (federal-tail universe, late-cohort exclusions); fig20/22/23 move much less on the same network. ARPANET fig21 also wiggles (small CS-only frame), but neither pre- nor post-treatment coefficients flip sign. Usenet moves least across all four figures.
Effect on published estimates (R1, primary fields)
We re-fit the two headline event studies on both pre-dedup and
post-dedup panels for all four networks: fig20
(binary treatment) and fig21 (continuous
E_join dose). Same R1 frame, same primary-field
bundle, same residualization on institution-size×year FE.
Computed by analysis/dedup_numeric_diff.py;
side-by-side PNGs in
results/dedup_before_after/<net>/.
fig20 (binary)
| Network | mean |Δcoef| | max |Δcoef| | year-5 (pre) | year-5 (post) |
|---|---|---|---|---|
| ARPANET | 0.027 | 0.042 | -0.014 | -0.039 |
| USENET | 0.003 | 0.008 | -0.004 | -0.007 |
| BITNET | 0.016 | 0.045 | 0.151 | 0.106 |
| NSFNET | 0.032 | 0.088 | 0.086 | -0.002 |
fig21 (continuous E_join)
| Network | mean |Δcoef| | max |Δcoef| | year-5 (pre) | year-5 (post) |
|---|---|---|---|---|
| ARPANET | 0.079 | 0.131 | -0.013 | -0.114 |
| USENET | 0.013 | 0.032 | -0.009 | -0.037 |
| BITNET | 0.042 | 0.175 | 0.432 | 0.256 |
| NSFNET | 0.443 | 1.056 | 1.186 | 0.130 |
No sign flips on any pre-trend. USENET is
essentially unchanged (the R1 STEM USENET sample was already
duplicate-light). BITNET shifts modestly in both
specifications: binary year-5 drops 0.151 → 0.106 (~30%);
continuous year-5 drops 0.43 → 0.26 (~40%). The slow-ramp
shape and significance pattern are intact in both. ARPANET
shows a uniform slight downward shift consistent with a small
denominator effect.
NSFNET attenuates the most: the previously-
positive (but already noisy) NSFNET year-5 coefficients drop
to near zero. NSFNET is the network we already flagged as the
federally-subsidized tail (rather than the full American
research internet — see
docs/treatment_implementation.md §3.8); the
continuous fig21 NSFNET SE ≈ 0.65 means neither pre nor post
coefficients are statistically distinguishable from zero.
fig22 (random 1 pair / paper) and fig23 (bilateral only)
| Spec | Network | mean |Δcoef| | max |Δcoef| | year-5 (pre) | year-5 (post) |
|---|---|---|---|---|---|
| fig22 (random) | ARPANET | 0.011 | 0.029 | 0.036 | 0.020 |
| USENET | 0.002 | 0.006 | 0.005 | 0.011 | |
| BITNET | 0.002 | 0.007 | 0.026 | 0.022 | |
| NSFNET | 0.002 | 0.006 | -0.067 | -0.068 | |
| fig23 (bilateral) | ARPANET | 0.021 | 0.044 | 0.042 | 0.040 |
| USENET | 0.003 | 0.005 | 0.007 | 0.003 | |
| BITNET | 0.001 | 0.005 | 0.011 | 0.012 | |
| NSFNET | 0.008 | 0.021 | -0.044 | -0.029 |
Both robustness specifications are dedup-robust: max coefficient shift ≤ 0.044 across 88 (rel_time × network) cells, with most cells shifting < 0.005. Random-1-pair-per-paper already implicitly controls duplicate-paper inflation (each paper contributes to one pair regardless of how many duplicate paperids it has), and bilateral-only papers tend to be cleaner records to begin with — both keep the same conclusions before and after dedup. The body's reliance on these specs as the reviewer-facing robustness panels is therefore conservative.
Browse the clusters
The work-dup explorer shows ~1500 clusters stratified by best tier so you can see the kind of merges each rule produces. For each cluster you get every member PaperID, its title, DOI, doctype, year, cite count, and which paperid was picked as canonical.