OpenAlex assigns each author a single AuthorID, but the same real person often appears under multiple AuthorIDs — especially when their published name varies (initials vs. full name, capitalisation, accents) or when a recent stub paper gets a fresh identifier. We built a duplicate-detection classifier that reduced … conflicting author IDs into … deduplicated clusters, collapsing … excess IDs.
How well does it work?
We hand-labelled 51 same-person duplicate pairs by sampling random OpenAlex authors with at least 10 works (1970s + 1980s) and reading OpenAlex's name-search results. The classifier catches 46 of 51 (90.2%) of those positives.
Pair-level recall
On the 51-pair manual baseline.
The previous version restricted candidate generation to authors active in a single focal year; 0 of 51 manual-baseline positives had both authors in that slice. We broadened the universe to every OpenAlex author whose Unicode-normalised name matches a name active in 1940-2000.
Pair-level precision
Stratified 508-pair OpenAlex API spot-check.
Confirmed false-positive rate is ~3.7% in the per-pair sample, ~6.4% volume-weighted (T6 dominates volume and has the lowest precision). Most residual FPs are different-era same-name pairs where neither author shared a paper, coauthor, or institution.
How candidate pairs are generated
A duplicate pair must first be a candidate. Candidate generation finds pairs of OpenAlex AuthorIDs that share the same Unicode-normalised display name (lowercase, accents stripped, punctuation collapsed to spaces; Cyrillic and CJK preserved). Two profiles for "Toshio Tomimura" and "Toshio TOMIMURA" land in the same name bucket; "Smith J." and "John Smith" do not.
The universe is every OpenAlex author with a non-empty display name whose normalised name also appears in our 1940-2000 imputed-author file. That covers about 14.9 million authors; pair enumeration among same-name buckets up to size 30 yields 31.2 million candidate pairs.
Per-pair features
For each candidate pair we attach features that distinguish real duplicates from mere homonyms:
- shared_papers — number of OpenAlex PaperIDs that both AuthorIDs appear on. (Two homonyms almost never co-author a real paper; OpenAlex sometimes splits one person across two AuthorIDs on the same paper, which is exactly what we want to merge.)
- shared_coauthors — number of distinct OpenAlex coauthor IDs in both authors' coauthor sets.
- shared_institutions — number of institutions each author has ever been affiliated with that the other has too.
- same / conflicting ORCID — decisive when present (rare in 1940-2000).
- middle_conflict — true when both middle-token strings exist, are non-equal, and aren't initial-vs-full compatible.
- career year ranges — both from the imputed file (1940-2000 only) and from the full-OpenAlex paper-author×year join (8.28M authors covered).
The acceptance tiers
Pairs are evaluated against tiered rules in priority order; the first
match becomes the pair's merge_tier. ORCID conflict overrides
everything (1.31M rejected). Each tier corresponds to a different precision
/ recall trade-off: T0–T4 are evidence-based and near-perfect precision;
T5–T7 fall back to name-shape with safeguards.
| Tier | Rule (key conditions) | Precision (sample) |
|---|---|---|
T0_same_orcid |
Same non-empty ORCID. | ~100% |
T1_shared_paper |
≥1 paper they both appear on. | ~100% |
T2_shared_coauthor |
≥1 shared coauthor + ≥2 name tokens + non-common surname. | ~100% |
T2b |
≥2 shared coauthors (common surnames OK). | ~100% |
T3_two_shared_institutions |
≥2 shared institutions + ≥2 tokens + non-common surname. | ~100% |
T3_shared_institution_small_bucket |
1 shared institution + ≥2 tokens + bucket ≤7. | ~98% |
T4_non_latin_small_bucket |
Both display_names non-Latin (Cyrillic / CJK) + bucket ≤5. | ~88% |
T5_singleton_pair_rare_name |
bucket = 2 + ≥2 tokens + non-common surname + minw ≤50. | ~90% |
T6_tiny_stub_absorption |
min_works ≤2 + ≥2 tokens + non-common surname + bucket ≤15. | ~85% |
T7a_three_token_small_bucket |
name_token_count ≥3 + bucket ≤7. | ~94% |
T7b_two_token_tiny_bucket |
name_token_count = 2 + bucket ≤4. | ~90% |
Connected components on the accepted-pair graph form clusters. Each cluster's canonical AuthorID is the member with the most papers. The duplicate explorer lets you browse clusters stratified by tier and size.
What kinds of pairs each tier catches
Below are real example clusters from the run, one row per tier. (The explorer has many more.)
Stats by tier
How accepted pairs and clusters break down by the tier that triggered the merge (priority order — a cluster's tier is its strongest edge):
Browse the predictions
The duplicate explorer shows ~1500 clusters stratified by tier so you can see the kind of merges each rule makes. For each cluster you get every member AuthorID, their display name, works count, year range from imputed and from full OpenAlex, and last-known institution.