Deduplicating OpenAlex author IDs

OpenAlex assigns each author a single AuthorID, but the same real person often appears under multiple AuthorIDs — especially when their published name varies (initials vs. full name, capitalisation, accents) or when a recent stub paper gets a fresh identifier. We built a duplicate-detection classifier that reduced conflicting author IDs into deduplicated clusters, collapsing excess IDs.

How well does it work?

We hand-labelled 51 same-person duplicate pairs by sampling random OpenAlex authors with at least 10 works (1970s + 1980s) and reading OpenAlex's name-search results. The classifier catches 46 of 51 (90.2%) of those positives.

Pair-level recall

On the 51-pair manual baseline.

Old "safe" rule (one focal year) 24%
Global classifier (this work) 90%

The previous version restricted candidate generation to authors active in a single focal year; 0 of 51 manual-baseline positives had both authors in that slice. We broadened the universe to every OpenAlex author whose Unicode-normalised name matches a name active in 1940-2000.

Pair-level precision

Stratified 508-pair OpenAlex API spot-check.

Strong tiers (T0–T4) ~99%
Name-shape tiers (T5–T7) ~90%
Volume-weighted overall ~94%

Confirmed false-positive rate is ~3.7% in the per-pair sample, ~6.4% volume-weighted (T6 dominates volume and has the lowest precision). Most residual FPs are different-era same-name pairs where neither author shared a paper, coauthor, or institution.

How candidate pairs are generated

A duplicate pair must first be a candidate. Candidate generation finds pairs of OpenAlex AuthorIDs that share the same Unicode-normalised display name (lowercase, accents stripped, punctuation collapsed to spaces; Cyrillic and CJK preserved). Two profiles for "Toshio Tomimura" and "Toshio TOMIMURA" land in the same name bucket; "Smith J." and "John Smith" do not.

The universe is every OpenAlex author with a non-empty display name whose normalised name also appears in our 1940-2000 imputed-author file. That covers about 14.9 million authors; pair enumeration among same-name buckets up to size 30 yields 31.2 million candidate pairs.

Per-pair features

For each candidate pair we attach features that distinguish real duplicates from mere homonyms:

The acceptance tiers

Pairs are evaluated against tiered rules in priority order; the first match becomes the pair's merge_tier. ORCID conflict overrides everything (1.31M rejected). Each tier corresponds to a different precision / recall trade-off: T0–T4 are evidence-based and near-perfect precision; T5–T7 fall back to name-shape with safeguards.

TierRule (key conditions)Precision (sample)
T0_same_orcid Same non-empty ORCID.~100%
T1_shared_paper ≥1 paper they both appear on.~100%
T2_shared_coauthor ≥1 shared coauthor + ≥2 name tokens + non-common surname.~100%
T2b ≥2 shared coauthors (common surnames OK).~100%
T3_two_shared_institutions ≥2 shared institutions + ≥2 tokens + non-common surname.~100%
T3_shared_institution_small_bucket 1 shared institution + ≥2 tokens + bucket ≤7.~98%
T4_non_latin_small_bucket Both display_names non-Latin (Cyrillic / CJK) + bucket ≤5.~88%
T5_singleton_pair_rare_name bucket = 2 + ≥2 tokens + non-common surname + minw ≤50.~90%
T6_tiny_stub_absorption min_works ≤2 + ≥2 tokens + non-common surname + bucket ≤15.~85%
T7a_three_token_small_bucket name_token_count ≥3 + bucket ≤7.~94%
T7b_two_token_tiny_bucket name_token_count = 2 + bucket ≤4.~90%

Connected components on the accepted-pair graph form clusters. Each cluster's canonical AuthorID is the member with the most papers. The duplicate explorer lets you browse clusters stratified by tier and size.

What kinds of pairs each tier catches

Below are real example clusters from the run, one row per tier. (The explorer has many more.)

loading…

Stats by tier

How accepted pairs and clusters break down by the tier that triggered the merge (priority order — a cluster's tier is its strongest edge):

loading…

Browse the predictions

The duplicate explorer shows ~1500 clusters stratified by tier so you can see the kind of merges each rule makes. For each cluster you get every member AuthorID, their display name, works count, year range from imputed and from full OpenAlex, and last-known institution.

Open duplicate explorer