We trained a classifier that places any scientific paper into one of 30 fields (e.g. Computer Science, Economics) and 304 subfields (e.g. Artificial Intelligence & Machine Learning, Labor Economics) from the SciNET taxonomy. We ran it on 14.6 million papers to replace the field labels OpenAlex ships, which were trained on modern papers and degrade on the historical record we care about (1900–2000).
How well does it work?
We evaluate on a held-out set of papers labelled by Anthropic's Claude Sonnet 4.5 (with Claude Opus 4.7 stepping in for the hardest 10%). We compare to OpenAlex's existing field labels.
Field accuracy
Did the model put the paper in the right field?
The model returns its top-3 fields for every paper. Our top-2 means the right field appears in the model's top-2 predictions. OpenAlex returns only one field per paper.
Subfield accuracy
A harder problem — 10× more classes.
At the subfield level, "the right answer" is itself fuzzy — the LLM frequently lists several plausible subfields. We score the model generous: a top-2 prediction is correct if it includes any of the LLM's plausible subfields. The strict top-1 is shown for reference.
How we got the labels
We sampled 44,000 papers stratified across every research field that OpenAlex covers and asked Anthropic's Claude Sonnet 4.5 to read each one's title, journal, and (where available) abstract, and pick the best-fitting field and subfield from the SciNET taxonomy. The prompt asks the model to return its top guess plus up to three alternatives and a confidence score from 0–100 — the LLM admits when it's unsure between, say, Materials Science and Chemistry, and that ambiguity is preserved end-to-end in the labels and in our evaluation.
Papers where Sonnet returned a confidence below 60 (about 1 in 10) were re-prompted under Claude Opus 4.7, the more capable but slower model, and the higher-confidence response was kept. Total cost of labelling: about $245 in API credits.
How we built the model
Each prediction is a single logistic regression — the simplest off-the-shelf classifier. We tried averaging in a gradient-boosted tree and a small neural net, but on our held-out slice they bought less than one percentage point and roughly tripled the inference cost on 14.6 million papers, so we shipped the simple model.
The features that go into the regression are what matter. We use three blocks:
-
Text. A
TF-IDF
bag-of-words over the title, abstract (when present), and journal
name; plus an
e5-base-v2sentence embedding (768 numbers per paper) for the 14.6M frame. - Metadata. For each paper, what fraction of its authors' past papers, its references, and its journal-mates are in each of the 30 fields? This is the strongest signal we have for short historical papers with no abstract.
- Field signal (subfield model only). The subfield regression takes the field regression's 30-way softmax as an input. Subfield predictions are also reranked at the very end by multiplying through the field probabilities, so the system never tells you "Computer Science → Marine Biology".
The metadata signal is recursive: a paper's neighbours' field shares depend on which fields the neighbours are in, which the model itself is predicting. We therefore train in two passes: a first model uses OpenAlex's existing primary topic to define neighbour shares; a second model uses the first model's own predictions on the corpus instead. A third pass adds nothing, so we stop. The same trick is applied at the subfield grain on the 14.6M-paper deployment, and that "subfield-meta" feature is the single largest contributor to subfield accuracy.
For the 94.7M-paper pre-2000 corpus we use only the text and metadata blocks (no embeddings) — most of those papers are title-and-journal-only records, so an embedding wouldn't have anything to summarise. For the 14.6M-paper modern frame we add the e5 sentence embedding on top.
How we classify authors
Author fields are career-level aggregates over the same 14.6 million paper predictions. For each OpenAlex AuthorID with at least one paper in the frame, we sum the paper-level top-3 field probabilities across that author's papers, renormalise the resulting 30-way vector, and take the largest entries as the author's primary and secondary fields. We also test variants that down-weight uncertain papers and very large teams.
Subfields use the same idea with the paper-level top-5 subfield probabilities. We do not force an author's top subfield to sit under the author's top field; interdisciplinary profiles are preserved and flagged when the subfield's parent field differs from the field aggregate.
Browse the predictions
The paper explorer shows a sample of about 11,000 papers split across three sets:
- In training — papers Sonnet/Opus labelled directly. Side-by-side, the model's top-3 / top-5 predictions and the LLM's chosen labels, with green / yellow / orange colour coding for agreement.
- Hard cases (active) — corpus papers our model was least confident about; we re-asked Sonnet 4.5 to label these for the explorer (not used in training). This is where you'll see the most disagreement and where the colour coding is most useful.
- Corpus only — random papers from the deployed 14.6M-paper run, no LLM judge attached.
You can filter by field, subfield, year, and agreement state, sort by confidence, and search inside titles and abstracts.
Open paper explorer Read author overview Open author explorer