Fine-grained Classification of A Million Life Trajectories from Wikipedia

Fine-grained Classification of A Million Life Trajectories from Wikipedia
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Life trajectories of notable people convey essential messages for human dynamics research. These trajectories consist of (\textit{person, time, location, activity type}) tuples recording when and where a person was born, went to school, started a job, or fought in a war. However, current studies only cover limited activity types such as births and deaths, lacking large-scale fine-grained trajectories. Using a tool that extracts (\textit{person, time, location}) triples from Wikipedia, we formulate the problem of classifying these triples into 24 carefully-defined types using textual context as complementary information. The challenge is that triple entities are often scattered in noisy contexts. We use syntactic graphs to bring triple entities and relevant information closer, fusing them with text embeddings to classify life trajectory activities. Since Wikipedia text quality varies, we use LLMs to refine the text for more standardized syntactic graphs. Our framework achieves 84.5% accuracy, surpassing baselines. We construct the largest fine-grained life trajectory dataset with 3.8 million labeled activities for 589,193 individuals spanning 3 centuries. In the end, we showcase how these trajectories can support grand narratives of human dynamics across time and space. Code/data are publicly available.


💡 Research Summary

The paper addresses the problem of fine‑grained classification of life‑trajectory activities extracted from Wikipedia biographies. While prior work has largely focused on a narrow set of events such as births and deaths, the authors propose a new multi‑class classification task that assigns one of 24 carefully defined activity types (grouped into nine broader categories) to each (person, time, location) triple together with its surrounding sentence.

To obtain the triples, the authors rely on the COSMOS extraction system, which provides high‑quality (person, time, location) triples along with the source sentence. From a random sample of 2,000 biography pages they extract 2.8 k triples and manually annotate them, creating a gold‑standard training/validation/test set. This annotated set is used to train and evaluate the proposed model.

The core methodological contribution is the SAM4LTC framework (Syntax‑Aware Masked model for Life Trajectory Classification). The pipeline works as follows: (1) each sentence is first refined by a large language model (LLM, e.g., GPT‑4) that rewrites the sentence into a more standardized syntactic form while preserving the original meaning and keeping the three triple entities unchanged; (2) a dependency parse (via SpaCy) is built for the refined sentence, yielding a syntactic graph; (3) the shortest paths connecting the three entities are identified, and all tokens lying on these paths are marked in a binary “MASK” vector; (4) a pre‑trained transformer (e.g., BERT) processes the original sentence, but its attention mechanism is guided by the MASK so that it focuses primarily on the masked tokens, effectively fusing syntactic proximity with semantic context; (5) the resulting text embedding is concatenated with a graph‑based embedding derived from the subgraph, and the combined representation is fed to a classifier; (6) supervised contrastive loss is added to encourage intra‑class compactness and inter‑class separation.

The authors argue that the MASK‑guided attention overcomes the problem that the three entities are often scattered across many irrelevant words in the raw sentence, while the dependency graph brings them closer together. The LLM‑based sentence refinement further reduces variability caused by the heterogeneous writing styles of Wikipedia editors, leading to more uniform graphs and less noise.

Experimental results show that SAM4LTC achieves 84.5 % accuracy on the 2.8 k annotated test set, outperforming strong baselines such as vanilla BERT (≈77 %), TextGCN, and GatedGCN by 7–10 percentage points. Ablation studies confirm that both the MASK mechanism and the LLM‑based refinement contribute substantially to the gain. Error analysis reveals that rare activity types (e.g., “Purchase and Sell”, “Exhibition”) still suffer from data imbalance, suggesting future work on re‑sampling or data augmentation.

Beyond the model, the paper delivers a massive dataset: using COSMOS, the authors extract 3.8 million (person, time, location, activity‑type) tuples for 589 193 notable individuals spanning three centuries. They release the annotated 2.8 k sample, the trained model, and (upon acceptance) the full 3.8 M dataset to the community.

In summary, the contributions are threefold: (1) definition of a new fine‑grained life‑trajectory labeling task and a 24‑type taxonomy; (2) a novel architecture that tightly integrates syntactic graph information with transformer embeddings via a MASK‑guided attention scheme and contrastive learning; (3) the creation and public release of the largest fine‑grained life‑trajectory dataset to date. The work opens up new possibilities for quantitative studies of human dynamics, such as migration patterns, the evolution of cultural hubs, and career mobility, and provides a reusable framework for other entity‑centric classification problems.


Comments & Academic Discussion

Loading comments...

Leave a Comment