Home > Research > Pretraining Language Models for Diachronic Linguistic Change Discovery

Pretraining Language Models for Diachronic Linguistic Change Discovery

May 4, 2025

Arxiv Accepted to Findings of EACL.

We use compute and data efficient methods to pretrain a battery of historically-specific models over a relatively limited budget of tokens. We show that this approach leaks far less temporally-inapporpriate information than finetuning an existing LLM while retaining adequate performance to do lexical change detection.

Our approach can be used for any corpus with delineations between collections of works. Code is available at GitHub and models and data on HuggingFace

←

Transferring Extreme Subword Style Using Ngram Model-Based Logit Scaling