Arxiv ACL Anthology Presented at NLP4DH2024 @ EMNLP
Using the BERT and CANINE series of embedding models for literary investigation led us to a series of experiments that tested their ability to embed orthographic information important to the study of works that employ “non-standard” versions of written English.
We collected and tagged a corpus of these works, and evaluated their embeddings to determine if their representations preserve this information in a form that could lead to unsupervised clustering of variants. We found that both model families have strengths and weaknesses relative to this task, but that there is enough information present to support clustering of some meaningfully-related variant systems.