Simulated Humanist Mind


De-duplicating Word Data with Yule’s K

This post continues my work on the premade HTRC dataset produced by Ted Underwood and his collaborators. 1  While the last post used the dataset to investigate some recent claims about G. Udney Yule’s measure K, this post attempts to put K to some use in deduplicating the premade dataset.   Yule’s K captures author similarity very weakly at best.  Instead, as I argued in a previous post, it seems do a much better job capturing genre level information, or, more specifically, information about how a particular text instantiates the lexical norms of a given genre. While this has obvious implications for genre classification of texts, K’s length invariance led me to believe it might be useful when deduplicating a text dataset. While this is sometimes an unnecessary step, the project I have in mind for the HTRC data could benefit greatly from ignoring repeat texts, making the added preprocessing time worthwhile. Metadata associated with a given text is the obvious and attractive source of deduplicaton information.  In principle similar texts should have similar, if not always exactly identical metadata.  Titles and authors may have spelling variations, information given for one volume may not be given for another, page amounts can change and shift, but variations should be minor. These expectations don’t always hold in practice, though thankfully they mostly do for the HTRC data.  Where metadata comparison becomes especially tricky is when dealing with reprinted material. Reprints may have wildly divergent metadata, making locating repeat texts quite the challenge.  To leap backwards into more abstract terms, a given text (say, Invisible Man) can be considered a work type. Each physical instantiation of this work is a token of this type, that may vary in largely inconsequential ways. For example. it is mostly scholars who concern themselves with the differences […]

Posted by Craig on January 29th, 2016


Dabbling with HTRC word-level data – Yule’s K, Invariance and Genre

  It has now been an official “while” since Ted Underwood and his collaborators released some pre-generated data from their genre learning project through the HTRC portal.1 The dataset itself consists of volumes from the HathiTrust digital library OCR’d, word tokenized, and split by genre (fiction/poetry/drama). As Underwood points out in his blog post annoucing the release of the dataset this sort of collection seems especially valuable for tracking diachronic change or perhaps understanding aggregate traits of texts over given periods, and offers a whole new vista for scholars interested in this historical period. While there are many clever and sophisticated ways to approach such a dataset,  I decided to start simple because A) I had no real project in mind and B)  I wanted to familiarize myself with this dataset, as it seems like a resource I’ll likely draw upon again. I had in mind something like the power law/Zipf based approach I had previously utilized on a smaller scale.  However that approach had many problems, the largest being the effect of differing text lengths upon the measure. With nothing immediately at hand, I put the idea to the side. After a few months of leaving the HTRC portal site open in my unabashedly-cluttered collection of Firefox tabs, I stumbled across a paper that gave me a direction to take.  That paper, Computational Constancy Measures of Texts — Yule’s K and Renyi’s Entropy by  Kumiko Tanaka-Ishii and Shunsuki Aihara tests a number of potential “textual constants,” relatively simple calculated measures that, in the words of the authors “[converge] to a value for a certain amount of text and remains invariant for any larger size.” 2 To summarize, Tanaka-Ishii and Aihara discover that out of the measures tested only Yule’s K measure (which they also show is functionally equivalent to […]

Posted by Craig on January 4th, 2016