Posted by Craig on January 29th, 2016
This post continues my work on the premade HTRC dataset produced by Ted Underwood and his collaborators. 1 While the last post used the dataset to investigate some recent claims about G. Udney Yule’s measure K, this post attempts to put K to some use in deduplicating the premade dataset.
Yule’s K captures author similarity very weakly at best. Instead, as I argued in a previous post, it seems do a much better job capturing genre level information, or, more specifically, information about how a particular text instantiates the lexical norms of a given genre. While this has obvious implications for genre classification of texts, K’s length invariance led me to believe it might be useful when deduplicating a text dataset. While this is sometimes an unnecessary step, the project I have in mind for the HTRC data could benefit greatly from ignoring repeat texts, making the added preprocessing time worthwhile.
Metadata associated with a given text is the obvious and attractive source of deduplicaton information. In principle similar texts should have similar, if not always exactly identical metadata. Titles and authors may have spelling variations, information given for one volume may not be given for another, page amounts can change and shift, but variations should be minor. These expectations don’t always hold in practice, though thankfully they mostly do for the HTRC data. Where metadata comparison becomes especially tricky is when dealing with reprinted material. Reprints may have wildly divergent metadata, making locating repeat texts quite the challenge. To leap backwards into more abstract terms, a given text (say, Invisible Man) can be considered a work type. Each physical instantiation of this work is a token of this type, that may vary in largely inconsequential ways. For example. it is mostly scholars who concern themselves with the differences between a copy of Invisible Man printed in 1960 and one printed in 2010. While there are certainly important reasons to take scholarly interest in differences between text tokens, the abstract view will not impact our analysis here as, intuitively, the actual words making up the tokens of a given text type should remain mostly, if not entirely similar.
In my mind this leaves the words themselves, correlated with some amount of title and author information, as the last possible vector for deduplication. K is an excellent aggregate measure of vocabulary and should show robustness across versions of a given text. Furthermore, while the HTRC dataset comes with a robust array of metadata, this will not always be the case. Datasets with missing or inaccurate metadata need a text-level corrective such as K.
I investigated the HTRC fiction metadata after appending each volume’s K value to its row in the document and sorting the entire table by K value. Doing so tended to group what I would intuitively deem tokens of a given text type in clusters. See the sample truncated metadata entry below:
|nnc1.cu58359559||Stuart, Ruth McEnery,||New York;Harper & brothers;1901.||1901||nyu||A golden wedding||367||400||1901||60.358036096|
|mdp.39015030714631||Stuart, Ruth McEnery,||New York;Harper;1893.||1893||nyu||A golden wedding, and other tales||368||400||1893||60.4986580369|
Despite their different titles and imprints I would call both of these entries tokens of the same text type. While sorting by author or title would likely clump these texts in this case, this may not always be so. Simple spelling differences or missing metadata can easily defeat such an option. Using a fuzzy compare library like fuzzywuzzy may also work, but comparing each title or author field one-to-one with every other title/author field in the dataset would be incredibly slow. To overcome these limitations, I combined this. approach with a K sort. Because sorting the metadata entries by K clumps similar entries but does not guarantee their adjacency, we need to decide on an a priori K threshold that will define a neighborhood of entries. For my K threshold I chose 0.7, meaning that the metadata entries are chunked into groups of entries within 0.7 K of each other (as a side note 0.7 is actually a rather large threshold, .5 or even smaller would likely be adequate and provide a decrease in running time). The program then iterates over these chunks, clumping the texts in each neighborhood by comparing their title and author metadata so long as they are same volume of a given text (or if volume information is not provided). Each clump should contain all of the tokens of a given text type found within the K neighborhood. The clumps are then written to an output csv file, with each clump separated by an empty row. Taking only one text from each clump will thus deduplicate the dataset. Here is a representative clump from the results file with additional metadata appended:
|mdp.39015070463321||Works||Whyte-Melville, G. J.||87.6786058497||London;W. Thacker;1898-1902||v23||416|
|uc1.b3039210||The works of G.J. Whyte-Melville||Whyte-Melville, G. J.||87.8631562201||London;W. Thacker;Calcutta;Thacker, Spink;1898-1902.||v23||420|
|uc1.b3339593||Works||Whyte-Melville, G. J.||87.8579866199||London;W. Thacker;1898-1902.||v23||418|
While these three texts, all tokens of volume 23 of Whyte-Meville’s works, have differences that would make a sort solely by a metadata field potentially fruitless, the combined approach described above correctly clumps them together.
I’ve uploaded the deduplication code as well as the resulting clumped fiction HTIDs to my github here.