Simulated Humanist Mind | Dabbling with HTRC word-level data – Yule’s K, Invariance and Genre

# Dabbling with HTRC word-level data – Yule’s K, Invariance and Genre

Posted by Craig on January 4th, 2016

It has now been an official “while” since Ted Underwood and his collaborators released some pre-generated data from their genre learning project through the HTRC portal.1 The dataset itself consists of volumes from the HathiTrust digital library OCR’d, word tokenized, and split by genre (fiction/poetry/drama). As Underwood points out in his blog post annoucing the release of the dataset this sort of collection seems especially valuable for tracking diachronic change or perhaps understanding aggregate traits of texts over given periods, and offers a whole new vista for scholars interested in this historical period.

While there are many clever and sophisticated ways to approach such a dataset,  I decided to start simple because A) I had no real project in mind and B)  I wanted to familiarize myself with this dataset, as it seems like a resource I’ll likely draw upon again. I had in mind something like the power law/Zipf based approach I had previously utilized on a smaller scale.  However that approach had many problems, the largest being the effect of differing text lengths upon the measure. With nothing immediately at hand, I put the idea to the side.

After a few months of leaving the HTRC portal site open in my unabashedly-cluttered collection of Firefox tabs, I stumbled across a paper that gave me a direction to take.  That paper, Computational Constancy Measures of Texts — Yule’s K and Renyi’s Entropy by  Kumiko Tanaka-Ishii and Shunsuki Aihara tests a number of potential “textual constants,” relatively simple calculated measures that, in the words of the authors “[converge] to a value for a certain amount of text and remains invariant for any larger size.” 2

To summarize, Tanaka-Ishii and Aihara discover that out of the measures tested only Yule’s K measure (which they also show is functionally equivalent to Renyi’s entropy given the parameter α = 2) reliably converges to a constant value when using word-level features. They also discover that this convergence can occur after as little as 1000-10000 words. Their final point of greatest importance to this post is that K is unreliable as a measure of authorship identification.  Instead,  Tanaka-Ishii and Aihara’s  conjecture that “H2 [Renyi’s entropy with α = 2, so functionally similar to K]  has the potential to distinguish genre or maybe writing style.” 3 Finally, a use too obvious for even me to miss…

## K As Potential Genre Indicator (Or, the short version)

In order to test genre effects on K/H2 using the HTRC data I started by:

1. Obtaining the HTRC Word Frequency datasets for poetry and fiction.  I used the pre-generated corpora provided on the HTRC site — Underwood et. al.’s excellent work also allows the individual scholar to tweak confidence levels etc. of the model to fit their own needs.  But again, I wanted to keep this initial foray simple
2. Writing a Python script to load the data and apply Underwood’s contextual OCR corrections and strip out non-word tokens
3. Using a Python implementation of the K measure to get the K values of the poetry and fiction corpora in aggregate (following the methodology of Tanaka-Ishii and Aihara, who used larger combined corpora alongside single texts in their tests)

With that basic test complete and the code framework in place, I tested a few other permutations.  The pre-generated HTRC data set does not prune reprints of earlier texts printed in later eras.  To test the effect of these reprints on the K value I implemented a crude filtering system using the Python fuzzy string comparison library fuzzywuzzy. Using fuzzywuzzy, I compared the title of each work to a list of previously seen titles, and only included the work in the dataset if fuzzywuzzy determined the title was significantly different from the other titles in the collection. There are many issues with this approach. One is speed — I restricted this permutation to works from 1700-1869 because this is a very slow way to do comparisons. The other issue is conceptual.  Taking this approach can certainly lead to undesired exclusions/inclusions (for example, think about condensed or simplified novels, and how they might be titled).

Regardless, in each case the K value for the agglutinated fiction and poetry corpora differed. For the values, see the table below:

 Poetry K Fiction K Full corpus 75.24 87.37 Pruned 71.38 89.29

I also generated K values for each of the individual prose and poetry texts in the dataset. As reported in the Tanaka-Ishii/Aihara paper, K converges to a stable value quite quickly (1000-10000 tokens) so even the shorter texts should prove sufficiently stable.  Graphing the two samples using kernel density estimation (after removing some suspicious outliers by requiring 0<K<500) produced this graph:

Almost looks normal! However, looking at a Q-Q chart, even after transforming both sets with log10 etc. convinced me otherwise.  Instead, I ended up comparing the two samples with the two sample K-S test, which produced a p-value < .05, meaning that we can reject the null hypothesis of the two samples coming from the same distribution.  Because of this, the difference in the mean of K between the two corpora (95.6 for fiction and 89.2 for poetry, as well as a median of  85.1 for poetry and 93.1 for fiction) can be deemed significant. Here’s the same process over the pruned dataset:

The pruned dataset also had a K-S test p-value < .05, and had a mean of 82.1 for poetry and 97.0 for fiction and a median of 79.2 and 94.6, respectively.

From this work I think I can tentatively conclude that Yule’s K is likely impacted by genre. Because K also, as per Tanaka-Ishii/Aihara, is invariant past a certain token threshold, K can serve as a potentially  useful factor when determining text genre.

Here is a link to my Github where you can find the source for processing the HTRC data set, as well as CSVs containing the HTID and K value for the texts in the poetry and prose corpora.  For more detail concerning K and some further conclusions, please do keep reading on.

## Yule’s Measure K

K can be described mathematically in a few different ways — though some are more accurate than others. 4 Yule’s canonical formulation goes like this:

$k=c*\frac{S_2-S_1}{S_1^2}$

Where $c$ is just a constant used to scale the resulting value (traditionally $10^4$),  $s_1$ is $N$ where $N$ is the number of words in the text and $s_2$ is:

$S_2=\sum{_m}m^2V(m,N)$

Where $V(m,N)$ is the amount of tokens that appear $m$ times in the text under consideration.5 The larger a Yule score, the less diverse a text’s vocabulary, though I am less interested in using K to make a judgment on vocabulary usage and more interested in making K significant through correlation with other categories.  The measure is not too bad to grasp intuitively; Tanaka-Ishii and Aihara offer a brief explanation on P. 483 of their paper that I won’t rehearse here.

Despite this relative simplicity it still took me some effort to convert the expressions to Python.  Translating math to executable code is really one of my weak points.  My crude but workable version looked like this:

def yule_k_orig(input_words):
n = float(sum(input_words))
mmax = max(input_words)
right = 0
for num in range(1,mmax+1):
right+=input_words.count(num)*(num/n)**2
k = 10000*((-1/n)+right)
return k


A bit ugly. I later came across a version by Magnus Nissel that I modified to suit my needs:

def calculate_k(input_words):
n = float(sum(input_words))
n2 = sum([num ** 2 for num in input_words])
try:
k = 10000 * ((n2-n)/(n**2))
return k
except:
return 0


This implementation was used to calculate the above K values. As an interesting side note to all of this, Yule’s original inspiration for the measure came from methods for calculating accident risk used by the insurance industry.6 This may not be a big surprise to Stephen Stigler/Ian Hacking fans, but I got a kick out of reading Yule’s own account of this strange little trading zone.

## K and Classification in the HTRC Corpus

As mentioned above, I also ran the K measure on the individual texts in the two corpora (available, again, on my Github). After sorting the pre-generated fiction corpus by K value, I noticed that a few groups of misclassified texts had gathered at high values of KKeep in mind that much of this could likely be rectified by working with the model directly instead of the ready-made copora. Here’s one small sub-sample of this clustering:

 HTID Title K mdp.39015020796721 The English cyclopaedia 190.3641406879 wu.89089183404 Geography 190.5215018866 wu.89055266753 A Year in a coal-mine 190.9439613031 mdp.39015063949401 A Year in a coal-mine 193.3023140397 mdp.39015035787616 Evangeline 194.1801753054 yale.39002043347278 Three essays on Oriental painting 196.575580242

Here we find a cluster of some nonfiction works (as well as Longfellow’s poem Evangeline which actually given the contentions of this post “should” have lower K).  While false positives will inevitably occur here and there in such a dataset, there appears to be a significant concentration at higher K. Just eyeballing it,  the higher concentration of false positives appears to begin somewhere around a K of 190-200.  Again, this seems to reinforce K’s ability to indicate text genre. Higher values of K, relative to this dataset,  tend to be non-fiction prose, just as lower values appear more likely to be poetry.

To me, this again illustrates K’s promise as one potential measure for identifying genre.  Just using K values wouldn’t “cleanly” separate this dataset, but these clumps of high K non-fiction texts as well as the poetry/fiction results explored above seem to indicate that it has some power of discrimination and could be used  in concert with other genre-indicating statistics/features fruitfully.

Remaining Issues/Ways forward:

At this this point there are still many open questions.  For example, what effect did the preprocessing of the HTRC fiction corpus have on which “leftover” nonfiction texts remained? The corpus does not just lump texts all or nothing by genre; instead only the pages of a work deemed “fictional” were included in the fiction data set. Since K is theoretically invariant only extremely small text sizes should impact the calculation of the measure. However, it would be safest to check page length/token amount for all the samples used since the original classification may have very much correctly identified a small section of an otherwise non-fiction book as fiction and included this subset.  Similarly, these nonfiction texts above could very well be caught by modifying the genre model confidence thresholds/focusing on the page level etc. This is a very powerful model, and getting to know its internals better would likely solve misclassification as well as anything could. However, for the purposes of this post I am more interested in using the pre-generated corpus to test K than K to test the pre-generated corpus.  It would also be hasty to conclude that non-fiction texts in general take on higher K. The potentially misclassified nonfiction in this dataset tends towards higher values but this could simply occur due to the lower K valued non-fiction having been classified out. Finally, would the pages of these texts be generally classified as nonfiction anyway? What sort of human consensus could be reached on their genre? Underwood provides an excellent discussion of this very issue in the performance report here.

I’m not sure how much time I will have to follow this up but I would like to:

1. Use precise methods to determine how well K predicts potential irregularities in the dataset. This would require some hand classification of said false positives, including looking up and reading potential nonfiction candidates
2. Generate some expectations for the value of K for non-fiction prose
3. See if K can be fruitfully employed as a part of a genre detection scheme on unclassified texts

It was also a pleasure to work with this data set.  Even if I can’t revisit this very issue, I’m certainly looking to use it in the future.  With luck, as a part of a dissertation project…

(In case it was missed above, find source and CSVs on Github here)

1. Ted Underwood, Boris Capitanu, Peter Organisciak, Sayan Bhattacharyya, Loretta Auvil, Colleen Fallaw, J. Stephen Downie (2015). Word Frequencies in English-Language Literature, 1700-1922 (0.2). HathiTrust Research Center. doi:10.13012/J8JW8BSJ
2. Kumiko Tanaka-Ishii and Shunsuke Aihara. “Computational Constancy Measures of Texts — Yule’s K and Renyi’s Entropy.” Computational Linguistics (2015)  41:3, 481-502. doi:10.1162/COLI_a_00228. 481.
3. ibid. 498
4. A. Miranda-García and J. Calle-Martín. “Yule’s Characteristic K Revisited.” Language Resources and Evaluation (2005) 39:287–294. doi:10.1007/s10579-005-8622-8
5. See Tanaka-Ishii and Aihara, as well as G. Udny Yule. The Statistical Study of Literary Vocabulary. New York: Cambridge University Press. 2014.
6. See Yule, 4