Posted by Craig on January 23rd, 2015
(This is part 2 of my series on Zipf’s law and literary stylistics. See part one here)
The Origins of Zipfian Regularity
In a short article for Discover magazine entitled “The Untidy Desk and the Larger Order of Things” Hugh Kenner examines Zipf’s Law in the context of literary works like T.S. Eliot’s The Wasteland and Henry James’s The Ambassadors.1
Kenner offers a succinct non-technical definition of the Zipf phenomenon by focusing on the 80-20 effect of such systems — “the greater part of any activity [80%] draws on but a small fraction of resources [20%].”2 In linguistics, this time-saving phenomenon manifests itself as a correlation between the rank of a word (from most frequently used to least) and its usage in the text as expressed as a percentage or fraction of all of the words used. Mathematician and occasional linguist Yuri Manin offers a technical definition of this phenomenon, noting that Zipf’s law “states that if words of a language are ranked in the order of decreasing frequency in texts, the frequency is inversely proportional to the rank (sequence number in the list).”3 As Kenner more prosaically offers, if the largest city in the U.S. is New York, the “second largest has ½ the population of New York. Number three, ⅓ the population of New York” and so on and so forth until the last Montana holdout is counted.4
Regardless of definition, this regularity appears in a staggering number of texts. Aside from the James and Eliot he surveys, Kenner notes that Zipf himself (as noted in my last post) discovered the regularity in a copy of the Word Index to James Joyce’s Ulysses. While Kenner points out that Zipf approached such a data point in a workmanlike fashion — “he was not surprised to find the same pattern exactly” — here the literary critic must pause.5 To find a regularity in an idiosyncratic text that at least, in a phenomenological sense, reads so strangely that also cuts across a corpus (also examined by Zipf) of Sunday paper articles from Buffalo must necessarily trigger some eyebrow raising. If this metric flattens the difference between two wildly disparate literary productions, what can it tell the critic at all? Some might even contend that the phenomenological experience of literary style cannot supervene (to a greater or lesser degree) on this linguistic feature at all. Here is where we migh chose to follow in Bulson’s footsteps. By pairing his own numeral-centered musings on Ulysses with a genetic approach to the composition of the text, Bulson extracts real value from an (admittedly, idiosyncratic) word-count structure. We too may be able to find such correlates in the wider social sphere to give our results contextual value.
Power Laws and the Contextual Generation of Text
But first, in order to make claims about the literary properties of a work based off of its statistical properties we should have a reliable way to track deviations from the power law properties of normal natural language. Casually examining frequency tables or plots no longer suffices. Thankfully, new research and techniques have emerged to make the process simple and accurate. A power law like Zipf’s law takes the form of a probability distribution:
The most important part of this equation for this paper is the exponential term represented by a. This constant, sometimes called the scaling constant takes, for most systems, values between two and three.6
For samples of natural language produced by one author, the value of generally remains close to two.7 As reported by researcher R. Ferrer i Cancho, the value of the scaling constant can vary above or below those bounds given the psychological state of a writer or the purpose of their writing — numerical proof of a context feeding back into the communicative act.8 Cancho’s examples include values of significantly greater than 2 in the writings of patients with fragmented discourse schizophrenia, values of about 1.6 in the writing of young children, and, perhaps tellingly, values of about 1.7 in military combat texts.9 Cancho presents a formal model of cognition and language that attributes variations in to differing attitudes in each of these writing subjects towards the cost and value of their communication. Cancho points out that previous writers on Zipf distributions have systematically neglected the fact the words have meanings corresponding to mental and social states, and that high values of might correspond to “a higher weight on the communicative efficiency” while low values would indicate the opposite10 Regardless of the efficacy of Cancho’s (somewhat idiosyncratic, information theory based) model of language and cognition at corresponding with the empirical facts of human conversation, a correspondence between values of a, whether that value falls above or below 2, and the mental state or structural concerns of the communicator certainly seems to exist.
While Cancho seeks an explanation for this correspondence in information theory, as a literary critic I will be drawing on a more externally focused social semantics that brings historical factors into the account. I contend that the historical context of writing that determines the meaning of individual words and thoughts by offering frameworks in which they can be interpreted heavily impacts and can reveal authorial concerns hitherto unseen. Different frameworks of concern will compel authors to produce different types of texts, and their variance on the scale can reveal salient features of those frameworks. While some forms of stylistic experimentalism might seem, on the surface level, roughly commensurate (triggering responses of “oh that text is so X-ian or Y-ian) analysis through power laws can pierce through to hidden cultural differences embedded in the style of the text.
After a brief post detailing the software authored for this investigation, my next main post will make an argument for this point through a series of example texts drawn from diverse sources including the Puritan divines and Gertrude Stein.