Zipf’s Law, Style and the Literary – Introduction

Posted by Craig on January 23rd, 2015

As a scholar of literature, I often feel embarrassed by linguistic regularities.

While the consistencies of language are the very thing that allow critical thought on literature to exist (in that they allow the critic to be assured that her readers will share at least some of the phenomenal experience of reading a given text) academic scholars are more likely to focus on the idiosyncratic moments of a work rather than its ‘low level’ entropy-thwarting communicative structure. These moments of aberration more readily stick in the reader’s craw, demanding an explanation for their presence among the more pedestrian-seeming connectives and conjunctions that make up the ‘glue’ of written text.  However, with the rise of techniques like topic modelling, the regularities of language are becoming a major focus of critical attention. This focus is not entirely novel. Texts from the critical canon of narratology, for instance, often focus on the regular as a matter of due course.


The paradigmatic example must be Roland Barthes’ classic study S/Z. In his minute analysis of the structure of Balzac’s short story Sarrasine, Barthes does not hesitate to direct his gaze to Balzac’s use of what Barthes calls the ‘action’ code — a voice that “[implies] a logic in human behavior.” 1  These actions structure the text in a way that offers a baseline regularity for the reader to grasp.  Yet even this attempt to grasp the quotidian aspects of a text remains too high level.  Barthes’ focus remains solidly on the level of narrative — a coarser grain than the verbs and connectives that produce the action seme itself. But where will we find any finer grain?

As he often does, eminent Modernist scholar and avid Heathkit computer enthusiast Hugh Kenner provides us with a way forward.  Kenner had a magpie mind, and was not afraid to seek new critical paths among other disciplines. One such topic he hit upon was the linguistic regularity commonly known as Zipf’s Law. Though the exact formulation of Zipf’s law will be covered in later posts, it can be colloquially described as one of the most simple word-level linguistic regularities.  Essentially, it states that the specific percentage of the occurrences of a word in a given sample of natural language is correlated to the rank it holds among the unique words present in the sample. While I’m leaving the exact nature of this correlation for a later post, hopefully this description communicates the simplicity of this regularity.  Following in the footsteps of Kenner’s (tentative) experimentation with Zipf’s law, I will use this series of posts to explore what can be found using this metric.

Discussion over what such techniques can add to humanistic inquiry is at an all time high. A recent article entitled “Ulysses by Numbers” by Eric Bulson that appeared in Representations has engendered numerous responses over what just counting can get the literary critic (intriguingly Bulson dances around Kenner’s work on Zipf, despite the fact that Kenner, and Zipf himself, were both inspired by the same Miles Hanley Word Index to James Joyce’s Ulysses that Bulson takes as his starting point. More on this later). Ultimately, Bulson argues that the experience of composing Ulysses as a serial work conditioned Joyce into realizing how he could manipulate the disjunction of the narrative’s stated time and the time of the reading of the work2.  By pairing counting with a genetic approach to understanding the composition of Ulysses, Bulson makes a strong argument for the use of quantitative analysis in the humanities.  But where else can number point to? What influences — of social structures, means of composition, form — can it reveal, using only extremely simple metrics that look at the granular consistency of a work (the power of machine learning and topic modelling is well known, but it comes with its attendant complexities, and doesn’t necessarily lead to any more trenchant critical payoff).

I aim to answer these questions in this upcoming series of posts on literary analysis using Zipf’s law. They will be broken down as such:

1) This post, the introduction

2) A post on the history of and technical aspects of Zipf’s Law and 2A (notes on the software I authored to use in this study)

3) Some case studies: Puritans and experimental writing.

4) General conclusions regarding literature, quantitative methods, style and the reading mind

Given that I aim to have most of these posts pre-written they should come out on a tight schedule — perhaps once a week.

(As an aside I’ve adapted the posts in these series from the work I did over the summer of 2014 on a graduate research grant from UCLA. My thanks to Prof. Brian Kim Stefans for supervising this work).

  1. Roland Barthes, S/Z trans Richard Miller New York: Hill and Wang, 1973. 18
  2. See “Ulysses by Numbers” Eric Bulson Representations Vol. 127 No. 1.  19

