Posted by Craig on February 3rd, 2015
(This is part 3 of my series on Zipf’s law and literary stylistics. See part two here)
Software and Statistics
Today’s post continues the theme of my last, focusing on some technical aspects. I will first quickly take you through the software I used in this study. I will then end with a bit of disclaimer that I feel is necessary whenever someone not formally educated in statistics delves into this subject.
In order to facilitate my exploration of the case studies in my next post(s) I have relied heavily on the python module powerlaw authored by Jeff Alstott, Ed Bullmore and Dietmar Plenz. The functions provided by powerlaw allow for data to be fit to a power law distribution like Zipf’s and the value of to be inferred. Powerlaw also allows this fit to be compared to other similar distributions (lognormal, for example) to make sure that the phenomenon in question is indeed ruled by a power law. However, since natural language is generally considered to be ruled by Zipf distributions, I have deemed this step not necessary in this particular case. In addition, I programmed a graphical user interface in python that wraps powerlaw in order to make it easy to use for those unfamiliar with programming or command line tasks (See a screenshot below)
Results going forward will generated by these tools. To set a baseline for this study as it moves ahead, I have ‘checked’ Kenner’s work by processing both Joyce’s Ulysses and Finnegan’s Wake in Zipf Explorer. The results are reproduced below:
True to Kenner’s assertion of Joyce’s surprising regularity, both Ulysses and Finnegan’s Wake fall just about in the normal 2-range for natural language, as per Cancho and others. Though experimentalism may well contribute to a text’s variation from this norm, it is clear that an experimental attitude towards language alone is not a sufficient explanation. Only certain types of prose, produced with strong reference to outside systems, causes such a deviation.
Here is where I wish to make a disclaimer. My training in statistics hasn’t been formal, but I know well enough that my results in the next post(s) have not been tested using the standard measures of statistical significance. Without this, I lose a bit of ability to claim a causal relationship between literary context and the exponent represented by a. While this isn’t ideal, I want to make a few claims on why I haven’t done this. On one hand, I am fearful of my own ignorance. Using the wrong test/applying it wrong might actually hurt the impact of my results. This is one of the prime reasons for my posting this work in a blog series rather than (at least initially) attempting publication in a peer-reviewed journal — I need feedback on these methods! I am also unsure if it is strictly necessary in this case. Cancho already shows that deviations equal to the ones I study in the next post(s) are significant, and the burden of proof in literary studies tends to focus on elements besides formal testing. In any case, please keep in mind that going ahead my methodology isn’t strictly strictly kosher.
Next up — Puritans and Zipf’s law!