Following is a draft of an article intended for publication in TEXT Technology. The author would appreciate any comments or suggestions for revision. They may be sent to JohnsonE@dsuvax.dsu.edu Computing the Kinds of Words used in Novels Eric Johnson It is the underlying premise of this paper that novels are necessarily made only of words, and that analysis of what kinds of words are used may reveal something about the nature of novels. If novels are built from words in somewhat the same way that plants and animals are built from cells, and if a biological scientist can gather information about plants and animals by determining the proportion of various kinds of cells, then perhaps the literary researcher may be able to draw conclusions about novels by determining the percentage of various kinds of words. Lists and FINDLIST In order to study the percentage of a few of the kinds of words found in novels, seven lists of words were assembled. A computer program that I wrote called FINDLIST1 was used to calculate the percentage of the running words in each of 32 novels that were found on those lists. (The 32 novels were picked almost at random from among those available in electronic form; most researchers would focus on a specific period, author, or subject, and thus would be more selective.) Output from FINDLIST is given in figure 1. The novels (and accompanying statistics) are listed in roughly chronological order with blank lines between novels by different authors. Lists of Words for Food and Drink The first four lists contain words associated with food, eating, drink, and drinking. List 1 contains general, rather abstract words connected with food such as "breakfast," "dine," "eat," and "meal." List 2 contains more specific words, usually the names of food such as "bacon," "bread," "potato," and "veal." List 3 contains general words such as "drink" and "sip," and list 4 includes more specific words such as "coffee," "tea," and "wine." As figure 1 shows, some novels have a relatively high percentage of food and drink words from all four lists: for example, four novels by Charles Dickens (Great Expectations, The Chimes, A Christmas Carol, and A Tale of Two Cities). Other novels have a relatively low percentage of food and drink words: for example Nathaniel Hawthorne's Scarlet Letter and Henry James's Confidence. A comparison of lists 1 and 2 with lists 3 and 4 indicates that in these 32 novels, there is substantially more food and eating than drink and drinking. Dickens's short novel Chimes has the highest percentage of food words: a total of 0.44%. The highest percentage of drink words is less than half of that: 0.18% for Dickens's Tale of Two Cities and for James Janke's Jeremiah Bacon. In describing food and eating, an author may elect to be rather general or fairly specific. In Jane Austen's Pride and Prejudice, a character comments simply that she and her friends had "the nicest cold luncheon in the world" (page 222). Compare that general observation with the following from James Janke's Jeremiah Bacon: "He finished the sandwich and reached for a hard boiled egg. Then he ate a second egg and another large piece of cheese along with another meat sandwich" (page 52). A writer may consistently use general descriptions -- perhaps in order to thus allow readers' imaginations to fill in the specifics; in that case, on the subject of food, a higher percentage of words from list 1 (rather than from list 2) would be used. That is exactly what is shown in figure 1 for the first eleven text files: the first six are novels by Jane Austen, and the next five are novels by Anthony Trollope. By contrast, the next four novels (by Dickens) show a higher percentage for list 2 -- or, at least, about the same as list 1. That is what might be expected by readers of Dickens who are familiar with his substantial descriptions. So, too, White Fang, Call of the Wild, and The Sea Wolf by Jack London have high percentages of both general and specific words for food. A similar contrast between writers who prefer descriptions that are general and those who prefer more specific descriptions can be seen in the percentages from lists 3 and 4 that deal with drink -- although the contrast is less sharp because the numbers are smaller. Trollope's Rachel Ray appears to be an anomaly with its relatively high percentage of specific words for drink (0.13%). The novel's setting and plot deal with a brewery, and there is much discussion about the production of "good beer" and "bad beer." (It is always possible for unique features of a novel to skew percentages such as those in figure 1: obviously, the frequent mention of the title character of Jeremiah Bacon makes the percentage for list 2 inaccurate as a calculation of food in that novel.) Of course the plot of a novel may obviate using some kinds of words. Joseph Conrad's short novel Typhoon may be the greatest storm novel every published. The crew of a ship caught in a typhoon has little time to eat or drink, and, accordingly, the percentages of words on the four food and drink lists are very low. A similar point can be made for the soldiers in Stephen Crane's Red Badge of Courage, and its food and drink percentages are also low. Considering its plot, H. G. Wells's Invisible Man has surprisingly high percentages of words for food and eating. Number Words In order to determine whether there are significant differences in numeric specificity among novels, I compiled list 5: it contains words that are used for numbers such as "one," "two," "ten," "hundred," and "million." Janke's Blood on the Wind River Mountains has the highest percentage of number words (1.16%) followed by Jack London's Call of the Wild (1.10%). The lowest percentage of number words (0.46%) is found in Hawthorne's Scarlet Letter and in James Fenimore Cooper's Last of the Mohicans. The percentages of number words in the six novels by Janke are all in the highest third; the percentages for the three London novels are in the top half. There is obviously a tendency in Janke and London to be numerically specific. If Cooper's Mohicans were not the lowest percentage, a researcher might conclude that novels about adventure and the outdoors always use a high percentage of number words. There do seem to be tendencies. Novels of adventure and the outdoors (such as those mentioned above) tend to be numerically specific. Novels of manners and those with intellectual themes tend not to be numerically specific: the novels of Austen and Trollope are in the lower half of the calculations of numbers, as is Henry James's Confidence. Letters and Correspondence There was a brief discussion on an Internet list devoted to Jane Austen about whether there was a good deal of mention of letters and correspondence in the six Austen novels. List 6 contains words such as "letter," "epistle," and "missive." The six Austen novels do contain frequent references to letters: they are in the upper half of the 32 novels tested for percentage of letter words; of Austen's six novels, Emma has the highest percentage. Of the 32 novels in this study, Wilkie Collins's Woman in White has the highest percentage of letter words (0.20%). Three novels by Janke have the lowest percentage: two of his novels (Blood on the Wind River Mountains and Winter Kill) have only one one-hundredth of a percent of letter words, and A Tinstar for Braddock has zero. Many word forms are homographs, and they can have several meanings, and thus adjustments in percentages may have to be made. For example, the word "letter" may mean an alphabetic character as well as an epistle. In Hawthorne's Scarlet Letter, the word "letter" occurs 135 times, and every single time it means an alphabetic letter (usually the scarlet A) -- never a piece of correspondence. Therefore, the relatively high frequency (0.17%) of letter words in The Scarlet Letter gives no indication of sending or receiving any epistles. More often the adjustment required due to homographs is slight. In Austen's Emma, four adults casually spell out words using a children's alphabet game, and they briefly discuss the word "blunder" that was spelled by one of them (pages 347-350). In this context of alphabetic characters, the words "letter" and "letters" are used eight times (of the total of 147 occurrences of these two words in the novel. The adjusted percentage of letter words for Emma is thus a little over 0.12 (rather than 0.13 listed in figure 1). Common Words Over the years in working with indexing and concordance programs, I have collected a list of common words. This list (list 7 in figure 1) contains 125 words such as articles, prepositions, and forms of "to be" and "to have" that are normally excluded from computer analysis in order to focus on what are thought to be more interesting words. Approximately fifty percent of the running words of each of the 32 novels tested are common words on list 7, but there are interesting patterns. There is about a ten-percent difference between the novels with the highest percent of common words and the lowest percent. The percentages of common words at the top, bottom, and middle tend to cluster by author. The six novels by Janke are the six lowest shown in figure 1. The five novels by Trollope are the five highest. Each of the six novels of Austen has a percentage similar to the others (52.45 to 53.45), and the two novels by Hawthorne have extremely similar percentages (49.26 and 49.46). Apparently novelists may differ as much as ten percent one from another in their use of common words, but any one author tends to use almost exactly the same percentage of common words in each novel. It would seem reasonable to assume that those authors whose works have a lower percentage of common words would thus have a larger vocabulary and are better authors. Moreover, conversely, those with a higher percentage of common words should have a smaller vocabulary and be lesser authors. The six novels of Janke have a full ten percent lower percentage of common words than five novels of Trollope. With due respect to my colleague and friend Jim Janke, he is not a greater author than Anthony Trollope. Conclusion If novels are necessarily made only of words, analysis of what kinds of words are used may reveal something about the nature of specific novels. Since particular authors tend to use a similar percentage of some kinds of words (such as common words) in all their works, calculation of word percentages may be a test useful in identifying authorship. This paper has not attempted to reach any firm conclusions about specific novels or authors -- too few words, lists, and novels have been analyzed. However, the method of using computers to calculate the percentage of types of words in novels can, no doubt, be profitably used much more extensively by others. _________________________________________________________________ [IMAGE] Click here to go to Eric Johnson's publications. [IMAGE] Click here to go to Eric Johnson's home page. _________________________________________________________________ Note 1 FINDLIST was created using a Catspaw SPITBOL-386 compiler. The program was run under DOS and Windows on an ALR 486. Although the program was not named, FINDLIST produced the calculations in my "How Jane Austen's Characters Talk," TEXT Technology, 4.4(Winter, 1994), 263-267. _________________________________________________________________ References Austen, Jane. Pride and Prejudice. London: Oxford University Press, 1932. Austen, Jane. Emma. London: Oxford University Press, 1933. Janke, James. Jeremiah Bacon. Madison, SD. American Polygon Publishing Group, Inc. 1992. _________________________________________________________________ Eric Johnson is the Editor of TEXT Technology: The Journal of Computer Text Processing. He can be contacted via his Web page: http://www.dsu.edu/~johnsone/eric.html or directly by email: johnsone@dsuvax.dsu.edu. _________________________________________________________________ List 1 List 2 List 3 List 4 List 5 List 6 List 7 Text File Food1 Food2 Drink1 Drink2 Number Letter Common Words Words Words Words Words Words Words SENSE.AND 0.08% 0.01% 0.01% 0.02% 0.68% 0.10% 53.29% PRIDE.AND 0.11% 0.01% 0.01% 0.02% 0.57% 0.13% 53.45% MANSFIEL.D 0.08% 0.02% 0.01% 0.02% 0.59% 0.09% 53.35% EM.MA 0.09% 0.06% 0.01% 0.02% 0.58% 0.13% 52.94% NORTHANG.ER 0.08% 0.02% 0.02% 0.02% 0.75% 0.07% 52.68% PERSUASI.ON 0.08% 0.01% 0.00% 0.00% 0.66% 0.08% 52.45% AYALAS.ANG 0.12% 0.05% 0.02% 0.03% 0.57% 0.16% 56.62% LADY.ANN 0.07% 0.01% 0.01% 0.01% 0.51% 0.11% 57.74% FORGIVE.HER 0.13% 0.03% 0.02% 0.04% 0.66% 0.12% 56.28% RACHEL.RAY 0.13% 0.05% 0.03% 0.13% 0.50% 0.11% 54.80% WORTLE.DR 0.08% 0.04% 0.03% 0.03% 0.65% 0.16% 58.00% GREAT.EXP 0.13% 0.12% 0.04% 0.06% 0.63% 0.06% 54.51% CHIMES.THE 0.14% 0.30% 0.02% 0.06% 0.53% 0.11% 50.58% CAROL.XMA 0.11% 0.17% 0.02% 0.04% 0.80% 0.02% 51.23% TALEOF.TWO 0.07% 0.05% 0.06% 0.12% 0.85% 0.07% 52.59% WOMAN.IN 0.06% 0.01% 0.01% 0.02% 0.63% 0.20% 52.89% INVISIBL.E 0.14% 0.06% 0.03% 0.03% 0.79% 0.05% 49.80% TYPH.OON 0.02% 0.06% 0.01% 0.01% 0.56% 0.09% 48.24% SCARLET.LET 0.04% 0.03% 0.00% 0.00% 0.46% 0.17% 49.26% HOUSEOF.7 0.08% 0.11% 0.02% 0.04% 0.64% 0.03% 49.46% MOHICANS.THE 0.04% 0.02% 0.01% 0.00% 0.46% 0.04% 49.57% REDBADGE.THE 0.03% 0.03% 0.02% 0.02% 0.48% 0.16% 48.42% CONFIDEN.CE 0.05% 0.01% 0.00% 0.00% 0.63% 0.08% 54.27% WHITEFAN.G 0.18% 0.22% 0.01% 0.02% 0.70% 0.03% 50.42% CALLWILD.THE 0.17% 0.10% 0.02% 0.00% 1.10% 0.02% 49.25% SEAWOLF.THE 0.11% 0.07% 0.03% 0.05% 0.73% 0.07% 52.19% JEREMIAH.BAC 0.08% 0.27% 0.08% 0.10% 0.91% 0.06% 44.82% MCHENRY.LST 0.07% 0.04% 0.02% 0.00% 0.88% 0.02% 45.93% TINSTAR.FOR 0.07% 0.03% 0.02% 0.04% 0.75% 0.00% 46.19% BLOOD.ON 0.03% 0.04% 0.00% 0.02% 1.16% 0.01% 47.00% WINTER.KIL 0.07% 0.03% 0.00% 0.01% 0.81% 0.01% 47.25% LASTBOAT.TO 0.05% 0.02% 0.02% 0.04% 0.80% 0.07% 44.24% Figure 1. Output from FINDLIST showing the percentage of words in text files that are found on seven lists.