Today’s Wall Street Journal reports on a database of language usage that is now available to researchers. It is derived from the digital library of the world’s books that Google has been assembling in recent years. Thus far, two billion words are available from over five million titles published over the past two hundred years. Some interesting patterns of English usage are evident, including the distributions of the terms man and woman. The WSJ article is a good read:
You can explore data sets at Google Labs.
The viewer provided by Google Labs allows you to choose different sets of source material, phrases, and allows exploration of a number of different languages (right now various dialects of English, Russian, French, German, Spanish and simplified Chinese).
Having so much language data available so readily is dazzling, and consider the following statistics: at present Google has digitized 15 million titles yielding more than 2 trillion words in 400 languages! By some estimate there are about 129 million titles in the world since the invention of the printing press, so what has been digitized so far represents just under 12 percent of the entire body that could be digitized. Of course, this body will keep growing as new titles are written and published.
What kinds of linguistic phenomena can be examined through explorations of such a repository? Here are a few that come to mind.
- The rise (or demise) of borrowings from one language to another language. The expression modus vivendi (way of living) is a Latin phrase borrowed into English. The expression is almost nonexistent prior to around 1880, then it has a fairly steady climb until before 1970, at which point it declines fairly sharply to present times. This trend is clearest for American English; the phrase has a bit more staying power in the British corpus. Taco is interesting. In American English there is a tiny bump around 1810, then near obscurity until about 1970, at which point the word’s usage rises steeply and continuously to the present. In British usage, this term from Spanish simmered along at a notably higher rate of usage than was true for American English, rose sharply about the same time as for American English, but has apparently declined after 2000. The English word buzz rides a rocket after about 2000 in terms of appearance in French and German texts. Is this evidence of the borrowing of the social media sense of this word as ‘interest, atmosphere of excitement’? Buzz shows a sharp rise in English as well, after 2000.
- The appearance or disappearance of native forms within a language. The noun text appears throughout the English corpus, but the verb texting (no surprise here) shoots almost vertically after 2000, after creeping along in the writing from about 1940 to 1980 where it then begins to slowly rise until the explosion post-2000. Coworker and codependent are both very recent bloomers. The usage of telegraph probably mirrors its importance as a technology; there’s an almost bell-curved distribution of it between 1840 and 1980. The term nosegay (small bouquet of flowers) is a thriving word in the 19th century and experiences a continuous decline through the 20th, with a little bounce again in the 21st. Is Martha Stewart responsible for this recent uptick? 🙂
- The use (or disuse) of hyphens to convey the connectedness of ideas. Recent examples of this phenomenon include e-mail/email e-commerce/ecommerce. Unfortunately hyphenated words don’t appear to be included (or has their spelling been normalized? Highly doubtful). It would be interesting to trace patterns of tell-tale versus telltale, for example, or to what extent tele- as a prefix was hyphenated with following forms such as -port, –graph, -kinesis, etc..
- The replacement of one term with another in similar contexts. Google provides a nice example with burnt/burned, alternate spellings of a word with burned on the rise as burnt subsides. Woman has a higher usage throughout the corpus than lady, but the two terms diverge more after around 1960, with woman on the rise and lady occurring more rarely. Is that proof that one term is replacing the other? It’s suggestive because the terms do share contexts, but they also have different senses in which they wouldn’t be natural substitutions for each other. In British English, however, where lady has meanings not common in American English, the same divergence appears, with lady on the decline. Iniquity and injustice show a crossover pattern for American English, with injustice gaining ground during the mid 19th century. Both terms appear to be falling off in the 20th century, with iniquity now quite rare.
Some questions are worth raising in the context of all this lovely data. Words and phrases can and do convey different meanings in different contexts; mad and angry share contexts, but mad includes others. To investigate more accurately any possible trends in the usage of similar senses requires looking at contexts which restrict the words to their similar senses, e.g., be mad at and be angry at.
Many patterns that one finds in this repository will not be statistically significant. It is also important to know what editing policies applied to these works, both in the originals and by the database designers. I was surprised not to find any instances of the hyphenated expressions I queried; I have numerous examples of such forms in well-known American literature. Linguists who intend to use this data for research into historical variation and change will care deeply about the orthographic policies of the tokens they are looking at.
There are a lot of interesting and valuable interfaces that can and should be implemented through which to explore this data. Google has put out a nice starter with the Book Ngram Viewer.
In looking at language usage across the centuries, it’s important to remember that most of the data we have, including this absolutely amazing database, is written language, not spoken language. There are important differences between speech and writing; any linguist who has systematically studied tape-recorded speech or taken field notes on spoken language knows this. Even dialogue written by talented writers is not an infallible window onto the spoken language of a given time and place, although it can certainly be useful. Our digital age with its unedited, nearly instantaneous delivery of non-spoken communication may be blurring the distinctions that have often been seen between spoken and written idioms. But for most of history until now, writing was the result of rumination and editing. Writing has often lagged behind speech in terms of style and innovation.
I encourage everyone to take a look at this fascinating database!