The Technium

Culturomics


The library of the future will contain a unified text comprised of all books and magazines and newspapers (and blogs) completely hyperlinked and co-located. This aggregation has already begun to happen as Google, Amazon and others digitize the books of the our libraries and keep them machine readable. What if you could read all the books at once and deduce the patterns among their billions of words?

Some call that Culturomics because it would provide a quantitive analysis of culture, but I think of it as reading the universal library.

The Google/University consortium has digitized about 15 million books so far. Researchers at Harvard took the full text of the most reliable 5 million books, combined their texts into one file and treated their 6 billion words as a single text. As they report in the journal Science in December 2010, they then analyzed the patterns of word usage in this aggregated book text.

The tools of their analysis were made public as Google’s N-gram viewer, which anyone can use. You give n-gram a word or phrase and it will graph its usage in books over time. You can even compare two words (ideas) and see how their adoption patterns over time compare.

N gram

In their paper “Quantitative Analysis of Culture Using Millions of Digitized Books” the authors report their findings when they examine not just two trends but hundreds of vectors at once.

We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology.

All the words on the world wide web also form a large text and one can apply a similar n-gram analysis to the text of the web, but not over the same historical period of books. Nonetheless interesting surprises result. Microsoft is running a n-gram project on Bing and they found out that the ten-thousand most commonly used words in English (on the web) change by 10% over about a year. That rapid turnover of usage was unexpected, suggesting our language is changing very fast.

There’s a TEDx Boston talk which summaries the group’s initial findings:

I agree that being able to measure and quantify our words and pictures in real time and in the past will give us an x-ray into our culture. We’ll be able to track the diffusion of ideas, and their retreat, and to examine and study how culture learns and forgets. Culturomics will be a key instrument in understanding and managing the global social infrastructure.




Comments