Google Books Makes a Word Cloud of Human History

← Back to Stories (view on slashdot.org)

Google Books Makes a Word Cloud of Human History

Posted by Soulskill on Friday December 17, 2010 @07:18AM from the trends-for-books dept.

An anonymous reader writes "From Ed Yong at the Not Exactly Rocket Science blog: 'Just as petrified fossils tell us about the evolution of life on earth, the words written in books narrate the history of humanity. The words tell a story, not just through the sentences they form, but in how often they occur. Uncovering those tales isn't easy — you'd need to convert books into a digital format so that their text can be analyzed and compared. And you'd need to do that for millions of books. Fortunately, that's exactly what Google have been doing since 2004.' Yong goes on to explain that the astounding record of human culture found in Google Books offers new research paths to social scientists, linguists, and humanities scholars. Some of the early findings (abstract), based on an analysis of 5 million books containing 500 billion words: English is still adding words at a breathtaking pace; grammar is evolving and often becoming more regular; we're forgetting our history more quickly; and celebrities are younger than they used to be. You can also play with the Google Books search tool yourself. For example, here's a neat comparison of how often the words Britannica and Wikipedia have appeared."

21 of 127 comments (clear)

Min score:

Reason:

Sort:

OCR errors by SputnikPanic · 2010-12-17 07:25 · Score: 5, Interesting

AFAIK, Google Books doesn't do the sort of methodical OCR clean-up that Project Gutenberg does, so a lot of Google's digitized books have a a fair number of errors. It'd be funny to see what kind of blips this might creates in our extracted cultural history!
1. Re:OCR errors by SputnikPanic · 2010-12-17 07:36 · Score: 2
  
  From Google's "about" page for their Books Ngram Viewer lab: "Why does the word 'Internet" occur before 1950?"
2. Re:OCR errors by migla · 2010-12-17 07:41 · Score: 2
  
  A simpson quote where lenny as a kid talks about the netting in his shorts, the internet, and later says "I think I just logged onto the internet" comes to mind...
  
  --
  Some of my favourite people are from th US; Vonnegut, Chomsky, Bill Hicks.
3. Re:OCR errors by meloneg · 2010-12-17 08:50 · Score: 2
  
  If you follow links on that ngram (and play with the date ranges a bit), you find this query that seems to be showing a lot of those references to Abe were in the meta-data.
  A little more digging finds this little gem. Which appears to just be mis-dated. I suspect it was written in 1890 from looking very carefully at the copyright page.
  It also very possible that some of those references are to others people with the same name. Like this one and this one.
4. Re:OCR errors by raddan · 2010-12-17 09:38 · Score: 3, Funny
  
  Maybe so, maybe so. All is know is that 1720 was a really bad year.
5. Re:OCR errors by Motard · 2010-12-17 09:48 · Score: 2
  
  Yes, here's an amazingly precient book from 1920 101 Successful Businesses You can Start on the Internet
Case sensitive? by IWannaBeAnAC · 2010-12-17 07:25 · Score: 4, Informative

Interesting that it is case sensitive. Searching for "britannica,wikipedia" in lowercase, produces, for today, close to zero for brittanica, and 0.00005% for wikipedia, which is not far off the result for Wikipedia (with capital).
Putting these together, the case-insensitive comparison of brittanica and wikipedia has wikipedia already well ahead of brittanica, at around 0.00010% for britannica, vs 0.00013% for wikipedia.
1. Re:Case sensitive? by Daniel+Dvorkin · 2010-12-17 09:39 · Score: 2
  
  Freedom has always been popular, but since the early 19th c. it's gone much better with "democracy" than with "republic".
  
  --
  The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
2. Re:Case sensitive? by jc42 · 2010-12-17 09:41 · Score: 2
  
  We might also note that big peak in the incidence of "Britannica" in the early 1800s. But back then, it was still expected that educated people (at least in Europe) would study Latin, and "Britannica" is merely a Latin adjectival form of "Britannia", or "Britain", and the British Empire was rather active around the world at that time. So most of the uses of "Britannica" around then probably had nothing to do with the encyclopedia.
  I'd guess that you'd also find a fair number of occurrences of "Britannica" before 1768, the year that the encyclopedia was first published. But most of those would probably be lower case.
  
  --
  Those who do study history are doomed to stand helplessly by while everyone else repeats it.
Slashdot circa 1885 by Anonymous Coward · 2010-12-17 07:27 · Score: 5, Funny

http://ngrams.googlelabs.com/graph?content=slashdot&year_start=1800&year_end=2008&corpus=0&smoothing=3
Sometime around 1885, the very first Anonymouse Cowarde briefly tried writing about Slashdot, but apparently died off before his comments could be modded up.
Smoothing creates bias by AAWood · 2010-12-17 07:30 · Score: 2

Note that in the linked Brittanica / Wikipedia chart, Britannica appears higher due to smoothing being set as it is. Set it to a lower value, which gives a less pretty, more accurate chart, and Wikipedia is much higher by the present day.
Re:Academic conceit by KublaiKhan · 2010-12-17 07:31 · Score: 3, Insightful

Tell me, how are you proposing to measure the words and thoughts of those who did not take the time to put them down in a form that later generations could refer to?

Because if you have a time machine, I've got some business plans that could make us both filthy rich...

--
In Xanadu did Kubla Khan
A stately pleasure dome decree
A bit sparse of an article by alcourt · 2010-12-17 07:37 · Score: 4, Interesting

I wish they had gone in the article into more depth about grammar changes, rather than just word forms. For example, sentence ordering, comma usage, and some various other grammar items would be more intriguing. I found the burnt/burned the most interesting comparison because it showed an example of two competing versions of a word.
Interesting idea, but as was stated in the article, there are definite limits to what this technique can study, and many are unconvinced of its value for more than highly limited problems.

--
"I may disagree with what you say, but I will defend unto the death your right to say it." -- Voltaire
Re:Academic conceit by Ephemeriis · 2010-12-17 07:44 · Score: 2

Oh yeah, the only thing that ever matters is when a self-selected sample of writers puts words on paper. Nothing else matters.

I don't know that anyone besides yourself actually made that claim...

What is the percentage of humans who have lived? And what percentage of those humans got book deals
If we're talking about human history here, not many published authors actually had to get book deals. Those are a fairly recent occurrence.

and successfully negotiated the minefield to get not only published, but indexed by a 15-year-old company?
Google is indexing everything they can get their hands on. It isn't like you have to pay an entrance fee or anything.

Surely this is the sum of all human knowledge! How could it be otherwise? Oh, no, my anti-intellectualism is showing! How dare I question my betters?
The fact of the matter is that the important stuff is usually what gets written down.
Genealogies, religious texts, laws, business records, etc.
And even if it's fiction, it's generally a good indicator of what people care to read about. Lots of sex and scandal and whatnot.
Regardless of your opinion on the value of what gets written down... It isn't like we have a whole lot else to go by. We can't very well go back 1,000 years and just ask somebody what they think. We have to work with the records we have - be it written text, or the remains of a city, or statues, or whatever.

--
"Work is the curse of the drinking classes." -Oscar Wilde
Re:Fuck's Great Comeback by jfengel · 2010-12-17 07:48 · Score: 3, Informative

Most of the actual hits there appear to be OCR-os for the word "suck" and "such", often due to the use of medial "s" that resembles an "f". The word "such" appeared on a page which was badly speckled.
Given that the word "suck" was often used in the expression "to give suck", many of those pages are quite hilarious ("she would not suffer the strange lamb to fuck"). I didn't see any actual "fucks" in the first few pages of hits.
I know that the word was known. Shakespeare made a sly reference to it in Merry Wives of Windsor. But I suspect it wasn't often set down on paper, at least not in the kinds of books that got preserved.
Re:Academic conceit by AJWM · 2010-12-17 07:52 · Score: 2

History isn't what really happened, it's what got written down. Everything else is evanescent (well, except for what archaeologists can dig up and reconstruct, which isn't much and not necessarily accurate -- and it only counts if they write it down). Mind, I'd be more impressed if Google were also tracking the content of every hieroglyph and cuneiform tablet ever found.
It will ever be thus, unless someone invents a time machine (or at least a time viewer).

--
-- Alastair
tl;dr by PatPending · 2010-12-17 08:15 · Score: 2

tl;dr

--
What one fool can do, another can. (Ancient Simian Proverb)
Where's Buffy! by martin-boundary · 2010-12-17 08:24 · Score: 2

Oh oh, according to this graph, we're being overrun by vampires, and the slayers are dropping like flies :(
Rickrolled easter egg by daboochmeister · 2010-12-17 08:46 · Score: 4, Funny

http://ngrams.googlelabs.com/graph?content=never+gonna+give+you+up&year_start=&year_end=&corpus=5&smoothing=0

--
"Ahh! I see you're in that indeterminate Schrodinger state where - oh, uh ... never mind." Dave Bucci
Google Books vs. real corpora by CorpusProf · 2010-12-17 09:50 · Score: 4, Informative

http://corpus.byu.edu/coha
Corpus of Historical American English.

-- 400 million words, 1810s-2000s.
-- Allows for many types of searches that Google Books can't:
* accurate frequency of words and phrases by decade and year
* changes in word forms (via wildcard searches)
* grammatical changes (because corpus is "tagged" for part of speech)
* changes in meaning (via collocates; "nearby words")
* show all words that are more common in one set of decades than others
* integrate synonyms and customized word lists into queries
* etc etc etc
-- Funded by the National Endowment for the Humanities (NEH), 2009-2011.

Take a look at the "Compare to Google/Archives" link off the first page.
Re:Fuck's Great Comeback by jfengel · 2010-12-18 05:51 · Score: 2

Which means, incidentally, that the trailing off of "fuck" at the beginning of the 19th century IS very interesting, for a different reason. It's watching the tail end of the use of the medial "s".
That's the kind of data that would have been really hard to gather any other way, unless the OCR were to distinguish between medial "s" and regular "s" in its results. There IS a Unicode for medial S, but most OCR doesn't go there.
So, we have a proxy for it: "suck" scanned as "fuck", which wouldn't otherwise appear very often. I should write a paper on it. Wouldn't "Use of 'Fuck' as a proxy for medial S" look great on my CV?