Google Books Makes a Word Cloud of Human History
An anonymous reader writes
"From Ed Yong at the Not Exactly Rocket Science blog: 'Just as petrified fossils tell us about the evolution of life on earth, the words written in books narrate the history of humanity. The words tell a story, not just through the sentences they form, but in how often they occur. Uncovering those tales isn't easy — you'd need to convert books into a digital format so that their text can be analyzed and compared. And you'd need to do that for millions of books. Fortunately, that's exactly what Google have been doing since 2004.' Yong goes on to explain that the astounding record of human culture found in Google Books offers new research paths to social scientists, linguists, and humanities scholars. Some of the early findings (abstract), based on an analysis of 5 million books containing 500 billion words: English is still adding words at a breathtaking pace; grammar is evolving and often becoming more regular; we're forgetting our history more quickly; and celebrities are younger than they used to be. You can also play with the Google Books search tool yourself. For example, here's a neat comparison of how often the words Britannica and Wikipedia have appeared."
AFAIK, Google Books doesn't do the sort of methodical OCR clean-up that Project Gutenberg does, so a lot of Google's digitized books have a a fair number of errors. It'd be funny to see what kind of blips this might creates in our extracted cultural history!
Interesting that it is case sensitive. Searching for "britannica,wikipedia" in lowercase, produces, for today, close to zero for brittanica, and 0.00005% for wikipedia, which is not far off the result for Wikipedia (with capital).
Putting these together, the case-insensitive comparison of brittanica and wikipedia has wikipedia already well ahead of brittanica, at around 0.00010% for britannica, vs 0.00013% for wikipedia.
Sometime around 1885, the very first Anonymouse Cowarde briefly tried writing about Slashdot, but apparently died off before his comments could be modded up.
So a word cloud of human history probably has WAR in in the center at 900 point font.
No folly is more costly than the folly of intolerant idealism. - Winston Churchill
Note that in the linked Brittanica / Wikipedia chart, Britannica appears higher due to smoothing being set as it is. Set it to a lower value, which gives a less pretty, more accurate chart, and Wikipedia is much higher by the present day.
Up until the 1820s, Fuck was apparently very much in vogue. Not until 1960s was this great word brought back into the lexicon of the common man.
Tell me, how are you proposing to measure the words and thoughts of those who did not take the time to put them down in a form that later generations could refer to?
Because if you have a time machine, I've got some business plans that could make us both filthy rich...
In Xanadu did Kubla Khan
A stately pleasure dome decree
now I spent almost an hour fooling around with this today
I can't believe I'm almost 30 years old and the first thing I did was graph sex and f*ck. I guess some things never change...
Hmmm... So Britannica still on top?
But this link (is with smoothing=0) gives a different result:
http://ngrams.googlelabs.com/graph?content=Britannica%2CWikipedia&year_start=1800&year_end=2008&corpus=0&smoothing=0
Not that I know whether smoothing=0 is better or worse then smoothing=3
Kind regards,
Roel
I wish they had gone in the article into more depth about grammar changes, rather than just word forms. For example, sentence ordering, comma usage, and some various other grammar items would be more intriguing. I found the burnt/burned the most interesting comparison because it showed an example of two competing versions of a word.
Interesting idea, but as was stated in the article, there are definite limits to what this technique can study, and many are unconvinced of its value for more than highly limited problems.
"I may disagree with what you say, but I will defend unto the death your right to say it." -- Voltaire
Awesome tool! These two are quite cool:
http://ngrams.googlelabs.com/graph?content=doctor%2Clawyer%2Carchitect%2Csoldier%2Cpoliceman%2Cdog+walker&year_start=1500&year_end=2000&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=peace%2Cwar%2Cmoney&year_start=1500&year_end=2000&corpus=5&smoothing=3
The richest data mine in the whole world... and probably bottomless..
For justice, we must go to Don Corleone
Oh yeah, the only thing that ever matters is when a self-selected sample of writers puts words on paper. Nothing else matters.
I don't know that anyone besides yourself actually made that claim...
What is the percentage of humans who have lived? And what percentage of those humans got book deals
If we're talking about human history here, not many published authors actually had to get book deals. Those are a fairly recent occurrence.
and successfully negotiated the minefield to get not only published, but indexed by a 15-year-old company?
Google is indexing everything they can get their hands on. It isn't like you have to pay an entrance fee or anything.
Surely this is the sum of all human knowledge! How could it be otherwise? Oh, no, my anti-intellectualism is showing! How dare I question my betters?
The fact of the matter is that the important stuff is usually what gets written down.
Genealogies, religious texts, laws, business records, etc.
And even if it's fiction, it's generally a good indicator of what people care to read about. Lots of sex and scandal and whatnot.
Regardless of your opinion on the value of what gets written down... It isn't like we have a whole lot else to go by. We can't very well go back 1,000 years and just ask somebody what they think. We have to work with the records we have - be it written text, or the remains of a city, or statues, or whatever.
"Work is the curse of the drinking classes." -Oscar Wilde
I mean, how good can they be if they don't even get THIS right?!
On the other hand, they seem to have pegged this one!
"Ahh! I see you're in that indeterminate Schrodinger state where - oh, uh
Not until 1960s was this great word brought back into the lexicon of the common man.
Oddly enough, email was a pretty popular word from up until the 1960's, peaking in popularity in the 1860's, but has made a comeback since the mid 1990's!
Taking guns away from the 99% gives the 1% 100% of the power.
Rather than expose the full texts to the public (and themselves to copyright infringement)
But wait, I thought you were breaking the law just by scanning the books and creating unauthorized copies. Or is there a different law for corporations like Google?
Seven puppies were harmed during the making of this post.
History isn't what really happened, it's what got written down. Everything else is evanescent (well, except for what archaeologists can dig up and reconstruct, which isn't much and not necessarily accurate -- and it only counts if they write it down). Mind, I'd be more impressed if Google were also tracking the content of every hieroglyph and cuneiform tablet ever found.
It will ever be thus, unless someone invents a time machine (or at least a time viewer).
-- Alastair
This should be a great way to test euphemism treadmills for instance, try 'lunatic asylum' and 'psychiatric hospital'. Lunatic asylum makes a comeback in the early 2000's, i'm guessing because of history books.
thought this was more interesting than the summary's example:
http://ngrams.googlelabs.com/graph?content=peace%2C+love%2C+understanding&year_start=1800&year_end=2008&corpus=0&smoothing=3
Beware of the Leopard.
http://ngrams.googlelabs.com/graph?content=LSD,Alcohol,Marijuana&year_start=1800&year_end=2008&corpus=0&smoothing=3
Yahoo had an early lead and blew it, but has made a comeback!
Google Vs Yahoo
"the" vs "of" is also exciting......I will be following this contest for the rest of my life.
The vs Of
Is another worth more common than "the"?
tl;dr
What one fool can do, another can. (Ancient Simian Proverb)
http://ngrams.googlelabs.com/graph?content=Pepsi%2C+Coca+Cola&year_start=1900&year_end=2008&corpus=0&smoothing=0 At last, we can declare a winner. It's Pepsi, by far!
There is no sig.
Oh oh, according to this graph, we're being overrun by vampires, and the slayers are dropping like flies :(
References to forms of national leadership are interesting. A nice peak for the reign of the Virgin Queen, the appearance and growth of President in line with the upstart of those bloody colonies in North America, President finally tops King just about the time of the Great War, but King reasserts until the Second World War finally pushes President on top. Interestingly enough, King comes back and surpasses President just about the turn of the Millennium. http://ngrams.googlelabs.com/graph?content=King,President,Queen&year_start=1500&year_end=2008&corpus=0&smoothing=3
My impression is that a search for "man" would not match "woman". (Ie. word boundaries are assumed.) True?
Bukowski said it. I believe it. That settles it.
Apparently someone at google labs is a fan of the whole "pirates prevent global warming" joke: http://ngrams.googlelabs.com/graph?content=pirates%2Cninjas&year_start=1800&year_end=2008&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=fuck,ass&year_start=1600&year_end=2008&corpus=0&smoothing=10
What happened in the 1700s!
Seeing the graphs of word popularity over time reminds me of that old Saturday Night Live skit with Phil Hartman giving word investing tips.
Support Right To Repair Legislation.
http://ngrams.googlelabs.com/graph?content=never+gonna+give+you+up&year_start=&year_end=&corpus=5&smoothing=0
"Ahh! I see you're in that indeterminate Schrodinger state where - oh, uh
...really seems to permeate time:
http://ngrams.googlelabs.com/graph?content=iphone&year_start=1800&year_end=2008&corpus=0&smoothing=3
Mind, I'd be more impressed if Google were also tracking the content of every hieroglyph and cuneiform tablet ever found.
It will ever be thus, unless someone invents a time machine (or at least a time viewer).
I suspect they plan to...
History isn't what really happened, it's what got written down.
"The writing of history is largely a process of diversion. Most historical accounts distract attention from their secret influences behind great events. The few histories that escape this restrictive process vanish into obscurity through obvious processes. Destruction of as many copies as possible, burying the too revealing accounts in ridicule, ignoring them in the centers of education, insuring that they are not quoted elsewhere" - Frank Herbert
Bing, more popular than Google for 200 years:
http://ngrams.googlelabs.com/graph?content=google%2Cbing&year_start=1800&year_end=2000&corpus=0&smoothing=3
This sentence no verb.
i guess people have been lulzing for quite some time
It's said that liberals have issues and conservatives have principles. Plug "issue,principle" into it and see a really good picture of Western political change.
I'm a Programmer. That's one level above Software Engineer and one level below Engineer.
And the winner... ...man... also throw woman in there and look to simplified Chinese.
sense of security, like pockets jingling...
suddenly you'll see that no one has ever written about wikileaks! Check it out here
It appears that we have dodged a bullet: http://ngrams.googlelabs.com/graph?content=global+temperature&year_start=1800&year_end=2008&corpus=0&smoothing=3 We're on the other side of the hockey stick.
http://corpus.byu.edu/coha
Corpus of Historical American English.
-- 400 million words, 1810s-2000s.
-- Allows for many types of searches that Google Books can't:
* accurate frequency of words and phrases by decade and year
* changes in word forms (via wildcard searches)
* grammatical changes (because corpus is "tagged" for part of speech)
* changes in meaning (via collocates; "nearby words")
* show all words that are more common in one set of decades than others
* integrate synonyms and customized word lists into queries
* etc etc etc
-- Funded by the National Endowment for the Humanities (NEH), 2009-2011.
Take a look at the "Compare to Google/Archives" link off the first page.
I can see the word BFD ticking up right about now.
Data goes back to 1500 BCE. Really reminds me how important the 'open source' concept is (literacy and printing methods in this case...)
http://ngrams.googlelabs.com/graph?content=God
The word "Britannica" doesn't just refer to Encyclopaedia Britannica. It means 'of Britain' (latin scholars can help me with the exact meaning but this is its general sense). So you'll get hits from before when the encyclopaedia existed, back to at least 1500 according to the search tool. And some hits from after the books started won't refer to them. It's a poor choice of comparison for a search.
http://ngrams.googlelabs.com/graph?content=communism%2Cterrorism&year_start=1920&year_end=2010&corpus=0&smoothing=3
Search for Slashdotte: 415 results. Go to page 9 of the results: Now there's only 89 results.
Mmmm.. Donuts
I think the cloud is halfway to capacity with this "summary" alone.
So what's with the downturn in the last 10 -20 years here?
http://ngrams.googlelabs.com/graph?content=physics%2Cchemistry%2Cbiology%2Cmathematics&year_start=1700&year_end=2008&corpus=0&smoothing=6
I'm going to start appending that to the end of my posts in a futile, silly attempt to defend ridiculous, unfounded assertions I make. Oh, no, my anti-intellectualism is showing! How dare I question my betters?
http://ngrams.googlelabs.com/graph?content=catholics%2Cpornography&year_start=1800&year_end=2008&corpus=0&smoothing=3
The important stuff is very recent.
air conditioning,electric power,telephone,vacuum tube,transistor,airplane
I wonder why the (ever so slight) drop of "euphemism" near the present bothers me...
Now here's an interesting one:
http://ngrams.googlelabs.com/graph?content=North,South,East,West&year_start=1700&year_end=2008&corpus=0&smoothing=3
The directions "North" and "South" were more than an order of magnitude more popular than "East" and "West" until ~1800, when they quickly caught up over the course of a decade or so. Perhaps this is due to the American revolution, but I noticed that lower-case versions of all four words didn't become popular until about the same time, as well.
Interesting...
>> Standing on head makes smile of frown, but rest of face also upside down.
This site (http://share.seadragon.com/demos/ChronoZoom/firstgeneration.html) provides a graphical view and timeline of the history of the universe. To get a sense of the place human history has in the greater scheme of things, click on the 'Human History' link near the top of the page to zoom in. Before you flame me, this requires Silverlight but it's worth it. Besides, Silverlight is a quick install on Macs and PCs and you can always uninstall afterwards.
Put in "pirates" and "ninjas" and see what google automagically adds.
http://ngrams.googlelabs.com/graph?content=Slashdot&year_start=1885&year_end=1915&corpus=0&smoothing=0
What user id does this guy have?
No surprise to anyone, but run a comparison on "spin, secrets" you'll be surprised what the outcome is, and when it started... hmm look at that...
I don't know how this was missed earlier, but:
http://ngrams.googlelabs.com/graph?content=sharks,lasers&year_start=1770&year_end=2008&corpus=0&smoothing=3
This comparison may (or may not) give interesting insights into the rise of influence of the US (or at least American vs British English).
You might compare English to German for example and have a look at what it looks like around the world wars.
Careful on translations though. Few words are direct translations meaning exactly the same.
Most arguments are based on people having different meanings assigned to words in their head and not realising actually.
At the moment I have to do this by making the PNG transparent and overlaying. I'd love to know how to do it automatically. It's facinating to see different languages reacting differently to world events, like the 60's and wars, for example.
A blog I run for the wealth
Well, I guess that solves that.
"Britannica" can reference to other things than said encyclopaedia. This gives a different picture.
What a depressingly stupid machine.
I don't know how to make sense of this:
blue,red,green,yellow
They seem to share a distinct pattern. What does that pattern reflect?
In this case, and many others, I think it would be useful to have one search term as a base line.
Is this really TROLL?! Cause I was thinking the same thing. *ouch*
The word cloud of published materials idea is neat, but trying to make it represent 'human history', instead of a subset of human history (like, 'published by...'') does seem a tad arrogant.