Analyzing Culture With Google Books

← Back to Stories (view on slashdot.org)

Analyzing Culture With Google Books

Posted by Soulskill on Monday August 8, 2011 @09:08AM from the unavoidable-selection-bias dept.

Harperdog writes with this excerpt from Miller-McCune: "I would not call myself a Luddite — I use digital resources all the time, in my research and my teaching. I have hundreds of PDFs of books I have downloaded from a variety of online sources — Early English Books Online, Eighteenth Century Collections Online, Gallica (the digital service of the French National Library), and yes, Google Books — that I use in my research. But when I read the Science article (abstract), I was immediately struck by what seems to me to be a fundamental flaw in its methodology: its reliance on Google Books for its sample. Google Books has focused on digitizing academic libraries. I would argue that books found in academic libraries are not necessarily representative of cultural trends across society. As any historian knows, every scholarly library is different and every library has its biases.'"

20 comments

Min score:

Reason:

Sort:

handwringing over multiculturalism by vlm · 2011-08-08 09:29 · Score: 1, Interesting

Google Books has focused on digitizing academic libraries. I would argue that books found in academic libraries are not necessarily representative of cultural trends across society.
This is just public posturing handwringing over being multicultural "enough". You wanna publicly wring your hands to get "diversity street cred", OK go wring your hands, but you don't need to actually engage the rest of us, you just need to strike the pose.
Come on, g.books has "fanny hill", which is not exactly the pinnacle of dry academic prose (to save some /.ers, its pretty good pr0n, sorta nsfw, search for it at home, to give you a cultural reference its like a very long format penthouse letters set in the 1800s). Its also got "punch" and some old amsci and just plain ole "books".
Note that we have very similar tastes in reading (err, not specifically commenting on "fanny hill" above, I mean in general), for example my ipad is stuffed full of project gutenberg goodies. In fact I'm about 50% of the way thru "a friend of caesar" by WSD. Which rocks, just like everything WSD wrote rocks, pretty much. And Xenophon rocks. And Herodotus rocks. And Thucydides rocks. etc.

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
1. Re:handwringing over multiculturalism by hansraj · 2011-08-08 09:53 · Score: 1
  
  Academic libraries frequently already have a big portion of their catalogue in digital form: mostly theses and journals. So my guess would be that someone starting out with a goal of digitizing all books would naturally start with the academic libraries.
2. Re:handwringing over multiculturalism by mcmonkey · 2011-08-08 09:59 · Score: 2
  
  I gave TFA a quick read, and you seem to be projecting your own issues on to the author. There's no talk of being "multicultural," no hand wringing over diversity.
  It's a legitimate question that needs to be addressed in any research based on Google Books. I've heard the figure quoted in the article before, that Google Books represents 4% of all books ever published. That 4% is a large enough sample to "allow the kind of statistically significant analysis common to many sciences" doesn't mean the particular 4% represented on Good Books is such a sample.
  I think your example is telling. Yes, Google Books includes Fanny Hill, so it's not all academic texts and scholarly volumes. But old pr0n is not the same as new pr0n, particularly is representing popular culture. You've got Fanny Hill, but not Penthouse Forum.
  And given what we know about Google, are Google Books ngrams influenced by my Gmail account or previous searches on Google?
3. Re:handwringing over multiculturalism by Rogue+Haggis+Landing · 2011-08-08 09:59 · Score: 5, Insightful
  
  Google Books has focused on digitizing academic libraries. I would argue that books found in academic libraries are not necessarily representative of cultural trends across society.
  This is just public posturing handwringing over being multicultural "enough". You wanna publicly wring your hands to get "diversity street cred", OK go wring your hands, but you don't need to actually engage the rest of us, you just need to strike the pose.
  Speaking as someone who's been working in academic libraries for 18 years -- the original quote isn't handwringing over multiculturalism, it's an accurate description. Academic libraries purchase books that will be of use to academics. There are huge areas that they generally don't collect in. A contemporary academic library will purchase relatively few cookbooks, popular genre novels (romance, mystery, sci fi, etc.), YA books, self-help books, and so on, simply because they don't fit the library's mission. OTOH, an academic library is far more likely than a public library or a brick and mortar bookstore to have books written in foreign languages, books written by or about marginalized groups, and books written by minor or otherwise marginalized authors. An academic library's collection is likely to be more multicultural than that of any other book repository. But it won't have any Harlequins, any recent celebrity biographies, or Personal Finance for Dummies, so it's really hard to say that it represents the broad swath of society's reading practices.
4. Re:handwringing over multiculturalism by Americano · 2011-08-08 10:56 · Score: 2
  
  It doesn't seem to be handwringing over cultural trends at all. Just seems to be saying that there's a very real chance of selection bias inherent to the data set.
  Say you did an analysis of computer programming based on O'reilly's Safari service. It'd probably suggest to you that algorithms and system design was pretty much irrelevant to programming, since there's only a couple Knuth books which could be overlooked, but things like "Ruby on Rails," "Perl," and "iOS Programming in 21 Days," were incredibly important tomes, simply based on the relative volume of output related to the two subjects.
  The content of those digitized books could easily skew the results of any quantitative analysis, when you are using source material that is curated & biased towards certain topics. The fact that you cite one "kinda porn" book as an example of the diversity simply highlights this: the inclusion of only a handful of "sexy" fare might lead people to conclude that, as a society, modern people are actually a very asexual bunch, much more inclined to think about academic topics, than sex. In short - a very mistaken conclusion, as anybody who's bothered to turn on a television in the past 20 years can attest.
5. Re:handwringing over multiculturalism by Anonymous Coward · 2011-08-09 01:54 · Score: 0
  
  Multiculturalism is the problem? Don't you have some Norwegian kids to go shoot up?
First post by Anonymous Coward · 2011-08-08 09:29 · Score: 0

Pursuit of fame sponsored by Google Troll
uh... by versiondub · 2011-08-08 09:29 · Score: 2

Isn't this article the same one that came out to accompany google's "ngrams" (http://ngrams.googlelabs.com/) lab? I don't think these guys are trying to make generalizations about culture in general; they are only raising the possibility that, even with a small (4% of the total published) sample, interesting queries and surveys of human (although in this case Anglophone) culture can be made.
I've encountered similar problems with Icelandic by Rei · 2011-08-08 09:29 · Score: 3, Interesting

I have a sort of backburner project in which I break down the Icelandic vocabulary by morphology patterns and frequency of use, with the frequency of use arrived at by polling Google (the search engine). I figured, hey, with Google as a source, you'll get mostly people talking, plus news, plus ads, plus books, and in general a nice cross-section, right? Well, just ignoring some of my search methodology problems involving homonyms and declension forms (I have some ideas on how to counter those), I found that there were some serious biases by using Google as a search methodology which should have been obvious in retrospect. For example, "síða" (which can mean, among other things, webpage) was listed as one of the most common nouns. :)
Whatever corpus you choose, it's going to have its own biases.

--
Hey, guys, I'm just pleased as punch to report that it's a fleet of a hundred Vogon Battle Destroyers!
Are they nuts? by dev434 · 2011-08-08 09:35 · Score: 0

I see that updated spec adds extension for DirectX coexistence, or in other words allows you to intermix DirectX and OpenGL in same app.
Why?
1. Re:Are they nuts? by Plombo · 2011-08-08 11:25 · Score: 1
  
  Wrong story...
Books and data quality by joe+155 · 2011-08-08 09:36 · Score: 1

To be fair, the corpus is much more rounded for the 1800-2000 English cloud, which is what they use in the science article.
Now, I'm not saying that all the data is perfect, I've found some issues - but if you actually look at the additionall materials for the science article they talk a fair bit about how theye made sure the data were good. And they spent a lot of time looking into it. I personality it. Data doesn't have to be perfect to be useful. Don't let the perfect be the enemy of the good.

--
*''I can't believe it's not a hyperlink.''
1. Re:Books and data quality by DingerX · 2011-08-08 19:44 · Score: 1
  
  "Scientific" quantitative analysis is the Gay Cowboy Movie of historiography. One turns up every decade or so, it's inevitably hailed as a revolutionary breakthrough, then promptly forgotten.
  The problem with historical data is that it's so far from random that the more sophisticated the analysis you subject them to, the more you end up analyzing artifacts of the selection criteria. This has been shown with every generation of quantitative data to be subjected to historical analysis. For example, let's say Google is teaming up with Harvard to digitize their library. Their library would have been built for the University's mission, which changed over the years. A Protestant Seminary will have less use for collection of nineteenth-century Catholic publications than somewhere else, and even less for those from the Jewish community. So a study of nineteenth-century book terms will reflect the world as skewed towards 19th-century WASPs. The democratization of universities after the 1950s changed considerably which books were bought and when. So, for example, the "finding" that 19th-century new tech terms took a century to enter common usage, while 20th-c tech came in 50 could also be explained by a selection bias: 19th-century universities (and then their libraries) were very different communities, with a much more conservative selection, then 20th-century ones. Or the problem could be the advent of cheap print; or even that the texts where such terms are commonly used were undated.
  
  For that matter, selecting on a date (as they do for '1951') is following the selection bias: only dated texts are included, and, for copyrighted works, Google always includes the front matter.
  
  It's a technique that has its utility, but the more you want to use the data to say something meaningful, the more the selection problems creep in, and the more useless it becomes.
pathetic by mapkinase · 2011-08-08 09:45 · Score: 1

browsed through the figures in the Science paper and my impression is that the choice of indicators is pathetic.
Another pseudoscientific study.

--
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
Quanity not enough? by Jerry · 2011-08-08 09:47 · Score: 2, Interesting

Google has digitized 5 million books from primarily academic libraries.
Microsoft began their digitization project in 2005 and abandon it in 2008, throwing users onto the tender mercies of book publishers and public libraries for content. Public libraries cannot afford to digitally scan books, even if the publishers would allow it.
Book publishers are the most vocal critics of Google's book scanning project, and to hear them wail you'd think Google was burning books, not scanning them. What the book publishers are wailing about is their perceived loss of profits because digitized books open the barn door, making mute the hope some have of renewing copyrights on material LONG resident in the public domain. In a word, greed.

--
Running with Linux for over 20 years!
It is comparative by fermion · 2011-08-08 15:13 · Score: 1
I read this article when it came out 8 months ago. My impression was the article investigated change in language over time, and interest in topics over time in formal writing. Academic sources are used because they tend to include a range of texts from the long ago, where more popular sources are going to cull the resources much more drastically. On reflection that only thing I might have included were newspapers and pamphlets to more accurately measure the rise of fall of new terminology.
For those who don't have access to the article, here are some conclusions based on the analysis of the, admittedly possibly skewed, data.
- by 1900 burned was in more use than burnt in the US though it was more widely used in the UK
- Items invented at the turn of the 20th century were in wide use in the literature within 50 years, but it took 100 years for things invented at the turn of the 19th century to be in equal use
These types of observations are valid. Note that they are comparative within the domain, not global statements of fact.
--
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
Bias in "-omics" unavoidable by staalmannen · 2011-08-08 16:05 · Score: 1

Sample selection will always cause a bias no matter how extensive it is. The important thing is that the sources are well described so that the bias(es) can be properly accounted for (I think GoogleBooks does fit that requirement). The interesting thing with this paper is that it makes use of the "-omics" approach to something that previously had been a purely scholarly subject (where the insight of the individual scholar naturally gets limited by its ability to manually absorb material, at a much lower throughput and a much smaller sample size, but arguably at a higher quality). I think this looks a lot like in my own field (molecular biology), where we wet-lab people struggle to connect two dots and variations thereof, whereas the genomics/transcriptomics/proteomics get the massive datasamples out there. The whole point however is that it is not either-or situation. In my own field, the small and focused wet-lab projects are still vital to find the new (unexpected) mechanisms, but the ideas are often pulled from results of massive data collection from the "-omics" guys.
Go for quality not for quantity by k4f · 2011-08-09 16:55 · Score: 0

Seems to me this entire argument goes *poof* if one replaces "culture" with "academic culture". Is there value in statistical analysis of that culture using rigorous mathematical techniques? Obviously yes. What I see is another article that uses distain to disguise the fear that arrises as it becomes clear the digitization of the world turns all fields of study into informatics. Don't worry, the "hard" sciences are having just the same problem.
can you imagine by k4f · 2011-08-10 18:24 · Score: 0

Seems to me this entire argument goes *poof* if one replaces "culture" with "academic culture". Is there value in statistical analysis of that culture using rigorous mathematical techniques? Obviously yes. What I see is another article that uses distain to disguise the fear that arrises as it becomes clear the digitization of the world turns all fields of study into informatics. Don't worry, the "hard" sciences are having just the same problem.
Calling this technique by k4f · 2011-08-15 23:41 · Score: 0

Calling this technique "culturomics" makes it sound like it is some new field, and that analysis of this type has never been undertaken before. It is actually called bibliometrics, and it has been around for decades. Digital resources such as Google books allow now much larger amounts of data to be analyzed than was possible in the past, but this isn't some new field just invented by these researchers.