Analyzing Culture With Google Books
Harperdog writes with this excerpt from Miller-McCune:
"I would not call myself a Luddite — I use digital resources all the time, in my research and my teaching. I have hundreds of PDFs of books I have downloaded from a variety of online sources — Early English Books Online, Eighteenth Century Collections Online, Gallica (the digital service of the French National Library), and yes, Google Books — that I use in my research. But when I read the Science article (abstract), I was immediately struck by what seems to me to be a fundamental flaw in its methodology: its reliance on Google Books for its sample. Google Books has focused on digitizing academic libraries. I would argue that books found in academic libraries are not necessarily representative of cultural trends across society. As any historian knows, every scholarly library is different and every library has its biases.'"
Google Books has focused on digitizing academic libraries. I would argue that books found in academic libraries are not necessarily representative of cultural trends across society.
This is just public posturing handwringing over being multicultural "enough". You wanna publicly wring your hands to get "diversity street cred", OK go wring your hands, but you don't need to actually engage the rest of us, you just need to strike the pose.
Come on, g.books has "fanny hill", which is not exactly the pinnacle of dry academic prose (to save some /.ers, its pretty good pr0n, sorta nsfw, search for it at home, to give you a cultural reference its like a very long format penthouse letters set in the 1800s). Its also got "punch" and some old amsci and just plain ole "books".
Note that we have very similar tastes in reading (err, not specifically commenting on "fanny hill" above, I mean in general), for example my ipad is stuffed full of project gutenberg goodies. In fact I'm about 50% of the way thru "a friend of caesar" by WSD. Which rocks, just like everything WSD wrote rocks, pretty much. And Xenophon rocks. And Herodotus rocks. And Thucydides rocks. etc.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
I have a sort of backburner project in which I break down the Icelandic vocabulary by morphology patterns and frequency of use, with the frequency of use arrived at by polling Google (the search engine). I figured, hey, with Google as a source, you'll get mostly people talking, plus news, plus ads, plus books, and in general a nice cross-section, right? Well, just ignoring some of my search methodology problems involving homonyms and declension forms (I have some ideas on how to counter those), I found that there were some serious biases by using Google as a search methodology which should have been obvious in retrospect. For example, "síða" (which can mean, among other things, webpage) was listed as one of the most common nouns. :)
Whatever corpus you choose, it's going to have its own biases.
Hey, guys, I'm just pleased as punch to report that it's a fleet of a hundred Vogon Battle Destroyers!
Google has digitized 5 million books from primarily academic libraries.
Microsoft began their digitization project in 2005 and abandon it in 2008, throwing users onto the tender mercies of book publishers and public libraries for content. Public libraries cannot afford to digitally scan books, even if the publishers would allow it.
Book publishers are the most vocal critics of Google's book scanning project, and to hear them wail you'd think Google was burning books, not scanning them. What the book publishers are wailing about is their perceived loss of profits because digitized books open the barn door, making mute the hope some have of renewing copyrights on material LONG resident in the public domain. In a word, greed.
Running with Linux for over 20 years!