How Do You Visualize 100 GB of Google Text Data?
An anonymous reader writes "There is an amazing series of charts that visualizes trigrams and bigrams, portions of sentences that have been extracted from Google's web data set. The graphs highlight word associations and the frequency with which we use them on web pages. Chris Harrison from Carnegie Mellon University found, for example, that the word 'he' is often tied to 'argues,' while 'she' is found often with 'loves.' There are also word-relation charts that highlight words used in combination with their opposites, such as good and bad, peace and war, and PC and Mac."
There are a lot of these things, and they're really interesting to browse through.
his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?
With a semantic network which reflects how humans relate various concepts together, and what topics and relationships humans care about.
Yes it will be biased and partial and rough, but it's a good start.
More formal reasoning and association techniques, such as bayesian stuff, logic, etc will be also be needed for general AI, but for the
knowledge base to be grounded in human concerns and human perceptions; that's a key to an ai we can relate to and which can
relate to us.
I imagine this kind of semantic network will be usable for google 2.0 "pre-emptive search" or "my virtual social planner and concierge".
Where are we going and why are we in a handbasket?
Go to the http://ngrams.googlelabs.com/ site and compare word frequency between 'pirates' and 'ninjas'. Please.
Coral Cache (just add .nyud.net to any URL's hostname)
"Was this "anonymous reader" the guy who owns the blog?"
"his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?" - mine don't.
I am slowly recovering from flu. What's the justification for all you miserable bastards out there? This is genuinely interesting stuff presented in an accessible way, and is the sort of thing /. should be about (checks karma and mod points - yup, probably allowed to say that.)
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
You're not missing anything - the images are unreadable even at 200% or more.
Anyway, I don't get what they're illustrating. Word relations? So what.
This is a "Digg" sort of submission ... back over to Fark for me.
We'll start off by imagining 1 GB of data. Now multiply that by 100!
... progress.
Corpus linguistics
http://en.wikipedia.org/wiki/Quantitative_linguistics
Interestingly enough, most relevant authors (e.g. Kaeding) were not cared for.
CC.
TaijiQuan (Huang, 5 loosenings)