How Do You Visualize 100 GB of Google Text Data?
An anonymous reader writes "There is an amazing series of charts that visualizes trigrams and bigrams, portions of sentences that have been extracted from Google's web data set. The graphs highlight word associations and the frequency with which we use them on web pages. Chris Harrison from Carnegie Mellon University found, for example, that the word 'he' is often tied to 'argues,' while 'she' is found often with 'loves.' There are also word-relation charts that highlight words used in combination with their opposites, such as good and bad, peace and war, and PC and Mac."
There are a lot of these things, and they're really interesting to browse through.
Was this "anonymous reader" the guy who owns the blog?
Trolling is a art,
due to the server being slashdotted. Anyone have a mirror or alternate link?
Using the world's smallest font of course
his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?
With a semantic network which reflects how humans relate various concepts together, and what topics and relationships humans care about.
Yes it will be biased and partial and rough, but it's a good start.
More formal reasoning and association techniques, such as bayesian stuff, logic, etc will be also be needed for general AI, but for the
knowledge base to be grounded in human concerns and human perceptions; that's a key to an ai we can relate to and which can
relate to us.
I imagine this kind of semantic network will be usable for google 2.0 "pre-emptive search" or "my virtual social planner and concierge".
Where are we going and why are we in a handbasket?
I like to see what the correlation is between the two words "microsoft" and "sucks".
I've abandoned my search for truth; now I'm just looking for some useful delusions.
Go to the http://ngrams.googlelabs.com/ site and compare word frequency between 'pirates' and 'ninjas'. Please.
Just use grep, or vi with a heavy object on the down-arrow key. What did I win?
This is the NSA, we're gonna geet U h@x0r5! Also, what is a h@x0r5?
"Was this "anonymous reader" the guy who owns the blog?"
"his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?" - mine don't.
I am slowly recovering from flu. What's the justification for all you miserable bastards out there? This is genuinely interesting stuff presented in an accessible way, and is the sort of thing /. should be about (checks karma and mod points - yup, probably allowed to say that.)
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
cold ... ... ... ... ... ... ...
winter steel case turkey
blood
weather
spring
air
water
springs spots
products new spot
hot
with the word lists getting smaller as you go to the right, of course (the ... lists words I can't make out in his image). No need for the "peacock" arrangement that reduces readability and requires it being stored as an image.
We'll start off by imagining 1 GB of data. Now multiply that by 100!
Looking at the Cat vs. Dog picture, all I can say is, "What's wrong with dog people?"
That one is kind of disturbing.
Anon comments don't show up - fix it morons.
Slashdot sucks.
Dog-Cat chart NSFW
"There is more worth loving than we have strength to love." - Brian Jay Stanley
How Do You Visualize 100 GB of Google Text Data?
Easy:
$$$$$$$$$$
Can't get the PDF's to display properly. I tried Evince, Foxit Reader and even Adobe Acrobat. Black lines everywhere, can't read a single word of it.
Visualization = Dark Background + Light Words + Pretty Lines
How does that give me any sort of understanding of the content?
... progress.
Corpus linguistics
http://en.wikipedia.org/wiki/Quantitative_linguistics
Interestingly enough, most relevant authors (e.g. Kaeding) were not cared for.
CC.
TaijiQuan (Huang, 5 loosenings)
to check something like how often righties use violent turns of phrase vs. how often lefties use them. There have been a lot of claims recently about this kind of usage, righties saying it's the same and lefties saying righties use them much more often; Google could let us empirically test the competing hypotheses.
Also, I'll read stories on fark that I'll see a couple weeks later here. It's getting to the point where I don't need to come here for news anymore. Check the 'Geek' tab on Fark, or check my feeds from Techdirt, New Scientist, Wired, CNet or SciAm, and I've got all the news a good week before ./ . Once in a while, there is a rare gem on the feed here, but it's sad, as I came here a lot a year or two ago... now, I just come here to check what the iFanboys like to say, and to hear what Linux and Microsoft fanboys like to stir up. I love coming in to hear the "did you hear OpenBSD sucks now" crowd. I like it just fine...
It's all damned lies and statistics!! I mean 47% of all people use statistics to back up their arguments.
I can guess some of the words... but it required blowing up the pics to 2400% and I was using Adobe PDF.
My only question is what to do with it. If you are trying to add keywords that will make your site more search worthy, I can understand, or to show a line of thinking how people associate terms. 'Hot and cold' gets you to "environment" "water" "pool"... Might be fun for word association tests.
It's all damned lies and statistics!! I mean 47% of all people use statistics to back up their arguments.
It's easy to visualize 100GB of data. Just view it as a percentage of the Library of Congress -- e.g. a door, or small closet.
They are going to have a field day (more likely, a lot of field days.)
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
I ask simply because I have viewed them today on the latest Chrome on Ubuntu 10.10 and Windows 7, and I cannot reproduce the problem, even on a crappy 4 year old laptop.
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
How Do You Visualize 100 GB of Google Text Data?
"I picture a man... then I take away reason and accountability..." -- As Good As it Gets
http://ngrams.googlelabs.com/graph?content=blue%2Cred%2Cgreen%2Cyellow&year_start=1880&year_end=2008&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=Britannica%2CWikipedia&year_start=1800&year_end=2010&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=1881%2C1891%2C1901%2C1911%2C1921%2C1931%2C1941%2C1951%2C1961%2C1971%2C1981%2C1991&year_start=1880&year_end=2008&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=poker%2Cchess&year_start=1880&year_end=2008&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=Galileo%2CDarwin%2CEinstein%2CFreud&year_start=1880&year_end=2008&corpus=0&smoothing=3http://ngrams.googlelabs.com/graph?content=Warren+Harding%2CCalvin+Coolidge%2CHerbert+Hoover%2CFranklin+Roosevelt%2CHarry+Truman%2CDwight+EisenhowerJohn+Kennedy%2CLyndon+Johnson%2CRichard+Nixon%2CGerald+Ford%2CJimmy+Carter&year_start=1910&year_end=2008&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=fax%2CXerox&year_start=1960&year_end=2008&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=steak%2Csausage%2Cice+cream%2Chamburger%2Cpizza%2Cpasta%2Csushi&year_start=1880&year_end=2008&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=Google%2CMicrosoft%2CMacintosh%2CiPad%2CiPhone%2CWindows&year_start=1984&year_end=2008&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=Google%2CiPhone%2CMacintosh&year_start=2000&year_end=2008&corpus=0&smoothing=3
http://ngrams.googlelabs.com/graph?content=3.14%2C3.1416%2C3.14159&year_start=1880&year_end=2008&corpus=0&smoothing=3
nah, just use cat and read really fast
RYRYRYRYRYRYRYRYRY...
This is an obscene abuse of a perfectly innocent program meant to concatenate files.
I'll have you know I've called the Unix Police and they will be picking you up shortly.
And you don't have to read fast. All you need is a 45.5 baud teletype machine and filename > /dev/tty
Personally I prefer to read the punchtape directly though ... with a torch.
www.tribalnetworks.org - helping tribal people around the world to own their own means of high-tech communications
How Do You Visualize 100 GB of Google Text Data?
With a really really small font.
Not meant as a troll...
Why is 'Visual Modeling' useful? (I never have figured out what it is supposed to make easier.)
If you find it useful, what is it that it helps you with?
try visualising the US debt, then this becomes trivial.
I wish people would stop using the words "bigram" and "trigram" incorrectly. The "-gram" suffix comes from a Greek word for "a written character", the same root is in the word "grapheme". Hence bigram == a two-character substring, and trigram == a three-character substring. And these words are actually being used in the correct sense as well. Two-word and three-word substrings should IMHO be called "bilexes" and "trilexes", or something similar. But a good first step is to stop calling them bigrams and trigrams.
File -> Print
With your eyes. Your eyes.
UTF-8: There and Back Again