Slashdot Mirror


How Do You Visualize 100 GB of Google Text Data?

An anonymous reader writes "There is an amazing series of charts that visualizes trigrams and bigrams, portions of sentences that have been extracted from Google's web data set. The graphs highlight word associations and the frequency with which we use them on web pages. Chris Harrison from Carnegie Mellon University found, for example, that the word 'he' is often tied to 'argues,' while 'she' is found often with 'loves.' There are also word-relation charts that highlight words used in combination with their opposites, such as good and bad, peace and war, and PC and Mac." There are a lot of these things, and they're really interesting to browse through.

117 comments

  1. /.ed by grub · · Score: 1


    Was this "anonymous reader" the guy who owns the blog?

    --
    Trolling is a art,
    1. Re:/.ed by Anonymous Coward · · Score: 2, Insightful

      You're not missing anything - the images are unreadable even at 200% or more.

      Anyway, I don't get what they're illustrating. Word relations? So what.

      This is a "Digg" sort of submission ... back over to Fark for me.

    2. Re:/.ed by icebike · · Score: 1

      Quite possibly.

      That last bit about "really interesting to browse through" was a pretty big clue, since I don't find any this all that interesting, or unexpected.

      Word association games have been played for centuries.

      Picking sets of following words given any first word is child's play, and doing it by computer is pretty meaningless until you add other characteristics, such as regional differences, time differences (50 years ago vs Today) or something to actually reveal something useful.

      More interesting would be listing the most common previous words, given any particular word.

      Everyone knows that squat almost always follows diddly, but what most often precedes squat?

      --
      Sig Battery depleted. Reverting to safe mode.
    3. Re:/.ed by Desler · · Score: 1

      That last bit about "really interesting to browse through" was a pretty big clue, since I don't find any this all that interesting, or unexpected.

      That wasn't part of the original submission. That was added by Taco.

    4. Re:/.ed by Anonymous Coward · · Score: 0

      This is a "Digg" sort of submission ... back over to Fark for me.

      Yea, it sucks having to put up with crappy stories like this, much better to stick to Fark, where currently you can read such illuminating stories as:

      Hamburglar's wife demands burgers, Robble robble
      Today's "pranksters turn river fluorescent green" story brought to you by Prestone
      Firefighter can't control his hose, even after putting a diaper over the nozzle
      That perfect intersection of pompous affluence, shock art and political incorrectness: artist produces diamond-studded baby skull

    5. Re:/.ed by Anonymous Coward · · Score: 0

      I've never seen rhetorical used as an insult before. Or anyone get so upset of the idea of a rhetorical question. Or someone sign up for four hundred fucking accounts on a site they claim to hate. Maybe it's time to accept the fact that you're a goddamn moron.

    6. Re:/.ed by Anonymous Coward · · Score: 0

      ..about

  2. Having trouble visualizing by epdp14 · · Score: 1

    due to the server being slashdotted. Anyone have a mirror or alternate link?

    1. Re:Having trouble visualizing by shadowknot · · Score: 1
    2. Re:Having trouble visualizing by Anonymous Coward · · Score: 1

      Here is a copy of the PDFs, if you want to just view the results.

      http://www.mediafire.com/?ua4dhfxmry2nnhn

      Posting anon for non-karma whoring reasons.

    3. Re:Having trouble visualizing by noidentity · · Score: 2

      Coral Cache (just add .nyud.net to any URL's hostname)

    4. Re:Having trouble visualizing by ae1294 · · Score: 1
  3. Simple by Anonymous Coward · · Score: 0

    Using the world's smallest font of course

  4. pdf by jcombel · · Score: 2

    his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?

    1. Re:pdf by Anonymous Coward · · Score: 1

      by jcombel (1557059) writes: Alter Relationship on 01-11-11 11:11 (#34837900)

      Sweet, 511!

    2. Re:pdf by icebike · · Score: 1

      his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?

      Scalability is my guess. I found that using Chrome I could zoom such that the smallest text is visible (within the browser). Same with Foxit PDF reader.

      No unreadable lines seen here.

      --
      Sig Battery depleted. Reverting to safe mode.
    3. Re:pdf by jandrese · · Score: 1

      Yeah, they're totally unreadable (missing blocks everywhere) with Acrobat reader.

      --

      I read the internet for the articles.
    4. Re:pdf by zach_the_lizard · · Score: 1

      If he wanted scalability, he should have saved them in SVG format. As it stands now, I can't read them; Okular is rendering them pretty weirdly.

      --
      SSC
    5. Re:pdf by ILuvRamen · · Score: 1

      I'm on XP and with Adobe Reader X 10.0.0, had the same black line overlay problem at all zoom levels. Dunno why.

      --
      Google's Super Secret Search Algorithm: SELECT @search_results FROM internet WHERE @search_results = 'good'
    6. Re:pdf by Tobenisstinky · · Score: 1

      Works fine in Safari (on Mac) at maximum zoom, the smallest text appears like a 36pt font, with no jaggies...

      --
      wha'? where am i?
    7. Re:pdf by icebike · · Score: 1

      I have trouble with Okular and Adobe Reader on linux as well.
      I suspect some form of embedded fonts were used that works well on windows but not elsewhere.

      Oddly enough, google chrome's internal sandboxed pdf rendering engine has no problem on Windows or Linux, and since that is my normal browser I didn't even notice problems on Linux.

      --
      Sig Battery depleted. Reverting to safe mode.
    8. Re:pdf by morgan_greywolf · · Score: 1

      Agreed. They're illegible even if I view them in the latest version of Adobe Reader on either Linux or Windows. They're not images, though, they're text rotated using PostScript/PDF commands. Any reports from the iPeople? It may be a font issue.

    9. Re:pdf by morgan_greywolf · · Score: 1

      Like I thought. It's a font issue. What font is it?

      Linux/Windows people: If you want to view these things, you need to get some Mac fonts.

    10. Re:pdf by Colonel+Korn · · Score: 1

      his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?

      Same problem here with XP and Adobe Reader 10.

      --
      "I zero-index my hamsters" - Willtor (147206)
    11. Re:pdf by icebike · · Score: 1

      Linux/Windows people: If you want to view these things, you need to get some Mac fonts.

      Not true.

      --
      Sig Battery depleted. Reverting to safe mode.
    12. Re:pdf by ultranova · · Score: 1

      I suspect some form of embedded fonts were used that works well on windows but not elsewhere.

      Doesn't work on Windows either. And why would embedded fonts be platform-dependent anyway? Don't PDF renderers do document rendering internally?

      I suspect that the PDF files are simply faulty.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    13. Re:pdf by icebike · · Score: 1

      I suspect that the PDF files are simply faulty.

      Then how do you explain they work just fine for me on Win 7 and also in Google Chrome regardless of platform?

      --
      Sig Battery depleted. Reverting to safe mode.
    14. Re:pdf by fyndor · · Score: 1

      The answer to why is that it is not a graphic/image. It is text shaped in a "half circle". I use Chrome, and as others say that works. He probably didn't notice the problem because he likely uses Chrome (and so should you?, after all it is freaking fast as hell, i use it for my day to day). SVG seems like a bad idea as well because it is not supported by IE except for v9 beta (which btw renders this incorrectly as well). I am not even sure what he should have used since its not a good idea to either publish stuff that only one browser can view (what he did), or in a format the most popular browser can't view (SVG on IE). And image files don't seem like the right idea because they would have to be huge to have the text readable. Using font/text vs image seems like the correct choice, it's too bad the browser ecosystem still hasn't really gotten there yet.

    15. Re:pdf by Anonymous Coward · · Score: 0

      maybe they are more tolerant

    16. Re:pdf by Confusador · · Score: 1

      Both platforms have excessive fault tolerance not found elsewhere?

  5. This can be used to preload a "human-like" ai by presidenteloco · · Score: 4, Interesting

    With a semantic network which reflects how humans relate various concepts together, and what topics and relationships humans care about.

    Yes it will be biased and partial and rough, but it's a good start.

    More formal reasoning and association techniques, such as bayesian stuff, logic, etc will be also be needed for general AI, but for the
    knowledge base to be grounded in human concerns and human perceptions; that's a key to an ai we can relate to and which can
    relate to us.

    I imagine this kind of semantic network will be usable for google 2.0 "pre-emptive search" or "my virtual social planner and concierge".

    --

    Where are we going and why are we in a handbasket?
    1. Re:This can be used to preload a "human-like" ai by Kilrah_il · · Score: 1

      Yes it will be biased and partial and rough...

      Just like most humans.

      More formal reasoning and association techniques, such as bayesian stuff, logic, etc will be also be needed for general AI...

      Because we all know that most people use reasoning and bayesian logic everyday.

      --
      Whenever in an argument, remember this.
    2. Re:This can be used to preload a "human-like" ai by korgitser · · Score: 1

      Semantics is all fluffy and stuff, but you are nowhere near AI until the computer can actually comprehend meaning. Semantics is just yet another buzzword for 'dead data, somewhat organized, but still dead, which we hope will make AI. Building larger or better organized datasets will get us nowhere if we can not put the initial 'cogito, ergo sum' into the machine. (And yes I know the 'cogito' is not the ultimate first thought of any mind.) The defining characteristic of life is the fact that data has meaning to a it. What will it take to spark that in a computer?

      --
      FCKGW 09F9 42
    3. Re:This can be used to preload a "human-like" ai by icebike · · Score: 1

      I doubt you can derive human like artificial intelligence from simple word order frequency charts.

      People, or at least intelligent people, start saying something with destination in mind, nor simply to mimic some statistical summary.

      Word order charts made today will be different in 6 months, as new phrases enter common usage, but does that mean human relationships or topics change that much over 6 months?

      This reminds me more of the Bing TV ads than anything else.

      --
      Sig Battery depleted. Reverting to safe mode.
    4. Re:This can be used to preload a "human-like" ai by tb()ne · · Score: 0

      The defining characteristic of life is the fact that data has meaning to a it.

      I'm guessing most biologists would disagree.

    5. Re:This can be used to preload a "human-like" ai by korgitser · · Score: 1

      I'm guessing most biologists would disagree.

      Of course they would. But they also disagree on the characteristics of life. Biosemiotics on the other hand has no question about it.

      --
      FCKGW 09F9 42
    6. Re:This can be used to preload a "human-like" ai by ultranova · · Score: 2

      With a semantic network which reflects how humans relate various concepts together, and what topics and relationships humans care about.

      Wouldn't it make more sense to simply point it to Wikipedia?

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    7. Re:This can be used to preload a "human-like" ai by ultranova · · Score: 1

      Semantics is all fluffy and stuff, but you are nowhere near AI until the computer can actually comprehend meaning.

      It already is. A hard drive controller comprehends the meaning of alternating magnetic patterns on the disk: a sequence of ones and zeroes. A processor comprehends a higher-level meaning: a stream of assembly instructions. An operating system comprehends the yet higher level of meaning: a page of code belonging to firefox.exe that was just swapped in and began executing.

      This phenomenom should be quite familiar with humans, too. Most communications have multiple levels of meaning, sometimes unrealized even by the originator of the message himself (for example, when you don't realize what something you're telling someone else implies). Computers are simply not smart enough yet to reach what we usually consider the "core" of the message.

      Semantics is just yet another buzzword for 'dead data, somewhat organized, but still dead, which we hope will make AI.

      "Dead" data, as opposed to what?

      Building larger or better organized datasets will get us nowhere if we can not put the initial 'cogito, ergo sum' into the machine.

      Meaning what, exactly speaking? What is this "cogito" you're talking about and how does it differ from "mere" data processing?

      The defining characteristic of life is the fact that data has meaning to a it. What will it take to spark that in a computer?

      Any automatic system relies on finding "meaning" from data. And we already have quite complex automated systems. So, I'd say it's simply a matter of overall complexity whether we'd call something alive or not.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    8. Re:This can be used to preload a "human-like" ai by ultranova · · Score: 1

      I doubt you can derive human like artificial intelligence from simple word order frequency charts.

      It's been done already, and the resulting AI was good enough to get three papers submitted to a computer science conference.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    9. Re:This can be used to preload a "human-like" ai by glwtta · · Score: 2

      Meaning what, exactly speaking? What is this "cogito" you're talking about and how does it differ from "mere" data processing?

      We don't know. We don't have even the faintest beginnings of a "theory of intelligence".

      Which doesn't mean that you can just ignore it, start throwing data at simplistic machines and expect (strong) AI to just happen.

      --
      sic transit gloria mundi
    10. Re:This can be used to preload a "human-like" ai by korgitser · · Score: 2

      The _engineer_ behind the hard drive controller comprehends the meaning of alternating magnetic patterns on a disk. The hard drive controller or any automated system comprehends it no more than a clock comprehends time. Computers are not smart in any way, they are just clockwork; its only people who have become smarter in programming. And making an honest face while selling hot air.

      --
      FCKGW 09F9 42
    11. Re:This can be used to preload a "human-like" ai by Anonymous Coward · · Score: 0

      "So it only appears to think... just like everyone else"
      -Archchancellor Mustrum Ridcully regarding the "HEX" thinking engine.

    12. Re:This can be used to preload a "human-like" ai by tophermeyer · · Score: 1

      I'm guessing most biologists would disagree.

      Of course they would. But they also disagree on the characteristics of life. Biosemiotics on the other hand has no question about it.

      ...right. Because biosemiotics is a field dedicated to studying how living organisms processes and interpret data. Your statement is tautological. Biosemioticists have no question because their field is predicated on it.

      Making the claim that anything is the 'defining'g characteristic of life is a little rash, because the definition of life is still kind of up in the air. Clearly, there is some disagreement as to what constitutes life.

    13. Re:This can be used to preload a "human-like" ai by ultranova · · Score: 1

      Computers are not smart in any way, they are just clockwork; its only people who have become smarter in programming.

      Your brain is a clockwork mechanism, yet it somehow manages to be "smart", or at least appears that way to you.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    14. Re:This can be used to preload a "human-like" ai by ultranova · · Score: 1

      Meaning what, exactly speaking? What is this "cogito" you're talking about and how does it differ from "mere" data processing?

      We don't know. We don't have even the faintest beginnings of a "theory of intelligence".

      Yes we do. We have a whole branch of science concerning the matter. Which is precisely why I asked: the grandparent post sounds suspiciously like semi-mystical pseudophilosophy that gets thrown around because people don't actually want to know how their minds work and prefer to think them as magical. Which is all fine and good, but gets in the way when talking about related fields, such as AI.

      Which doesn't mean that you can just ignore it, start throwing data at simplistic machines and expect (strong) AI to just happen.

      No, but you can throw data at a sufficiently complex machine and expect it to learn. That's how humans acquire their initial working knowledge of the world.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    15. Re:This can be used to preload a "human-like" ai by glwtta · · Score: 1

      Yes we do. We have a whole branch of science concerning the matter.

      Sure we do, it's just that so far they have not come up with anything concrete. Oh, they've done lots of work poking around the edges, but the main question is still pretty much the same - what is intelligence? Perhaps "not the faintest beginnings" was a little strong, I'll rephrase as "have not made consistent progress towards" understanding intelligence.

      Which is precisely why I asked: the grandparent post sounds suspiciously like semi-mystical pseudophilosophy that gets thrown around because people don't actually want to know how their minds work and prefer to think them as magical.

      Which is fair, and a lot of people do like to think that way, but there is a difference between not wanting to and admitting that we don't (yet).

      No, but you can throw data at a sufficiently complex machine and expect it to learn. That's how humans acquire their initial working knowledge of the world.

      Sure, but it's the "sufficiently complex" part that's the problem, so far we're working with trivially simple ones.

      --
      sic transit gloria mundi
    16. Re:This can be used to preload a "human-like" ai by tgv · · Score: 1

      No, this just leads to symptom modeling. There is no relation between "he" and "argues" or "she" and "loves" other than that they occur more frequently in the texts that comprise the corpus. I've done corpus studies, and if you look at word frequencies from a certain corpus, i.e. unigrams, they look ok, until you compare them to another one. One of them had 3rd person personal pronouns high, but the rest low, but in another, the 1st person singular (I) was the most frequent word. The difference? The former was a news paper corpus, the second e-mail.

      So what did that reflect? The style and topics in the texts from which the corpus was formed, nothing else. Did you notice that "666" is associated with heaven and not hell?

      So, while it is a valuable source of information for NLP, it doesn't mean anything wrt real semantics or AI. If you base your knowledge on this kind of corpus, remember the adage "garbage in, garbage out".

    17. Re:This can be used to preload a "human-like" ai by tehcyder · · Score: 1

      So, I'd say it's simply a matter of overall complexity whether we'd call something alive or not.

      I'm sure the whole internet is more complex than an amoeba, but that doesn't mean it's alive.

      --
      To have a right to do a thing is not at all the same as to be right in doing it
    18. Re:This can be used to preload a "human-like" ai by tehcyder · · Score: 1

      Your brain is a clockwork mechanism

      In which case, why don't you just build one and prove it?

      Oh, that's right, you can't.

      --
      To have a right to do a thing is not at all the same as to be right in doing it
    19. Re:This can be used to preload a "human-like" ai by badkarmadayaccount · · Score: 1

      Give me time.

      --
      I know tobacco is bad for you, so I smoke weed with crack.
  6. Can I do my own searches? by Locke2005 · · Score: 1

    I like to see what the correlation is between the two words "microsoft" and "sucks".

    --
    I've abandoned my search for truth; now I'm just looking for some useful delusions.
    1. Re:Can I do my own searches? by benjamindees · · Score: 1

      The results sort of surprised me.
      microsoft sucks vs. microsoft doesn't suck

      Hmm, let's see what's going on here.
      microsoft doesn't suck vs. microsoft doesn't suck that much

      That makes more sense.

      --
      "I assumed blithely that there were no elves out there in the darkness"
    2. Re:Can I do my own searches? by Anonymous Coward · · Score: 0

      Not much of one except for people with a chip on their shoulder and no life of their own. Most of us have moved on. Why not try it?

    3. Re:Can I do my own searches? by debrain · · Score: 1
    4. Re:Can I do my own searches? by marpot · · Score: 1
  7. ngrams by cangrande · · Score: 2

    Go to the http://ngrams.googlelabs.com/ site and compare word frequency between 'pirates' and 'ninjas'. Please.

    1. Re:ngrams by noidentity · · Score: 1

      This guy also did that on his earlier visualizations (not in this current "peacock" style though).

  8. Easy! by countSudoku() · · Score: 1

    Just use grep, or vi with a heavy object on the down-arrow key. What did I win?

    --
    This is the NSA, we're gonna geet U h@x0r5! Also, what is a h@x0r5?
    1. Re:Easy! by JamesP · · Score: 1

      nah, just use cat and read really fast

      --
      how long until /. fixes commenting on Chrome?
  9. Kudos to Chris Harrison, though by Kupfernigk · · Score: 3, Insightful
    He does these really interesting data visualisations and publishes them for free - and what do people do?

    "Was this "anonymous reader" the guy who owns the blog?"

    "his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?" - mine don't.

    I am slowly recovering from flu. What's the justification for all you miserable bastards out there? This is genuinely interesting stuff presented in an accessible way, and is the sort of thing /. should be about (checks karma and mod points - yup, probably allowed to say that.)

    --
    From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
    1. Re:Kudos to Chris Harrison, though by Anonymous Coward · · Score: 1

      I'm sorry, but there is no rationale to call this ( http://gyazo.com/57fe0a7de30d5bfbbeb4998b74730fc3.png ) GOOD. Who failed here? Sure, Adobe has their part after .pdf "being demonstrated" [sic!] as a very "robust" format at the 27c3 (you can put all kinds of shit into an uncompiled pdf - it will compile and execute on launch without asking).

      But I have done comparably complex graphics in pdf an those did not fail - so what's the probleM? I use win7x64.

    2. Re:Kudos to Chris Harrison, though by FrankDrebin · · Score: 3, Funny

      'he' is often tied to 'argues,'

      I don't agree.

      --
      Anybody want a peanut?
    3. Re:Kudos to Chris Harrison, though by Anonymous Coward · · Score: 1

      Windows XP (at work), and I've got the same problem.

    4. Re:Kudos to Chris Harrison, though by Anonymous Coward · · Score: 0

      The PDFs are broken in every viewer I've tried them in, so for me at least it was a completely worthless link. I suspect that it's the same for others. Wouldn't you complain too in that case?
      Especially since creating simple (at least from what it appears in the useless preview images) PDFs isn't rocket science.

    5. Re:Kudos to Chris Harrison, though by Anonymous Coward · · Score: 0

      "his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?" - mine don't.

      Mine don't, either, and they are likely in .pdf because it swallows the super-high resolution needed for his fonts without the browser bitching the way it might for a 20 megapixel .png.

      I do question why the black backgrounds? would print better on white.

    6. Re:Kudos to Chris Harrison, though by Anonymous Coward · · Score: 0

      Debian + Chrome 8... No problems...

    7. Re:Kudos to Chris Harrison, though by Anonymous Coward · · Score: 0

      What do you mean? If he wants to publish something, he most like has to leave his experiments out for the public. I have all the codw from my dissertation available so people can reproduce my experiments. It's basic to the scientific community.

    8. Re:Kudos to Chris Harrison, though by Anonymous Coward · · Score: 0

      That's not an argument. That's just contradiction.

  10. Poor way of presenting by noidentity · · Score: 1
    Wouldn't it be better to just present it as a list of words, so that it could be rendered in HTML? For example

    cold
    winter steel case turkey
    blood ...
    weather ...
    spring ...
    air ...
    water ...
    springs spots ...
    products new spot ...
    hot

    with the word lists getting smaller as you go to the right, of course (the ... lists words I can't make out in his image). No need for the "peacock" arrangement that reduces readability and requires it being stored as an image.

    1. Re:Poor way of presenting by Colonel+Korn · · Score: 1

      Wouldn't it be better to just present it as a list of words, so that it could be rendered in HTML? For example

      cold

      winter steel case turkey

      blood ...

      weather ...

      spring ...

      air ...

      water ...

      springs spots ...

      products new spot ...

      hot

      with the word lists getting smaller as you go to the right, of course (the ... lists words I can't make out in his image). No need for the "peacock" arrangement that reduces readability and requires it being stored as an image.

      I think that Tufte would agree with you.

      --
      "I zero-index my hamsters" - Willtor (147206)
    2. Re:Poor way of presenting by Anonymous Coward · · Score: 0

      How'd you do in Art class?

    3. Re:Poor way of presenting by noidentity · · Score: 1

      That's the point; making it colorful and interesting caused a significant reduction in utility. The plain HTML list approach would have communicated the same information, yet been easily viewable and searchable in any browser, with no need for a PDF. It could still have been with the color gradient as well, and black background.

  11. Easy. by Beelzebud · · Score: 2

    We'll start off by imagining 1 GB of data. Now multiply that by 100!

    1. Re:Easy. by tehcyder · · Score: 1

      Basically, all the worst parts of the bible.

      --
      To have a right to do a thing is not at all the same as to be right in doing it
  12. Cats v. Dogs by drunkenkatori · · Score: 1

    Looking at the Cat vs. Dog picture, all I can say is, "What's wrong with dog people?"

    1. Re:Cats v. Dogs by Anonymous Coward · · Score: 0

      Pretty sure there isn't such a thing as 'kitty-style'.

    2. Re:Cats v. Dogs by Anonymous Coward · · Score: 0

      So true - and the Cats seem to be influenced by machine speak with all their number associations.

    3. Re:Cats v. Dogs by jittles · · Score: 1

      Well you don't see people talking about having sex "kitty style" now do you? So some of the hits on dog may be due to that and not just people who like to feed their dog peanut butter...

    4. Re:Cats v. Dogs by tehcyder · · Score: 1

      Pretty sure there isn't such a thing as 'kitty-style'.

      Amateur.

      --
      To have a right to do a thing is not at all the same as to be right in doing it
  13. Women and Men by dragonxtc · · Score: 1

    That one is kind of disturbing.

    1. Re:Women and Men by Anonymous Coward · · Score: 0

      i bet the high numbers for "she loves" comes from all the billions of copies of the lyrics to that one beatles song posted on the web

  14. slashdot broken by Anonymous Coward · · Score: 0

    Anon comments don't show up - fix it morons.

    Slashdot sucks.

    1. Re:slashdot broken by Anonymous Coward · · Score: 0

      We like it that way.

  15. Warning - unfiltered by SuperKendall · · Score: 1

    Dog-Cat chart NSFW

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  16. How GOOG does it: by shitetaco · · Score: 1

    How Do You Visualize 100 GB of Google Text Data?

    Easy:

    $$$$$$$$$$

  17. Evince on Windows 7 : Unreadable by Anonymous Coward · · Score: 0

    Can't get the PDF's to display properly. I tried Evince, Foxit Reader and even Adobe Acrobat. Black lines everywhere, can't read a single word of it.
     

    1. Re:Evince on Windows 7 : Unreadable by sexconker · · Score: 1

      XP, Adobe Reader 9 or some shit with all updates, black lines all over the place.

      Works fine if I open it up in Adobe Acrobat 8.

      Couldn't care less about figuring out why, or the content of the PDFs. Take your fucking word graphs, tag clouds, and other useless shit back to 1999, where you'll still be recognized as completely useless.

  18. Visualization? by schlameel · · Score: 1

    Visualization = Dark Background + Light Words + Pretty Lines

    How does that give me any sort of understanding of the content?

    1. Re:Visualization? by DeadDecoy · · Score: 1

      I agree somewhat. The problem is two fold: graphing libraries do the same things and there is not much meaning to be had in the raw data. For the former item, many visualization libraries are designed to display graph/network data somewhat gracefully. Consequently, many visualizations center around, how do we put this thing in graph form? rather than what interface naturally explains this data best? The second problem is that this huge morass of data just has frequency counts and n-grams. So, we sorta know how things are connected and their magnitude, but very little of the semantic context that goes with that data. There are other datasets such as the Penn Treebank (which applies parts of speech tags) and Medline articles which have (Medical Subject Headings) which provide semantic meaning to the content within. Other sources such as semantic web try to provide structure to the underlying meaning of data: a is_a b, b contains c, etc. Getting this data is hard because it is a largely manual process as meaning comes from us fleshy humans.

      One could sorta use automatic techniques to discover structure, but depending on the model's assumptions, one could apply a flawed structure to the data. Due to these issues, we tend to have pretty visualizations that show the connectivity and magnitude data rather than their meaning. This isn't entirely a bad thing as it will probably lead to groups bootstrapping semantic structure into this data.

  19. Astonishing ... by foobsr · · Score: 2

    ... progress.

    Corpus linguistics

    http://en.wikipedia.org/wiki/Quantitative_linguistics

    Interestingly enough, most relevant authors (e.g. Kaeding) were not cared for.

    CC.

    --
    TaijiQuan (Huang, 5 loosenings)
  20. It would be interesting by Anonymous Coward · · Score: 0

    to check something like how often righties use violent turns of phrase vs. how often lefties use them. There have been a lot of claims recently about this kind of usage, righties saying it's the same and lefties saying righties use them much more often; Google could let us empirically test the competing hypotheses.

    1. Re:It would be interesting by Anonymous Coward · · Score: 0

      Or you could just look at the etymology of the words "dexter" and "sinister". Dexter refers to the right side, and gives root to words such as dexterous (a generally positive attribute to give someone's abilities); while sinister refers to the left side, and today generally means "evil" or "underhanded". So for the last few thousand years, right-handed people have been honored, while left-handed people were reviled.

      You're probably not going to learn much of value about a millennial-old stigma from a pop psychology theory.

    2. Re:It would be interesting by tehcyder · · Score: 1

      Er, I think GP meant left and right as in left wing and right wing politics, although admittedly I've never seen lefties and righties used in that way before.

      --
      To have a right to do a thing is not at all the same as to be right in doing it
  21. OT: old, busted news ( was Re:/.ed ) by sleepy_weasel · · Score: 1

    Also, I'll read stories on fark that I'll see a couple weeks later here. It's getting to the point where I don't need to come here for news anymore. Check the 'Geek' tab on Fark, or check my feeds from Techdirt, New Scientist, Wired, CNet or SciAm, and I've got all the news a good week before ./ . Once in a while, there is a rare gem on the feed here, but it's sad, as I came here a lot a year or two ago... now, I just come here to check what the iFanboys like to say, and to hear what Linux and Microsoft fanboys like to stir up. I love coming in to hear the "did you hear OpenBSD sucks now" crowd. I like it just fine...

    --
    It's all damned lies and statistics!! I mean 47% of all people use statistics to back up their arguments.
    1. Re:OT: old, busted news ( was Re:/.ed ) by Anonymous Coward · · Score: 0

      Why do you insist on posting here? Sorry you had a bad experience - time to break up I guess...

  22. guess the word by sleepy_weasel · · Score: 1

    I can guess some of the words... but it required blowing up the pics to 2400% and I was using Adobe PDF.

    My only question is what to do with it. If you are trying to add keywords that will make your site more search worthy, I can understand, or to show a line of thinking how people associate terms. 'Hot and cold' gets you to "environment" "water" "pool"... Might be fun for word association tests.

    --
    It's all damned lies and statistics!! I mean 47% of all people use statistics to back up their arguments.
  23. as a fraction by owlnation · · Score: 1

    It's easy to visualize 100GB of data. Just view it as a percentage of the Library of Congress -- e.g. a door, or small closet.

  24. Psychologists. by Kupfernigk · · Score: 1

    They are going to have a field day (more likely, a lot of field days.)

    --
    From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
  25. Responding to myself (I don't respond to ACs) by Kupfernigk · · Score: 1
    Why are all the posters bitching about the PDFs ACs?

    I ask simply because I have viewed them today on the latest Chrome on Ubuntu 10.10 and Windows 7, and I cannot reproduce the problem, even on a crappy 4 year old laptop.

    --
    From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
    1. Re:Responding to myself (I don't respond to ACs) by Anonymous Coward · · Score: 0

      Why are all the posters bitching about the PDFs ACs?

      I ask simply because I have viewed them today on the latest Chrome on Ubuntu 10.10 and Windows 7, and I cannot reproduce the problem, even on a crappy 4 year old laptop.

      Good for you!

      What version of Acrobat and plug-ins do you have? I have 9.4.0.195 on XP sp3 - the PDFs are unreadable.

      I'm an AC because I refuse to create an account. :-P

  26. How Do You Visualize 100 GB of Google Text Data? by Anonymous Coward · · Score: 0

    How Do You Visualize 100 GB of Google Text Data?

    "I picture a man... then I take away reason and accountability..." -- As Good As it Gets

  27. Some more Google N-Gram finds by bgspence · · Score: 1

    http://ngrams.googlelabs.com/graph?content=blue%2Cred%2Cgreen%2Cyellow&year_start=1880&year_end=2008&corpus=0&smoothing=3
    http://ngrams.googlelabs.com/graph?content=Britannica%2CWikipedia&year_start=1800&year_end=2010&corpus=0&smoothing=3
    http://ngrams.googlelabs.com/graph?content=1881%2C1891%2C1901%2C1911%2C1921%2C1931%2C1941%2C1951%2C1961%2C1971%2C1981%2C1991&year_start=1880&year_end=2008&corpus=0&smoothing=3
    http://ngrams.googlelabs.com/graph?content=poker%2Cchess&year_start=1880&year_end=2008&corpus=0&smoothing=3
    http://ngrams.googlelabs.com/graph?content=Galileo%2CDarwin%2CEinstein%2CFreud&year_start=1880&year_end=2008&corpus=0&smoothing=3http://ngrams.googlelabs.com/graph?content=Warren+Harding%2CCalvin+Coolidge%2CHerbert+Hoover%2CFranklin+Roosevelt%2CHarry+Truman%2CDwight+EisenhowerJohn+Kennedy%2CLyndon+Johnson%2CRichard+Nixon%2CGerald+Ford%2CJimmy+Carter&year_start=1910&year_end=2008&corpus=0&smoothing=3
    http://ngrams.googlelabs.com/graph?content=fax%2CXerox&year_start=1960&year_end=2008&corpus=0&smoothing=3
    http://ngrams.googlelabs.com/graph?content=steak%2Csausage%2Cice+cream%2Chamburger%2Cpizza%2Cpasta%2Csushi&year_start=1880&year_end=2008&corpus=0&smoothing=3
    http://ngrams.googlelabs.com/graph?content=Google%2CMicrosoft%2CMacintosh%2CiPad%2CiPhone%2CWindows&year_start=1984&year_end=2008&corpus=0&smoothing=3
    http://ngrams.googlelabs.com/graph?content=Google%2CiPhone%2CMacintosh&year_start=2000&year_end=2008&corpus=0&smoothing=3
    http://ngrams.googlelabs.com/graph?content=3.14%2C3.1416%2C3.14159&year_start=1880&year_end=2008&corpus=0&smoothing=3

  28. Cat Abuse by bananaendian · · Score: 1

    nah, just use cat and read really fast

    RYRYRYRYRYRYRYRYRY...

    This is an obscene abuse of a perfectly innocent program meant to concatenate files.

    I'll have you know I've called the Unix Police and they will be picking you up shortly.

    And you don't have to read fast. All you need is a 45.5 baud teletype machine and filename > /dev/tty

    Personally I prefer to read the punchtape directly though ... with a torch.

    --
    www.tribalnetworks.org - helping tribal people around the world to own their own means of high-tech communications
  29. Re:How Do You Visualize 100 GB of Google Text Data by Anonymous Coward · · Score: 0

    How Do You Visualize 100 GB of Google Text Data?

    With a really really small font.

  30. What for? by Anonymous Coward · · Score: 0

    Not meant as a troll...

    Why is 'Visual Modeling' useful? (I never have figured out what it is supposed to make easier.)

    If you find it useful, what is it that it helps you with?

  31. Easy ... by Anonymous Coward · · Score: 0

    try visualising the US debt, then this becomes trivial.

    1. Re:Easy ... by Anonymous Coward · · Score: 0

      Debt is a distraction. The country's had a record debt almost every year since its very founding when Alexander Hamilton's doctrine of assumption assumed the states' war debts. We produce huge surpluses of food and shelter and other goods in this country. The debt is merely a number in a ledger book. Innovation is the real goal; the advance of knowledge is what enhances our survival fitness by allowing us to predict and adapt to sudden catastrophic change. As long as we spend printed money to encourage creativity among individuals the currency will remain strong and we can print as much debt-free money as we like. Lincoln when he printed over $400 million greenbacks realized this...

  32. bigram means two characters by misof · · Score: 1

    I wish people would stop using the words "bigram" and "trigram" incorrectly. The "-gram" suffix comes from a Greek word for "a written character", the same root is in the word "grapheme". Hence bigram == a two-character substring, and trigram == a three-character substring. And these words are actually being used in the correct sense as well. Two-word and three-word substrings should IMHO be called "bilexes" and "trilexes", or something similar. But a good first step is to stop calling them bigrams and trigrams.

  33. How? by Arador+Aristata · · Score: 1

    File -> Print

  34. Same way you visualize anything else by halcyon1234 · · Score: 1

    With your eyes. Your eyes.