Slashdot Mirror


How Do You Visualize 100 GB of Google Text Data?

An anonymous reader writes "There is an amazing series of charts that visualizes trigrams and bigrams, portions of sentences that have been extracted from Google's web data set. The graphs highlight word associations and the frequency with which we use them on web pages. Chris Harrison from Carnegie Mellon University found, for example, that the word 'he' is often tied to 'argues,' while 'she' is found often with 'loves.' There are also word-relation charts that highlight words used in combination with their opposites, such as good and bad, peace and war, and PC and Mac." There are a lot of these things, and they're really interesting to browse through.

12 of 117 comments (clear)

  1. pdf by jcombel · · Score: 2

    his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?

  2. This can be used to preload a "human-like" ai by presidenteloco · · Score: 4, Interesting

    With a semantic network which reflects how humans relate various concepts together, and what topics and relationships humans care about.

    Yes it will be biased and partial and rough, but it's a good start.

    More formal reasoning and association techniques, such as bayesian stuff, logic, etc will be also be needed for general AI, but for the
    knowledge base to be grounded in human concerns and human perceptions; that's a key to an ai we can relate to and which can
    relate to us.

    I imagine this kind of semantic network will be usable for google 2.0 "pre-emptive search" or "my virtual social planner and concierge".

    --

    Where are we going and why are we in a handbasket?
    1. Re:This can be used to preload a "human-like" ai by ultranova · · Score: 2

      With a semantic network which reflects how humans relate various concepts together, and what topics and relationships humans care about.

      Wouldn't it make more sense to simply point it to Wikipedia?

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    2. Re:This can be used to preload a "human-like" ai by glwtta · · Score: 2

      Meaning what, exactly speaking? What is this "cogito" you're talking about and how does it differ from "mere" data processing?

      We don't know. We don't have even the faintest beginnings of a "theory of intelligence".

      Which doesn't mean that you can just ignore it, start throwing data at simplistic machines and expect (strong) AI to just happen.

      --
      sic transit gloria mundi
    3. Re:This can be used to preload a "human-like" ai by korgitser · · Score: 2

      The _engineer_ behind the hard drive controller comprehends the meaning of alternating magnetic patterns on a disk. The hard drive controller or any automated system comprehends it no more than a clock comprehends time. Computers are not smart in any way, they are just clockwork; its only people who have become smarter in programming. And making an honest face while selling hot air.

      --
      FCKGW 09F9 42
  3. ngrams by cangrande · · Score: 2

    Go to the http://ngrams.googlelabs.com/ site and compare word frequency between 'pirates' and 'ninjas'. Please.

  4. Re:Having trouble visualizing by noidentity · · Score: 2

    Coral Cache (just add .nyud.net to any URL's hostname)

  5. Kudos to Chris Harrison, though by Kupfernigk · · Score: 3, Insightful
    He does these really interesting data visualisations and publishes them for free - and what do people do?

    "Was this "anonymous reader" the guy who owns the blog?"

    "his files are hosted in *.pdf files. tried looking at them in a windows 7 and an ubuntu machine, both have the text with unreadable lines through them. why would you host graphics as pdf?" - mine don't.

    I am slowly recovering from flu. What's the justification for all you miserable bastards out there? This is genuinely interesting stuff presented in an accessible way, and is the sort of thing /. should be about (checks karma and mod points - yup, probably allowed to say that.)

    --
    From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
    1. Re:Kudos to Chris Harrison, though by FrankDrebin · · Score: 3, Funny

      'he' is often tied to 'argues,'

      I don't agree.

      --
      Anybody want a peanut?
  6. Re:/.ed by Anonymous Coward · · Score: 2, Insightful

    You're not missing anything - the images are unreadable even at 200% or more.

    Anyway, I don't get what they're illustrating. Word relations? So what.

    This is a "Digg" sort of submission ... back over to Fark for me.

  7. Easy. by Beelzebud · · Score: 2

    We'll start off by imagining 1 GB of data. Now multiply that by 100!

  8. Astonishing ... by foobsr · · Score: 2

    ... progress.

    Corpus linguistics

    http://en.wikipedia.org/wiki/Quantitative_linguistics

    Interestingly enough, most relevant authors (e.g. Kaeding) were not cared for.

    CC.

    --
    TaijiQuan (Huang, 5 loosenings)