Slashdot Mirror


IBM vs. Content Chaos

ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.

216 comments

  1. It can be feed? by BCole · · Score: 0, Offtopic

    How about "it can be fed"

    1. Re:It can be feed? by RootMoose · · Score: 1

      Offtopic as hell- I know! but in reply to your sig - what are the chances that that actually happened? That Major Brian Reed was the one that found Saddam Hussein in the dead of the night over tired as many soldiers are and that he was quick enough to make that smart ass remark to Saddam? Not to mention the wildly unreported matter that it was purportedly the SAS or some other brit corps who actually found Saddam. Cynically I think this is a bit of Pro-Bush propaganda that's getting way more lip service than it deserves.

    2. Re:It can be feed? by BCole · · Score: 1

      You are certainly entitled to your opinion.

  2. I think a better question... by bc90021 · · Score: 5, Funny

    ...doesn't concern whether "Pink" is a colour or a singer, but whether "Paris Hilton" is a hotel in France or an oft downloaded video... ;)

    1. Re:I think a better question... by jpr1nd · · Score: 0, Flamebait

      wow, that is such a clever and original joke. how do you come up with stuff so funny?

    2. Re:I think a better question... by Dave2+Wickham · · Score: 2, Funny

      "from the help-me-find-directions-to-p4r1s-h1l70n dept."

    3. Re:I think a better question... by robslimo · · Score: 1

      If the reference was to the band Pink Floyd, that was the name of the group, originally "The Pink Floyd Sound" and did not reference anyone in the group.

      However, I guess it _was_ from a person's name since the band was named for American blues artists Pink Anderson and Floyd Council.

    4. Re:I think a better question... by bc90021 · · Score: 1

      The really funny part is that I didn't even see that until you pointed it out...

    5. Re:I think a better question... by ePhil_One · · Score: 2, Funny

      Oh, by the way, which one's Pink?

      --
      You are in a maze of twisted little posts, all alike.
    6. Re:I think a better question... by emok · · Score: 1

      Come in here, dear boy, have a cigar.
      You're gonna go far, fly high,
      You're never gonna die,
      You're gonna make it if you try;
      They're gonna love you.
      Well I've always had a deep respect,
      And I mean that most sincerely.
      The band is just fantastic,
      that is really what I think.
      Oh by the way, which one's Pink?
      And did we tell you the name of the game, boy,
      We call it Riding the Gravy Train.

    7. Re:I think a better question... by Anonymous Coward · · Score: 0

      I got slapped down two points for just mentioning part of that. Go figure!

    8. Re:I think a better question... by You're+All+Wrong · · Score: 1

      Even more originally called the "Screaming Abdabs" while they were in their formative years at architectural college in London?
      Several of my relatives delight in telling me about concerts by such bands when they were younger. Bastards.

      YAW.

      --
      Your head of state is a corrupt weasel, I hope you're happy.
  3. pr0nfountain by 3lb4rt0 · · Score: 1, Funny

    The spinoff that will be used by joe sixpack net user.

  4. All we need... by TJ_Phazerhacki · · Score: 3, Interesting

    There is already altogether too much "Stuff out there" for anyone to put any major effort into catogorizing it. We should soon reach the point of info overload, and then what? What is the point of catologing overflow data? Do we really need something like this? Or should we just ship a bunch of programmers wasting their time over to something else, like better spam filters and OS's without gaping security holes?

    --
    Physics is nothing like religion. If it was, we'd have an easier time trying to raise money!
    1. Re:All we need... by Frymaster · · Score: 0, Flamebait
      the first commercial use will be to track public opinion for companies.

      here's one to start with:

      microsoft (msft) of redmond washington: you suck!

      now, go log that.

    2. Re:All we need... by geoffspear · · Score: 1, Insightful
      Oh yes, because there's such an enormous shortage of programmers right now. IBM should lay off all of these programmers so Microsoft will have a pool of available programmers who know nothing about OS security to work on security.

      And once all the game producers, who make a product we definitely don't "need" get rid of all of their programmers, there will be plenty of free people to work on anti-spam technology. Whee!

      --
      Don't blame me; I'm never given mod points.
    3. Re:All we need... by millahtime · · Score: 5, Insightful

      There are many organizations that need better ways to analyze their info. There are databases that are terabytes in size and have to do detailed searches. With SQL databases that can take a long time and any faster way can save a lot of time and money. There is a big need for this technology across many industries.

    4. Re:All we need... by xyzzy · · Score: 5, Insightful

      That's really funny that you mention "spam filters", since that is exactly the content categorization task that you are talking about.

      Automatic categorization of overflowing data is exactly what you need to do when you have too much to think about -- it allows you to triage your attention span, which is the most limited resource you have.

    5. Re:All we need... by redragon · · Score: 2, Interesting

      I think the inverse is the case.

      The more chaotic (overloaded in your terms) that data tends to be, then the greater the information contained in that data (think compression). So what they're going after is not "catogorizing" the internet, they're going after making some sense out of all of that data. Information overload begins to necesitate an intermediary to help filter out the data that you're interested in.

      The interesting thing becomes what sort of biases are built into a system like this? That is what I'm curious about. Right now when we search on Google (which of course has it's own biases), we decide which links end up mattering (if we have the will to root through it). If a computer system is doing this, it will inevitably alter the way in which we come to understand the data we're looking through.

      I think you're saying (or am I (mis)reading you?) that, "it doesn't matter," isn't the right direction of thinking here. Sure spam and security are issues too, spam actually being a related problem, but it seems unfair to delegate this to the "bad idea" stack already.

      --
      - Sighuh?
    6. Re:All we need... by Anonymous Coward · · Score: 0

      it allows you to triage your attention span, which is the most limited resource you have.
      [ Reply to This ]

      Besides your penis.

      Good point but you lost me at 'triage'

      sincerely,
      Joe Sixpack

    7. Re:All we need... by Anonymous Coward · · Score: 0
      Doh! Using heuristics to find structure in "unstructured" data can well be the first step in analysing if it contains enough relevant information, and then filtering out content that does not.

      What you are whining about is one good reason to research subject; not a counter-argument that would prove it pointless.

  5. Send link to Google by Urkki · · Score: 4, Insightful

    They could certainly use this kind of techniques to improve their results...

    Then again, in a way they already use something like this, except they're only really concerned about links, not actual contents of pages...

    1. Re:Send link to Google by AndroidCat · · Score: 1

      Or send it to Slashdot. :^)

      --
      One line blog. I hear that they're called Twitters now.
  6. structure... by Rhubarb+Crumble · · Score: 5, Funny
    a huge system to turn all the unstructured info on the web into structured data

    In order to do this, they will use a scheme by which each document is referred to by a string including the transfer protocol, the host name, and a file path.

    oh, wait...

  7. First customer by Anonymous Coward · · Score: 3, Funny

    IEEE reports that the first commercial use will be to track public opinion for companies.

    Word has it the first test case will be SCO. Web fountian: "Outlook not so good"

    1. Re:First customer by Anonymous Coward · · Score: 0

      The e-mail client? You needed IBM to tell you that?

      Blogzine

    2. Re:First customer by Anonymous Coward · · Score: 0

      Hmmm... ask it about Exchange Server next? ;)

  8. SITE ALREADY SLASHDOTTED, HERES A MIRROR! by ThisIsAnExampleAccou · · Score: 2, Funny
    1. Re:SITE ALREADY SLASHDOTTED, HERES A MIRROR! by jpr1nd · · Score: 0

      um, thats not a mirror.

      and apparently "To ride your bicycle safely and efficiently, it is important to have equipment operating smoothly and properly."

    2. Re:SITE ALREADY SLASHDOTTED, HERES A MIRROR! by jonathan_ingram · · Score: 1
  9. Get this setup by millahtime · · Score: 3, Interesting

    I wonder how long until IBM sells this setup. If it works well Logistics Orginazations would love to get their hands on it.

    1. Re:Get this setup by millahtime · · Score: 1

      I mean by this that most Logistics Orgainzations will have propritary info that they won't let IBM house.

    2. Re:Get this setup by orac2 · · Score: 4, Informative

      Although the article didn't have room to go into this point (and I should know, I'm the author), IBM can completley compartmentalize competitors' data, even if hosted in house (IBM already does this in other parts of its business). If companies are still wary, they can host the data themselves and let WebFountain troll it on a need to know basis.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
    3. Re:Get this setup by The+Limp+Devil · · Score: 2, Interesting

      let WebFountain troll it

      I sincerely hope you meant trawl it. The last thing we need is for IBM to build and sell an automated system for trolling the entire internet!

  10. Expensive by starvingcodeartist · · Score: 4, Interesting

    In the article is says they plan on charging between $150,000 and $300,000 a year to use this super-search engine. They think corporate execs will pay for it. Seems really steep to me. BUT, for corporate execs, its probably not too expensive. They'll just outsource another 10-15 programming jobs to India to pay for it.

    1. Re:Expensive by orac2 · · Score: 4, Interesting

      The point is that it's not intended for use as a search engine, but a platform for doing computation intensive data mining and analysis. A search engine can tell you how many mentions of IBM appear on the web, but not how people feel about IBM.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
    2. Re:Expensive by starvingcodeartist · · Score: 1

      That's what they say, but the article gave me the impression that it basically just organizes data into usable categories. The benefit being that you can get "exactly" the data you are looking for, instead of wasting your time wading through scores of unrelated pages.

    3. Re:Expensive by millahtime · · Score: 1

      For the kinds of data mining that would be done this cost vs the benefit will easily pay for iteslf. If it works as advertised. There is a huge speed problem in doing data mining currently. If this can solve that then there are a lot of companies that will jump on it.

    4. Re:Expensive by orac2 · · Score: 1

      The point is that the "you" in "you can get exactly the data you're looking for" is not a person, but a data mining program.

      Disclaimer: I'm the author of the article!

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
    5. Re:Expensive by Speare · · Score: 1


      "A search engine can tell you how many mentions of IBM appear on the web, but not how people feel about IBM."

      I give you googlism.com: http://www.googlism.com/index.htm?ism=ibm&type =2

      Googlism for: ibm

      ibm is even "officially" spineless
      ibm is still the 'king'
      ibm is shipping 2 new powerpc processors
      ibm is bullish on asps and hosted services in
      ibm is offering internship that supports grid
      ibm is my choice
      ibm is outstanding
      ibm is giving peace
      ibm is planning to ship new
      ibm is willing to help
      ibm is announcing sanfrancisco v2
      ibm is embracing linux and microsoft windows 2000 and creating
      ibm is a couple of generations ahead of the competition ...

      --
      [ .sig file not found ]
    6. Re:Expensive by Anonymous Coward · · Score: 0

      Oh Waaahh!!!!

      Go to India you wanker.

    7. Re:Expensive by Anonymous Coward · · Score: 0

      I fully grasp the distinction, but as others rightly say I also think this technology is destined for common everyday use in search engines.

      The shallow and brittle system of using weighted keywords and links has been fine for guys like Google until now, when the internet was relatively small it worked. What is desperately needed now is efficient semantic space searching
      to get relevance back to searching such a huge set.

      A single ended (post analysis) system like I assume IBMs Webfountain is, will struggle to do much good in anything like the near-real-time window web users desire. But searching a pre structured records (semantic/categorical parsed), perhaps using something like Kohonens self organising map is the way forward.

      Im sure such natural language 'NLUIs' will be the next big thing and the next layer we see put onto sites like Google, enabling searches like 'Is the film Matrix 3 any good?' to return only results which are opinions on the film given' and so on.

      Kinda how we always imagined search engines would be :)

    8. Re:Expensive by benton · · Score: 1

      that makes sense, but just out of curiosity, do you think they would build an app on top of it to serve as a search engine and compete with Google? They certainly seem to have a pretty good start on it and search has proven pretty profitable.

    9. Re:Expensive by TurboProp · · Score: 1

      If you relly want to know how people feel about something, you have to put it on Slashdot

      --
      ~ You may speak freely, If you have enough cash ~
    10. Re:Expensive by Anonymous Coward · · Score: 0

      That's one of the problems with our beloved splatdot. There's an increasing divide between the research scientists, the college kids, the sysadmin/console monkeys, and the enterprise architects. "Back in the day", our community was small enough that we all pretty much had a common goal, world domination. Now that we are close to achieving our goal, our community has grown so large that many of us are having very different experiences.

      Some people are just getting started (and there will always be newcomers), but as time goes on, they will have much more ground to cover before they catch up to where the community is today.

      I'm an IBM'er and my customer pays $1,500 a day for my services. Also, February will mark 2 years that I have been on the same project, so its not just a 3-week deal. Comments like yours show that you have a long way to go before you realize how much money is out there in the business community. $150,000 a year is literally nothing when a copy of their DB2 database costs $30,000 alone. With your extremely high id, it looks like you may still be in college. Good luck when you get out here. In five years, you'll understand this post.

  11. corporate meddling by commo1 · · Score: 3, Insightful

    One of my main concerns with search databases is the inhenrent ability for corporations to increase their visibility on the web by manipulating data to their benefit to bring their corporate page up first on the list. I wonder if there is a way for the database to have a scoring system based on the validity of the data: is the information there, or are there just highly develpoped metatags doing the work? If you do a search for a specific part number for an HP product, what are the cances of getting a) the HP home page where a further search would be necessary to find any relevant info or b) the big chains like Staples, Sircuit City who just want to sell you cartridges and have the time and resources to steer you in the right direction. How would the system be regulated? (kinda like Slashdot mods :P)? Who watches the watchers, and can information validity be electronically implemented? What kind of AI would be necessary?

    1. Re:corporate meddling by orac2 · · Score: 1

      WebFountain isn't intended a a general purpose search engine, but to provide a platform for data mining and analysis.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
  12. Information... by enrico_suave · · Score: 1, Funny

    Information wants to be... Fuscia!

    *shrug*

    e.

    --
    Build Your Own PVR/HTPC news, reviews, &
    1. Re:Information... by __past__ · · Score: 1

      Really? But I heard mauve has the most RAM!

  13. What about Existing Data? by ParadoxicalPostulate · · Score: 4, Interesting

    Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

    You would need an enormous workforce to do that.

    And if they don't plan on doing that, what about all the existing information? Is it going to be excluded from the database? Seems like much of a waste to me!

    Damn but I would love to have access to one of these, even if the amount of information available will be miniscule (relatively speaking) for the next few years.

    1. Re:What about Existing Data? by Ronald+Dumsfeld · · Score: 5, Funny
      Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

      No, they're writing software to put in the XML tags.

      What will be more interesting to see is if it's possible to pollute the database by putting in your own XML. Instead of Google-Bombing we'll have people pissing in the WebFountain.
      --
      Where's the Kaboom?
      There's supposed to be an Earth-shattering Kaboom.
    2. Re:What about Existing Data? by azzy · · Score: 1

      If they are prepared to pay me enough, I'll do it!

    3. Re:What about Existing Data? by AndroidCat · · Score: 2, Informative

      According to the article, Web Fountain is supposed to sift through information which isn't XML tagged.

      --
      One line blog. I hear that they're called Twitters now.
    4. Re:What about Existing Data? by ParadoxicalPostulate · · Score: 1

      Ah, you are correct. I mistakenly took the "annotators" for humans (damn personification...) But, when I think about it in oversimplified terms, it sounds pretty funny: So they are writing software to categorize software so that it can be recognized by other software?

    5. Re:What about Existing Data? by GT_Alias · · Score: 2, Informative
      Erm...did your read the article? WebFountain has created multiple "annotators" to sift through the data fed to it and apply XML tags.

      You would need an enormous workforce to do that.

      C'mon, give these guys some credit.

    6. Re:What about Existing Data? by cookie_cutter · · Score: 1
      Instead of Google-Bombing we'll have people pissing in the WebFountain.

      And so a new piece of slang, is born.

    7. Re:What about Existing Data? by corbettw · · Score: 1

      If they are prepared to pay me enough, I'll do it!

      Well, they're probably prepared to pay $1.50 an hour. So unless you live in India or the Philipines, I wouldn't be dusting off the ol' resume if I were you.

      --
      God invented whiskey so the Irish would not rule the world.
    8. Re:What about Existing Data? by K-Man · · Score: 1
      Better yet, a mental image. Can't wait for the IBM brochure.

      --
      ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
  14. Entirely unsuited by happyfrogcow · · Score: 3, Insightful

    From the article, "But many online information sources are entirely unsuited to the XML model--for example, personal Web pages, e-mails, postings to newsgroups, and conversations in chat rooms."

    entirely unsuited? chrissake. email, unsuited. newsgroups, unsuited. chat rooms, unsuited. If personal home pages are unsuited, then so are corporate home pages, as there is nothing inherantly different about the two. All this from an IEEE article... I would have thought them to be more acurate and less misleading. I could put <popularmusic>Pink</popularmusic> in my HTML as easily as Amazon could in theirs.

    HTML is based on the XML model. HTML is used to create personal web pages. How on earth then, could personal web pages be "entirely unsuited to the XML model"?

    1. Re:Entirely unsuited by Anonymous Coward · · Score: 0

      Um... No, XML is based on the HTML model.

    2. Re:Entirely unsuited by Anonymous Coward · · Score: 0

      No, you are both wrong. HTML is an SGML application. XML is a simplification of SGML. XHTML is an XML application.

      Oh, and "pink" is the colour - P!nk is the singer :)

    3. Re:Entirely unsuited by happyfrogcow · · Score: 1

      Um... No, XML is based on the HTML model.

      no, XML is based on the SGML model. HTML too, with exceptions to some SGML features. more info: http://www.w3.org/TR/html401/intro/sgmltut.html

    4. Re:Entirely unsuited by Anonymous Coward · · Score: 0
      HTML is not based on "the XML model". HTML is a cut-down version of SGML, as is XML. There is a variant of HTML - XHTML - designed to parse as an XML document, but XHTML doesn't necessarily include any semantic information.

      It is the semantic content that is important. Personal web pages will never be marked up with meaningful semantics, because doing that is a lot of work for little benefit to the writer. Corporate webpages are different: corporations employ IT people who should be capable of understanding the need for semantic markup, plus there could be benefits to a corporation of paying for the time and skills necessary to convert at least some of their web presence to a semantically-marked-up form.
      I could put <popularmusic>Pink</popularmusic> in my HTML as easily as Amazon could in theirs.
      Now explain to every single person with a Geocities page how to and why they need to do that.
    5. Re:Entirely unsuited by orac2 · · Score: 4, Insightful

      Disclaimer: I'm the author of the article.

      Most people don't and won't tag as they go. (Except for those of us used to writing HTML-enabled comments on /. of course). Also, in order to be able to write <popularmusic>Pink</popularmusic>, and have it make sense, you'd have to be following a DTD.

      As anyone who's been involved in DTD formulation can attest, even for internal documentation, it can be a royal pain in the butt. I don't think the vast majority of on-line rapid content generators (all those bloggers, emailers, chatters) will ever use XML to routinely tag their content manually. The article isn't talking about machine generated or commercial content, like Amazon's, but the day to day stuff that gets put up in the time it takes to write it and click submit, and which is of most interest to market researchers.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
    6. Re:Entirely unsuited by xyzzy · · Score: 1

      More to the point, HTML tags for RENDERING, not semantics. To a first order, ALL HTML pages look alike.

    7. Re:Entirely unsuited by happyfrogcow · · Score: 1

      Is it unreasonable to imagine a web community that advocates the use of some relavant DTD? On the nerdly end of things, if slashdot had their own DTD or used some other DTD, I might use it. It could ad value to the site from a usability perspective as well as economic value for the owners.

      I think that if it was suffiecntly easy for a person to know what tag to put around "Pink", and know that it would ad something to the usability and understandability (am i making up words?) they might do it.

    8. Re:Entirely unsuited by Anonymous Coward · · Score: 0

      It's unreasonable to expect people to tag because the use of tags presupposes what those tags will be used for which is impossible to predict in general. For example, suppose I want to search for musicians whose names are also colors, should I expect everybody who is writing about Pink the musicion to use the tags <color><popularmusic>Pink</color></popularmusic&gt ; just to support my search? It's the same with any other use of semantic information embedded within the markup of a document.

    9. Re:Entirely unsuited by orac2 · · Score: 1

      On the nerdly end of things, if slashdot had their own DTD or used some other DTD

      Even back when the web was just composed and read by nerds, people still didn't follow the "rules" -- look at how HTML drifted from it's original use of marking up content to being a poor man's page layout language.

      they might do it.

      Sorry, I just can't believe it. Most contributors to the web (i.e. non computer nerds) are hard pressed to remember even a handful of HTML tags, let alone maintain a familiarity with a DTD, however easy it was to lookup.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
    10. Re:Entirely unsuited by You're+All+Wrong · · Score: 1

      "HTML tags for RENDERING, not semantics"

      In theory, or in practice?

      <dt>, and <dd> are certainly highly semantic, <a> even more so.

      Sure, 99% of webpages use the tags for nothing but presentation, but that doesn't mean that their only, or even intended, use is presentation. <table> being the perfect example of something that started off as a semantic structure, but got almost entirely hijacked to do layout instead.

      YAW.

      --
      Your head of state is a corrupt weasel, I hope you're happy.
    11. Re:Entirely unsuited by You're+All+Wrong · · Score: 1

      If I can't decide what format to put structured data into I now just fall back onto what would be XML if only I were to codify the DTD I have in mind.

      People who don't care about structuring their data 99% of the moronic web-loggers, are just not going to be interested in XML at all.

      However, some do care, typically the ones that do more with their web-presence than just being another blogger on some handle-turning point-and-click blog site. However, whether this minority would adopt such a formal approach is still far from certain.

      YAW.

      --
      Your head of state is a corrupt weasel, I hope you're happy.
    12. Re:Entirely unsuited by kfg · · Score: 1

      If personal home pages are unsuited, then so are corporate home pages, as there is nothing inherantly different about the two.

      Aha! I think we're on to something here. :)

      XML, for those people who when asked "What part of 'Hello World!' don't you understand?," say:

      "Ummmmmm, that one?

      "Ok, is this better?":

      Hello World

      "Oh, well geez. Why didn't you say that the first time?

      KFG

    13. Re:Entirely unsuited by xyzzy · · Score: 1

      Right, but on the great unwashed web, practice is all ya got. And the semantics of the tags (and their origins in document generation) is pretty darned impoverished. really doesn't tell you *anything* unless you are looking for tables -- what people REALLY need to find information is and and things like that.

    14. Re:Entirely unsuited by xyzzy · · Score: 1

      What people really need is to preview before submitting :-)! The last comment should have said "...to find information is <Year> and <GasPrices> and things like that".

  15. Re:Content Chaos by Anonymous Coward · · Score: 0

    Meaning 1:

    crazy, ridiculous

    Your mom flushed your stash? that's whack!

    Meaning 2:

    stupid, dumb; gay

    Dude, that's whack.

    Source: Urban Dictionary(.com)

  16. Too easy, think complicated by korpiq · · Score: 1


    Some information at different paths might require cross-referencing. Thus, the scheme you propose should be extended so that there would be a way for text documents to contain links to each other.

    However, if you just take a big enough storage system and download all the documents from teh intterweb, you can have a flat directory containing all the documents. Woohoo, progress!

    --

    I think, therefore thoughts exist. Ego is just an impression.
  17. IBM needs this... by G.+Waters · · Score: 1

    IBM should try their own website. Passport-Advantage is about the most hideous labyrinth I've ever spelunked (sp). IBM is not alone, but through sheer scale the site just screams "bueromaze".

    1. Re:IBM needs this... by null+etc. · · Score: 1

      You think that site is bad, try mining through Symantec's site. Their online store, combined with their poor product differentiation between product models and product lines (i.e. Norton products vs. Symantec products) make it impossible for anyone to be a Symantec guru. My friend is a product manager there, and I always give him flack about it.

  18. Impact on Google IPO by G4from128k · · Score: 2, Interesting

    This is the type of technology that could either ensure or derail Google's future (I'm not saying that it will, only that it could). Semantic analysis and clustering of web pages could improve search. I hope Google gets to use/create this type of tech.

    --
    Two wrongs don't make a right, but three lefts do.
  19. Echelon? by SexyKellyOsbourne · · Score: 2, Interesting

    This project sounds quite interesting -- it could really help out projects like Echelon to help win the war on terrorism, if it's capable of understanding other languages of course, and could possibly build a whole database of information that's intercepted from other places. All that chatter, with the codewords they use, could possibly be understood by a football field full of Linux rackmounts, and might foil something.

    Of course, such power could also be horribly misused if it came into the wrong hands. What if they wanted to enumerate every member or affiliate of the "terrorist" Green Party in the case of a "national emergency?" Feed WebFountain some data from the internet, and from ECHELON, and they would have a quick blacklist.

    Or corporations, for that matter, as that's who it's designed for, could quickly blacklist people from employment who were considered "dangerous" such as whistleblowers, heavily involved union members, spies, watchdogs, and so forth.

    1. Re:Echelon? by pantycrickets · · Score: 1

      All that chatter, with the codewords they use, could possibly be understood by a football field full of Linux rackmounts, and might foil something.

      You don't need much technology to predict future terrorist attacks. I just used google.. and look what I found!

    2. Re:Echelon? by orac2 · · Score: 3, Insightful

      Disclaimer: I'm the author of the article.

      I know, from talking to the WebFountain team that they're very sensitive to privacy concerns. WebFountain obeys robots.txt and doesn't archive material which has vanished from the publicly visible web (if only for reasons of storage capacity!).

      The point is that all the information that feeds into IBM is already publicly availble. If wanted to go after Green Party members and if the Green Party posted it's membership roll on a webserver, I think they'd be able to get it, WebFountain or no.

      Of course, I suppose WebFountain could be used to construct a membership list by scanning people's home page's to find out if they say that they're a member, but again this is publicly declared information.

      Bottom line, as always: if you don't want it generally accessible to all, don't put it on a public web server.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
    3. Re:Echelon? by Nevyn · · Score: 1
      The point is that all the information that feeds into IBM is already publicly availble. ... Of course, I suppose WebFountain could be used to construct a membership list by scanning people's home page's to find out if they say that they're a member, but again this is publicly declared information.

      But that's it, you can't just say "all I did was collect public data" so it can't have privacy concerns. It's obviously still got them (unless your collector is useless).

      For instance, I might say on /. that I think Fox "News" are extreme right wing liers and/or that a women's right to choose is a good thing ... and I might have links from my /. page to my website ... and I might be interested in photography and publish photos on my website of places near where I live.

      But I'm pretty sure if someone said had a list of names and addresses of "unborn killer sympathizers", I wouldn't want to be on it.

      And I'm not saying that it's obviously bad, there are obviously a lot of good uses of information agrregation ... you just don't get to not have the discussion because "hey, all I did was republish".

      --
      ustr: Managed string API with ave. 44% overhead over strdup(), for 0-20B
    4. Re:Echelon? by cfuse · · Score: 1
      Of course, such power could also be horribly misused if it came into the wrong hands.

      Too late.

    5. Re:Echelon? by Anonymous Coward · · Score: 0

      For instance, I might say on /. that I think Fox "News" are extreme right wing liers and/or that a women's right to choose is a good thing ... and I might have links from my /. page to my website ... and I might be interested in photography and publish photos on my website of places near where I live.

      But I'm pretty sure if someone said had a list of names and addresses of "unborn killer sympathizers", I wouldn't want to be on it.


      Seems like a bad example, but I'm game.

      If I was an extreme right winger, I'd say that you were an "unborn killer sympathizer" and I'd be able to track you down. All using info that you have already put out there.

      (I'm not an extreme right winger, just so you feel better. :) )

      Aggregation of the info is irrelevant. The fact that some system makes it easy to collate and/or find the information doesn't change the fact that the information is *already* out there. How can it have any privacy concerns beyond its public existance in the first place?

      The fact that you don't want to be on someone's list is also irrelevant. You don't have control over other people's impressions of you. If someone were to read your opinions and decided that you belonged on their shit list, then they could add you to said list and there's not a damn thing you can do about it. This is simply reality.

      So anyway, I fail to see how indexing or doing any form of collating publically available information can possibly have privacy concerns beyond the existance of the information to begin with. When the info is already out there, it's just a matter of finding it. Making it easier to find and bring together this information isn't a privacy concern, because the information was already there to begin with. It wasn't invented out of thin air. And you don't have to put that information out there in the first place.

    6. Re:Echelon? by Nevyn · · Score: 1
      Aggregation of the info is irrelevant. The fact that some system makes it easy to collate and/or find the information doesn't change the fact that the information is *already* out there. How can it have any privacy concerns beyond its public existance in the first place?

      Making it eaiser is a big thing though. For instance it's possible for someone to find out my social security or credit card numbers by just stealing information from the right place(s). This is not particuly well kept information, I'd imagine most people on /. could do it. However I'm not likely to post them on my website, as that makes it much easier.

      In the same way, in the example I gave in the previous post. Matching up all the information, who I am ... where I live based on the photos, etc. is non-trivial for someone to do manually. However, if you can just say give me a list of people who match profile X. Then that is a significant difference.

      --
      ustr: Managed string API with ave. 44% overhead over strdup(), for 0-20B
    7. Re:Echelon? by Anonymous Coward · · Score: 0

      That's exactly where it came from: NSA + IBM research project!

  20. One Net to Rule Them All by null+etc. · · Score: 5, Insightful
    It would be nice if, in parallel to the Internet, another network was developed to hold only symantically organized knowledge. That network would be free of marketing and commercial business, and would ostensibly be the largest repository of organized knowledge in the planet. Think Internet2, based entirely in XML.

    Similar to HTML's current weakness in separating presentation from content, the web today has a weakness in separating content sites from sales sites. Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic. This lack of ability to separate queries for knowledge, verses queries for product sales literature, is especially frustrating for scientists and programmers. I think Google is taking a step towards this with Froogle, meaning that if Froogle becomes popular enough, it's possible that Google will strip marketing pages from their search results.

    Worse even, is when someone registers a thousand domains (plumbing-supplies-store.com, plumb-superstore-supplies.com, all-plumbing-supplies.com, etc) and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage. You would think that Google could detect this "marketing domain spam" and reduce the relevancy of such search results.

    Anyways, I can't complain, because I can find nearly anything on the web I need, compared to 10 years ago.

    1. Re:One Net to Rule Them All by ThomasXSteel · · Score: 1
      It would be nice if, in parallel to the Internet, another network was developed to hold only symantically organized knowledge.

      Don't see why you need another whole network to do this. See wikipedia. It may not have the uber xml web service driven aspect oriented paradigm shifting 300 grand per annum buzzword love, but it works fine for me. Besides, we can just graft that shite on later if it turns out to be useful.

    2. Re:One Net to Rule Them All by Tom · · Score: 2, Informative

      Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic.

      (topic) -checkout -buy

      Other things that work well sometimes:
      (topic) site:.org
      (topic) -amazon
      (topic) -site:amazon.com -site:amazon.co.uk

      and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage.

      Does it? I always thought that's exactly what google is filtering out behind the "12345 more results were omitted because they were similiar" thingy.

      --
      Assorted stuff I do sometimes: Lemuria.org
    3. Re:One Net to Rule Them All by starvingcodeartist · · Score: 1

      Wikipedia is cool, but it's not really what the previous post is talking about. If you wanted to know how to make a Javascript menu, for instance, you couldn't find that on Wikipedia. They could tell you what a programming language is though. If you searched google for Javascript menu, you'd get a billion results for companies that sell DHTML/Javascript menus...but you wouldn't (easily) be able to find an informative article on how to make one yourself. I agree that the internet search engines make it hard to get useful data. You can find it, but you usually have to wade through a bunch of crap first.

    4. Re:One Net to Rule Them All by gilroy · · Score: 1
      Blockquoth the poster:

      If you searched google for Javascript menu, you'd get a billion results for companies that sell DHTML/Javascript menus...but you wouldn't (easily) be able to find an informative article on how to make one yourself.

      A lot of people's complaints about searching the Net come from a very narrow idea of search terms. Although sometimes I get swamped with commercial sites, I am generally able to find 6-8 useful pages on the first page of Google's results. For example, try

      javascript menu tutorial

      and you get 10 solid results on page 1.

      On New Year's Eve I decided to test the searchability of the Net by tracking down a song I've been looking for over the past decade. The issue was, I didn't know the artist, album, or title, and it was instrumental so I couldn't search on lyrics. What I did know was that a clip of the song had been used in an Amtrak commercial campaign around 1994.

      Armed just with that, I was able to correctly find the song using only Google and maybe ten minutes. So I don't exactly buy laments about the unsearchability of the Net.
    5. Re:One Net to Rule Them All by null+etc. · · Score: 1

      Does it? I always thought that's exactly what google is filtering out behind the "12345 more results were omitted because they were similiar" thingy.

      Google does that to an extent, but to see an example, enter the following query and examine the resulting links:

      http://www.google.com/search?q=%22PlumbingSupply+G roup%22+%22American+Standard%22
      "PlumbingSupply Group" "American Standard"

      Each result has identical content, under different store (and subsequently, domain) names.

      This example is a bit contrived, but all last week I couldn't stop running into duplicate results when trying to locate certain products on the web.

    6. Re:One Net to Rule Them All by K-Man · · Score: 1

      According to one talk I went to, Google uses approximate hashing to find duplicates (easy), and near-duplicates (hard). They may not be using the best methods, but even if they were, I suspect it would be difficult to find all the duplicate pages.

      Maybe if they looked for duplicate contexts on each search it would cover a lot of the problem.

      --
      ---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
    7. Re:One Net to Rule Them All by s00p41337h4x0r · · Score: 1
      It would be nice if, in parallel to the Internet, another network was developed to hold only symantically organized knowledge. That network would be free of marketing and commercial business, and would ostensibly be the largest repository of organized knowledge in the planet. Think Internet2, based entirely in XML.

      Yeah, it would be nice. It would also be nice if there weren't spammers or PageRank hijackers or any of the myriad problems with the current Web. The difficulty isn't that it's undesirable to have a clean complete net, it's that there's no way to enforce clarity or completeness. People do try to build these understandable information networks, but you'll always be hampered by the limited number of people who are willing to spend the time and effort to mark up their pages.

  21. HTML is based on the XML model. by wiredog · · Score: 1

    Ummm. No. HTML predates XML.

    1. Re:HTML is based on the XML model. by happyfrogcow · · Score: 1

      details details...

      HTML (1992?) does predate XML (1996?). My point is that they are both SGML based, and a strict HTML 4.01 document is a valid XML document, unless I have something wrong in my understanding of all of this.

      Furthur, my point was not a debate on what is or isn't HTML considered to be derived or a subset of, but that personal web pages are not inherantly different from other web pages. To say a company can do something with their data that an individual cannot do, is misleading.

    2. Re:HTML is based on the XML model. by WindowlessView · · Score: 1

      HTML does pre-date XML but time is not the relevant factor here. Both XML and HTML are subsets of SGML. So horse pulled wagons may have pre-dated cars but they are still both subsets of the general category of wheel based transportation.

      Really the hierachy is SGML->XML->HTML since HTML is really just a subset of XML, one of many languages that can be written in XML. See XHTML for a more apparent example.

      --
      Leave the gun, take the cannolis.
    3. Re:HTML is based on the XML model. by MCZapf · · Score: 1
      HTML is not a complete subset of XML. There are a few disjoint parts. That's why they had to come up with XHTML - to correct these small differences. Some examples I can think of:
      1. In terms of tag names, HTML is not case sensitive, but XML is.
      2. XML requires all tags opened to have a closing tag. HTML has some cases where this is not required, such as with the <p> tag. <img> never has a closing tag.
      3. I think HTML allowed overlapping tags, which XML also forbids. For example, <b>bold<i>BOTH</b>italic</i>, but maybe it was just sloppy browser implimentations that allowed it.

      AFAIK, all valid XHTML 1.0 is also valid HTML. But newer, not-widely-used revisions of XHTML are starting to make incompatible changes.

    4. Re:HTML is based on the XML model. by WindowlessView · · Score: 1

      I think we are pretty much in agreement. My read on it is that HTML is disjointed not so much that it couldn't (or shouldn't) be a subset of XML but that the real world realities of early web evolution bent it out of shape a bit because some of the browsers were a bit too relaxed and that laxity quickly became standard.

      Oh well. Either way I am pretty sure the sun is going to rise tomorrow in the east.

      --
      Leave the gun, take the cannolis.
  22. In other news... by jetkust · · Score: 1

    Researchers in Alabama are working on a system which converts all music on the internet into a single Menudo mp3 file. EIEIO reports the first public use will be to create a single mp3 file that results in trilllions of dollars in royalties to the RIAA when traded illegally.

    1. Re:In other news... by Anonymous Coward · · Score: 0

      Good one.

      Retard.

  23. Pink is... by LJPeixoto · · Score: 0, Offtopic

    Brains coworker :-)

    Watch cartoons all day and see your mind melt down :-)

  24. i.e. nameprotect by joeldg · · Score: 3, Interesting

    nameprotect does something similar, except they are looking for people violating copyrights.
    in addition I think they might be one of the most banned bots online.

    anyway, their users are all corporate entities who pay a lot of money to be able to auto-cease and desist copyright infringers..

    These same companies will pay IBM to tell them that since their cease and desist spree everyone hates them.

  25. almaden webspider by MrSpiff · · Score: 1

    is the Almaden webspider (http://www.almaden.ibm.com/cs/crawler/) that's been scavenging in the dark a part of this?

  26. URL of the project page by DerOle · · Score: 2, Informative
  27. Like NorthernLight? by dpbsmith · · Score: 4, Informative

    This sounds very similar to NorthernLight.

    NorthernLight was (it still exists, but apparently is not available to the nonpaying public at all) a search engine that displayed its results automatically sorted into as many as fifteen or twenty categories, automatically generated on the basis of the search. (For some reason, they called these categories "custom search folders.")

    Since it's no longer available to the public I can't give a concrete example. I can't test it to see whether a search on "Pink" creates a couple of folders labelled "Singer" and "Color," for example. But that's exactly the sort of thing it does/did.

    I actually would have used NorthernLight as one of my routine search engines--it worked quite well--had it not been for another major annoyance: in the publicly available version, it always searched both publicly available Web pages and a number of fee-based private databases, so whatever you searched for, the majority of the results were in the fee-based databases and I would have had to pay money to see what they were. In other words, it was heavy-handed promotion of their paid services and had only limited utility to those who did not wish to by them).

    1. Re:Like NorthernLight? by Wiktor+Kochanowski · · Score: 2, Informative
      Vivisimo is doing sorting searches.

      Try it out, works quite often for me - beats Google for many queries, not in actual number of pages found, but in the time it takes me to find out whatever I'm looking for.

    2. Re:Like NorthernLight? by rcastro0 · · Score: 1
      "Pink" on vivisimo results in 150 hits:

      pink (150)
      Pictures (20)
      Pink Floyd (14)
      Art, Artist (10)
      Features, Yahoo (3)
      Updates (2)
      News, Bio (2)
      Other Topics (3)
      Book (7)
      Music (9)
      Color (9)
      Lyrics (6)
      CD (6)
      Fan Sites (5)
      Dot (4)
      Exactly what IBM wants to achieve, it seems.
      --
      Quem a paca cara compra, paca cara pagará.
    3. Re:Like NorthernLight? by orac2 · · Score: 1

      Exactly what IBM wants to achieve, it seems.

      Except IBM isn't trying to build a general purpose search engine for humans, but a platform for data mining programs.

      Also WebFountain is trying to analyse not 150 hits, but the millions of hits returned over the web, not just the handful of top-ranked hits that vivisimo returns from other search engines (look at the details sections of the vivisimo result page where it lists the engines searched). It's apples and oranges really.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
  28. Gaming Webfountain by G4from128k · · Score: 3, Interesting

    I wonder how long it will take sleazy e-commerce sites and p0rn sites to game WebFountain and turn it into SpamFountain?

    I suspect that this tool (and any like it) must make a core assumption -- that each webpage is about one semantic thing and that the creators are trying to communicate that one thought. In contrast, people who try to boost their page rank have no compuction about misleading people (or algorithms). Clever tagging and misleading verbage should be able to fool IBM's analyzer into clustering a site where it does not belong (but where the site owner wants it). The result is pages look like it is about another thing (some popular search term)while being about soemthing else (selling their junk or porn).

    Next will come high-priced consultants that tell you how to make you site pace highly on WebFountain (like the ones that currently game Google).

    --
    Two wrongs don't make a right, but three lefts do.
  29. if you read the article by Anonymous Coward · · Score: 0

    if you read the article you would have seen that that statement is about the fact that people are not going to spend time xml tagging their irc chat and every blog entry and email.

  30. What is PINK? by BigBlockMopar · · Score: 2, Funny

    (Is "pink" the singer or the color?)

    I didn't get the joke.

    These are, after all, engineers. Pink is neither a color nor a singer (talented or otherwise).

    To an engineer, PINK can only be an acronym.

    --
    Fire and Meat. Yummy.
    1. Re:What is PINK? by Phreakiture · · Score: 1

      It could also be a song (by Aerosmith).

      --
      www.wavefront-av.com
  31. Maybe we should just outsource by CompWerks · · Score: 1

    this project to india.

    --
    If you can read this sig - the bitch fell off.
    1. Re:Maybe we should just outsource by Anonymous Coward · · Score: 0

      Maybe you should go to India. And then outsource your projects to the USA ? Huh ?

      Maybe you should go to India and stop trying to be funny on slashdot you ignorant fuck.

  32. IBM's Pink by th77 · · Score: 2, Funny

    IBM should know that Pink was the predecessor to Taligent which was the predecessor to absolutely nothing.

    --
    Your favorite sig sucks
  33. Intel-based? by trACE666 · · Score: 1

    Why does IBM use PC hardware?
    Wouldn't it make more (marketing) sense to use one of their own platforms, I guess the z-Series should be the most suited for that amount of data...

  34. A good idea for search engines follow? by dollar70 · · Score: 0

    Don't get me wrong, I like getting a little web-traffic (I said a little so no /.ing please!), but when I look through my logs and see searches where Google is referring people to my site inappropriately, I almost want to scream at the mindlessness they use to catagorize my web pages. On the one hand, I'm flattered, but on the other hand it's disturbingly out of context. I even put the <meta NAME="robots" CONTENT="noindex,noarchive"> line in the headers that were giving me headaches, but people still end up at my site looking for that damned lemonparty.jpg just because I mentioned it in my blog once.

    1. Re:A good idea for search engines follow? by rcastro0 · · Score: 1
      Shame on me for being curious.
      lemon party
      a group of 3 or more old men in a circle sucking each other off
      Bill and Carl joined their grand fathers at the lemon party
      I don't want to think about why this term ever arose and was able to drive trafic through google.
      --
      Quem a paca cara compra, paca cara pagará.
    2. Re:A good idea for search engines follow? by Anonymous Coward · · Score: 0

      Note to self:
      Look up definition before searching google for that filename.

      Excuse me while I go shoot myself

  35. ObSCO ref by gosand · · Score: 1
    IEEE reports that the first commercial use will be to track public opinion for companies.


    Can't wait to see what the entry for SCO looks like...

    --

    My beliefs do not require that you agree with them.

    1. Re:ObSCO ref by Tackhead · · Score: 1
      > > IEEE reports that the first commercial use will be to track public opinion for companies.
      >
      > Can't wait to see what the entry for SCO looks like...

      You mean you haven't seen goatse.cx yet?! pfft. n00b!

  36. so-called tags by jamesl · · Score: 1

    "Things such as price or product identification numbers are identified by bracketing them with so-called tags, as in Deluxe Toaster , $19.95 ."

    They're "tags", not "so-called tags".

    Tags! Like those little things they hang on stuff at the store to tell you how much it costs. Tags.

    Of course, he may have been referring to their use in a "software program".

  37. How long before people start gaming the system? by dpbsmith · · Score: 4, Interesting

    As Google has discovered, it's only possible for simple heuristics and algorithms to "understand" the human content on the Web for as long as it doesn't matter.

    As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.

    And the stakes are much higher for gaming WebFountain than for gaming Google.

    For example, I'd imagine there would be big money for anyone who could convince companies that they know how to make it appear that a particular movie/song/toy/computer was "hot," so that the WebFountain-using Walmarts and Best Buys of the world would stock more of it.

    WebFountain will work well only until it is actually introduced.

    1. Re:How long before people start gaming the system? by orac2 · · Score: 2, Informative

      Disclaimer: I'm the author of the article

      As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.
      .

      This could be tricky -- WebFountain uses a kitchen sink approach, with a varying palette of content discriminators and disambiguators. The developers are also savvy to downweight link farm type approaches. Of course, one could say, conduct a campaign among bloggers to mention a term and make it appear well-known to WebFountain, but the inevitable consequence is that it would then actually be well-known!

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
    2. Re:How long before people start gaming the system? by FreshFunk510 · · Score: 1

      My initial response would be that the power of data mining is supposed to make this sort of "gaming" negligible since data mining is about analyzing an incredibly amount of data sources instead of a random few.

      However, there is one thing I should note. I'm not sure how sensitive Google is to gaming. I know they have Google bombing but isn't this only possible with the cooperations of thousands of bloggers? If not, then there's definitely something wrong with teh system. Theoretically, though, a data mining system should analyze an extremely large number of sources making it difficult to game the system (difficult but not impossible i suppose).

      --


      "Injustice anywhere is a threat to justice everywhere." - Martin Luther King, Jr.
    3. Re:How long before people start gaming the system? by Jerf · · Score: 2, Informative
      It's important not to underestimate people's ability to game systems, regardless of the thought put into them. The simple algorithm
      • Reconstruct algorithm.
      • Simulate algorithm and play with the inputs until the outputs match what you want.
      • Bring those inputs about.
      is extremely powerful, and note that as a "meta-algorithm" there's absolutely no way to completely shut it down.

      You have only four basic defenses against this:
      1. Keep changing the algorithm (expensive and large changes may not be possible if stability is desirable, which for search results it generally is),
      2. make the input-gaming process more expensive then the value of the output to the attacker (as you become more valuable you're a more enticing target),
      3. make the outputs desired by the attacked impossible (generally not possible in the general case, but in certain limited ways it is; it is probably not possible to be the #1 google result for all possible search terms, for instance, despite the desirability of such a result),
      4. or have a human monitoring attacks and shut them down manually (only possible if you can out-staff the attackers)
      There are some other possibilities but a lot of them don't apply in the real world, like "make it impossible to reverse the inputs necessary for some output" (like MD5); this is not applicable to a real-world application like a search engine because there has to be some obvious human-sensible logic to the placement or the search engine is just returning random results, which is not even a "search engine", let alone a useful one. Not even all four can be brought to bear in a given situation; #2 probably doesn't apply in this case since the benefits could be in the millions of dollars in theory.

      I'm not saying that WebFountain is hosed; Google has trouble but it is handlable. But it is worth talking about; certain basic algorithms will have certain effects as people try to game them, and it may be the case that some clever, useful algorithm is so easily gamed and so difficult to create countermeasures for that it will never be possible in the real world in the general case.

      (I doubt this is the case, but there's only one way to find out, and that's try it and see what happens.)
    4. Re:How long before people start gaming the system? by bogie · · Score: 1

      You mean kinda like how Google is getting ruined by scumbags who set up thousands of fake sites that just refer everything you've ever searched for directly to Amazon? Google has become almost worthless for product research anymore. Sure its still "better" than anything going, but the spammers and marketers have filled it with way too much garbage.

      --
      If you wanna get rich, you know that payback is a bitch
    5. Re:How long before people start gaming the system? by bobbuck · · Score: 1

      Isn't that the basic rule of quantum mechanics and internet? You can't observe something without changing it? (Or in the case of Slashdot, destroying it?)

    6. Re:How long before people start gaming the system? by orac2 · · Score: 1

      The thing is, that it's hard to do the second step of your general algorithm: Simulate algorithm and play with the inputs until the outputs match what you want.

      Determining the outputs and closing the feedback loop is hard -- getting WebFountain output is pretty pricey, compared to search engine results, where you can have a very low-cost feedback loop. This makes reconstructing the alogrithms hard, if not impossible. Also remember that the exact set of algorithms varies depending on the problem: because the gamers can't get direct access to WebFountain, they can't determine what alogrithms are running. Finally, to simulate some of these results you'd need databases so large that you'd in effect have to be running your own WebFountain.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
    7. Re:How long before people start gaming the system? by kindofblue · · Score: 1
      I second this sentiment, that gaming of any system is likely, not merely possible.

      This is because humans can be "gamed" in the real world. That is, one can fabricate a "buzz" about things, not simply by overt measures like commercials, but plants in social situations. Sony or some other consumer electronics companies planted people in Times Square and other highly visible situations to pretend to use some cool new gadget. Then people see it and tell their friends and then eventually, they hope, there will be an underground sort of "buzz" generated about the cool product.

      If people can't tell that they are being manipulated on the street, I see no hope for machines to be able to figure this out either. Furthermore, when applied to the web, it is not difficult to imagine writing programs that could post good or bad reviews of any product or company to blogs, newsgroups, or whatever.

      The posted reviews would have to be varied and still capture the same meaning. That does not seem that hard. Even a simple randomized (CFG) grammar could provide enough variability to make that same statements but with different phrases. It doesn't have to fool a human, just a machine that is looking for grammatical sentences with key adjectives. Perhaps something like...

      START := PRODUCT "is" COOL_PHRASE. "I installed one" TIME_PHRASE HOME_PLACE.
      COOL_PHRASE := "wicked" | "phat" | "fly".
      TIME_PHRASE := "during Christmas" | "after Kwansa" | "last weekend".
      HOME_PLACE := "at my crib" | "in my apartment" | "outside my mansion" | "in my old jetstream" | "in my mom's basement"

      Basically, link farms combined with natural language generation techniques could make this type of semantic inference about as hard as deciphering spam.

      I imagine that a first pass of a WebFountain approach would try to cluster pages or comments based on the inferred sentiment. Then a human would have to read all the individual comments to filter out the semantic spam. Then the remainder would have to be reranked.

    8. Re:How long before people start gaming the system? by Jerf · · Score: 1

      First, the "hardness" needs to be measured against the value of the benefit obtained from gaming. If it's large, more effort will be thrown at it.

      Second, you seem to have missed the implications of my carefully-chosen word simulate. You don't need to replicate the algorithm, just create something that mostly works in most of the situations that you care about. (Both "mosts" are important.) This is a significantly lower bar then "complete replication", and is one of the reasons it's so hard to combat this; while you need to get to 99.9% accuracy to be a "good search engine", the engine gamers can quite often get by on a simulation that is hardly correct at all... I'd guess the Google spammers understand at most 1% of the Google algorithm... but that's plenty to game the system, often because they can make up the difference with "overwhelming force". (The google spammers aren't going to sweat the difference between a ranking of 5.554 and 5.556 when they can just create another 20 sites that link to the desired site; "overwhelming force".) The search engines are fighting on a very unbalanced-in-the-gamer's-favor battlefield.

      Again, don't overread my claim. I'm not saying this is inevitable; Google largely wins overall. But the possibility should not be waved away; it's a big mistake to do so. (Especially for the implementers; if you think your algorithm is perfect you're likely to learn how wrong you are in a big way as you bet too much on it, whereas I doubt the people at Google ever thought they'd ever completely "win", so they make much more reasonable choices.)

    9. Re:How long before people start gaming the system? by You're+All+Wrong · · Score: 1

      "isn't this only possible with the cooperations of thousands of bloggers?"

      hundreds, probably, but nevertheless - there _are_ hundreds, thousands even, of bloggers, and therefore the system can be
      beaten.

      Coordinated bloggers are like a swarm of locusts, or of termites.

      Look at slashdot - could you imagine what would happen if 1000 people decided tomorrow to post random nonsense to every thread.
      Slashdot would be rendered absolutely useless immmediately.
      Never underestimate the power of numbers.

      YAW.

      --
      Your head of state is a corrupt weasel, I hope you're happy.
  38. "Is this web site selling something"? by Animats · · Score: 3, Insightful
    Search engine spiders need to understand more about sites. Things like this:
    • The site is selling something.
    • The page is composed of multiple unrelated articles or ads, each one of which should be viewed as a separate entity for search purposes.
    • The page is part of a blog.
    • Content on this site duplicates that found on other sites.
    • The site is owned by an organization with a known Dun and Bradstreet number. (If a site is selling something, and its Whois info doesn't match the DNB corporation database, it should be downgraded in search position. This would encourage honest Whois info.)
    1. Re:"Is this web site selling something"? by FreshFunk510 · · Score: 1

      This reminds me of tech magazines (in printed form) today. You can easily pick up a tech business or tech computer magazine and you'll these "artiles" with a tiny label along the top saying "special advertising section". They are basically advertisements made to look like articles. Anyway this is the first thing that comes to mind when I read your first comment that spiders needs to distinguish when a site is selling something and when it isn't. The point is even as humans that's hard to tell.

      On a broader outlook, look at media? Media is always selling something even if it's CNN News.

      I suppose, though, you could distinguish between an "article" and an ecommerce page.

      --


      "Injustice anywhere is a threat to justice everywhere." - Martin Luther King, Jr.
    2. Re:"Is this web site selling something"? by Animats · · Score: 1
      A good way to find out if a site is selling something is by looking for links that lead to forms that take credit cards.

      That's a good spam-filtering algorithm, too. As I keep telling people who fight spam, "follow the money". Quit worrying about where the spam is coming from. Follow where the money goes.

  39. SCO by Zork+the+Almighty · · Score: 4, Funny

    IEEE reports that the first commercial use will be to track public opinion for companies.

    Searching "SCO"
    Found "Slashdot"
    ERROR arithmetic underflow.

    --

    In Soviet America the banks rob you!
  40. CrapFountain by s4m7 · · Score: 4, Funny

    Here's how it works:

    Executive Bob, who's paid IBM $150,000 for his enterprise liscence of webfountain, enters into his webfountain search box: "Pink the musician, not the color"

    IBM's powerful software parses this command into "pink music -color" and passes it to google, retrieves the results, removes Google's paid ads and replaces them with IBM's paid ads. The content is then served to Executive Bob, who shouts: "EUREKA" since within the top ten search results he finds "NUDE PICTURES OF RAPPER PINK!"

    IBM then lands a lucrative support contract with Exectutive Bob to remove all the viruses and spyware from his desktop PC. Rinse and Repeat.

    --
    This comment is fully compliant with RFC 527.
  41. Half a football field? by AndroidCat · · Score: 4, Interesting
    (Imperial or metric football fields?)
    IBM's breakthrough is called WebFountain--half a football field's worth of rack-mounted processors, routers, and disk drives running a huge menagerie of programs.
    Later:
    It uses a cluster of thirty 2.4-GHz Intel Xeon dual-processor computers running Linux to crawl as much of the general Web as it can find at least once a week.

    To ensure that WebFountain's finger is constantly on the pulse of the Internet, an additional suite of similar computers is dedicated to crawling important but volatile Web sites, such as those hosting blogs, at least once a day. Other machines maintain access to popular non-Web-based sources, such as Usenet (a newsgroup service that predates the Web) and the Internet Relay Chat system, known as IRC. The data is then passed into WebFountain's main cluster of computers, currently composed of 32 server racks connected via gigabit Ethernet. Each rack holds eight Xeon dual-processor computers and is equipped with about 4-5 terabytes of disk storage.

    That's a lot of stuff, but half a football field? Possibly they're including cubicles for the staff or did they just inherit some old Big Iron space that was that large?
    --
    One line blog. I hear that they're called Twitters now.
  42. Prior art :o) by Mr_Silver · · Score: 3, Funny
    IEEE reports that the first commercial use will be to track public opinion for companies

    You can do that already with Google:

    A search for "Microsoft is evil" gets you 600,000 pages.

    A search for "Microsoft is good" gets you 3,590,000 pages.

    Therefore Microsoft is more good than evil.

    Err ... that wasn't quite the answer I was expecting.

    (cue sounds of joke falling apart...)

    --
    Avantslash - View Slashdot cleanly on your mobile phone.
    1. Re:Prior art :o) by CaptnMArk · · Score: 1

      The funny thing is: if you search for the above with usenet google will suggest an interesting list of newsgroups.

    2. Re:Prior art :o) by sharkey · · Score: 1
      If it'll make you feel any better:

      Microsoft sucks brings back over 8000 results.
      Microsoft does not suck returns about 16.

      Therefore, Microsoft sucks more that it dosen't.

      --

      --
      "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
    3. Re:Prior art :o) by imr · · Score: 1

      Nope,
      -searches for
      microsoft is evil
      and
      microsoft is good
      produce such results.
      BUT
      -searches for
      "microsoft is evil"
      and
      "microsoft is good"
      produce a different result:
      2070 and 1020 respectively, showing that:
      1/ microsoft IS evil.
      2/ good prevails over evil on the internet.

    4. Re:Prior art :o) by Anonymous Coward · · Score: 0
      Therefore Microsoft is more good than evil.

      I wouldn't worry about this.

      Dark Helmet: So, Lone Star, now you see that evil will always triumph because good is dumb.

    5. Re:Prior art :o) by radarman1968 · · Score: 1

      If you change your search to "Microsoft is bad" you will get a much more accurate 3,350,000 results. I guess this means that Microsoft is actually borderline.

  43. BSD or GPL equivalent? by cowbrain_jimbo_ox · · Score: 0, Troll

    Sounds good. There ought to be something similar under BSD or GPL.

    Political dissidents would definitely benefit from this kind of super search system, and so do normal users like kids doing searches for their homework.

    We need our own "commie" version.

    I wish I was fluent in computer languages or else I'd be the first one to start this up under BSD licence.

    Any suggestions as to what language I need to learn to develop this kind of search engine?
    Its gotta have a capability like freenet to distribute load on the network and the system while keeping users anonymous, since private users won't have the resource to come up with 1000s of servers. I'm thinking on the lines of XML.

  44. Obligatory SCO poke. by i_r_sensitive · · Score: 1
    Damnit, too busy reading stupid poll posts, damnit dmanit dmanit.

    You've won this round, Lonestar...

    --
    "Talk minus action equals nothing" - Joey Shithead, D.O.A.
    "Talk minus action equals /." -
  45. Potential money saver: Differential buzz by benja · · Score: 2, Insightful
    The head of a research and development department could feed WebFountain all the e-mails, reports, PowerPoint presentations, and so on that her employees produced in the last six months. From this, WebFountain could give her a list of technologies that the department was paying attention to. She could then compare this list to the technologies in her sector that were creating a buzz online. Discrepancies between the two lists would be worth asking her managers about, allowing her to know whether or not the department was ahead of the market or falling dangerously behind.

    This is a potentially very useful money-saver. Currently companies employ hoards of middle-management people who do little else than detecting discrepancies between the technologies that their department is focusing on and those that are currently all the buzz. Now we can create an automatic boss that sends out e-mails like, "What's this IP-over-XML thing and why don't we use it and how soon can you have all our critical systems migrated to it?"

  46. Bad thing. by irokitt · · Score: 0, Flamebait

    This sounds like just another tool for the RIAA to use against us. This time, anyone with an apache server account and some mp3s is vulnerable, not just the P2P guys.

    --
    If my answers frighten you, stop asking scary questions.
  47. It already exists by claudebbg · · Score: 3, Interesting

    I've already seen/heard of such system, basically in the Business Intelligence field.
    In England, a systems like Autonomy (used by the police at the beginning) can crawl a mass of information with dedicated spiders (not only for the web, but also commercial databases, files...). Then, it structures all the content in thematics with links and proximity.
    I personnaly tested it some years ago, feeding it with information websites and asking some articles "close to" another one. The efficiency was amazing because it was able to make the difference between close terms that have really different meaning depending on the context. Usually, search engines are wrong because they can't use the context.
    I also set up some "agents" for recurrent searches (an agent is basically a search plus some training, letting Autonomy know what found document are close and not) and it was able to propose everyday a really good press review with nearly no wrong documents.
    As a complement to Autonomy, I know a BI team that uses some other tools like Periclesto feed the searches with "relevant" content, basically thematics that are "appearing" in the group of documents and are close to some interests.
    Such BI tools can already provide the kind of information cited, like a opinion movement against a company detected in the newsgroup or some websites. And IBM is certainly on the tracks to improve such tools with the techniques of their labs.
    I hope these tools won't be limited to PR articles on the web and/or private use by big corporations, because it could only be another Echelon with all its bad consequences:
    - bad use of public information
    - paranoia feeded with wrong scares
    - public/corp. power against the citizens
    If tools like echelon could be used by everybody, it would have to let much more privacy to citizens and the public leaders would have to explain the investments.

  48. Scanalyzer by Anonymous Coward · · Score: 1, Insightful

    Reminds me of the Scanalyzer service in John Brunner's book "Stand On Zanzibar." The supercomputer Shalmaneser analyzed millions of inputs and tried to make sense of them.

    1. Re:Scanalyzer by Anonymous Coward · · Score: 0

      "Christ what an imagination I've got!"

  49. Sounds like CYC by Sanity · · Score: 2, Interesting
    CYC have been trying to collect all human knowledge for the last few decades and feed it into a knowledge base. They have even open sourced part of their database.

    Despite the apparent promise of the project, it is difficult to find actual examples of it doing really cool stuff.

    1. Re:Sounds like CYC by Anonymous Coward · · Score: 0

      I actually attended a demo of it. They have some interesting stuff going on. During the demo, the presenter gave it queries about curent news (plain English, IIRC), and the program answered the queries. Still needs a little more work to be done on it, but it looks nice.

  50. semantic web by jonasmit · · Score: 2, Informative

    XML simply isn't enough. Structure != Meaning. Meaning must be inserted somewhere by someone. Trying to interpret HTML/natural language to form structured documents is a daunting task. If you want real meaning then the data needs to be described or translated into a meaningful form like RDF (yes represented by xml) when it is created so that intellegent agents such as this can *understand* the data. RDF uses triples (thing graphs) to describe relationships making use of URIs: Subject--Predicate--Object ...etc. Now think about how to merge all this information - with well formed rules RDF documents merge great: with traditional structured xml the merged docs would not be well-formed. Now they can be and XML can be generated for standard xml rendering. Take a look at the Semantic Web

    1. Re:semantic web by jonasmit · · Score: 1

      I lost some meaning somehow merging docs: merge two different docs and you preserve the meaning of what a Person is.

  51. Total Information Awareness by Anonymous Coward · · Score: 0

    When I heard about TIA I figured they would do it this way.

    Can a site have a copyright saying "reselling my data prohibited"? Then IBM can't give it to customers.

    Also, I look forward to the system being manipulated for fun and profit.

  52. Pink!? What a stupid search term... by Anonymous Coward · · Score: 0

    Well, if you're daft enough to only enter 'Pink' into a search engine and expect it to know you mean the singer, not the colour, you are daft.

    Search engines need to be used properly in order to get the best results. In my quotes above, the only thing that might mean the singer over the colour to a search engine is the capital letter at the beginning, but seriously people, who the hell uses generic words in search engines these days and expects to get great results?

  53. social trends analysis by WebTurtle · · Score: 1

    This technology should be made available to social scientists, anthropologists, cultural critics, etc. so that current social trends can be analyzed. Perhaps IBM would be kind enough to provide free access to this system to Universities?

    It is a pity that the WebFountain system is geared toward corporate users. Of course, there must be some ROI... but, still it makes me sad that every new technology seems to be driven by corporate desire for good PR and world domination.

    Interestingly, this article comes out right after Slashdot's coverage of the O'Reilly GeekCamp, in which the CNN article mentions the following relevant projects:

    Jeremy Zawodny teamed up with David Sifry, the founder and CEO of Technorati, a popular search engine for blogs, and others to propose a new way to organize the thousands of newsfeeds available from media outlets around the world. The new standard they hacked up, FDML, may well be adopted by major corporations and news outlets soon after this column hits newsstands.

    Simon Cozens, an author and programmer from England, presented Twingle, a program that helps you find things in your e-mail archives (who doesn't need that?).

    Also receiving good geek buzz was an application called Dashboard, which automatically scans and indexes your hard drive, then displays documents related to whatever you're working on.

    So, perhaps the Open Source community will be able to create some similar technology that is freely available for researchers, writers, scientists, etc. to use.

    --
    ------- "One of the joys of travel is visiting new towns and meeting new people." -- G. KHAN
  54. Encourage Human Markup Discourage Machine MU by leoaugust · · Score: 2, Informative
    Analytic tools can ferret out patterns in, say, a sales receipt database, so that a retail store might see that people tend to buy certain products together and that offering a package deal would help sales. ...
    This urban-legend example of people buying beers and diapers at the same time (hence the sections for beer and diapers should be close by, at least on Saturdays) has been beaten to death and beyond.
    A sentence that originally read "We visited Mount Fuji and took some photos" would become something like ?We visited Mount Fuji and took some photos.?
    I am not sure what the tags around "Mount Fuji" have added in this example. Only thing I can think of is that these are similar to the "smart-tags" of MS office that pre-populate straight forward relational data like a contact's email or address. Personally I would do a search for the latitude/longitude when I need this info in Google as "mount fuji latitude" and the first result I get is the one that gives me the latitude and longitude of Mount Fuji. What is the point of pre-feeding this info during the "markup"? And it bears repeating here that rather than complaining about results that you get with one or two keywords, think about adding keywords to narrow and specialize the search. Paris Hilton video is better than just Paris Hilton which might unnecessarily show you stuff about hotels.
    By the time the annotators have finished annotating a document, it can be up to 10 times longer than the original.
    So, a person was probably talking about a molehill, and the machine markup has changed that into a mountain. How much of the extra tags (even accounting for the verbosity of XML) have really added "meaning" to the document. How much of the "meaning" was intended and how much has been force-fed by the machine ?
    These heavily annotated pages are not intended for human eyes; rather, they provide material that the analytic tools can get their teeth into.
    This is where I think that they are using XML but going away from the XML concept. It was supposed to be human readable. If the IBM research group started focusing on how to help people make sense of the 1x material and 10x markup, they will be introducing the person at the right time in the analysis process - introducing a person at the last stage, esp in deriving "meaning" may not be the best strategy. The markups are just "filters" thru which when the material is viewed a lot of context becomes apparent. What we need to do is to let people start with the filters and then look for the material (top-down) or start with the material and look for filters (bottom-up) - sort of a more iterative procedure involving both these approaches.

    Google lets you do a keyword search (bottom-up) or via the directories - DMOZ (top-down). Vivisimo and Grokker were recently discussed on slashdot where they were creating dynamic categorizations, i.e. bottom-up. I think it would be better to let people analyze the markup (directory/top-down approach) or analyze the material (keyword/bottom-up) rather than mixing up the two and presenting the "results" to the person.

    E-mails or instant messages can't be labeled in this way without destroying the ease of use that is the hallmark of these ad hoc communications; who would bother to add XML labels to a quick e-mail to a colleague?
    This is the second place where energies should be focused. Where the document is created may mean a lot. It could be in which directory I create a new file inherits the path (hence context), or it could be as simple that on the top-right of the screen I create personal files, on the bottom right I create files about sports, on the left-bottom-middle I create files about java .. etc. I think this beats anyday the bot-annotators that come after me and add 10 times markup than the whole of the quick email that I sent to a colleague.
    --
    To see a world in a grain of sand, and then to step back and see the beach where the sand lies ...
    1. Re:Encourage Human Markup Discourage Machine MU by orac2 · · Score: 1

      I think you are missing the point. The tags are not for people, but for data analysis software. Comparing a search engine to a general analysis platform (which is what WerbFountain is) is like comparing apples to oranges. The entire apparatus (WebFountain plus data mining software) is designed to produce high level reports that talk about data in the aggregate.

      --
      "Just once, I'd like to meet an alien menace that wasn't immune to bullets." -- The Brigadier, Dr. Who
    2. Re:Encourage Human Markup Discourage Machine MU by leoaugust · · Score: 1
      The entire apparatus (WebFountain plus data mining software) is designed to produce high level reports that talk about data in the aggregate.

      The tags are not for people, but for data analysis software

      My perspective is from the point of view of a business man trying to use the "data." This data must have some correlation to reality of the business, and most preferably illustrate some correlation or cause-effect that I could use to predict the future a little more accurately. This is where the theory and action meet. This is how I am going to make money off your data mining.

      Also note that people rarely make cerebral decisions based purely on the elegantness of the logic. It has to jive with their instincts, and in that sense they are behaving more like Kasparov rather than Deep Blue when they are playing the chess of business. Unlike the almost infinite possibilites that confronted Deep Blue before each move, Kasparov just began with a few possibilites. Hence, instead of just telling the almost-infinite "properties" of the data in the aggregate, someone will have to translate what this data means into a few options based on which I can make choices...

      And what the data means can be seperated into two parts. One is the data and the second is the markup on it. You and the business man can agree upon the data, but if they can't correlate your markup to their reality-based parameters, they will never be able to understand then all the analytical models or the analysis done with the markup to come with the conclusions that you did.

      So why not just begin with the markup that they can relate to. So that when you get down to explaining what the data can do for them, you already agree upon the data, and if you have the same terminology of the markup - half the battle is won.

      On the other hand if you start up with a markup that is not human understandable, you are handicapping yourself.

      All I am saying it is nice to peep out of the lab and see what is going to happen to the baby out in the business world. It is going to meet reality and the reality is defined by the people not machines.

      --
      To see a world in a grain of sand, and then to step back and see the beach where the sand lies ...
  55. Re:Expensive .. 4..Profit! by AndroidCat · · Score: 1

    The really nice part is that they can use their 0.5 FBFs of stuff to data-mine the Internet once, and then sell the work over and over again. (There's a little work to sort/package the data for each client, but trivial compared to crunching and tagging the Internet in the first place.)

    --
    One line blog. I hear that they're called Twitters now.
  56. Hey Execs.. Use Googlism by xenolaeus · · Score: 1

    http://www.googlism.com/

    Where's my $300,000?

  57. Yeah by wiredog · · Score: 1

    Good point.

  58. IBM fixes Slashdot? by Dark$ide · · Score: 1

    Does this mean that the folks at IBM Almaden can fix slashdot so we don't get all that unstructured crap from the first posters when a new topic arrives?

    --

    Sigs. We don't need no steenking sigs.

  59. Colour, singer OR band... by WebCowboy · · Score: 2, Funny

    Wonder if this "web fountain" will be smart enough to determine the context to THAT level.

    A painter thinks "colour" when he sees the word.

    A slashdot reader (and many other grown-ups) thinks of the band "Pink Floyd".

    If you are (or are the parent of) a teen-aged girl you think of neither...you think of the anti-Britney pop-star princess of angst Pink

    1. Re:Colour, singer OR band... by Anonymous Coward · · Score: 0

      Or the Pink Panther...

      Or Pinky & the Brain...

      Or damn commie pinkos...

      Or my little pinky.

  60. Actually... by Kjella · · Score: 1

    ...they were used in calibration tests... you know, find the highs and lows of the system.

    Kjella

    --
    Live today, because you never know what tomorrow brings
  61. Riding the Gravy Train... by Pac · · Score: 1

    It's a reference to Pink Floyd's Have a Cigar lyrics, "And by the way, which one is Pink?"

    1. Re:Riding the Gravy Train... by Thing+1 · · Score: 1

      Contract that verb, you insensitive clod!

      --
      I feel fantastic, and I'm still alive.
  62. Another pink by Pink+Eater · · Score: 1

    Pink can also refer to female genitalia, hence my name. Mmmm mmmm good.

    1. Re:Another pink by Anonymous Coward · · Score: 0

      Also a song by Aerosmith on that very subject.

    2. Re:Another pink by Anonymous Coward · · Score: 0

      ..if you like the taste of rotten fish, that is.

  63. Slashdot headline 2005 by xmedar · · Score: 1

    Pr0nfountain leads to sticky keyboards

    --
    Any sufficiently advanced man is indistinguishable from God
  64. pink is a color by Anonymous Coward · · Score: 0

    If you've ever heard her sing you'd know that pink is a color.

  65. Re:Content Chaos by Anonymous Coward · · Score: 0

    So, by extension, Baby Bush speaks black streetgang loserspeak! Probably wrong but about his intelligence level.

  66. 30% by Anonymous Coward · · Score: 0

    30% of the net is Porn.
    30% of the net is dupes.

    How much of the net is porn dupes?

  67. the ultimate irony by Anonymous Coward · · Score: 0

    from IBM? thats rich. their website is so bad I have to use google to locate stuff on it, even if i know it exists.

  68. Re:A good idea for search engines to follow? by Anonymous Coward · · Score: 0
    Well I'm sorry. I thought lemonparty.org (type it yourself if you're sick enough) was as common knowledge as the goatse guy. But how do you think I feel? As of right now, I'm the number one reference Google gives for the damn thing. Back before October 2003, I'd never even heard of it myself. Then I made a stupid comment about it, and today I'm unintentionally, undeservingly,and ungratefully holding the #1 position for anyone searching for lemonparty.jpg on Google. Short of deleting my own stuff, how do I make them stop?!

    Look, I know I run a lame-ass excuse for a website, (yeah, quote me on that) and it's not even meant to be viewed by children or prudish adults, but I'll be happier when you don't see my website under this link. In the meantime, I hope IBM's "Web Fountain" doesn't troll over my site and determine that it's about Lycoris Screen Resolution either.

    Tell me why Google is my friend again?

  69. Is "pink" the singer or the color..... by sleepophile · · Score: 1

    ............or the song?

  70. Close to home? by BasculeTheFule · · Score: 1

    Can't think why they're putting money and effort into this project. BTW, anyone tried to find something on the IBM site recently?

  71. Please by _Sexy_Pants_ · · Score: 1

    Slashdot should be a refuge from mentions of artists of such calibur as Pink.

    --
    Look it's a joke about my sig IN MY SIG! LOL!
  72. on the desktop by goon · · Score: 1

    utilising your own system is a start. on the desktop there's nat *Ximian* friedmans Dashboard

    --
    peterrenshaw ~ Another Scrappy Startup
  73. A rack of servers can't beat good old META data by prototype · · Score: 1
    Trying to intelligently search for information in the universe is an age-old problem. How can my system be so smart to tell the difference between Pink the singer and pink the color (or colour if you prefer). Basically, it can't.

    Nothing is smart enough to tell the difference because the content is contextual (hence the name). In a corporation like the one I'm at now (a class A railway) we have hundreds of terabytes of information flowing through our systems on a regular basis. Trying to track it, categorize it, and make sense of what's there is next to impossible. Yet we still keep trying.

    I've been trying to architect the information gathering myself in a manual way using a distributed model. Rather than having one system (or hundreds of systems depending on how you look at it) go out and farm the information, have each system submit themselves (automated if such a way exists) to a central repository so that it makes sense. Like I said, any entity is the best thing to know about itself and how it should be classified.

    The Trove system from SourceForge is such a beast. Any project submits themself to the trove for categorization. If you abstract that concept up a level, you get a general classification system that lets you not only search based on it, but also filter the information and allow something to be categorized in multiple dimensions. It's not just about one listing anymore, because Pink the singer could be listed under Rock, Pop and Female. You can't choose just one. The trove system as it is, isn't the most scalable in the world, but with a little work could be and could be generic enough to classify documents, objects, people, whatever. just a thought...

  74. Google fight by Kidbro · · Score: 1

    That sounds a whole lot like Google fight :)

    This wasn't the answer I was hoping for either ;)

    1. Re:Google fight by Anonymous Coward · · Score: 0

      vi is a word in more languages than emacs is.
      However, to go on with the joke, check this out..

  75. ... and you can call it ... by Anonymous Coward · · Score: 0

    GOPHER! Quick, patent that idea before some one else thinks of it first, oh, wait .....

  76. Photo of the guys behind it all by fingerfucker · · Score: 1


    http://www.research.ibm.com/resources/news/images/ 20030918_andrew_bob.jpg

    On the left:

    Andrew Tomkins, WebFountain Chief Scientist

    On the right:

    Bob Carlson, VP of WebFountain at the Almaden Research Center

  77. Re:Not necessarily by Anonymous Coward · · Score: 0

    Spamdexing tools used to push page rankings basically spout babble. Even the most naive semantic parser will choke on it and spit that stuff out as rubbish.

    Of course I don't doubt you are right about the motivation, and one would expect to see them come up with nastly little tricks like taking public domain documents and replacing keywords with their own ( so to get semantically well formed data that is actually just a rankings magnet )

    But, in the final analysis 'content costs' and even robots can be coded to reject nonsense.

  78. Provably, the guys behind this are doofuses by You're+All+Wrong · · Score: 1

    "the first commercial use will be to track public opinion for companies."

    Have they learnt _nothing_ from the google-bombs?
    As soon as people find out what algoritm they use, there'll be someone coordinating abuse thereof.

    YAW.

    --
    Your head of state is a corrupt weasel, I hope you're happy.
  79. This is histerically funny! by Anonymous Coward · · Score: 0

    Let me see if I understand this correctly. The makers of Lotus Notes, WebSphere, and what could very well be the WORST website for developers EVER thinks they can some how tackle content chaos? Hahahaha. It hurts to laugh this hard. "Trust you? How can I trust a man who can't even trust his own pants?" - Henry Fonda, Once Upon a Time in the West

  80. Dun and Bradstreet number by rark · · Score: 1

    > The site is owned by an organization with a
    > known Dun and Bradstreet number. (If a site is
    > selling something, and its Whois info doesn't
    > match the DNB corporation database, it should
    > be downgraded in search position. This would
    > encourage honest Whois info.)

    This may be a question born of serious ignorance. If so, I'd really appreciate some enlightenment.

    This is also not so theoretical for me, as I am currently privately developing a product that I will eventually be selling online.

    However, until your post, I had not heard of Dun and Bradstreet. I have gone to their website, and they apparently provide a number of services, which can be broadly catagorized as marketing advice/consulting and credit advice/consulting. For my particular small business, neither of these services are useful. I'm funding out of pocket (so my business's credit is not of interest) and I don't plan on extending credit to my customers (there's no need for it) and the market for my product is very small and specialized and I'm fairly confident in stating that I know as much about it as Dun and Bradstreet or pretty much anyone else, modulo a few other people within the same community.

    So my analysis, as a prospective small business owner, is that a Dun and Bradstreet number would be useless. Why would I want to get one?

    Now, this would be utterly offtopic, except that you suggest that a Dun and Bradstreet number would be a reliable way of confirming whois info. When I get a domain name for this venture (not there yet) I do intend to provide accurate whois info and I generally agree that accurate whois info is a good thing and fairly important. However, for me, a DNB number would be a useless expense (actually, it appears it may be free? I distrust anyone who requires me to give out personal info before giving me prices, and apparently they want my email address first. However, if it's free, than one considers questioning the reliability of their information) and, while I'm not the majority case, I'm also not entirely unique.

    If one were to implement this, this would mean that many businesses who were quite legitimate and had accurate whois information would be classed as if they did not have accurate whois information. This strikes me as a serious weakness.

  81. ontologies by carldot67 · · Score: 1
    my personal belief for some time is that
    the answer lies in the use of linked ontological
    domains coupled with bayesian stats overlying
    graph based storage. Graph theory stuff.


    Lots of technical issues to get it implemented
    but I think this is the way to go. Of anyone
    IBM have the resources to make it go.


    Anyone interested might want to look at
    DAML + OIL www.daml.org.

    --
    I wish at was Friday, but I dont want to wish my life away. So I wish it was last Friday.
  82. Re:Content Chaos by Anonymous Coward · · Score: 0

    Ever considered just how ridiculous you urban rap hiphop black streetgang wannabes really sound? Or look. Stan Laurel wearing that phat fuck Oliver Hardy's pants! Dum nigurz!

  83. Re:A good idea for search engines to follow? by Anonymous Coward · · Score: 0
    Wow... Happy Day! My wish was granted! My site is no longer listed under those searches. I don't know who or what was responsible, but thank you! =-)

    PS- I'm abandoning my account. I think the moderation on this site is rigged. How did I get modded overrated when I was never modded up to begin with?