Slashdot Mirror


Learning About Full-text Search

An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."

10 of 140 comments (clear)

  1. Re:poor guy by martingunnarsson · · Score: 4, Insightful

    If Google can cache pages and put them online, so should Slashdot. People say copyright issues would be a problem, but in that case, why is Google's online cache any better?

    --
    Martin
  2. Re:Anti-XML by anomalous+cohort · · Score: 4, Insightful

    From the google cache...

    searching for words isn't really what you want to do. You'd like to search for ideas, for concepts, for solutions, for answers.

    That is why he is going in an XML direction. The relational approach is rectilinear and requires that the information be framed in a highly normalized fashion. Generalized semantic searching is highly non-normalized because, well, humans are highly non-normalized.

    I think that he should look at some work by a different Tim, the Semantic Web.

  3. Re:poor guy by davew2040 · · Score: 4, Insightful

    And they considered incorrectly.

  4. Re:re-inventing the wheel by Anonymous Coward · · Score: 4, Insightful

    Try reading the articles/essays. Knuth's vol 3 is about comparison search, not full-text search.

  5. Re:re-inventing the wheel by getarun_vr · · Score: 2, Insightful

    Maybe search technology has changed a lot since Kuth days. If one cursorily glances through the last coupla journals on Information Search and Retrieval, one cannot help the heavy influence of PageRank (Google's own technology). Thankfully the algorithm is well known. On the flip side, Critics have often asked wheather such algorithms be published? The bloggers have demonstrated that even Google rankings can be rigged... Personally, I would choose the open architecture philosophy, due to parallels with the ideas of Bruce on cryptography. A peer reviewed system is always better than a closed proprietery system.

  6. Re:poor guy by ihummel · · Score: 2, Insightful

    Google is Google and Slashdot is Slashdot.

    But even if the issue of liability were taken off the table, they would still have to get off of their metaphorical butts and set up a caching system. I don't know if there is any usable open-source system currently in existence, but if not, they would either have to code it themselves or adapt something already out there that doesn't serve their needs. Disk space isn't really an issue, as the commenting system takes a lot more space than the cache would (assuming they didn't mirror isos or anything silly like that).

  7. Re:Anti-XML by gorilla · · Score: 2, Insightful

    Call me stupid if you like, but I don't see how the representation of the data helps to search for ideas concepts etc. Regardless of how the text is stores, unless you have a human do a lot of markup on the text, then you're going to have a problem in extracting the ideas from the text. And by markup I don't mean Heading I mean some entering what the ideas, concepts etc are for each part of the text - which can be done equally easily in a traditional database as in a XML document.

  8. it's geared for public consumption by Anonymous Coward · · Score: 1, Insightful

    it's geared for public consumption,
    such is the nature of websites,
    so as long as you don't pretend you wrote it,
    it's abundantly clear where the original came from,
    go ahead and mirror (by mirror i mean take a snapshot).

    only if a copyright holder says don't do that should you remove it.

  9. Re:re-inventing the wheel by Anonymous Coward · · Score: 1, Insightful

    You have that backwards. PageRank was heavily influenced by other systems, like Harvest. And full-text search has changed very little since Knuth. For instance, the basic extact string matching algorithms haven't advanced at all.

  10. Re:This technology still exists? by smittyoneeach · · Score: 2, Insightful

    It will thrive until the Next Big Thing(tm) arrives, to "save us from the sad shortcomings of XML".

    XML's only real fault is that's it's been oversold, not unlike Object Oriented Programming and Java before it.

    --
    Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear