Slashdot Mirror


Learning About Full-text Search

An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."

44 of 140 comments (clear)

  1. Salute by grub · · Score: 2, Funny


    ..and has been /.'ed once or twice..

    You mean two or three times now.

    --
    Trolling is a art,
    1. Re:Salute by antarctican · · Score: 4, Interesting

      ..and has been /.'ed once or twice..

      You mean two or three times now.


      And it's my poor server that has to bare the burden.... but so far it's held up fairly well each time. Pretty good for a celeron 1.7GHz w/ 256M. :)

      However this time was particularly bad because of it being a series of essays. I just increased the number of instances of Apache by 66% and doubled the number of requests before a child dies. That seems to have brought some responsiveness back.

      Funny thing is I didn't even know he was /.'ed until he emailed me. I went to check my email (via pine) and the console was as responsive as usual.

      For the geeks who enjoy technical detail... it's running on an Inspire cube PC, one of those little cubes with a mini-ATX in it. Shows you don't see a lot of horse power to serve static content. :)

  2. web page irony by Savatte · · Score: 3, Funny

    He writes about seaching technology, but you can't easily search through his writings.

    1. Re:web page irony by Dreadlord · · Score: 5, Funny

      too bad his pages are valid XHTML documents, it would have made an excellent +5 funny comment :(

      --
      The IT section color scheme sucks.
    2. Re:web page irony by Anonymous Coward · · Score: 2, Informative

      they don't, but the parent post is about finding some conflict between the author's pages and aticles.
      He's got an article about searching and his pages aren't searchable, and he's got articles about XML, so having non-valid XHTML pages would definitely have been ironic...

    3. Re:web page irony by arrogance · · Score: 3, Interesting

      Well, especially when it's been slashdotted. Here's a google cache hit to part of his writings.

      I agree that it doesn't look to be easy to search around, at least when all you have is an URL to go on (http://www.tbray.org/ongoing/When/200x/2003/07/30 /OnSearchTOC) and Google to find reachable material. I'm also not too sure about using dates as folder names but that's just a personal thing: I think Tim Berners Lee recommended it at one point in an article "Cool URI's don't Change". He does recommend using "Latest" or some such instead of the creation date in a URI, though, if "there is no reason for the persistence of the URI to outlast the magazine." It might make things easier to search for though, at least if you know when it was created: if the URIs aren't changing then you won't have tons of broken links.

    4. Re:web page irony by Schwarzchild · · Score: 4, Informative
      He writes about seaching technology, but you can't easily search through his writings.

      Really? How about search site:tbray.org?

      --

      "sweet dreams are made of this..."

    5. Re:web page irony by Anonymous Coward · · Score: 2, Funny

      Tee hee, I get it now. It reminds me of this time that something didn't happen, but if it had happened, it would have been funny. Ha ha, that still cracks me up. Yes, most amusing.

  3. Hold on there by arvindn · · Score: 5, Funny
    ...has been writing this endless series of essays on search technology since summer. He says he's finished now...

    Finished an endless series?

    1. Re:Hold on there by MooCows · · Score: 4, Funny

      The maximum number of results have been returned.

      --
      The path I walk alone is endlessly long.
      30 minutes by bike, 15 by bus.
  4. poor guy by understyled · · Score: 5, Informative

    i cringe at the bandwidth demands a slashdotting can bring with it. here's google's cache of the page.

    --
    Sig (appended to the end of comments you post, 120 chars)
    1. Re:poor guy by martingunnarsson · · Score: 4, Insightful

      If Google can cache pages and put them online, so should Slashdot. People say copyright issues would be a problem, but in that case, why is Google's online cache any better?

      --
      Martin
    2. Re:poor guy by Arslan+ibn+Da'ud · · Score: 4, Informative
      --

      Practice Kind Randomness and Beautiful Acts of Nonsense.

    3. Re:poor guy by davew2040 · · Score: 4, Insightful

      And they considered incorrectly.

    4. Re:poor guy by martingunnarsson · · Score: 2, Offtopic

      Google isn't asking for permission. Again, Slashdot could obey to the rules in robots.txt.

      --
      Martin
    5. Re:poor guy by ihummel · · Score: 2, Insightful

      Google is Google and Slashdot is Slashdot.

      But even if the issue of liability were taken off the table, they would still have to get off of their metaphorical butts and set up a caching system. I don't know if there is any usable open-source system currently in existence, but if not, they would either have to code it themselves or adapt something already out there that doesn't serve their needs. Disk space isn't really an issue, as the commenting system takes a lot more space than the cache would (assuming they didn't mirror isos or anything silly like that).

    6. Re:poor guy by johnteslade · · Score: 5, Informative
      The site is still slashdotted. Each of his papers are on separate pages so here are the google caches of the individual papers:

      I have to write some crap at the end here so i can get past the "Your comment has too few characters per line" error message.

    7. Re:poor guy by spectre_240sx · · Score: 3, Informative

      I don't know about that. There seem to be too many problems associated with caching. One that comes to my mind is the extra bandwith that they would have to worry about. An Article about the design of the site mentions that just changing over to CSS made a grand savings of 3-14 GB a day equalling something like $3,600.00 in the end. Now that's just by cutting 2-9KB off every page request. Now, think about them serving (possibly) huge pages from other sites that may not optomize their code... That's a lot of money that slashdot would have to spend.

    8. Re:poor guy by davew2040 · · Score: 2, Funny

      Well then, I guess slashdot would learn firsthand about the slashdot effect!

  5. Interesting stuff! by clifgriffin · · Score: 3, Funny

    Though, I'm unaware of how to apply this to my life. I think I'll take it and put it in the "Unaware of How to Apply This to My Life" Stack with The Simpsons and The Internet.

  6. Anti-XML by MattRog · · Score: 4, Interesting
    Whether there's going to be a lot of XML around in repositories to search. XML these days is more used in interchange rather than archival applications.


    Why the fascination with XML? Well, I certainly know the reason why *Tim* is fascinated, but I want to know why he's seriously contemplating reinventing the wheel - namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.

    Also, most SQL DBMS (Oracle, Sybase ASE, MS SQL, etc.) come with full-text indexing built in, so all it would take would be to chop up HTML pages and stick them in the DBMS, then you can perform rich-text queries on them with minimal effort.
    --

    Thanks,
    --
    Matt
    1. Re:Anti-XML by phurley · · Score: 5, Informative

      I agree to a point, but if we are talking about a mixed environement where you are using Oracle, I am using DB2, our friend Bob has his data in a legacy ISAM setup and a customer wants to integrate a search system across the tree systems they are going to have to write a lot of custom glue.

      If an XML aspect of the data is available (you can still keep it all in Oracle - just provide a "view" of it in XML) from each of us - common search tools and methods can be utilized.

      --
      Home Automation & Linux -- now I know I'm a geek
    2. Re:Anti-XML by arrogance · · Score: 4, Interesting

      He even goes so far as to mention that Index Server will search your website: but fails to mention that it does full text searching on your entire file system.

      Unfortunately his site (http://www.tbray.org/ongoing/) seems to be sufficiently disorganized that I have trouble finding out what his real points are, or whether he's addressed all of the issues: for example, I saw no mention of the Semantic Web if his concern is searchability on web documents.

      As a side note, MS SQL is going more and more toward XML, as is the whole .NET framework. This results in richer (read: fatter) data but it does mean that you can store whatever metadata you want along with it.

    3. Re:Anti-XML by anomalous+cohort · · Score: 4, Insightful

      From the google cache...

      searching for words isn't really what you want to do. You'd like to search for ideas, for concepts, for solutions, for answers.

      That is why he is going in an XML direction. The relational approach is rectilinear and requires that the information be framed in a highly normalized fashion. Generalized semantic searching is highly non-normalized because, well, humans are highly non-normalized.

      I think that he should look at some work by a different Tim, the Semantic Web.

    4. Re:Anti-XML by anomalous+cohort · · Score: 3, Funny

      Hmmm, perhaps a visit to a dictionary is in order. Once you read the definitions for rectilinear and normalized, I'll think you'll find the sense of the post.

      This is a sound strategy any time you run into a message that makes no sense. Simply look up the definitions of the words that you don't know.

    5. Re:Anti-XML by I8TheWorm · · Score: 2, Interesting

      I tend to get on an XML soap (no pun intended) box when I see articles about it, so here goes...

      XML is great for sharing data between non-congruous systems. It's horrible, however, for storing data in any large quantity, and even more horrible for treating as a searchable text file. It's inherintly large and full of ascii/ansi/utf characters that are completely unnecessary when performing byte by byte text searches. For large amounts of data, you're right... RDBMS is the current way to go... maybe OODBMS will be in the near future, but I still haven't tinkered with it myself and don't have any opinions developed yet.

      XML is not the data end-all... same as __insert_your_own_programming_language_here__ is not the end-all of programming. It's a nice tool, but tends to be overused because it's still a buzz-word.

      --
      Saying Android is a family of phones is akin to saying Linux is a family of PCs.
    6. Re:Anti-XML by DrVomact · · Score: 3, Interesting

      The reason why XML is widely used today for a multitude of purposes (e.g., data interchange between otherwise incompatible systems, configuration files, technical documents, command protocols that communicate with servers, etc. etc.) and why it will be used for even more stuff in the future is that it is centered on a very simple and powerful idea: self-documenting data. That is, the data is structured by internal markers that give information about the type of information contained in each logical element of the data stream (or file). Naturally, the XML geeks are doing everything they can to complicate this simple idea, but I digress...

      Because XML files are structured, self-documenting text files that correspond to a formal definition (I know DTDs aren't technically required for "well-formed" XML, but you really don't want to do that), you can rely on your data being usable without making assumptions about the type of systems that will use it (OS is irrelevant, applications can have front and back ends that understand XML). Moreover, this compatibility isn't going to go away: it's just pure text--we will always be able to read it.

      I have no idea whether the databases of the future will store their data in XML form or not. I'm not a database expert, but I suspect there are more efficient ways of storing and searching information than in huge chunks of tagged text. However, while a database that stores its data in a rigid table format may be quicker and more compact, it cannot preserve the richness of meta-information contained in an XML tagging system. If you put your XML into a traditional database, you won't be able to take advantage of being able to make searches based on information in the tags.

      Be that as it may, the the fact remains that you will at least be able to feed your database XML, and get XML out of it. That means that the XML front end will parse the XML input data, and will be able to figure out how to organize your data in its innards based on the information provided by the XML tags in the data you give it. When you make a query (probably formatted as XML by the query software), the data will be returned as XML using the tag scheme specified in your DTD.

      Another consideration is that you can store much more information about your data with an XML tagging scheme than you can with any database format--and you can communicate that information when you send your data to someone else, because the metadata is part of the data. I work with huge texts (technical manuals, actually) and I heartily welcome the flexibility and usefulness of being able to identify parts of that text based on any criteria that are meaningful to me or the consumer of the data.

      --
      Great men are almost always bad men--Lord Acton's Corollary
    7. Re:Anti-XML by gorilla · · Score: 2, Insightful

      Call me stupid if you like, but I don't see how the representation of the data helps to search for ideas concepts etc. Regardless of how the text is stores, unless you have a human do a lot of markup on the text, then you're going to have a problem in extracting the ideas from the text. And by markup I don't mean Heading I mean some entering what the ideas, concepts etc are for each part of the text - which can be done equally easily in a traditional database as in a XML document.

  7. Bray's theorem by KoolDude · · Score: 3, Funny


    The essay series converges to text book when time tends to infinity. Proof is left as an exercise to the reader.

    --
    getSexySig(); /* returns sexy signature */
  8. This technology still exists? by Pathetic+Coward · · Score: 2, Funny

    Search technology. Hmmm. Wasn't that outsourced to India last month? Or was that last year? I just can't keep up with IT today.

    1. Re:This technology still exists? by smittyoneeach · · Score: 2, Insightful

      It will thrive until the Next Big Thing(tm) arrives, to "save us from the sad shortcomings of XML".

      XML's only real fault is that's it's been oversold, not unlike Object Oriented Programming and Java before it.

      --
      Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
  9. Why isn't "someone" Tim Bray by leoaugust · · Score: 5, Interesting

    I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.

    "Someone" ought start building ... I wonder why this someone isn't Tim Bray. He is one of the most well known names in XML, has experience under his belt with another Search Engine Project Antartica .....

    I just mean it in the sense that if he is having trouble getting his own ideas himself off the ground, what a challenge it will be for someone else to do so.

    Mr. Bray should get the thing going like Linus did, and call in help from the Open Source Community. If he is waiting for someone with moneybags to catch the bait, and call him on the project as a highly paid consultant, maybe the approach needs to be modified.

    Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...

    --
    To see a world in a grain of sand, and then to step back and see the beach where the sand lies ...
    1. Re:Why isn't "someone" Tim Bray by wizarddc · · Score: 2, Informative
      Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...


      I thought that was just a myth?
      --
      Th
    2. Re:Why isn't "someone" Tim Bray by mbrinkm · · Score: 3, Informative

      "This is the last in my series of On Search essays. I've written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I'd like to change this part of the world. In short, I'd like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works. This essay is about what that software should look like. Early next year I'll write something on how it might get built.

      Naming the Baby An important piece of software needs to have a name, but that takes time and creativity and can wait; for now I'll just call this thing the Basic Resource Finder (BRF).

      Requirements Then a couple of non-requirements and a conclusion.

      BRF is Open-Source My heartfelt apologies to anyone still trying to make a go of it in server-side search; but that business is just so over. It always was a lousy business, nobody has ever made real money there on a sustained basis, and yet it's something that every Web deployment needs. For a substantial site you can easily drop six figures for a search engine, and all the bells and whistles that buys you are mostly not cost-effective.

      So BRF is going to be open-source. That doesn't mean that you can't make money with search software; it just means you have to do it in services. There are always going to be search deployments loaded with tricky implementation and deployment work: figuring out where the data is, aggregating it, cleaning it up, building the workflows so these things keep happening, maintaining some application-specific synonyms, the list goes on and on, and none of these things are free. And they are much better things to spend money on than software licenses."

      RTFA!

      The original submission is about his last essay and that essay starts with the above quote.

      And whoever moderated you up needs to RTFA also!

      --
      "Don't worry about people stealing an idea. If it's original, you will have to ram it down their throats." --Howard Aike
    3. Re:Why isn't "someone" Tim Bray by gwhulbert · · Score: 2, Informative

      Tim Bray was one of the founders of open text corporation ... they INVENTED the search engine.
      Digital (with whom they were working) "stole" the idea and opened Altavista 3 months before their IPO.
      I worked for Open Text for a year but after Tim left (just about the time the 1.0 draft of the XML spec appeared).

  10. ObHutz by sharkey · · Score: 3, Funny

    Mr. Simpson, this is the most blatant case of fraudulent advertising since my suit against the film, ``The Never-Ending Story''.

    --

    --
    "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
  11. Slashdot search question by Glass+of+Water · · Score: 2, Interesting
    But there is some good stuff out there; for example Slashdot's search engine seems to run smooth, clean, and fast, but some poking around failed to reveal what it is: I wouldn't be surprised if it's just the Mysql search facility.
    Anybody know the answer to this one?
    --
    There are no trolls. There are no trees out here.
  12. Re:re-inventing the wheel by Anonymous Coward · · Score: 4, Insightful

    Try reading the articles/essays. Knuth's vol 3 is about comparison search, not full-text search.

  13. Or instead, talk to a librarian (the Register) by JPMH · · Score: 2, Interesting
    An interesting counterpoint to this story in the Register today:

    "A Quantum Theory of Internet Value" by Andrew Orlowski
    -- why librarians are better at finding the book you want than Google.

  14. Re:re-inventing the wheel by getarun_vr · · Score: 2, Insightful

    Maybe search technology has changed a lot since Kuth days. If one cursorily glances through the last coupla journals on Information Search and Retrieval, one cannot help the heavy influence of PageRank (Google's own technology). Thankfully the algorithm is well known. On the flip side, Critics have often asked wheather such algorithms be published? The bloggers have demonstrated that even Google rankings can be rigged... Personally, I would choose the open architecture philosophy, due to parallels with the ideas of Bruce on cryptography. A peer reviewed system is always better than a closed proprietery system.

  15. Yeah, I know... Preview.... by stoborrobots · · Score: 3, Informative

    I don't know... see for yourself, then come and tell us... The comment on this page suggests that you are right...

  16. Mirror by Door-opening+Fascist · · Score: 4, Informative
    Since the site looks bogged down from the /.'ing, I've made a few mirrors:

    Mirror #1

    Mirror #2

    Mirror #3

  17. "long departed Open Text index?" Not by Anonymous Coward · · Score: 2, Informative

    It just has a new name, and it's being developed by librarians.
    http://www.dlxs.org/products/xpat.htm l

  18. searching using php perl and mysql by chrisranjana.com · · Score: 2, Interesting

    More search related functions should be available to php and perl and built in to them .. Even Mysql too...

    --
    Chris ,
    Php Programmers.