Slashdot Mirror


Learning About Full-text Search

An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."

140 comments

  1. Salute by grub · · Score: 2, Funny


    ..and has been /.'ed once or twice..

    You mean two or three times now.

    --
    Trolling is a art,
    1. Re:Salute by antarctican · · Score: 4, Interesting

      ..and has been /.'ed once or twice..

      You mean two or three times now.


      And it's my poor server that has to bare the burden.... but so far it's held up fairly well each time. Pretty good for a celeron 1.7GHz w/ 256M. :)

      However this time was particularly bad because of it being a series of essays. I just increased the number of instances of Apache by 66% and doubled the number of requests before a child dies. That seems to have brought some responsiveness back.

      Funny thing is I didn't even know he was /.'ed until he emailed me. I went to check my email (via pine) and the console was as responsive as usual.

      For the geeks who enjoy technical detail... it's running on an Inspire cube PC, one of those little cubes with a mini-ATX in it. Shows you don't see a lot of horse power to serve static content. :)

  2. web page irony by Savatte · · Score: 3, Funny

    He writes about seaching technology, but you can't easily search through his writings.

    1. Re:web page irony by Anonymous Coward · · Score: 0

      How do I do a keyword search of the new textbook on searches?

    2. Re:web page irony by Dreadlord · · Score: 5, Funny

      too bad his pages are valid XHTML documents, it would have made an excellent +5 funny comment :(

      --
      The IT section color scheme sucks.
    3. Re:web page irony by Anonymous Coward · · Score: 2, Informative

      they don't, but the parent post is about finding some conflict between the author's pages and aticles.
      He's got an article about searching and his pages aren't searchable, and he's got articles about XML, so having non-valid XHTML pages would definitely have been ironic...

    4. Re:web page irony by arrogance · · Score: 3, Interesting

      Well, especially when it's been slashdotted. Here's a google cache hit to part of his writings.

      I agree that it doesn't look to be easy to search around, at least when all you have is an URL to go on (http://www.tbray.org/ongoing/When/200x/2003/07/30 /OnSearchTOC) and Google to find reachable material. I'm also not too sure about using dates as folder names but that's just a personal thing: I think Tim Berners Lee recommended it at one point in an article "Cool URI's don't Change". He does recommend using "Latest" or some such instead of the creation date in a URI, though, if "there is no reason for the persistence of the URI to outlast the magazine." It might make things easier to search for though, at least if you know when it was created: if the URIs aren't changing then you won't have tons of broken links.

    5. Re:web page irony by Schwarzchild · · Score: 4, Informative
      He writes about seaching technology, but you can't easily search through his writings.

      Really? How about search site:tbray.org?

      --

      "sweet dreams are made of this..."

    6. Re:web page irony by Anonymous Coward · · Score: 2, Funny

      Tee hee, I get it now. It reminds me of this time that something didn't happen, but if it had happened, it would have been funny. Ha ha, that still cracks me up. Yes, most amusing.

  3. Hold on there by arvindn · · Score: 5, Funny
    ...has been writing this endless series of essays on search technology since summer. He says he's finished now...

    Finished an endless series?

    1. Re:Hold on there by MooCows · · Score: 4, Funny

      The maximum number of results have been returned.

      --
      The path I walk alone is endlessly long.
      30 minutes by bike, 15 by bus.
    2. Re:Hold on there by Dreadlord · · Score: 1

      Maybe we have just hit +Infinity of time?
      Time flies when we're sitting in front of our comps, reading /.

      --
      The IT section color scheme sucks.
    3. Re:Hold on there by Lozzer · · Score: 1

      In the first three months he wrote a page, in the next month and a half he wrote another page, in the next (scratching of head) three quarters of a month he wrote another page, and so on. Now after six months he has written an endless amount of stuff, simple (yet amazing) really.

      --
      Special Relativity: The person in the other queue thinks yours is moving faster.
    4. Re:Hold on there by haystor · · Score: 1

      He wrote them in a circle.

      --
      t
  4. poor guy by understyled · · Score: 5, Informative

    i cringe at the bandwidth demands a slashdotting can bring with it. here's google's cache of the page.

    --
    Sig (appended to the end of comments you post, 120 chars)
    1. Re:poor guy by martingunnarsson · · Score: 4, Insightful

      If Google can cache pages and put them online, so should Slashdot. People say copyright issues would be a problem, but in that case, why is Google's online cache any better?

      --
      Martin
    2. Re:poor guy by Anonymous Coward · · Score: 0

      The idea being people would visit the cache before the actual site. What if the site added an ammendum? Google Cache is a pain in the ass JUST ENOUGH that you will try the real link first.

    3. Re:poor guy by ihummel · · Score: 1

      Google is a search engine used and respected by virtually everyone. Slashdot is, well, Slashdot.

      Also, I believe that Google respects instructions in the robots.txt not to cache their page.

    4. Re:poor guy by martingunnarsson · · Score: 1

      So could Slashdot do. Hehe, it's actually kind of funny! The webmaster's choises would be:
      1) Allow Slashdot to cache the site
      2) Get the site slashdotted back to the stoneage

      Nothing wrong with some maffia methods every now and then!

      --
      Martin
    5. Re:poor guy by Arslan+ibn+Da'ud · · Score: 4, Informative
      --

      Practice Kind Randomness and Beautiful Acts of Nonsense.

    6. Re:poor guy by ihummel · · Score: 1

      I think the main problem is that the guys who run slashdot would probably need to get permission beforehand to cache the linked page, and it would take too much time out of their day to email back and forth to every linked site. Sure, J Random Hacker wouldn't mind being cached, but CNN, News.com.com.com.com.com, and the New York Times just might. And they would have enough bandwidth to handle the Slashdotting.

    7. Re:poor guy by davew2040 · · Score: 4, Insightful

      And they considered incorrectly.

    8. Re:poor guy by davew2040 · · Score: 1

      Too much out of their day? Out of the 15 sites they link every day, they can't be bothered with asking because of *time constraints*?!

      Apparently it's more acceptable to them to knowingly blow sites out of the water (they even joked about it in this post) than to spend the time to fire off an email. The fact is, they don't even want to try.

    9. Re:poor guy by martingunnarsson · · Score: 2, Offtopic

      Google isn't asking for permission. Again, Slashdot could obey to the rules in robots.txt.

      --
      Martin
    10. Re:poor guy by ihummel · · Score: 2, Insightful

      Google is Google and Slashdot is Slashdot.

      But even if the issue of liability were taken off the table, they would still have to get off of their metaphorical butts and set up a caching system. I don't know if there is any usable open-source system currently in existence, but if not, they would either have to code it themselves or adapt something already out there that doesn't serve their needs. Disk space isn't really an issue, as the commenting system takes a lot more space than the cache would (assuming they didn't mirror isos or anything silly like that).

    11. Re:poor guy by johnteslade · · Score: 5, Informative
      The site is still slashdotted. Each of his papers are on separate pages so here are the google caches of the individual papers:

      I have to write some crap at the end here so i can get past the "Your comment has too few characters per line" error message.

    12. Re:poor guy by spectre_240sx · · Score: 3, Informative

      I don't know about that. There seem to be too many problems associated with caching. One that comes to my mind is the extra bandwith that they would have to worry about. An Article about the design of the site mentions that just changing over to CSS made a grand savings of 3-14 GB a day equalling something like $3,600.00 in the end. Now that's just by cutting 2-9KB off every page request. Now, think about them serving (possibly) huge pages from other sites that may not optomize their code... That's a lot of money that slashdot would have to spend.

    13. Re:poor guy by davew2040 · · Score: 2, Funny

      Well then, I guess slashdot would learn firsthand about the slashdot effect!

    14. Re:poor guy by spectre_240sx · · Score: 1

      Umm, I think they already do when you consider the amount of people who come here. Remember how often people say RTFA? If only those who view the articles cause "the slashdot effect" Imagine how much traffic slashdot already gets.

    15. Re:poor guy by The+Unabageler · · Score: 1

      I always called it a recursive wget.

      --
      perl -e '$_="\007/4`\cp%2,".chr(127);s/./"\"\\c$&\""/gees; print'
    16. Re:poor guy by Nucleon500 · · Score: 1

      Their concern is that commercial sites will feel cheated out of ad revenue. But this problem is trivial to avoid: Don't cache pages initially, but have a system for caching them quickly if the webmaster asks. The stories wouldn't be delayed, but when they are accepted, a notification would be sent and a copy made. When the webmaster asks to be relieved, the links in the story would be changed to the cache.

    17. Re:poor guy by gangien · · Score: 1

      So the quick answer is: "Sure, caching would be neat." It would make things a lot easier when servers go down, but it's a complicated issue that would need to be thought through in great detail before being implemented.

      this is incorrect?? thinking about the implications of copying something and putting it on your server is incorrect?

    18. Re:poor guy by 1iar_parad0x · · Score: 1

      How much code does it take to mirror a simple site? I think I could copy the site and set up a link in Perl pretty easy. The hardware seems to be the tricky part.

      On a related note... How does Slashdot avoid the Slashdot effect?

      --
      What do you mean my sig is repetitive? What do you mean my sig is repetitive? What do you mean....
    19. Re:poor guy by Anonymous Coward · · Score: 0
      How does Slashdot avoid the Slashdot effect?
      The same way the ocean avoids getting flooded by the Nile. Distribution and lots of room.
  5. re-inventing the wheel by peter303 · · Score: 1, Interesting

    Try Knuth Vol 3.

    1. Re:re-inventing the wheel by Anonymous Coward · · Score: 0

      > Try Knuth Vol 3.

      Can't afford it. This is freely downloadable. Knuth loses.

    2. Re:re-inventing the wheel by Anonymous Coward · · Score: 0

      Really? How much is there about Google in Knuth?

    3. Re:re-inventing the wheel by Anonymous Coward · · Score: 0
      "Knuth is the most over rated, ..."


      A common complaint from people who doesn't understand his books...

    4. Re:re-inventing the wheel by Anonymous Coward · · Score: 4, Insightful

      Try reading the articles/essays. Knuth's vol 3 is about comparison search, not full-text search.

    5. Re:re-inventing the wheel by getarun_vr · · Score: 2, Insightful

      Maybe search technology has changed a lot since Kuth days. If one cursorily glances through the last coupla journals on Information Search and Retrieval, one cannot help the heavy influence of PageRank (Google's own technology). Thankfully the algorithm is well known. On the flip side, Critics have often asked wheather such algorithms be published? The bloggers have demonstrated that even Google rankings can be rigged... Personally, I would choose the open architecture philosophy, due to parallels with the ideas of Bruce on cryptography. A peer reviewed system is always better than a closed proprietery system.

    6. Re:re-inventing the wheel by Anonymous Coward · · Score: 0

      Managing Gigabytes by Ian Witten, Allistair Moffat, Timothy C. Bell is far more relevant to full-text searching.

    7. Re:re-inventing the wheel by Anonymous Coward · · Score: 1, Insightful

      You have that backwards. PageRank was heavily influenced by other systems, like Harvest. And full-text search has changed very little since Knuth. For instance, the basic extact string matching algorithms haven't advanced at all.

    8. Re:re-inventing the wheel by Anonymous Coward · · Score: 0

      Thanks for the recommendation, that book's bound for my Amazon wishlist.

    9. Re:re-inventing the wheel by Anonymous Coward · · Score: 0

      For a solid grounding in the subject, I would highly recommend (in addition to Knuth):
      Pattern Matching Algorithms
      by Alberto Apostolico and Zvi Galil
      ISBN: 0195113675

    10. Re:re-inventing the wheel by Anonymous Coward · · Score: 0

      Really? How much is there about Google in Knuth?
      Very much. I seriously doubt that Google would consider hiring anyone who isn't at least familiar with Knuth either directly through his Art of Computer Programming, or indirectly through Sedgewick's works, Aho's works, Rivest's works, and perhaps every book on data structures, algorithms, and the running time of programs out there, because, eventually, they most all lead back to Knuth. Not to mention the fact that most mathematics texts as well as most technical texts in publication today use Knuth's TeX program or one of its derivatives in some form or another for their creation.
      So, it's probably safe to say that Google probably would not exist if it weren't for Knuth.

  6. Interesting stuff! by clifgriffin · · Score: 3, Funny

    Though, I'm unaware of how to apply this to my life. I think I'll take it and put it in the "Unaware of How to Apply This to My Life" Stack with The Simpsons and The Internet.

    1. Re:Interesting stuff! by KoolDude · · Score: 1, Funny


      I'm unaware of how to apply this to my life. I think I'll take it and put it in the "Unaware of How to Apply This to My Life" Stack with The Simpsons and The Internet

      But what if your stack grows big and you need to search through the stack ?

      --
      getSexySig(); /* returns sexy signature */
    2. Re:Interesting stuff! by Anonymous Coward · · Score: 0

      If you're searching a stack, the best you can do is pop it until the thing you're looking for is on top.

    3. Re:Interesting stuff! by stoborrobots · · Score: 1, Offtopic

      Actually no... one of the interesting things is that it is far more efficient to "scan" through a stack than to pop it if you're looking for something... (Assuming you have an in-memory stack, which is easily manipulated by memory operations as well as stack ops.)

      It breaks the abstraction, but the improvement may actually be worth it sometimes...

  7. Anti-XML by MattRog · · Score: 4, Interesting
    Whether there's going to be a lot of XML around in repositories to search. XML these days is more used in interchange rather than archival applications.


    Why the fascination with XML? Well, I certainly know the reason why *Tim* is fascinated, but I want to know why he's seriously contemplating reinventing the wheel - namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.

    Also, most SQL DBMS (Oracle, Sybase ASE, MS SQL, etc.) come with full-text indexing built in, so all it would take would be to chop up HTML pages and stick them in the DBMS, then you can perform rich-text queries on them with minimal effort.
    --

    Thanks,
    --
    Matt
    1. Re:Anti-XML by SillySnake · · Score: 1

      I thought Longhorn was going to use some sort of XML file system? Or at least there were thoughts about it?

    2. Re:Anti-XML by Havokmon · · Score: 1
      namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.

      I'm with ya there buddy.. If it wasn't for a corporate buyout, my OS/2 box with REXX scripts would still be ftp'ing files (I was really hoping for 10 years - but I've been gone for 3 now).

      Now they'll do it in some xxx.Net, because it's all new and cool. Whatever, at least my stuff was readable with 'edit'.

      --
      "I can't give you a brain, so I'll give you a diploma" - The Great Oz (blatently stolen sig)
    3. Re:Anti-XML by phurley · · Score: 5, Informative

      I agree to a point, but if we are talking about a mixed environement where you are using Oracle, I am using DB2, our friend Bob has his data in a legacy ISAM setup and a customer wants to integrate a search system across the tree systems they are going to have to write a lot of custom glue.

      If an XML aspect of the data is available (you can still keep it all in Oracle - just provide a "view" of it in XML) from each of us - common search tools and methods can be utilized.

      --
      Home Automation & Linux -- now I know I'm a geek
    4. Re:Anti-XML by arrogance · · Score: 4, Interesting

      He even goes so far as to mention that Index Server will search your website: but fails to mention that it does full text searching on your entire file system.

      Unfortunately his site (http://www.tbray.org/ongoing/) seems to be sufficiently disorganized that I have trouble finding out what his real points are, or whether he's addressed all of the issues: for example, I saw no mention of the Semantic Web if his concern is searchability on web documents.

      As a side note, MS SQL is going more and more toward XML, as is the whole .NET framework. This results in richer (read: fatter) data but it does mean that you can store whatever metadata you want along with it.

    5. Re:Anti-XML by Hayzeus · · Score: 1
      I don't really get the advantages of XML Data storage either, but when it comes to emitting data in a generic, interoperable, self-describing format, XML works quite nicely, even if it is a tad verbose.

      Which (slightly OT) reminds me: has anyone here used an XML compression tool, that they'd like to share opinions on? I've looked at XMLPPM briefly but not worked with it yet. Any others?

    6. Re:Anti-XML by MattRog · · Score: 1

      I don't think writing your own DBMS engine (with query, data management, concurrency, etc.) support is going to be 'less' work than simply either ensuring that your SQL works with different vendors or writing small data pieces to talk to a number of DBMS products.

      You could, of course, bundle an existing DBMS product into the application which would remove the limitation of being forced to use the customer's DBMS product.

      --

      Thanks,
      --
      Matt
    7. Re:Anti-XML by anomalous+cohort · · Score: 4, Insightful

      From the google cache...

      searching for words isn't really what you want to do. You'd like to search for ideas, for concepts, for solutions, for answers.

      That is why he is going in an XML direction. The relational approach is rectilinear and requires that the information be framed in a highly normalized fashion. Generalized semantic searching is highly non-normalized because, well, humans are highly non-normalized.

      I think that he should look at some work by a different Tim, the Semantic Web.

    8. Re:Anti-XML by MattRog · · Score: 1
      The relational approach is rectilinear and requires that the information be framed in a highly normalized fashion. Generalized semantic searching is highly non-normalized ...

      That makes absolutely no sense.
      --

      Thanks,
      --
      Matt
    9. Re:Anti-XML by anomalous+cohort · · Score: 3, Funny

      Hmmm, perhaps a visit to a dictionary is in order. Once you read the definitions for rectilinear and normalized, I'll think you'll find the sense of the post.

      This is a sound strategy any time you run into a message that makes no sense. Simply look up the definitions of the words that you don't know.

    10. Re:Anti-XML by MattRog · · Score: 1

      It doesn't make any sense because it's meaningless. Try and provide reasoning why you think this sort of information can't be modeled relationally.

      --

      Thanks,
      --
      Matt
    11. Re:Anti-XML by cthrall · · Score: 1

      SQL dbs might come with full-text indexing, but the power of information retrieval really comes into play when you can start clustering, using stemmers to find people/places, etc. Db full-text indexing feels more like a feature checkbox than a real information retrieval system.

      XML can be useful because you can take data from disparate sources (an Exchange server, SQL db, etc.) and normalize the meta data (the document author, date the document was created, etc.).

      I agree there's an overwhelming "silver-bullet" feel about XML sometimes, but it can definitely help in this case.

    12. Re:Anti-XML by I8TheWorm · · Score: 2, Interesting

      I tend to get on an XML soap (no pun intended) box when I see articles about it, so here goes...

      XML is great for sharing data between non-congruous systems. It's horrible, however, for storing data in any large quantity, and even more horrible for treating as a searchable text file. It's inherintly large and full of ascii/ansi/utf characters that are completely unnecessary when performing byte by byte text searches. For large amounts of data, you're right... RDBMS is the current way to go... maybe OODBMS will be in the near future, but I still haven't tinkered with it myself and don't have any opinions developed yet.

      XML is not the data end-all... same as __insert_your_own_programming_language_here__ is not the end-all of programming. It's a nice tool, but tends to be overused because it's still a buzz-word.

      --
      Saying Android is a family of phones is akin to saying Linux is a family of PCs.
    13. Re:Anti-XML by DrVomact · · Score: 3, Interesting

      The reason why XML is widely used today for a multitude of purposes (e.g., data interchange between otherwise incompatible systems, configuration files, technical documents, command protocols that communicate with servers, etc. etc.) and why it will be used for even more stuff in the future is that it is centered on a very simple and powerful idea: self-documenting data. That is, the data is structured by internal markers that give information about the type of information contained in each logical element of the data stream (or file). Naturally, the XML geeks are doing everything they can to complicate this simple idea, but I digress...

      Because XML files are structured, self-documenting text files that correspond to a formal definition (I know DTDs aren't technically required for "well-formed" XML, but you really don't want to do that), you can rely on your data being usable without making assumptions about the type of systems that will use it (OS is irrelevant, applications can have front and back ends that understand XML). Moreover, this compatibility isn't going to go away: it's just pure text--we will always be able to read it.

      I have no idea whether the databases of the future will store their data in XML form or not. I'm not a database expert, but I suspect there are more efficient ways of storing and searching information than in huge chunks of tagged text. However, while a database that stores its data in a rigid table format may be quicker and more compact, it cannot preserve the richness of meta-information contained in an XML tagging system. If you put your XML into a traditional database, you won't be able to take advantage of being able to make searches based on information in the tags.

      Be that as it may, the the fact remains that you will at least be able to feed your database XML, and get XML out of it. That means that the XML front end will parse the XML input data, and will be able to figure out how to organize your data in its innards based on the information provided by the XML tags in the data you give it. When you make a query (probably formatted as XML by the query software), the data will be returned as XML using the tag scheme specified in your DTD.

      Another consideration is that you can store much more information about your data with an XML tagging scheme than you can with any database format--and you can communicate that information when you send your data to someone else, because the metadata is part of the data. I work with huge texts (technical manuals, actually) and I heartily welcome the flexibility and usefulness of being able to identify parts of that text based on any criteria that are meaningful to me or the consumer of the data.

      --
      Great men are almost always bad men--Lord Acton's Corollary
    14. Re:Anti-XML by platypus · · Score: 1

      Because you'd (for example) have to provide a relational model for the semantics of the english language. And even that wouldn't meet the criterium of "generalized", because, ehm, it's specialized for the english language.

    15. Re:Anti-XML by gorilla · · Score: 2, Insightful

      Call me stupid if you like, but I don't see how the representation of the data helps to search for ideas concepts etc. Regardless of how the text is stores, unless you have a human do a lot of markup on the text, then you're going to have a problem in extracting the ideas from the text. And by markup I don't mean Heading I mean some entering what the ideas, concepts etc are for each part of the text - which can be done equally easily in a traditional database as in a XML document.

    16. Re:Anti-XML by platypus · · Score: 1

      Err, missed some context in this thread, clearly, XML as opposed to relational wouldn't help here either.

    17. Re:Anti-XML by radio4fan · · Score: 1

      for example, I saw no mention of the Semantic Web


      Try here on the page about metadata.
    18. Re:Anti-XML by cthrall · · Score: 1

      The problem isn't that the information can't be modeled in a relational manner, you could easily use a relational database for your data store.

      The problem is retrieving information to index. You pull information from existing data sources that have never heard of your data model and don't care. XML provides a simple way to map your existing content to some standard design that you come up with. That's the "normalization" step, and one of the harder parts of indexing.

    19. Re:Anti-XML by cthrall · · Score: 1

      You can use stemmers, term frequencies and relative location in a document to provide some general gist of what a document is about. The whole point of creating advanced information retrieval tools is to make information processing a more automated task.

    20. Re:Anti-XML by gorilla · · Score: 1

      Yep, but what difference does it make if the text is stored in XML or in a database?

    21. Re:Anti-XML by Anonymous Coward · · Score: 0

      Sigh. Yeah, you're an idiot. I do (and have done for about 5 years) many IE/IR systems. "Generalized semantic searching" as you want to call it is actually a highly structured exercise. Most systems don't use databases to do the work because it is the opposite of that, so simply done that there is no reason for a database. It could be represented in a relational format, just like any large search engine, and tagger appropriately (word order information and such), but most systems work on small chuncks of text.

      Sorry, please try again.

    22. Re:Anti-XML by Anonymous Coward · · Score: 0

      I'd just like to point out that Tim specifically says that he is "going to use XML syntax, because it's reasonably readable... This is not to suggest that you'd actually store a search engine's document table in XML. You might (I wouldn't), but the point is the XML is being used here as an expository device.".

      So he isn't suggesting "reinventing the wheel" at all.

    23. Re:Anti-XML by cthrall · · Score: 1

      The XML part comes in when you are extracting content from an existing data store. You can use a relational database for a backend store, but when you're going through the step of mapping existing content to the info your indexing engine wants (the normalization step), XML is very handy.

    24. Re:Anti-XML by kirkjobsluder · · Score: 1

      True, the problem is that HTML became such a beast mixing semantic markup with visual markup that it is really hard to find well-marked up documents.

      Still, while it is possible to convert any form of data into a relational database, does that mean that the relational database is the best fit for all types of data?. One of the things that XML does well but relational databases don't do well (without a lot of violent shuffling around) is arbitrary parent-child relationships. So for example, a typical paper might have:

      [frontmatter]
      -[title]
      -[author]
      -[contact info]
      -[abstract]
      [section]
      -[paragraph]
      -[par agraph]
      -[blockquote]
      -[paragraph]
      -[image]
      -- [caption]

      etc. etc..

      Why go through the problem of chopping this up and normalizing the data if you don't have to? Especially if you give the people who are producing the content a template and schema that produces well-formed semantic XML or can process existing texts with some assumptions?

    25. Re:Anti-XML by kirkjobsluder · · Score: 1

      I didn't get that impression from the article that he was considering XML as data storage. I saw the point as being that we don't know how much XML a search system will have to process. If your data consists of a large number of OpenOffice, DocBook, XHTML or Framemaker documents, then it might just be easier to keep things in XML rather than to split the data apart into a bunch of atomic chunks.

      I love using RDBMS but for some applications, creating a normalized database is a pain in the rear. Bibliographic information for example requires three tables for author-article relationships. (And this is not getting into the fact that journal articles usually have a volume number while articles do not, inheritance of publisher's information from anthologies to individual book chapters, and the tricky difference between newspaper and academic citations.) I don't see RDBMS systems as a natural choice for structured documents that may have an arbitrary structure.

    26. Re:Anti-XML by Pseudonym · · Score: 1

      XML is almost ideal for storing structured text in large quantities. Storing non-textual data, not so much. (This is one reason why XML gets a bad reputation for data representation; people are using it for tasks which are not textual markup-related.) For byte-by-byte searching... true enough, it sucks for that. But surely if you have text in large quantities, you're hardly going to search it using "grep". That would be insane whether it's stored in XML or plain text.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    27. Re:Anti-XML by Pseudonym · · Score: 1

      Relational databases and full-text indexing are a poor fit once you have a lot of text to store. Yes, I know. Most SQL DBMS come with full-text indexing. That's not enough. Read on for the reason why.

      Think about how a relational DBMS works. Internally, the major data structure is the "stream of tuples". A tuple is a virtual record which is made up of a number of fields, each of which has data in it.

      When you search, you get back a stream of tuples, which is usually some projection of the record store. That is, you apply a function to each physical record which returns a tuple. This may be some subset of the physical fields, or may involve transforming them in some way.

      Similarly, when you retrieve data back to the client, you get a stream of tuples. Everything conceptually works on these tuples.

      Now think about how you work with a text database, such as Google. Google still has a record store. (See the "Google cache" for an example of what it might look like internally.) However, you don't see this. What you do is enter a search query and what is returned is the first ten results in summarised form. You hit "next" and you get the next ten. Typical searchers only look on one or two pages before they either go on to one of the results or refine their query.

      What makes this feasable is that Google avoids hitting physical records unless the actual records are needed. To do this, its main internal data structure is a stream of record numbers, not a stream of tuples which contain data. (Z39.50, the successor to WAIS, calls this a "result set".) When you finally retrieve data, you can then go to the record store and retrieve the records associated with those record numbers.

      This is why it's hard for relational databases to achieve the performance of native text databases: Text databases go to a lot of trouble to optimise the kinds of operations that text users want to do, which are very different from the kinds of operations that relational data users want to do. With text databases, searching, sorting, ranking and presenting data to the client are all conceptually different things, each with their own performance characteristics and access patterns. With relational databases, they are all intermingled, because relational database users tend not to need that kind of granularity.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    28. Re:Anti-XML by anomalous+cohort · · Score: 1

      Well, data structures and algorithms are related in that the choice of one affects the choice of the other. The classic example here is link lists are better for inserting than arrays but worse for searching.

      IMHO, XML handles hierarchy better than relational databases but it is possible to use either. Of course, real human understanding is not strictly hierarchical so I guess that it's a wash. Neither XML nor RDB is all that good at capturing meaning.

      You brought up a second point of who does the natural language processing, the human or the search algorithm? In Semantic Web, the publisher is doing the natural language processing and capturing meaning using XML.

    29. Re:Anti-XML by Anonymous Coward · · Score: 0

      __insert_your_own_programming_language_here__? C'mon, eveyone knows you're talking about Python.

    30. Re:Anti-XML by gillbates · · Score: 1
      I have no idea whether the databases of the future will store their data in XML form or not

      Not likely. XML is designed to solve the data identification problem, not the data storage problem.

      Due to the heirarchical nature of XML, a validating parser must read the entire document before returning any results. Given the way that most parsers are designed, the entire document will be read into memory and first parsed, then validated. Which, of course, limits the size of your database to the machine's memory.

      The relational databases aren't going to go away. XML is more like a text-mode IMS for the PC. For those who have never heard of IMS, it was a heirarchical database built for IBM mainframes back in the 60's. When DB2 came out with it's relational model, IMS was largely abandoned. XML is no different, except it incurs a significant performance penalty - it isn't indexed, so binary searches are automatically precluded. Instead, searching a large XML "database" would require using a linear search after loading the document into memory.

      But I think I've heard it best put by a manager, "XML forces your database to follow an arbitrary structure. It forces your database to be built around the presentation of the data, rather than the efficiency of retrieval." Until then, I hadn't thought about it in those terms, but I realized that he was exactly right. With a relational database, it's easy to represent many different views and relationships of the same data. But with XML, you're stuck with what the DTD designer believes necessary - the presentation in effect dictates the database structure.

      I like XML as an exchange format. But it's the COBOL of the new millenium. The inherent inefficiencies involved (the entire document must be read to extract a single node?), and the inflexibility of relationships (it will always be heirarchical), make it a poor candidate for a database format.

      Yes, I've used XML as a database format, with horrible results. A year or so ago, I used it for a financial application. I had written the frontend, and later decided to write an add-on piece which would compile statistics about the transaction data. Rather than being able to read a single record and do things like:

      RunningTotal += CurrentRecord.AmountPayable;

      I had to use syntax like:

      RunningTotal = RunningTotal + Float.parseFloat(DocumentRoot().NextRecord().Item( "CurrentTransactions).Item("Transaction").Item("Am ountPayable"));

      Now tell me that's an improvement!? And should I change my DTD, this code would useless. Yes, I could add fields, but I can't otherwise change the structure of the document without invalidating all of my code.

      So XML isn't really as "universal" as the pundits would like to claim. In order to parse an XML document, you still need the DTD, and should the DTD change, your code will be broken. Here's a brief comparison of database formats:

      • XML
      • Memory Constrained
      • No update in place (requires entire file rewrite)
      • No fast search (1)
      • Not architecture independent (2)
      • Not self-documenting(3)
      • Comma delimited:
      • No update in place (requires entire file rewrite)
      • No fast search (1)
      • Not architecture independent (2)
      • Not self-documenting(3)
      • Fixed Width:
      • Update in place
      • Fast search (on indexed/sorted fields)
      • Not architecture independent
      • Not self-documenting.

      Okay, flame away. Basically, XML doesn't offer any advantages to the old fixed-width mainframe formats, except possibly that it can be editted with a text editor. But it's not like any self-respecting DBA would let a programmer edit a production database with a text editor, so it's a moot point anyway.

      Notes:

      1. Since the record and field boundaries can be predicted in advance, a fixed width file doesn't need to pa

      --
      The society for a thought-free internet welcomes you.
    31. Re:Anti-XML by 2short · · Score: 1

      "Which (slightly OT) reminds me: has anyone here used an XML compression tool"

      I've looked at a few, but frankly, haven't seen the point. Several generic compression types (e.g. zip) are based on finding sequences in the data (e.g. "<SomeTagName") that are repeated, and hence they do very well with XML. I had some really big XML doc that whatever zip compression lib I was using for other stuff, with default options got down to ~15%, while some XML-specific compressor, after a bit of configuration bought me another 1 or 2%. Didn't seem worth it. Naturally, YMMV.

      Anyway, I love XML. When I need to have my code send data to someone, I can say "It's in XML, here's a sample", and our format discussion is (almost) over. If it's really a lot of data, I'd rather say, "It's XML, zip compressed." than explain whatever less widely adopted XML specific compressor I'm using.

      As far as XML for Data Storage as opposed to interchange: There's a variety of arguments for or against XML for storage, but the big pro for me is this: For the interchange side, I've written code to send XML to somewhere or receive XML from somewhere. Now I can do something different for storage, or I can just define "somewhere" to be a file.

    32. Re:Anti-XML by Hayzeus · · Score: 1

      Actually, the tricky bit here is that the resulting data has to be text. The best solution I've seen so far involves generic compression and then base64 encoding, but I'm just wondering what other options might exist.

    33. Re:Anti-XML by Anonymous Coward · · Score: 0

      EBCDIC is text. So is ASCII. So is Unicode.

  8. mirrors ? anyone ? by psycho_tinman · · Score: 1

    Everything beyond the TOC (which I loaded onto my browser) is slashdotted. The problem with the links to the different articles is that its not part of a tree hierarchy, I cant just say "wget all pages beyond point X", nor can I make a guess and do a regex download of all URLs with "search" in them, because some articles do not conform to that pattern.

    A tarball for offline browsing would be nice ? didnt see it on the page, though. Save you a part of a slashdotting, Tim.. how about it ? :)

    1. Re:mirrors ? anyone ? by Anonymous Coward · · Score: 0


      Why doesn't slashdot just do a snapshot of a page if its just text?
      Same thing google does right?
      After a link to the page, just have a little "[cached]" that you can see if it gets slashdotted...

    2. Re:mirrors ? anyone ? by Anonymous Coward · · Score: 1, Informative
  9. Re:Like...wow. by no+reason+to+be+here · · Score: 1

    Actually, this is one of the few times that someone used "like" correctly. The linked documents are not a textbook on searching; however, they are similar to a textbook on searching. It is, therefore, apropriate to use the preposition "like," since the linked essays are, in fact, like a textbook on searching.

  10. Bray's theorem by KoolDude · · Score: 3, Funny


    The essay series converges to text book when time tends to infinity. Proof is left as an exercise to the reader.

    --
    getSexySig(); /* returns sexy signature */
  11. This technology still exists? by Pathetic+Coward · · Score: 2, Funny

    Search technology. Hmmm. Wasn't that outsourced to India last month? Or was that last year? I just can't keep up with IT today.

    1. Re:This technology still exists? by smittyoneeach · · Score: 2, Insightful

      It will thrive until the Next Big Thing(tm) arrives, to "save us from the sad shortcomings of XML".

      XML's only real fault is that's it's been oversold, not unlike Object Oriented Programming and Java before it.

      --
      Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
  12. Why isn't "someone" Tim Bray by leoaugust · · Score: 5, Interesting

    I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.

    "Someone" ought start building ... I wonder why this someone isn't Tim Bray. He is one of the most well known names in XML, has experience under his belt with another Search Engine Project Antartica .....

    I just mean it in the sense that if he is having trouble getting his own ideas himself off the ground, what a challenge it will be for someone else to do so.

    Mr. Bray should get the thing going like Linus did, and call in help from the Open Source Community. If he is waiting for someone with moneybags to catch the bait, and call him on the project as a highly paid consultant, maybe the approach needs to be modified.

    Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...

    --
    To see a world in a grain of sand, and then to step back and see the beach where the sand lies ...
    1. Re:Why isn't "someone" Tim Bray by wizarddc · · Score: 2, Informative
      Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...


      I thought that was just a myth?
      --
      Th
    2. Re:Why isn't "someone" Tim Bray by veecee_veecee · · Score: 1
      From the article (On Search: Backgrounder), on using Open Source tools:

      Each of the ones I've looked at has a problem (lightly/poorly maintained, scalability problems, lack of internationalization, awkward API).

      Good luck convincing him to go Open Source!

    3. Re:Why isn't "someone" Tim Bray by mbrinkm · · Score: 3, Informative

      "This is the last in my series of On Search essays. I've written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I'd like to change this part of the world. In short, I'd like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works. This essay is about what that software should look like. Early next year I'll write something on how it might get built.

      Naming the Baby An important piece of software needs to have a name, but that takes time and creativity and can wait; for now I'll just call this thing the Basic Resource Finder (BRF).

      Requirements Then a couple of non-requirements and a conclusion.

      BRF is Open-Source My heartfelt apologies to anyone still trying to make a go of it in server-side search; but that business is just so over. It always was a lousy business, nobody has ever made real money there on a sustained basis, and yet it's something that every Web deployment needs. For a substantial site you can easily drop six figures for a search engine, and all the bells and whistles that buys you are mostly not cost-effective.

      So BRF is going to be open-source. That doesn't mean that you can't make money with search software; it just means you have to do it in services. There are always going to be search deployments loaded with tricky implementation and deployment work: figuring out where the data is, aggregating it, cleaning it up, building the workflows so these things keep happening, maintaining some application-specific synonyms, the list goes on and on, and none of these things are free. And they are much better things to spend money on than software licenses."

      RTFA!

      The original submission is about his last essay and that essay starts with the above quote.

      And whoever moderated you up needs to RTFA also!

      --
      "Don't worry about people stealing an idea. If it's original, you will have to ram it down their throats." --Howard Aike
    4. Re:Why isn't "someone" Tim Bray by cutting · · Score: 1
      Go Open Source Tim ... and get the ball rolling.
      The ball is already rolling. Check out Lucene or Nutch. Either of these could be enhanced to support Tim's ideas. Volunteers? (I'm already working on it.)
    5. Re:Why isn't "someone" Tim Bray by gwhulbert · · Score: 2, Informative

      Tim Bray was one of the founders of open text corporation ... they INVENTED the search engine.
      Digital (with whom they were working) "stole" the idea and opened Altavista 3 months before their IPO.
      I worked for Open Text for a year but after Tim left (just about the time the 1.0 draft of the XML spec appeared).

  13. ObHutz by sharkey · · Score: 3, Funny

    Mr. Simpson, this is the most blatant case of fraudulent advertising since my suit against the film, ``The Never-Ending Story''.

    --

    --
    "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
  14. Re:Tim Bray? by Anonymous Coward · · Score: 0
    >>Wasn't he the geek on Riptide?

    Yes, Thom was his stage name IIRC

  15. Re:Like...wow. by khamar · · Score: 1
    From what I can see through the war-haze of ./ing these articles are more like a blog. Are we confusing "essay" and "like a textbook" with some random ideas?
    I really like this guys comments, but would not confuse them with a textbook.

    Favorite idea: 'Turn on Search' built-in to Apache. This should be a standard feature.

    Of course, others have already started working on a flash version before this blog was written.

    --
    The first dog barks. All other dogs bark at the first dog.
  16. google cache by aarku · · Score: 1
  17. Slashdot search question by Glass+of+Water · · Score: 2, Interesting
    But there is some good stuff out there; for example Slashdot's search engine seems to run smooth, clean, and fast, but some poking around failed to reveal what it is: I wouldn't be surprised if it's just the Mysql search facility.
    Anybody know the answer to this one?
    --
    There are no trolls. There are no trees out here.
    1. Re:Slashdot search question by stoborrobots · · Score: 1

      I don't know... see for yourself, then come and tell us...

      The <a href="http://ask.slashcode.com/article.pl?sid=02/0 2/09/183217&mode=thread&tid=4" >comment on this page</a> suggests that you are right...

    2. Re:Slashdot search question by utopyr · · Score: 1

      There are more suggestions here that you might be right.

      Is this like Frequently-Asked-Magic-8-Ball?

    3. Re:Slashdot search question by ddilling · · Score: 1

      Smooth, clean, fast... and kinda stupid.

      Enter the precise title of this very article ("Learning about full text search" -- it will strip the hyphen anyway), and order by date: Your top hit will be "A.I. Helicopters" with this article hit #2.

      Even better, order by score: your top hit will be "C++ Answers From Bjarne Stroustrup" -- this article doesn't even appear on the first page of 30 hits.

      Okay, you say... maybe it's not searching the titles, but the article bodies only. Let's try "Tim bray XML search"... ordered by date, we get the same results (#2 to A.I. Helicopter's #1), ordered by score, this article again does not appear in the first 30 hits. By limiting the search to only the 'Developers' section, I finally got it to appear in the first page of results (at #5).

      A rather hit-or-miss tool at best, for trying to find something older than the last couple days, in here.

      --
      Mahnamahna!
  18. Or instead, talk to a librarian (the Register) by JPMH · · Score: 2, Interesting
    An interesting counterpoint to this story in the Register today:

    "A Quantum Theory of Internet Value" by Andrew Orlowski
    -- why librarians are better at finding the book you want than Google.

  19. Yeah, I know... Preview.... by stoborrobots · · Score: 3, Informative

    I don't know... see for yourself, then come and tell us... The comment on this page suggests that you are right...

  20. Mirror by Door-opening+Fascist · · Score: 4, Informative
    Since the site looks bogged down from the /.'ing, I've made a few mirrors:

    Mirror #1

    Mirror #2

    Mirror #3

    1. Re:Mirror by quartertone · · Score: 1

      Excellent work. Not sure how you were able to circumnavigate the /. takedown.
      One problem, however: It's just the front page. The meat of the information is still hiding on his server.

    2. Re:Mirror by Door-opening+Fascist · · Score: 1
      One problem, however: It's just the front page. The meat of the information is still hiding on his server.
      It was originally just the front page. I decided to get that up fast to get the load off the original server. I've just updated the mirror with the important links, but those took a little longer to fetch.
  21. page rank algorithm by bcrowell · · Score: 1
    Also, Google claims that links from pages that themselves have a lot of incoming links count for more, but I'm not actually sure they'd need to do that to get the results they do.
    Is Google's page rank algorithm really that mysterious? I know they fiddle with it in secret ways now and then to discourage abuse, but I heard the fundamental algorithm was basically pretty simple: something like finding the eigenvectors and eigenvalues of the matrix of links. (Not sure exactly what they do with these -- associate each page with the eigenvector of which it's the biggest component?) Is this wrong? Wouldn't it be pretty easy to reverse engineer the algorithm anyway?

    And when he says but I'm not actually sure they'd need to do that to get the results they do, can he be serious? If that wasn't the case, then it would really be encouraging link farms. If the eigenvalue/eigenvector approach is really how they do it, then it certainly does have this recursive property.

    1. Re:page rank algorithm by bcrowell · · Score: 1

      (Replying to myself): Here is a site that claims to explain Google page rank completely. I found it by doing a Google search on 'google "page rank"', and I assume it's pretty authoritative, because it had the highest page rank :-)

  22. "long departed Open Text index?" Not by Anonymous Coward · · Score: 2, Informative

    It just has a new name, and it's being developed by librarians.
    http://www.dlxs.org/products/xpat.htm l

  23. Searching and Sorting by andy666 · · Score: 0

    It seems that both searching and sorting are very important topics. Someone should write a good thorough book on them.

    1. Re:Searching and Sorting by alw53 · · Score: 1



      And maybe discuss the actual algorithms
      instead of the UI.

  24. So, where can I find it? by mod_parent_down · · Score: 1

    I've been looking all over...

  25. It's the FAKELSTEIN TROLL by Anonymous Coward · · Score: 0

    Trolling has been implemented.

  26. RBTFL Re:Why isn't "someone" Tim Bray by leoaugust · · Score: 1

    Sure we did RTFA. Can you Read Between The F* Lines RBTFL ?

    Here is what Tim says:

    This essay is about what that software should look like. Early next year I'll write something on how it MIGHT get built.

    So BRF is going to be open-source.

    I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.

    And if the following is not Consultant-Speak I don't know what is - Consultants are great at telling you why you should not be doing what you are doing. They might even tell you what you should be doing - but do they ever do anything except collect big fees. I have been on both sides and I know people who talk and people who walk. Stop talking and start walking.

    always going to be search deployments loaded with tricky implementation and deployment work

    figuring out where the data is,

    aggregating it,

    cleaning it up,

    building the workflows so these things keep happening,

    maintaining some application-specific synonyms,

    the list goes on and on,

    --
    To see a world in a grain of sand, and then to step back and see the beach where the sand lies ...
  27. it's geared for public consumption by Anonymous Coward · · Score: 1, Insightful

    it's geared for public consumption,
    such is the nature of websites,
    so as long as you don't pretend you wrote it,
    it's abundantly clear where the original came from,
    go ahead and mirror (by mirror i mean take a snapshot).

    only if a copyright holder says don't do that should you remove it.

  28. searching using php perl and mysql by chrisranjana.com · · Score: 2, Interesting

    More search related functions should be available to php and perl and built in to them .. Even Mysql too...

    --
    Chris ,
    Php Programmers.
    1. Re:searching using php perl and mysql by Pseudonym · · Score: 1

      Both Perl and PHP already have Z39.50 support to connect to full-text search engines.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
  29. another pagerank discussion by goon · · Score: 1

    google broken? (www.google-watch.org)

    "... unique ID for each page stored as ansi c, 4 bytes on Linux system (~4yo) gives theoretical limit of 4.2 billion pages. ..."

    discusses the move to 5 bytes and suggests how this move may be the cause of weird search results on google searchs this year - of course the other reason my be google foiling search cue jumpers.

    --
    peterrenshaw ~ Another Scrappy Startup
  30. UI you say - check out www.geninterface.com by wheatking · · Score: 1

    in the post-google world, UIs like the General Interface will appear. check out their demo at Integrated Web Services and no i dont work there. i just like the direction they are going in.

  31. General Interface? by sean.peters · · Score: 1

    If they're so general, how come I get this when I try to view the sample apps?

    Sample Applications

    General Interface Objects currently supports Internet Explorer 5.5 and later browsers running on Windows. For access to the sample applications please use another browser.

    I guess "general" means "IE only".

    Sean

  32. Searchlores by Anonymous Coward · · Score: 0
    Fravia was once the biggest name in reverse engineering. His webpage was a reverse-engineering blog as far back as 1995, and he was instrumental in getting good reverse-engineers to talk together and teach each other their tricks.

    The one remaining active mirror of his site is at http://www.woodmann.com/fravia. The messageboard at http://www.woodmann.com/upload is still the best place to go for reverse-engineering windows code; no crack requests, serial requests, or target-specific code are allowed, but you can address particular copy protections by name.

    Fravia has since moved on to reverse-engineering search engines. If you want to find the stuff that doesn't turn up at the top of a google search, start here.

  33. I think his searching technique needs some work by cbreaker · · Score: 1

    From the site: "It has fifteen instalments not including this table of contents."

    Last I searched the dictionary, it was "installments."

    I guess alphabetical searching is best after all.

    --
    - It's not the Macs I hate. It's Digg users. -
    1. Re:I think his searching technique needs some work by tepples · · Score: 1

      At least American Heritage Dictionary lists "instalment" as a recognized variant.

  34. surviving a slashdotting by Habbie · · Score: 1

    Have you considered 'KeepAlive Off'? One busy site I admin for has been up 100% ever since I did that, used to go down (gigs of swap in use) every night..