Learning About Full-text Search

← Back to Stories (view on slashdot.org)

Learning About Full-text Search

Posted by michael on Thursday December 18, 2003 @02:15AM from the looks-easy-but-isn't dept.

An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."

11 of 140 comments (clear)

Min score:

Reason:

Sort:

re-inventing the wheel by peter303 · 2003-12-18 02:34 · Score: 1, Interesting

Try Knuth Vol 3.
Anti-XML by MattRog · 2003-12-18 02:38 · Score: 4, Interesting

Whether there's going to be a lot of XML around in repositories to search. XML these days is more used in interchange rather than archival applications.

Why the fascination with XML? Well, I certainly know the reason why *Tim* is fascinated, but I want to know why he's seriously contemplating reinventing the wheel - namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.

Also, most SQL DBMS (Oracle, Sybase ASE, MS SQL, etc.) come with full-text indexing built in, so all it would take would be to chop up HTML pages and stick them in the DBMS, then you can perform rich-text queries on them with minimal effort.

--

Thanks,
--
Matt
1. Re:Anti-XML by arrogance · 2003-12-18 03:02 · Score: 4, Interesting
  
  He even goes so far as to mention that Index Server will search your website: but fails to mention that it does full text searching on your entire file system.
  
  Unfortunately his site (http://www.tbray.org/ongoing/) seems to be sufficiently disorganized that I have trouble finding out what his real points are, or whether he's addressed all of the issues: for example, I saw no mention of the Semantic Web if his concern is searchability on web documents.
  
  As a side note, MS SQL is going more and more toward XML, as is the whole .NET framework. This results in richer (read: fatter) data but it does mean that you can store whatever metadata you want along with it.
2. Re:Anti-XML by I8TheWorm · 2003-12-18 04:17 · Score: 2, Interesting
  
  I tend to get on an XML soap (no pun intended) box when I see articles about it, so here goes...
  
  XML is great for sharing data between non-congruous systems. It's horrible, however, for storing data in any large quantity, and even more horrible for treating as a searchable text file. It's inherintly large and full of ascii/ansi/utf characters that are completely unnecessary when performing byte by byte text searches. For large amounts of data, you're right... RDBMS is the current way to go... maybe OODBMS will be in the near future, but I still haven't tinkered with it myself and don't have any opinions developed yet.
  
  XML is not the data end-all... same as __insert_your_own_programming_language_here__ is not the end-all of programming. It's a nice tool, but tends to be overused because it's still a buzz-word.
  
  --
  Saying Android is a family of phones is akin to saying Linux is a family of PCs.
3. Re:Anti-XML by DrVomact · 2003-12-18 04:34 · Score: 3, Interesting
  
  The reason why XML is widely used today for a multitude of purposes (e.g., data interchange between otherwise incompatible systems, configuration files, technical documents, command protocols that communicate with servers, etc. etc.) and why it will be used for even more stuff in the future is that it is centered on a very simple and powerful idea: self-documenting data. That is, the data is structured by internal markers that give information about the type of information contained in each logical element of the data stream (or file). Naturally, the XML geeks are doing everything they can to complicate this simple idea, but I digress...
  
  Because XML files are structured, self-documenting text files that correspond to a formal definition (I know DTDs aren't technically required for "well-formed" XML, but you really don't want to do that), you can rely on your data being usable without making assumptions about the type of systems that will use it (OS is irrelevant, applications can have front and back ends that understand XML). Moreover, this compatibility isn't going to go away: it's just pure text--we will always be able to read it.
  
  I have no idea whether the databases of the future will store their data in XML form or not. I'm not a database expert, but I suspect there are more efficient ways of storing and searching information than in huge chunks of tagged text. However, while a database that stores its data in a rigid table format may be quicker and more compact, it cannot preserve the richness of meta-information contained in an XML tagging system. If you put your XML into a traditional database, you won't be able to take advantage of being able to make searches based on information in the tags.
  
  Be that as it may, the the fact remains that you will at least be able to feed your database XML, and get XML out of it. That means that the XML front end will parse the XML input data, and will be able to figure out how to organize your data in its innards based on the information provided by the XML tags in the data you give it. When you make a query (probably formatted as XML by the query software), the data will be returned as XML using the tag scheme specified in your DTD.
  
  Another consideration is that you can store much more information about your data with an XML tagging scheme than you can with any database format--and you can communicate that information when you send your data to someone else, because the metadata is part of the data. I work with huge texts (technical manuals, actually) and I heartily welcome the flexibility and usefulness of being able to identify parts of that text based on any criteria that are meaningful to me or the consumer of the data.
  
  --
  Great men are almost always bad men--Lord Acton's Corollary
Re:web page irony by arrogance · 2003-12-18 02:45 · Score: 3, Interesting

Well, especially when it's been slashdotted. Here's a google cache hit to part of his writings.

I agree that it doesn't look to be easy to search around, at least when all you have is an URL to go on (http://www.tbray.org/ongoing/When/200x/2003/07/30 /OnSearchTOC) and Google to find reachable material. I'm also not too sure about using dates as folder names but that's just a personal thing: I think Tim Berners Lee recommended it at one point in an article "Cool URI's don't Change". He does recommend using "Latest" or some such instead of the creation date in a URI, though, if "there is no reason for the persistence of the URI to outlast the magazine." It might make things easier to search for though, at least if you know when it was created: if the URIs aren't changing then you won't have tons of broken links.
Why isn't "someone" Tim Bray by leoaugust · 2003-12-18 02:46 · Score: 5, Interesting

I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.

"Someone" ought start building ... I wonder why this someone isn't Tim Bray. He is one of the most well known names in XML, has experience under his belt with another Search Engine Project Antartica .....
I just mean it in the sense that if he is having trouble getting his own ideas himself off the ground, what a challenge it will be for someone else to do so.
Mr. Bray should get the thing going like Linus did, and call in help from the Open Source Community. If he is waiting for someone with moneybags to catch the bait, and call him on the project as a highly paid consultant, maybe the approach needs to be modified.
Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...

--
To see a world in a grain of sand, and then to step back and see the beach where the sand lies ...
Slashdot search question by Glass+of+Water · 2003-12-18 03:32 · Score: 2, Interesting

But there is some good stuff out there; for example Slashdot's search engine seems to run smooth, clean, and fast, but some poking around failed to reveal what it is: I wouldn't be surprised if it's just the Mysql search facility.
Anybody know the answer to this one?

--
There are no trolls. There are no trees out here.
Or instead, talk to a librarian (the Register) by JPMH · 2003-12-18 03:48 · Score: 2, Interesting

An interesting counterpoint to this story in the Register today:
"A Quantum Theory of Internet Value" by Andrew Orlowski
-- why librarians are better at finding the book you want than Google.
Re:Salute by antarctican · 2003-12-18 05:16 · Score: 4, Interesting

..and has been /.'ed once or twice..

You mean two or three times now.

And it's my poor server that has to bare the burden.... but so far it's held up fairly well each time. Pretty good for a celeron 1.7GHz w/ 256M. :)

However this time was particularly bad because of it being a series of essays. I just increased the number of instances of Apache by 66% and doubled the number of requests before a child dies. That seems to have brought some responsiveness back.

Funny thing is I didn't even know he was /.'ed until he emailed me. I went to check my email (via pine) and the console was as responsive as usual.

For the geeks who enjoy technical detail... it's running on an Inspire cube PC, one of those little cubes with a mini-ATX in it. Shows you don't see a lot of horse power to serve static content. :)
searching using php perl and mysql by chrisranjana.com · 2003-12-18 06:10 · Score: 2, Interesting

More search related functions should be available to php and perl and built in to them .. Even Mysql too...

--
Chris ,
Php Programmers.