Learning About Full-text Search

← Back to Stories (view on slashdot.org)

Learning About Full-text Search

Posted by michael on Thursday December 18, 2003 @02:15AM from the looks-easy-but-isn't dept.

An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."

17 of 140 comments (clear)

Min score:

Reason:

Sort:

Hold on there by arvindn · 2003-12-18 02:19 · Score: 5, Funny

...has been writing this endless series of essays on search technology since summer. He says he's finished now...
Finished an endless series?
1. Re:Hold on there by MooCows · 2003-12-18 02:22 · Score: 4, Funny
  
  The maximum number of results have been returned.
  
  --
  The path I walk alone is endlessly long.
  30 minutes by bike, 15 by bus.
Re:web page irony by Dreadlord · 2003-12-18 02:23 · Score: 5, Funny

too bad his pages are valid XHTML documents, it would have made an excellent +5 funny comment :(

--
The IT section color scheme sucks.
poor guy by understyled · 2003-12-18 02:27 · Score: 5, Informative

i cringe at the bandwidth demands a slashdotting can bring with it. here's google's cache of the page.

--
Sig (appended to the end of comments you post, 120 chars)
1. Re:poor guy by martingunnarsson · 2003-12-18 02:52 · Score: 4, Insightful
  
  If Google can cache pages and put them online, so should Slashdot. People say copyright issues would be a problem, but in that case, why is Google's online cache any better?
  
  --
  Martin
2. Re:poor guy by Arslan+ibn+Da'ud · 2003-12-18 03:33 · Score: 4, Informative
  
  Slashdot has already considered this. RTFFAQ
  
  --
  Practice Kind Randomness and Beautiful Acts of Nonsense.
3. Re:poor guy by davew2040 · 2003-12-18 03:43 · Score: 4, Insightful
  
  And they considered incorrectly.
4. Re:poor guy by johnteslade · 2003-12-18 04:33 · Score: 5, Informative
  The site is still slashdotted. Each of his papers are on separate pages so here are the google caches of the individual papers:
  
  Table of Contents
  
  Chapter 1: Backgrounder
  
  Chapter 2: The Users
  
  Chapter 3: Basic Basics
  
  Chapter 4: Precision and Recall
  
  Chapter 5: Intelligence
  
  Chapter 6: Squirmy Words
  
  Chapter 7: UI Archeology (No Cache)
  
  Chapter 8: Stopwords
  
  Chapter 9: Metadata
  
  Chapter 10: I18n
  
  Chapter 11: Result Ranking
  
  Chapter 12: Interfaces
  
  Chapter 13: XML
  
  Chapter 14: Robots
  
  I have to write some crap at the end here so i can get past the "Your comment has too few characters per line" error message.
Anti-XML by MattRog · 2003-12-18 02:38 · Score: 4, Interesting

Whether there's going to be a lot of XML around in repositories to search. XML these days is more used in interchange rather than archival applications.

Why the fascination with XML? Well, I certainly know the reason why *Tim* is fascinated, but I want to know why he's seriously contemplating reinventing the wheel - namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.

Also, most SQL DBMS (Oracle, Sybase ASE, MS SQL, etc.) come with full-text indexing built in, so all it would take would be to chop up HTML pages and stick them in the DBMS, then you can perform rich-text queries on them with minimal effort.

--

Thanks,
--
Matt
1. Re:Anti-XML by phurley · 2003-12-18 02:55 · Score: 5, Informative
  
  I agree to a point, but if we are talking about a mixed environement where you are using Oracle, I am using DB2, our friend Bob has his data in a legacy ISAM setup and a customer wants to integrate a search system across the tree systems they are going to have to write a lot of custom glue.
  
  If an XML aspect of the data is available (you can still keep it all in Oracle - just provide a "view" of it in XML) from each of us - common search tools and methods can be utilized.
  
  --
  Home Automation & Linux -- now I know I'm a geek
2. Re:Anti-XML by arrogance · 2003-12-18 03:02 · Score: 4, Interesting
  
  He even goes so far as to mention that Index Server will search your website: but fails to mention that it does full text searching on your entire file system.
  
  Unfortunately his site (http://www.tbray.org/ongoing/) seems to be sufficiently disorganized that I have trouble finding out what his real points are, or whether he's addressed all of the issues: for example, I saw no mention of the Semantic Web if his concern is searchability on web documents.
  
  As a side note, MS SQL is going more and more toward XML, as is the whole .NET framework. This results in richer (read: fatter) data but it does mean that you can store whatever metadata you want along with it.
3. Re:Anti-XML by anomalous+cohort · 2003-12-18 03:17 · Score: 4, Insightful
  
  From the google cache...
  searching for words isn't really what you want to do. You'd like to search for ideas, for concepts, for solutions, for answers.
  That is why he is going in an XML direction. The relational approach is rectilinear and requires that the information be framed in a highly normalized fashion. Generalized semantic searching is highly non-normalized because, well, humans are highly non-normalized.
  
  I think that he should look at some work by a different Tim, the Semantic Web.
Re:web page irony by Schwarzchild · 2003-12-18 02:45 · Score: 4, Informative

He writes about seaching technology, but you can't easily search through his writings.
Really? How about search site:tbray.org?

--
"sweet dreams are made of this..."
Why isn't "someone" Tim Bray by leoaugust · 2003-12-18 02:46 · Score: 5, Interesting

I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.

"Someone" ought start building ... I wonder why this someone isn't Tim Bray. He is one of the most well known names in XML, has experience under his belt with another Search Engine Project Antartica .....
I just mean it in the sense that if he is having trouble getting his own ideas himself off the ground, what a challenge it will be for someone else to do so.
Mr. Bray should get the thing going like Linus did, and call in help from the Open Source Community. If he is waiting for someone with moneybags to catch the bait, and call him on the project as a highly paid consultant, maybe the approach needs to be modified.
Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...

--
To see a world in a grain of sand, and then to step back and see the beach where the sand lies ...
Re:re-inventing the wheel by Anonymous Coward · 2003-12-18 03:44 · Score: 4, Insightful

Try reading the articles/essays. Knuth's vol 3 is about comparison search, not full-text search.
Mirror by Door-opening+Fascist · 2003-12-18 04:11 · Score: 4, Informative

Since the site looks bogged down from the /.'ing, I've made a few mirrors:
Mirror #1
Mirror #2
Mirror #3
Re:Salute by antarctican · 2003-12-18 05:16 · Score: 4, Interesting

..and has been /.'ed once or twice..

You mean two or three times now.

And it's my poor server that has to bare the burden.... but so far it's held up fairly well each time. Pretty good for a celeron 1.7GHz w/ 256M. :)

However this time was particularly bad because of it being a series of essays. I just increased the number of instances of Apache by 66% and doubled the number of requests before a child dies. That seems to have brought some responsiveness back.

Funny thing is I didn't even know he was /.'ed until he emailed me. I went to check my email (via pine) and the console was as responsive as usual.

For the geeks who enjoy technical detail... it's running on an Inspire cube PC, one of those little cubes with a mini-ATX in it. Shows you don't see a lot of horse power to serve static content. :)