Learning About Full-text Search
An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."
Try Knuth Vol 3.
Why the fascination with XML? Well, I certainly know the reason why *Tim* is fascinated, but I want to know why he's seriously contemplating reinventing the wheel - namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.
Also, most SQL DBMS (Oracle, Sybase ASE, MS SQL, etc.) come with full-text indexing built in, so all it would take would be to chop up HTML pages and stick them in the DBMS, then you can perform rich-text queries on them with minimal effort.
Thanks,
--
Matt
Well, especially when it's been slashdotted. Here's a google cache hit to part of his writings.
0 /OnSearchTOC) and Google to find reachable material. I'm also not too sure about using dates as folder names but that's just a personal thing: I think Tim Berners Lee recommended it at one point in an article "Cool URI's don't Change". He does recommend using "Latest" or some such instead of the creation date in a URI, though, if "there is no reason for the persistence of the URI to outlast the magazine." It might make things easier to search for though, at least if you know when it was created: if the URIs aren't changing then you won't have tons of broken links.
I agree that it doesn't look to be easy to search around, at least when all you have is an URL to go on (http://www.tbray.org/ongoing/When/200x/2003/07/3
"Someone" ought start building ... I wonder why this someone isn't Tim Bray. He is one of the most well known names in XML, has experience under his belt with another Search Engine Project Antartica .....
I just mean it in the sense that if he is having trouble getting his own ideas himself off the ground, what a challenge it will be for someone else to do so.
Mr. Bray should get the thing going like Linus did, and call in help from the Open Source Community. If he is waiting for someone with moneybags to catch the bait, and call him on the project as a highly paid consultant, maybe the approach needs to be modified.
Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...
To see a world in a grain of sand, and then to step back and see the beach where the sand lies
There are no trolls. There are no trees out here.
"A Quantum Theory of Internet Value" by Andrew Orlowski
-- why librarians are better at finding the book you want than Google.
..and has been /.'ed once or twice..
:)
/.'ed until he emailed me. I went to check my email (via pine) and the console was as responsive as usual.
:)
You mean two or three times now.
And it's my poor server that has to bare the burden.... but so far it's held up fairly well each time. Pretty good for a celeron 1.7GHz w/ 256M.
However this time was particularly bad because of it being a series of essays. I just increased the number of instances of Apache by 66% and doubled the number of requests before a child dies. That seems to have brought some responsiveness back.
Funny thing is I didn't even know he was
For the geeks who enjoy technical detail... it's running on an Inspire cube PC, one of those little cubes with a mini-ATX in it. Shows you don't see a lot of horse power to serve static content.
More search related functions should be available to php and perl and built in to them .. Even Mysql too...
Chris ,
Php Programmers.