Learning About Full-text Search
An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."
You mean two or three times now.
Trolling is a art,
He writes about seaching technology, but you can't easily search through his writings.
Finished an endless series?
i cringe at the bandwidth demands a slashdotting can bring with it. here's google's cache of the page.
Sig (appended to the end of comments you post, 120 chars)
Try Knuth Vol 3.
Though, I'm unaware of how to apply this to my life. I think I'll take it and put it in the "Unaware of How to Apply This to My Life" Stack with The Simpsons and The Internet.
clifgriffin > blog
Why the fascination with XML? Well, I certainly know the reason why *Tim* is fascinated, but I want to know why he's seriously contemplating reinventing the wheel - namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.
Also, most SQL DBMS (Oracle, Sybase ASE, MS SQL, etc.) come with full-text indexing built in, so all it would take would be to chop up HTML pages and stick them in the DBMS, then you can perform rich-text queries on them with minimal effort.
Thanks,
--
Matt
Everything beyond the TOC (which I loaded onto my browser) is slashdotted. The problem with the links to the different articles is that its not part of a tree hierarchy, I cant just say "wget all pages beyond point X", nor can I make a guess and do a regex download of all URLs with "search" in them, because some articles do not conform to that pattern.
A tarball for offline browsing would be nice ? didnt see it on the page, though. Save you a part of a slashdotting, Tim.. how about it ? :)
Actually, this is one of the few times that someone used "like" correctly. The linked documents are not a textbook on searching; however, they are similar to a textbook on searching. It is, therefore, apropriate to use the preposition "like," since the linked essays are, in fact, like a textbook on searching.
my pet machine
The essay series converges to text book when time tends to infinity. Proof is left as an exercise to the reader.
getSexySig();
Search technology. Hmmm. Wasn't that outsourced to India last month? Or was that last year? I just can't keep up with IT today.
"Someone" ought start building ... I wonder why this someone isn't Tim Bray. He is one of the most well known names in XML, has experience under his belt with another Search Engine Project Antartica .....
I just mean it in the sense that if he is having trouble getting his own ideas himself off the ground, what a challenge it will be for someone else to do so.
Mr. Bray should get the thing going like Linus did, and call in help from the Open Source Community. If he is waiting for someone with moneybags to catch the bait, and call him on the project as a highly paid consultant, maybe the approach needs to be modified.
Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...
To see a world in a grain of sand, and then to step back and see the beach where the sand lies
Mr. Simpson, this is the most blatant case of fraudulent advertising since my suit against the film, ``The Never-Ending Story''.
--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
Yes, Thom was his stage name IIRC
I really like this guys comments, but would not confuse them with a textbook.
Favorite idea: 'Turn on Search' built-in to Apache. This should be a standard feature.
Of course, others have already started working on a flash version before this blog was written.
The first dog barks. All other dogs bark at the first dog.
here
There are no trolls. There are no trees out here.
"A Quantum Theory of Internet Value" by Andrew Orlowski
-- why librarians are better at finding the book you want than Google.
I don't know... see for yourself, then come and tell us... The comment on this page suggests that you are right...
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
Mirror #1
Mirror #2
Mirror #3
Is Google's page rank algorithm really that mysterious? I know they fiddle with it in secret ways now and then to discourage abuse, but I heard the fundamental algorithm was basically pretty simple: something like finding the eigenvectors and eigenvalues of the matrix of links. (Not sure exactly what they do with these -- associate each page with the eigenvector of which it's the biggest component?) Is this wrong? Wouldn't it be pretty easy to reverse engineer the algorithm anyway?
And when he says but I'm not actually sure they'd need to do that to get the results they do, can he be serious? If that wasn't the case, then it would really be encouraging link farms. If the eigenvalue/eigenvector approach is really how they do it, then it certainly does have this recursive property.
Find free books.
It just has a new name, and it's being developed by librarians.m l
http://www.dlxs.org/products/xpat.ht
It seems that both searching and sorting are very important topics. Someone should write a good thorough book on them.
I've been looking all over...
Trolling has been implemented.
Sure we did RTFA. Can you Read Between The F* Lines RBTFL ?
Here is what Tim says:
This essay is about what that software should look like. Early next year I'll write something on how it MIGHT get built.
So BRF is going to be open-source.
I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.
And if the following is not Consultant-Speak I don't know what is - Consultants are great at telling you why you should not be doing what you are doing. They might even tell you what you should be doing - but do they ever do anything except collect big fees. I have been on both sides and I know people who talk and people who walk. Stop talking and start walking.
always going to be search deployments loaded with tricky implementation and deployment work
figuring out where the data is,
aggregating it,
cleaning it up,
building the workflows so these things keep happening,
maintaining some application-specific synonyms,
the list goes on and on,
To see a world in a grain of sand, and then to step back and see the beach where the sand lies
it's geared for public consumption,
such is the nature of websites,
so as long as you don't pretend you wrote it,
it's abundantly clear where the original came from,
go ahead and mirror (by mirror i mean take a snapshot).
only if a copyright holder says don't do that should you remove it.
More search related functions should be available to php and perl and built in to them .. Even Mysql too...
Chris ,
Php Programmers.
google broken? (www.google-watch.org)
"... unique ID for each page stored as ansi c, 4 bytes on Linux system (~4yo) gives theoretical limit of 4.2 billion pages. ..."
discusses the move to 5 bytes and suggests how this move may be the cause of weird search results on google searchs this year - of course the other reason my be google foiling search cue jumpers.
peterrenshaw ~ Another Scrappy Startup
in the post-google world, UIs like the General Interface will appear. check out their demo at Integrated Web Services and no i dont work there. i just like the direction they are going in.
If they're so general, how come I get this when I try to view the sample apps?
Sample Applications
General Interface Objects currently supports Internet Explorer 5.5 and later browsers running on Windows. For access to the sample applications please use another browser.
I guess "general" means "IE only".
Sean
The one remaining active mirror of his site is at http://www.woodmann.com/fravia. The messageboard at http://www.woodmann.com/upload is still the best place to go for reverse-engineering windows code; no crack requests, serial requests, or target-specific code are allowed, but you can address particular copy protections by name.
Fravia has since moved on to reverse-engineering search engines. If you want to find the stuff that doesn't turn up at the top of a google search, start here.
From the site: "It has fifteen instalments not including this table of contents."
Last I searched the dictionary, it was "installments."
I guess alphabetical searching is best after all.
- It's not the Macs I hate. It's Digg users. -
Have you considered 'KeepAlive Off'? One busy site I admin for has been up 100% ever since I did that, used to go down (gigs of swap in use) every night..