Learning About Full-text Search
An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."
fp linux nerd
You mean two or three times now.
Trolling is a art,
He writes about seaching technology, but you can't easily search through his writings.
toaster,toaster toaser, do you have toast in you yet i think
so!!!!!!!!!!!!!!!!!!!Im not a toaster!!!!!!!!!!And one more
thing........YOUR A TOASER!!!!!!!!!!!!!! AND A COOKIE WITH MILK SOAGE
MILK!!!!!!!!!!AND A BUTT WITH POOP IN IT!!!!!!!!!!!!!!!!
Finished an endless series?
how gay of a book to write
Wasn't he the geek on Riptide?
OSAKA -- A man who spat on a high school student was arrested after being caught-red-handed at a station in Settsu, Osaka Prefecture, police said Tuesday.
The man, Kazuhiko Ukita, 31, was arrested for spitting on the back of the 17-year-old student at JR Senrioka Station in Settsu at about 7:25 a.m. on Tuesday.
Investigators said the student was using JR to get to Kyoto Station while she was on her way to school. She had noticed what appeared to be spit on her uniform about 10 times in the past, and had discussed the problem with the Kyoto Prefectural Police railway police unit.
A member of the unit accompanied the student as she was on her way to school on Tuesday, and caught Ukita, a resident of Suita, Osaka Prefecture, spitting at her.
Police said they suspected Ukita was involved in other similar crimes and they were continuing to question him.
Although it looked like a decent set of articals, I'm sure it would only be useful to me as a sleep agent...
i cringe at the bandwidth demands a slashdotting can bring with it. here's google's cache of the page.
Sig (appended to the end of comments you post, 120 chars)
Anderson in the Burger World drive thru: Did you get the order? Hello?
Butthead through the loudspeaker: We're like closed or something?
Try Knuth Vol 3.
Though, I'm unaware of how to apply this to my life. I think I'll take it and put it in the "Unaware of How to Apply This to My Life" Stack with The Simpsons and The Internet.
clifgriffin > blog
Why the fascination with XML? Well, I certainly know the reason why *Tim* is fascinated, but I want to know why he's seriously contemplating reinventing the wheel - namely using XML as data storage when we already have gobs and gobs of systems (think SQL DBMS products) that do this in a much faster, more compact, safer, better way.
Also, most SQL DBMS (Oracle, Sybase ASE, MS SQL, etc.) come with full-text indexing built in, so all it would take would be to chop up HTML pages and stick them in the DBMS, then you can perform rich-text queries on them with minimal effort.
Thanks,
--
Matt
Everything beyond the TOC (which I loaded onto my browser) is slashdotted. The problem with the links to the different articles is that its not part of a tree hierarchy, I cant just say "wget all pages beyond point X", nor can I make a guess and do a regex download of all URLs with "search" in them, because some articles do not conform to that pattern.
A tarball for offline browsing would be nice ? didnt see it on the page, though. Save you a part of a slashdotting, Tim.. how about it ? :)
The essay series converges to text book when time tends to infinity. Proof is left as an exercise to the reader.
getSexySig();
Search technology. Hmmm. Wasn't that outsourced to India last month? Or was that last year? I just can't keep up with IT today.
"Someone" ought start building ... I wonder why this someone isn't Tim Bray. He is one of the most well known names in XML, has experience under his belt with another Search Engine Project Antartica .....
I just mean it in the sense that if he is having trouble getting his own ideas himself off the ground, what a challenge it will be for someone else to do so.
Mr. Bray should get the thing going like Linus did, and call in help from the Open Source Community. If he is waiting for someone with moneybags to catch the bait, and call him on the project as a highly paid consultant, maybe the approach needs to be modified.
Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...
To see a world in a grain of sand, and then to step back and see the beach where the sand lies
Mr. Simpson, this is the most blatant case of fraudulent advertising since my suit against the film, ``The Never-Ending Story''.
--
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
non-metamodded overrated modifier is really killing /. , someone has just modded down 5 highly modded comments as overrated here.
Ideas?
here
There are no trolls. There are no trees out here.
While you struggle with the use of opposable thumbs, I have implemented full-text search on my award-winning blog.
I'm sorry that it's taken you so long.
Sincerely,
Seth Finklestein
Michael Sims' worst nightmare
I'm not Seth Finkelstein. I still speak the truth.
"A Quantum Theory of Internet Value" by Andrew Orlowski
-- why librarians are better at finding the book you want than Google.
I don't know... see for yourself, then come and tell us... The comment on this page suggests that you are right...
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
Mirror #1
Mirror #2
Mirror #3
Is Google's page rank algorithm really that mysterious? I know they fiddle with it in secret ways now and then to discourage abuse, but I heard the fundamental algorithm was basically pretty simple: something like finding the eigenvectors and eigenvalues of the matrix of links. (Not sure exactly what they do with these -- associate each page with the eigenvector of which it's the biggest component?) Is this wrong? Wouldn't it be pretty easy to reverse engineer the algorithm anyway?
And when he says but I'm not actually sure they'd need to do that to get the results they do, can he be serious? If that wasn't the case, then it would really be encouraging link farms. If the eigenvalue/eigenvector approach is really how they do it, then it certainly does have this recursive property.
Find free books.
It just has a new name, and it's being developed by librarians.m l
http://www.dlxs.org/products/xpat.ht
It seems that both searching and sorting are very important topics. Someone should write a good thorough book on them.
I've been looking all over...
Trolling has been implemented.
Sure we did RTFA. Can you Read Between The F* Lines RBTFL ?
Here is what Tim says:
This essay is about what that software should look like. Early next year I'll write something on how it MIGHT get built.
So BRF is going to be open-source.
I plan to conclude with a description of the next search engine, which doesn't exist yet but someone ought to start building.
And if the following is not Consultant-Speak I don't know what is - Consultants are great at telling you why you should not be doing what you are doing. They might even tell you what you should be doing - but do they ever do anything except collect big fees. I have been on both sides and I know people who talk and people who walk. Stop talking and start walking.
always going to be search deployments loaded with tricky implementation and deployment work
figuring out where the data is,
aggregating it,
cleaning it up,
building the workflows so these things keep happening,
maintaining some application-specific synonyms,
the list goes on and on,
To see a world in a grain of sand, and then to step back and see the beach where the sand lies
it's geared for public consumption,
such is the nature of websites,
so as long as you don't pretend you wrote it,
it's abundantly clear where the original came from,
go ahead and mirror (by mirror i mean take a snapshot).
only if a copyright holder says don't do that should you remove it.
More search related functions should be available to php and perl and built in to them .. Even Mysql too...
Chris ,
Php Programmers.
google broken? (www.google-watch.org)
"... unique ID for each page stored as ansi c, 4 bytes on Linux system (~4yo) gives theoretical limit of 4.2 billion pages. ..."
discusses the move to 5 bytes and suggests how this move may be the cause of weird search results on google searchs this year - of course the other reason my be google foiling search cue jumpers.
peterrenshaw ~ Another Scrappy Startup
in the post-google world, UIs like the General Interface will appear. check out their demo at Integrated Web Services and no i dont work there. i just like the direction they are going in.
If they're so general, how come I get this when I try to view the sample apps?
Sample Applications
General Interface Objects currently supports Internet Explorer 5.5 and later browsers running on Windows. For access to the sample applications please use another browser.
I guess "general" means "IE only".
Sean
The one remaining active mirror of his site is at http://www.woodmann.com/fravia. The messageboard at http://www.woodmann.com/upload is still the best place to go for reverse-engineering windows code; no crack requests, serial requests, or target-specific code are allowed, but you can address particular copy protections by name.
Fravia has since moved on to reverse-engineering search engines. If you want to find the stuff that doesn't turn up at the top of a google search, start here.
From the site: "It has fifteen instalments not including this table of contents."
Last I searched the dictionary, it was "installments."
I guess alphabetical searching is best after all.
- It's not the Macs I hate. It's Digg users. -
Have you considered 'KeepAlive Off'? One busy site I admin for has been up 100% ever since I did that, used to go down (gigs of swap in use) every night..