Learning About Full-text Search

← Back to Stories (view on slashdot.org)

Learning About Full-text Search

Posted by michael on Thursday December 18, 2003 @02:15AM from the looks-easy-but-isn't dept.

An anonymous reader writes "Tim Bray who's known for XML and has been /.'ed once or twice for that kind of stuff, actually seems to be a search geek and has been writing this endless series of essays on search technology since summer. He says he's finished now - it's like a textbook on searching."

14 of 140 comments (clear)

Min score:

Reason:

Sort:

poor guy by understyled · 2003-12-18 02:27 · Score: 5, Informative

i cringe at the bandwidth demands a slashdotting can bring with it. here's google's cache of the page.

--
Sig (appended to the end of comments you post, 120 chars)
1. Re:poor guy by Arslan+ibn+Da'ud · 2003-12-18 03:33 · Score: 4, Informative
  
  Slashdot has already considered this. RTFFAQ
  
  --
  Practice Kind Randomness and Beautiful Acts of Nonsense.
2. Re:poor guy by johnteslade · 2003-12-18 04:33 · Score: 5, Informative
  The site is still slashdotted. Each of his papers are on separate pages so here are the google caches of the individual papers:
  
  Table of Contents
  
  Chapter 1: Backgrounder
  
  Chapter 2: The Users
  
  Chapter 3: Basic Basics
  
  Chapter 4: Precision and Recall
  
  Chapter 5: Intelligence
  
  Chapter 6: Squirmy Words
  
  Chapter 7: UI Archeology (No Cache)
  
  Chapter 8: Stopwords
  
  Chapter 9: Metadata
  
  Chapter 10: I18n
  
  Chapter 11: Result Ranking
  
  Chapter 12: Interfaces
  
  Chapter 13: XML
  
  Chapter 14: Robots
  
  I have to write some crap at the end here so i can get past the "Your comment has too few characters per line" error message.
3. Re:poor guy by spectre_240sx · 2003-12-18 07:25 · Score: 3, Informative
  
  I don't know about that. There seem to be too many problems associated with caching. One that comes to my mind is the extra bandwith that they would have to worry about. An Article about the design of the site mentions that just changing over to CSS made a grand savings of 3-14 GB a day equalling something like $3,600.00 in the end. Now that's just by cutting 2-9KB off every page request. Now, think about them serving (possibly) huge pages from other sites that may not optomize their code... That's a lot of money that slashdot would have to spend.
Re:web page irony by Anonymous Coward · 2003-12-18 02:29 · Score: 2, Informative

they don't, but the parent post is about finding some conflict between the author's pages and aticles.
He's got an article about searching and his pages aren't searchable, and he's got articles about XML, so having non-valid XHTML pages would definitely have been ironic...
Re:web page irony by Schwarzchild · 2003-12-18 02:45 · Score: 4, Informative

He writes about seaching technology, but you can't easily search through his writings.
Really? How about search site:tbray.org?

--
"sweet dreams are made of this..."
Re:Anti-XML by phurley · 2003-12-18 02:55 · Score: 5, Informative

I agree to a point, but if we are talking about a mixed environement where you are using Oracle, I am using DB2, our friend Bob has his data in a legacy ISAM setup and a customer wants to integrate a search system across the tree systems they are going to have to write a lot of custom glue.

If an XML aspect of the data is available (you can still keep it all in Oracle - just provide a "view" of it in XML) from each of us - common search tools and methods can be utilized.

--
Home Automation & Linux -- now I know I'm a geek
Re:Why isn't "someone" Tim Bray by wizarddc · 2003-12-18 02:58 · Score: 2, Informative

Go Open Source Tim ... and get the ball rolling. You will be surprised how much help you will get ...

I thought that was just a myth?

--
Th
Re:mirrors ? anyone ? by Anonymous Coward · 2003-12-18 03:19 · Score: 1, Informative

http://developers.slashdot.org/faq/suggestions.sht ml#su900

Thank you, drive through.
Yeah, I know... Preview.... by stoborrobots · 2003-12-18 04:09 · Score: 3, Informative

I don't know... see for yourself, then come and tell us... The comment on this page suggests that you are right...

--
"Go to CNN [for a] spell-checked, fact-checked summary" -- CmdrTaco
Mirror by Door-opening+Fascist · 2003-12-18 04:11 · Score: 4, Informative

Since the site looks bogged down from the /.'ing, I've made a few mirrors:
Mirror #1
Mirror #2
Mirror #3
"long departed Open Text index?" Not by Anonymous Coward · 2003-12-18 04:29 · Score: 2, Informative

It just has a new name, and it's being developed by librarians.
http://www.dlxs.org/products/xpat.htm l
Re:Why isn't "someone" Tim Bray by mbrinkm · 2003-12-18 04:59 · Score: 3, Informative

"This is the last in my series of On Search essays. I've written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I'd like to change this part of the world. In short, I'd like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works. This essay is about what that software should look like. Early next year I'll write something on how it might get built.

Naming the Baby An important piece of software needs to have a name, but that takes time and creativity and can wait; for now I'll just call this thing the Basic Resource Finder (BRF).

Requirements Then a couple of non-requirements and a conclusion.

BRF is Open-Source My heartfelt apologies to anyone still trying to make a go of it in server-side search; but that business is just so over. It always was a lousy business, nobody has ever made real money there on a sustained basis, and yet it's something that every Web deployment needs. For a substantial site you can easily drop six figures for a search engine, and all the bells and whistles that buys you are mostly not cost-effective.

So BRF is going to be open-source. That doesn't mean that you can't make money with search software; it just means you have to do it in services. There are always going to be search deployments loaded with tricky implementation and deployment work: figuring out where the data is, aggregating it, cleaning it up, building the workflows so these things keep happening, maintaining some application-specific synonyms, the list goes on and on, and none of these things are free. And they are much better things to spend money on than software licenses."

RTFA!

The original submission is about his last essay and that essay starts with the above quote.

And whoever moderated you up needs to RTFA also!

--
"Don't worry about people stealing an idea. If it's original, you will have to ram it down their throats." --Howard Aike
Re:Why isn't "someone" Tim Bray by gwhulbert · 2003-12-18 06:07 · Score: 2, Informative

Tim Bray was one of the founders of open text corporation ... they INVENTED the search engine.
Digital (with whom they were working) "stole" the idea and opened Altavista 3 months before their IPO.
I worked for Open Text for a year but after Tim left (just about the time the 1.0 draft of the XML spec appeared).