Slashdot Mirror


Open Source Analog to Microsoft's Index Server?

An Anonymous Coward asks: "I have been tasked by my noble employer to find a better way accessing the 4,000 odd management documents and procedures we have. Currently MS Index Server is being used to provide a fairly good searching system. Index Server (for those that don't know) trawls through files and indexes their content.. ASP is then used to search the resulting database. My question is, there has to be a way to do this with nice open source software? Does anyone know of any competitors to index server that can index microsoft office documents? Thanks!" Might not HT://dig be a good foundation on which to build such a system?

38 comments

  1. Not Open Source, but... by Fifster · · Score: 2, Interesting

    It's not open source, but Sherlock for MacOS (part of the OS) has always featured hard drive or folder indexing features that can scan contents of documents fairly quickly and efficiently. I've not seen its performance on a /huge/ archive, though.
    --Fifster

  2. Google? by isorox · · Score: 5, Interesting

    Dont google license their engine (which reads word, powerpoint etc?)

    1. Re:Google? by Naikrovek · · Score: 4, Interesting

      Yes: http://www.google.com/appliance/

      However, this is not open source, it is not free and doesn't at all meet the goals of the person asking the question.

    2. Re:Google? by spencerogden · · Score: 1

      If I had to make a choice I would choose to deal with google rather than microsoft...

    3. Re:Google? by 4/3PI*R^3 · · Score: 3, Funny

      Place all of your sensitive corporate documents on your web server then place a lot of links to the pages so that Google will index them. After this you can simply go to Google and type {search phrase} site:{company domain} filetype:{file type}.

      -- for the irony impaired this is humor --

    4. Re:Google? by AirLace · · Score: 2

      -- for the English impaired this is irony --

    5. Re:Google? by malakai · · Score: 1

      The Google Search Appliance starts at US $28k. Index Server comes as part of Win2k, and not in bulk would be a bout 1/28th the cost.

  3. two I tried by epine · · Score: 3, Interesting


    I tried mnogosearch and swish-e. Different plusses and minuses. Later on I discovered that mnogosearch has a PHP front end and can be installed from a Debian package.

    My advice is to set up two entirely different search databases. Otherwise it's very difficult to compare hits, ranking performance, or discovered differences in the lexeme policy.

    1. Re:two I tried by shanx24 · · Score: 1

      for the benefit of posterity, can you list any specific plusses and minuses u found while testing the two? thanks..

      --
      As I said, I don't repeat myself.
    2. Re:two I tried by Anonymous Coward · · Score: 0

      They're for MSWord documents. So far as I know there aren't any OSS MSWord indexers.

  4. Ironic by Anonymous Coward · · Score: 5, Insightful

    I realize this is a little extreme, but what the heck:

    Imagine if the poster's company didn't have all their documents in a proprietary format. They would have plenty of other indexing programs available to them.

    And think, if that gigantic percentage of businesses didn't have their information trapped in a proprietary information format, there'd be even MORE solutions in the marketplace to choose from.

    When you don't come up with a cheaper and quicker solution, be sure to let your boss know it has just a little something to do with a proprietary format on a proprietary platform sold by a monopolist.

    Happy Sunday!

    1. Re:Ironic by david+duncan+scott · · Score: 2
      Hmm...I read the post again, and he doesn't actually say in what format the documents are stored. Maybe they're all ASCII text, and he just doesn't know any of those "plenty of other indexing programs available to them". Imagine if you could mention a couple, for his benefit and for the rest of us, and then go off about his employers presumed document storage.

      I've always wondered whether M'soft themselves use Index Server. God knows the KnowledgeBase search is awful, and if they do eat their own dog food it's a terrible advertisement.

      --

      This next song is very sad. Please clap along. -- Robin Zander

    2. Re:Ironic by ericski · · Score: 2, Informative

      The original post does specifically state "microsoft office documents" so it is fair from likely that they're ASCII text.

    3. Re:Ironic by Anonymous Coward · · Score: 0

      If Word (for example) was installed on the indexing server then it could be used to get the info out via automation.

    4. Re:Ironic by david+duncan+scott · · Score: 2

      You know what? You're right -- in that last sentence, he does say "microsoft office documents". I stand corrected, and withdraw my indignation and scorn (except towards the KnowledgeBase index -- that still sucks.)

      --

      This next song is very sad. Please clap along. -- Robin Zander

    5. Re:Ironic by Anonymous Coward · · Score: 0

      Imagine if the poster won a multistate lottery. He would have plenty of things to worry about besides indexing useless data.

    6. Re:Ironic by jhoffoss · · Score: 2

      Yeah, but the guy still didn't say what you could use if the files _were_ all ASCII...

      --
      Linux: The world's best text-adventure game.
  5. HT://Dig by tzanger · · Score: 2

    Have you ever tried using an HT://dig search? I despise that search tool on the basis that the results it throws back are not ranked all that well and (this is easily fixed) ugly.

    It's been a while since I've checked it out, maybe it has improved.

    1. Re:HT://Dig by Parsec · · Score: 1

      and (this is easily fixed) ugly

      Agreed, I have wrapped my htdig in a PHP readfile( ); call and used some javascript in the htdig header and long templates to do alternating table row background colors (don't forget the <noscript><tr></noscript> as a fallback if you do this).

  6. ASPSeek worked well for me.. by maeglin · · Score: 5, Informative

    I had about 2GB of documentation dumped onto me for a project. The documentation had no visible structure nor any place to really start tackling it so I decided to just index it all. The documentation was on my Windows2000 machine and I put ASPSeek (a GPL'd search engine that no one seems to know about) on one of my Linux workstations. I used pdftotext and word2txt as filters and let it chew through the documentation. The results were good enough that, when I left the project and shut down the ASPSeek interface, it took about 15 minutes before someone (who already had it all indexed on his Windows2000 workstation) was at my desk trying to get me to turn it back on.

  7. Namazu and Bool by rhkramer · · Score: 2, Interesting

    Check out this page on twiki.org: http://twiki.org/cgi-bin/view/Codev/SearchEngineVs GrepSearch -- it discusses some search engines that have been / are being considered to replace the grep based search on TWiki.

    To me, Namazu and Bool sound promising, but some others are discussed there as well.

    TWiki is a Perl and cgi based wiki, and Namazu seems to be able to integrate into a .cgi based environment quite well, and can index Word documents.

    Hope this helps!

  8. Zope and DocumentLibary by mwr · · Score: 2, Interesting

    Haven't tried the latter, but it may fit the bill. DocumentLibrary home

  9. As bad as Gates is, Schmidt is worse... by Anonymous Coward · · Score: 0


    Eric Schmidt is even more of a marxist-facist than Bill G himself. Don't lose any sleep over it, though: He'll destroy Google, just like he destroyed Novell.

    1. Re:As bad as Gates is, Schmidt is worse... by Anonymous Coward · · Score: 0

      "...is even more of a marxist-facist than Bill G himself."

      How exactly is Billy-G a marxist? If anything, he is the exact opposite.

  10. Apache Lucene by danpat · · Score: 4, Informative

    I highly recommend taking a look at the Apache Lucene Project, at http://jakarta.apache.org/lucene/

    It's a full text search engine API, so some coding for your specific requirements would be required. However, it's fast, extremely flexible, and has a pluggable interface for documents. It comes with native support for plain text, and for proprietry document types, we've written simple wrappers around tools like "pdf2text" and "catdoc" to index PDF's and Word docs.

  11. ATTN: Cliff by Anonymous Coward · · Score: 0

    > Might not HT://dig be a good foundation on which to build such a system?

    Hey Cliff, news flash: HT://dig suX0rz your Anal Cox!

  12. and Apache POI by tpv · · Score: 2, Informative
    POI (http://jakarta.apache.org/poi) is an MS Office file reader.

    Much talk has been made of intergrating Lucene + POI to provide indexing of MS Office Docs, but I don't what stage that is at.

    --
    Read more of this story at Slashdot.Read more of this story at Slashdot.Read more of this story at Slashdot.
  13. Glimpse by Aknaton · · Score: 1

    I know Scot Hacker used to use Glimpse to do searches on www.betips.net. From the brief research I have done on Glimpse, it would work well if you mainly have text files.

  14. glimpse by stor · · Score: 2, Interesting

    Hello there.

    I have done all of this before in a commercial environment using Glimpse and Perl.

    I'd recommend you check out glimpse and webglimpse. They ought to do what you are after, for free.

    Cheers
    Stor

    --
    "Yeah well there's a lot of stuff that should be, but isn't"
  15. Some Perl Engines by agentZ · · Score: 3, Interesting

    I don't know if you'd consider using Perl, but I've had some good luck with the Fluid Dynamics Search Engine. By default it can search text and PDF documents, and after some work I was able to get it to search the text of Microsoft Word documents too.

  16. HT://dig is garbage... by Anonymous Coward · · Score: 0

    ...and anybody who's ever used it knows what I mean.

    Getting the results you want is nearly impossible, and the page rendering of results is nasty as hell.

    Just because htdig is OSS doesn't make it a good tool. It's old, outdated, and is one of the worst examples of OSS available today.

  17. Try Xapian (Was commercial) by samjam · · Score: 2, Interesting

    Try xapian, www.xapian.org, about to undergo it's first release.

    It is based on an temporary open-source release of one of SmartLogik's products.

    I swear by it and find it highly flexible.

    I guess, though, unless you are a hacker - say capable of using to actually index your documents, you might want to wait for the next release.

    I use it in preference to htdig, swish++ and others I have looked at and sadly left; xapian is very fast and easily passes the 2G limit systems such as swish++ suffer from, and supports dynamic aggregation of multiple indexes into one search!

    Sam

  18. Call me ignorant if you like... by Ignorant+Cocksucker · · Score: 0
    But you have a working system, at a reasonable price, why the hell would you want to re-invent that particular wheel ?

    1. Re:Call me ignorant if you like... by sbillard · · Score: 0

      Some folks insist on displaying their Open Sores Software, even when a perfectly good solution exists. Remember, this is Slashdot. Micro$oft = Bad !Msft = Good You ignorant (but sensible) person.

  19. PostgreSQL and OpenFTS by Anonymous Coward · · Score: 0

    checkout the openfts project at sf.net

    It takes advantage of the unique indexing capabilities of PostgreSQL database server.

  20. mnogosearch Free Software by Anonymous Coward · · Score: 0

    http://www.mnogosearch.org/

    is working like a charm for doing that. There is also extended functionnalities like caching (like google) and a multitude of support for external format.

    Released under the GNU General Public License.

  21. holy shit by Anonymous Coward · · Score: 0

    Actual humility and admission of one's flaws in a Slashdot post. That's gotta be one of them signs of the Apocalypse.

  22. Trans. From the Host Geek Pt. 1 by poopbot by Anonymous Coward · · Score: -1, Offtopic
    Credits: BankOfAmerica_ATM SUBJECT: GREAT STOCK OPPORTUNITY!!! help me Get Big Brands on eBay I DON'T KNOW WHERE I AM! PENTIUM III CPU's IN STOCK
    Begin Fwded Message:

    If someone is listening out there, HELP! I'm trapped, and I don't know where I am. I know this sounds fucked up, but I started reading about this ATM 73.9GB SCSI SCA-2 LVD 3.5 X 1.6 80-PIN 5.7MS 4MB CACHE 10,000RPM HITACHI HARD DRIVE - $269.00 - only 1 left! ITEM#... DK31CJ-72MC http://www.hardwarest.com/product.asp?sku=DK31CJ%2 D72MC+&dept_id=7 online. Yeah, not like withdrawal or anything, but this was an actual ATM, and it was alive, and posting messages to this educational website that I visit from time to time.

    Pretty soon, I realized that not only was this ATM visiting the same site I liked, but (believe it or not) this ATM was conveniently located near me!!!!! is to take advantage of the current climate in the telecommunications industry!!!! In every industry downturn, opportunities can present themselves for a small aggressive company like GloboPhone to develop relations with corporations that have networks, infrastructure, and personnel but lack sufficient customers. This is GloboPhone's advantage.

    I don't have to tell you, this was no ordinary ATM. Actually this ATM had the power to transfer its consciousness into your mind. I know it sounds ridiculus, but...it used the magnetic strip to actually go inside your mind. Well like any computer lover I am always wanting to try the new technology, so If you are ready to become the biggest man you can be, then order your supply of Magna-RX+ today! See for yourself, what thousands of satisfied men (and their lovers) have already discovered: Magna-RX+ is the world's #1 Best-Selling Penis Enlargement Formula for one very simple reason: IT WORKS AND NOTHING ELSE CAN COMPARE! I went to where the ATM told me to (his inclosure) and swiped my card.

    I blacked out and when I awoke, I was in a new place. Yeah, that's right, the ATM had actually taken ahold of my body. It had done stuff like buy a bunch of magazines and alot of candy. It was like, he and I were different partitions on my brain's hard disk,. Anyway, he took control of my body in order to topple this great conspiracy called Project Faustois-an who doesn't want to stick it to the man? This is when all the trouble started...

    So now, after a few motnths of letting him use my body (although I quit for awhile) he's gone and done this to me. Normally I "wake up" from his using my body in a convenience store near my house, and it's no trouble getting home. But this time I'm trapped in We will be on the East Coast later this year.
    ---------------
    - Tuesday June 24, 6pm - 7:30pm


    Apple Store at South Coast Plaza, 3333 Bear St., Costa Mesa, CA 92626
    714-424-6331


    Mac Experts, 2300 Lincoln Blvd, Santa Monica, CA 90405
    310-581-1500
    ---------------
    - Tuesday July 9, 6pm - 7:30pm


    Apple Store at Fashion Island, 367 Newport Center Drive, Newport Beach, CA 92660
    949-729-4433
    ---------------
    - Tuesday July 16, 6pm - 7:30pm


    Apple Store at Northridge Fashion Center, 9301 Tampa Ave., Northridge, CA 91324
    818-709-2253
    ---------------
    - Tuesday July 23, 6pm - 7:30pm

    Apple Store at Glendale Galleria, 2148 Glendale Galleria, Glendale, CA 91210
    818-502-8310
    trapped in a strange place. Not a good place either. This makes me think of like, 2001 or something. But like creepy. See it's all this white under fluorescent lights and I can't see any windows or even doors. All that's in here is this old-ass terminal. Man, what the fucked happened? Then I remembered: I "picked up" the ATM on my way home from work, but I forgot that it was the fourth Thursday of the motnh. Usualy the day I host D & D for the guys. The ATM must have ben there in my body when my frends came over. Wnoder what happened then?

    Some point later, I'm here in this white room. It's scary at first, I know they're watching me. All I have in this room is this computer terminal. This has got to be the Project Fastus that's what the ATM has been trying to get inside all along. So I guess it's great that I'm (and he???) is insid, it's like I'm in the frickin' Death Star or something, but I don't see any garbage chutes or anything.

    After a few hours of clicking through on thiscomputer terminal (looks like they're running some old-ass *NIX : ) these two guys in suits come into my room from my room. Now it's serious.

    They drag me into a room full of all this really sciency equipment-you know, blooping and bleeping gadgets, big cold noises from the air conditioner. I thought I was in 2001 for a second, except instead of HAL, there's this big bald guy. He's red and pretty sweaty despite the massive air conditioning. He barks a few words to the suited guys and they go away.

    "So you've been harboring our little ATM problem," says the man nonchalantly. I don't say anything (I'm nervous). He restarts his spiel a few seconds later, this time with a bit of veins comung out of his neck.

    "Joel Shane Cross. That is your name, isn't it?" The guy went from good cop to bad cop pretty quick-which was really disturbing. I was already out of sorts with reality, waking up in nowheresville, this odd place. He just kept talking, and I started to get scared, and actually kinda angry. "We know all about you, Mr. Cross. We know that you've been allowing the ATM to inhabit your body for some time now. You've been mislead, Mr. Cross. Working for the wrong people."

    "I belive the ATM!" I told him, stickin to my guns while Istuck it to the man.

    "You'll learn in time," the red and sweaty man said it from his mouth, but the noise of his voice was all over the place. And then he was gone. Not by turning around, by like, vanishing. And the sciency room was gone too, replaced by the big white place I was stuck in. I don't know where I am. But this shit is If you are ready to become the biggest man you can be, then order your supply of Magna-RX+ today! See for yourself, what thousands of satisfied men (and their lovers) have already discovered: Magna-RX+ is the world's #1 Best-Selling Penis Enlargement Formula for one very simple reason: IT WORKS AND NOTHING ELSE CAN COMPARE!
    crazy. If someone gets this message...please help.


    END TRANSMISSION.

    Trolling /. since 7/8/02