apache.org · Domains · Slashdot Mirror

Perl is a powerful weapon by Z00L00K · 2008-01-25 07:49 · Score: 1 · on You Used Perl to Write WHAT?!

But as it is a scripting language it isn't good at everything.

And over the time new tools has emerged that makes the use of Perl more limited. One of the drawbacks is that it is possible to be very obscure when writing in Perl. But it may at the same time be very efficient.

To Write web applications I have stuck on Java and build the web pages using ECS. Unfortunately the use of ECS really brings out the BAD section of Java's inability to do explicit object deletion. It may be that ECS also could have been written in a better way - so anyway maybe I'm just whining.

The advantage is that I will get a really good HTML which will pass the W3C validator without too much fuzz. The disadvantage is that it's not that easy to introduce the ordinary HTML hacker into the world of ECS. (but why should the world be easy?)

And ultimately - there is a difference between tools and tools. If you have a tool like Eclipse you may use it to edit more than just Java and somebody else may go in afterwards with Emacs, VI or (horrible thought) Visual Studio to continue the work since the code isn't really aware of which tool I use. On the other hand - a programming language is a tool too. If somebody comes in and say that I need DIBOL for a certain task even though everything else is written in COBOL, then you may want to think twice about the mind of that person...

Re:The register's older writeup on this ... by Anonymous Coward · 2008-01-24 09:28 · Score: 0 · on Mystery Malware Affecting Linux/Apache Web Servers

The rumour goes that the trojan includes Rbot, so I googled "apache rbot".

The fourth hit (it's been displaced by a bunch of news hits now) listed a number of commits adding rbot to the Apache Maven project.

From the Maven site: "Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information."

I'm not entirely sure why Maven should have rbot built in, but I went to the irc server and channel mentioned was not in use. There were a few interesting hosts in #maven, but no sign of intelligent life beyond befuddled users looking for help.

Weirdly coincidentally an earlier article from The Register starts: "Security maven Mary Landesman is in the midst of piecing together a who-done-it involving the infection of hundreds of websites that are generating an enormous amount of traffic. Or maybe it's a how-done-it. Either way, she's mostly drawing blanks."

Wikipedia says "Maven is a yiddish word meaning "accumulator of knowledge".".

Total Information Awareness anyone ?

Re:The register's older writeup on this ... by Anonymous Coward · 2008-01-24 09:28 · Score: 0 · on Mystery Malware Affecting Linux/Apache Web Servers

The rumour goes that the trojan includes Rbot, so I googled "apache rbot".

The fourth hit (it's been displaced by a bunch of news hits now) listed a number of commits adding rbot to the Apache Maven project.

From the Maven site: "Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information."

I'm not entirely sure why Maven should have rbot built in, but I went to the irc server and channel mentioned was not in use. There were a few interesting hosts in #maven, but no sign of intelligent life beyond befuddled users looking for help.

Weirdly coincidentally an earlier article from The Register starts: "Security maven Mary Landesman is in the midst of piecing together a who-done-it involving the infection of hundreds of websites that are generating an enormous amount of traffic. Or maybe it's a how-done-it. Either way, she's mostly drawing blanks."

Wikipedia says "Maven is a yiddish word meaning "accumulator of knowledge".".

Total Information Awareness anyone ?

Re:So, here's your answer: by rcw-home · 2008-01-22 17:17 · Score: 1 · on How Would You Make a Distributed Office System?

If only someone would point that out to Microsoft.. the most obvious exception to your relationship.

No kidding. If it wasn't for Microsoft, I could have used the word "quite" instead of "often". It's not enough to have millions of beta testers (err, I mean customers) - you have to provide a way to listen to them. Collecting $99 or $249 to open a PSS ticket (and then spout worthless advice such as "do an in-place Windows reinstall" instead of providing a fix) doesn't cut it.

At least free software gets this right.

Re:Rewrite in Java by zrq · 2008-01-16 07:13 · Score: 1 · on Sun Buys MySQL

Sun don't own Derby. It used to be owned by IBM, and they gave it to the Apache foundation to look after.

From the Apache Derby website :

The initial code base from which to create this project is from the commercial product called IBM Cloudscape. The history of this product is that it was developed at Cloudscape Inc. starting in 1996. The Cloudscape product was purchased along with the Cloudscape company by Informix Software in 1999. In 2001, IBM purchased the database assets of Informix Software, including the Cloudscape product.

IBM plans to contribute the Derby code base, test cases, build files, and documentation to the ASF under the terms specified in the ASF Corporate Contributor License. Once at Apache, the project will be licensed under the ASF license.

Re:Rewrite in Java by LarsWestergren · 2008-01-16 02:53 · Score: 5, Informative · on Sun Buys MySQL

Damn it! Now they will rewrite it in Java. It will no longer be the fastest database engine, after the rewrite, it will certainly be the slowest.

Sun already has an embeddable db engine written in Java called Derby. It has pretty impressive features and performance.

Re:Rewrite in Java by LarsWestergren · 2008-01-16 02:53 · Score: 5, Informative · on Sun Buys MySQL

Damn it! Now they will rewrite it in Java. It will no longer be the fastest database engine, after the rewrite, it will certainly be the slowest.

Sun already has an embeddable db engine written in Java called Derby. It has pretty impressive features and performance.

Use, bugtrack, testc and contribute to OOO and POI by Anonymous Coward · 2008-01-15 08:01 · Score: 0 · on Public Request For Microsoft To Release Deprecated File Formats

OLE2 filez:
OOO - http://www.openoffice.org/ - OSS Office
POI - http://poi.apache.org/ - Java API To Access Microsoft Format Files

MS Access:
Jackcess - http://jackcess.sourceforge.net/changes-report.html#a1.1.11- Jackcess is a pure Java library for reading from and writing to MS Access databases.
http://www.kexi-project.org/

Re:FAST vs Lucene by zarr · 2008-01-09 08:51 · Score: 1 · on Microsoft Buys Search Engine, Going After Google?

Lucene is a library, FAST is a huge beast. FAST is more comparabele to nutch, except a few 100 man-years more advanced.

FAST vs Lucene by CodeBuster · 2008-01-09 05:06 · Score: 1 · on Microsoft Buys Search Engine, Going After Google?

I wonder how Lucene compares with FAST in terms of generic unstructured text searching, perhaps someone who knows more about FAST or Lucene can answer?

Bad news by ianare · 2008-01-07 11:26 · Score: 1 · on Wikia Search Launches Alpha, Not Ready Yet

They did the opposite, sorta. The new wikia search runs Nutch, which is a sub-project of the apache foundation's Lucene project.
Guess what search engine powers wikipedia? Yup, it's Lucene!

Bad news by ianare · 2008-01-07 11:26 · Score: 1 · on Wikia Search Launches Alpha, Not Ready Yet

They did the opposite, sorta. The new wikia search runs Nutch, which is a sub-project of the apache foundation's Lucene project.
Guess what search engine powers wikipedia? Yup, it's Lucene!

Re:Please enter your credentials here: by Basje · 2008-01-04 03:23 · Score: 2, Informative · on Firefox Spoofing Bug Puts Passwords At Risk

Because the realm is the identifying element of authentication. The username/password combo automaticly resent if the realm matches.

So if you first logon to paypal and afterwards to another page on the same realm, you don't need to retype the username/password.

If another site mimics the exact realm, the username/password is sent to that site as well.

Details here: http://httpd.apache.org/docs/1.3/howto/auth.html#basicworks

Re:Google by jimicus · 2008-01-03 21:40 · Score: 1 · on MS Drops Licensing Restrictions from Web Server 2008

But what the survey is really telling you is which web server is being used to serve unique content on the web. Whether one server serves a million pages or a million servers serve one page apiece is irrelevant.

Technically correct, but I think it could benefit from further clarification.

Netcraft's numbers tell you which piece of software is being used to provide web service on a unique hostname.

But with modules like Dynamic Mass Virtual Hosting (and whatever the equivalent is on IIS), it is trivially easy for a web company to have as many websites as they like without buying another license for the web server software, without buying another physical piece of hardware and without configuring anything beyond an entry in DNS.

It doesn't tell you a damn thing about "how many companies have gone out and bought the Microsoft solution vs. used a LAMP stack". Systems like blogger and myspace (which present each user to the world with their own unique hostname, but obviously don't install a new server every time a user signs up) can easily distort the numbers.

The point I'm making is you're quite right, it's pointless to discuss these statistics without understanding what they really represent. But in understanding what they represent, it becomes clear that they don't actually represent anything at all.

Links by jcaldwel · 2008-01-02 06:15 · Score: 0, Flamebait · on Scammers Continue to Wreak Havoc in MMO's

It's irritating to have links on words if the href has nothing to do with the anchor text .

Links by jcaldwel · 2008-01-02 06:15 · Score: 0, Flamebait · on Scammers Continue to Wreak Havoc in MMO's

It's irritating to have links on words if the href has nothing to do with the anchor text .

Apache and MediaWiki by AxelBoldt · 2007-12-30 13:26 · Score: 0, Troll · on Long Live Closed-Source Software?

The core claim,

Even though the open-source movement has a stinging countercultural rhetoric, it has in practice been a conservative force.

is crap: see Apache and MediaWiki. The closed-source model has never produced anything nearly as radical, important and innovative as either of those two projects. The iphone's interface is laughable in comparison.

Re:Sure, right, yeah... by palegray.net · 2007-12-30 10:33 · Score: 1 · on Long Live Closed-Source Software?

If that's so, then why are so few FOSS applications widely adopted? You're kidding, right?

OpenOffice.org
Mozilla Firefox
Clam Antivirus
BitTorrent
Apache Web Server
MySQL Database
PostgreSQL Database

I could go on, but my fingers are getting tired...

Re:Wikipedia, eh... by Anonymous Coward · 2007-12-17 04:25 · Score: 0 · on Yahoo Becomes Apache Platinum Sponsor

Do you even know what Lucene is for?

Last I knew, it is built to be the base of other search technology, so it does a fine job of text indexing and searching, but not guessing you meant "Paris Hilton" when you search for Pares Hiltone.

It is not trying to compete with google, but rather make a base for others, such as Nutch, to compete.

Re:Wikipedia, eh... by Anonymous Coward · 2007-12-17 04:25 · Score: 0 · on Yahoo Becomes Apache Platinum Sponsor

Do you even know what Lucene is for?

Last I knew, it is built to be the base of other search technology, so it does a fine job of text indexing and searching, but not guessing you meant "Paris Hilton" when you search for Pares Hiltone.

It is not trying to compete with google, but rather make a base for others, such as Nutch, to compete.

Google is also an Apache Sponsor by jaaron · 2007-12-16 16:04 · Score: 4, Informative · on Yahoo Becomes Apache Platinum Sponsor

Google is also an Apache platinum sponsor. We're happy to have both of them involved!

Yes, Apache is a legal US charity (Re:Tax Break?) by jaaron · 2007-12-16 15:59 · Score: 5, Informative · on Yahoo Becomes Apache Platinum Sponsor

Yes. Apache is a US charity under Section 501(c)(3) of the U.S. Internal Revenue Code. See the donation FAQ.

Furthermore, Apache is still almost completely a volunteer organization. The board members, officers and members do not take a salary from the donations. The only paid staff the ASF now has include a PR person, a system administrator, and a part-time secretary.

Disclaimer: I'm an Apache board member.

Yes, Apache is a legal US charity (Re:Tax Break?) by jaaron · 2007-12-16 15:59 · Score: 5, Informative · on Yahoo Becomes Apache Platinum Sponsor

Yes. Apache is a US charity under Section 501(c)(3) of the U.S. Internal Revenue Code. See the donation FAQ.

Furthermore, Apache is still almost completely a volunteer organization. The board members, officers and members do not take a salary from the donations. The only paid staff the ASF now has include a PR person, a system administrator, and a part-time secretary.

Disclaimer: I'm an Apache board member.

Google donates too by Dashcolon · 2007-12-16 15:58 · Score: 4, Informative · on Yahoo Becomes Apache Platinum Sponsor

All you gents lauding Yahoo for being a platinum donor in comparisons to Google should take a look at Apache's donation thanks page, where google is also listed as a platinum donor

Official policy to the rescue [?] by dcavanaugh · 2007-12-13 05:41 · Score: 1 · on The Setup Behind Microsoft.com

Large scale log processing isn't hard if you have the right tools. :) Let's hope their corporate policy allows something a little more robust than "Event Viewer".

Re:Firewall Schmirewall by allenw · 2007-12-13 04:33 · Score: 3, Informative · on The Setup Behind Microsoft.com

Large scale log processing isn't hard if you have the right tools. :)

Re:Okay, I know... by Anonymous Coward · 2007-12-11 18:32 · Score: 0 · on ISP Inserting Content Into Users' Webpages

Yes.

From http://httpd.apache.org/docs/2.0/ssl/ssl_faq.html#vhosts

Why can't I use SSL with name-based/non-IP-based virtual hosts?

The reason is very technical, and a somewhat "chicken and egg" problem. The SSL protocol layer stays below the HTTP protocol layer and encapsulates HTTP. When an SSL connection (HTTPS) is established Apache/mod_ssl has to negotiate the SSL protocol parameters with the client. For this, mod_ssl has to consult the configuration of the virtual server (for instance it has to look for the cipher suite, the server certificate, etc.). But in order to go to the correct virtual server Apache has to know the Host HTTP header field. To do this, the HTTP request header has to be read. This cannot be done before the SSL handshake is finished, but the information is needed in order to complete the SSL handshake phase. Bingo!

Re:Most open source will come from India??? by ArikTheRed · 2007-12-05 09:37 · Score: 1 · on Sun Offers Reward Program to Boost Open Source Effort

In a nutshell, yes. Through my stint as a professional OSS developer, I can safely say that most OSS interest is outside the US. Where we would have to beg and push US and Canadian companies to consider adopting open source software, overseas companies were busting down our door to get support - they were adopting it in droves. As we researched the viability of making money on OSS this feeling was confirmed - over half of US companies still don't have any policy whatsoever concerning open source, where something around 80% of EU companies did.

Re:ISO? by AKAImBatman · 2007-12-04 16:06 · Score: 5, Informative · on PDF Is Now ISO 32000

While I realize this is supposed to be an amusing turn of phrase, there are actually quite a few tools out there. A few that I like are:

PDFBox - OSS Library for modifying PDFs on the fly.
FOP - Use XSL-FO to design printable page layouts in XML, then use FOP to transform them to PDF documents.
Foxit Tools - Alternative to the overpriced Adobe products.
OpenOffice - The built-in support for PDFs is absolutely wonderful. I rarely give out DOC files anymore.
FPDF - PHP PDF generation tools.
iText - A great library for your own custom PDF generation.

Those are just a few. The PDF format itself is actually not too bad. (When Adobe isn't breaking it with needless revisions, that is.) It's biggest strength is that the psuedo-text nature of the format allows one to diagnose the internals of a file pretty easily. Its greatest weakness is that things like text fields are needlessly convoluted. At the end of the day, though, it's a pretty good format.

Re:Any suggestions to slashdotproof it? by Just+Some+Guy · 2007-12-04 07:58 · Score: 1 · on OOXML's 662 Resolutions

mod_proxy can be your friend. There's probably no need to regenerate a whole page every single time it's requested.

Re:NearlyFreeSpeech.net by dch24 · 2007-10-30 14:32 · Score: 1 · on Amazon and Hardware As a Service

For instance, they offer foo.nfshost.net sites, that could be covered with a *.nfshost.com [sic] SSL certificate.

Obligatory: I am a satisfied NearlyFreeSpeech.net customer.

They could offer that, and probably will in the near future (within a year, is my best guess). But e-commerce websites will want the brand recognition of http s://www.mysite.com. Such businesses aren't satisfied with sending all their customers to mysite.nfshost.com, in my experience. They want their customers to feel secure, and to be able to see the website's "real name" in the address bar when entering credit card info.

There is a way for NearlyFreeSpeech.net to serve up all those SSL certificates: use RFC 3546. In brief, the SSL library needs to support "Server Name Indication," which sends the server name to the server. That allows the server to do "virtual hosts," which is a big part of what NearlyFreeSpeech.net to sell "at cost."

Anyway, keep an eye on this bug in Apache for when support will be added to Apache's mod_ssl.

Re:Code defensively? What's that? by cerberusss · 2007-10-26 18:35 · Score: 1 · on Slashdot's Setup, Part 2- Software

OK the app is used in-house... Interesting situation. Well, there doesn't seem a lot you can do. Maybe Apache's mod_deflate, or else you could put some work in streamlining the output of the web app. I.e. add paginating instead of laaarge pages. Remove images. Put stylesheets and javascript outside of the pages so the browser caches them.

Re:Redefining .shtml? by pudge · 2007-10-26 08:52 · Score: 4, Informative · on Slashdot's Setup, Part 2- Software

Last time I checked (which was like 10 years ago), .shtml stood for Server Side Includes (SSI) HTML, which are definitely not static.

Wouldn't it have been better to choose an extension/term not already used, such as .htmls? They are static files on the filesystem. "Static" doesn't preclude being parsed on the way out by the server (mostly just to slap in the header and footer). This is as opposed to dynamic pages which are generated entirely on-the-fly.

Redefining .shtml? by VGPowerlord · 2007-10-26 08:32 · Score: 1, Interesting · on Slashdot's Setup, Part 2- Software

Last time I checked (which was like 10 years ago), .shtml stood for Server Side Includes (SSI) HTML, which are definitely not static.

Wouldn't it have been better to choose an extension/term not already used, such as .htmls?

Re:Just one thing to keep in mind... by neoform · 2007-10-23 03:17 · Score: 2, Interesting · on Amazon Patents Including a String at End of a URL

http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html

This module was invented and originally written in April 1996
and gifted exclusively to the The Apache Group in July 1997 by

Has amazon been using this technique for more than a decade?

Re:That's mod_rewrite! by HaydnH · 2007-10-23 02:20 · Score: 1 · on Amazon Patents Including a String at End of a URL

Mod rewrite was created in 1996 (and given to Apache in '97). (Source here: Just above TOC)

That's mod_rewrite! by Sandb · 2007-10-23 01:58 · Score: 5, Informative · on Amazon Patents Including a String at End of a URL

Did they just patented mod_rewrite??? Tue Aug 24 06:55:44 1999 UTC (8 years, 2 months ago) baby! http://svn.apache.org/viewvc/httpd/httpd/trunk/modules/mappers/mod_rewrite.c?revision=83751&view=markup&pathrev=573831

Re:Not the first time by ozmanjusri · 2007-10-11 13:44 · Score: 5, Funny · on The Russian Mafia Doesn't Like Spam Either

and who says geeks are mostly liberal.

We are.

Some of us just take Spam Assassin a little too literally.

Re:Deck chairs on the Titanic by VGPowerlord · 2007-10-10 16:01 · Score: 1 · on Get Speed-Booting with an Open BIOS

Same deal with Apache choking on virtual domains ... one at a time ... if the name server isn't answering. All those "wait X seconds for Y to happen" things can really add up.

That's why the Apache group tells you how to avoid that. If you specify an IP on the VirtualHost line and include a ServerName (and optionally ServerAlias) line inside the VirtualHost block, Apache doesn't need to do forward or reverse DNS lookups when it starts.

Apache for Windows [was:free as in beer?] by Lumenary7204 · 2007-10-10 03:50 · Score: 1 · on Microsoft Releases IIS FastCGI Module

... but when someone has the option of just turning on IIS on an underutilized box, or finding/buying a box to install linux and Apache on, the idea of price is a non-issue. Umm... What about Apache for Windows?

The Apache.org download page.

-- and --

The Win32 binary download mirror.

Just as "free" as Apache for Linux/UNIX...

Re:FastCGI vs Proxy by Fweeky · 2007-10-10 03:23 · Score: 1 · on Microsoft Releases IIS FastCGI Module

One day, in the dim and distant future, I hope to see FastCGI supported by mod_proxy[_balancer]. Sadly, the module in Apache trunk seems to have pretty much died. *sulk*.

Actually, scratch that. What I want to see in the dim and distant future is a PHP HTTP SAPI module, so it can run its own webserver and I can proxy or not as needed.

It's not GPL, but it is an Apache project by alt-j · 2007-10-08 13:48 · Score: 0 · on What is the Best Way to Start a Paid GPL Project?

I haven't used the POS portion, but OFBiz is a very flexible open-source project that does include POS.
I've used other portions of the project and once past the learning curve, it's great!

Re:syslog, not ssh+tail -f by allenw · 2007-10-07 07:16 · Score: 1 · on Logfiles Made Interesting with glTail

That is a very good point. I'm used to dealing with scales beyond a single node ;) where you have access to such things.

In any case, I'm considering borrowing the idea and using it to 'watch' blocks on HDFS. I think it would be interesting to have a visual of blocks/files getting read/written/replicated. It might show patterns that we're otherwise not seeing.

Re:I've seen a few of these by deftcoder · 2007-10-05 03:34 · Score: 1 · on Cracked Linux Boxes Used to Wield Windows Botnets

Run apache in a chroot (OpenBSD does this by default), and use SuExec ( http://httpd.apache.org/docs/2.0/suexec.html ) if you insist on running PHP.

Re:Lucene by daemous · 2007-10-02 02:30 · Score: 1 · on Best Way to Build a Searchable Document Index?

Solr is the defacto search server implementation of the Lucene library. http://lucene.apache.org/solr/ There is also a Ruby client system that Erik Hatcher (who co-authored "Lucene In Action") has made called, "Solr Flare".

kinosearch, swish-e, zebra, ht:/dig, etc. by ericleasemorgan · 2007-10-02 00:46 · Score: 1 · on Best Way to Build a Searchable Document Index?

There are many ways to skin this cat. I believe most of them have been mentioned, but I will outline my experiences anyway.

swish-e is a grand-daddy of an indexer. It can act as a robot, crawl your local file system, or get its input from STDIN. If indexing HTML, swish-e will index the document's metatags and provide field searching against them. Swish-e comes with a C, Perl, and PHP API. I don't think swish-e supports anything but ASCII very well.

kinosearch is my new favorite. Written in C but with a Perl API, this indexer works a lot like Lucene. Its resulting indexes (files) may be readable by Lucene. Kinosearch works by initializing a "document" with attributes, filling each attribute with values, and saving the document. Searching is fast an easy. It does not support wildcard searching, but uses extensive stemming instead. Kinosearch does not index files from your file system; you must parse your data and feed it to Kinosearch.

Ht:/dig is nice, but the last time I looked, it had no API. I found this to be too limiting. It indexes documents.

The Google Appliance is cool (and kewl) but also very expensive. This black box (well, it is really gold or blue) does a lot of the work for you. Configuring its output is dependent on your ability to do XSLT. You can feed the Google Appliance database dumps and other streams of data. Nice. I still think the price is steep.

There's Plucene, a Perl port of Lucene. Too slow, and seemingly unsupported.

Lucene and its kin seem to be the Gold Standard these days. I appreciate that, but alas, I don't have any Java experience. Increasingly people swear against SOLR, a Web Services-based interface to Lucene.

Zebra is an unsung hero. It has been around for more than ten years, actively supported and used extensively in Library Land. (I'm a librarian.) This thing can index just about any kind of document. It supports every type of searching feature (stemming, wild card, fielded, Boolean logic, relevance ranked, etc.). It can read files or be fed things from STDIN. Fast!

As an added bonus, I advocate readers explore abstracting their search interfaces with something like OpenSearch or Search/Retrieve via URL (SRU). These abstract layers allow you to create user interfaces to your underlying indexers without worrying what those indexers are. In other words, these abstract layers define the syntax for queries, the transport mechanism to the index, and the structure of the returned result. Given such a framework, you can write an OpenSearch or SRU interface to your index, but if you decide that Lucene is not what you want to use anymore but Kinosearch is, then you can change your indexer without the need to change your user interface. Very nice. OpenSearch is simpler to implement but is weak when it comes to expressive searches and search results. SRU is more robust but also more complicated.

Lucene Subprojects by esme · 2007-10-02 00:17 · Score: 1 · on Best Way to Build a Searchable Document Index?

I see a lot of people have already recommended Lucene, and I heartily agree.

But, I suggest you look at the various Lucene sub-projects to see if one of them meets your needs. For example, Nutch includes a crawler and parsers for Word/PowerPoint/PDF/HTML/etc. so you wouldn't have to write that part yourself. Solr is a webapp that wraps a Lucene index in a simple web service and comes preconfigured to run inside its own servlet container on a separate port, so that's pretty easy to setup and use.

-Esme

Lucene Subprojects by esme · 2007-10-02 00:17 · Score: 1 · on Best Way to Build a Searchable Document Index?

I see a lot of people have already recommended Lucene, and I heartily agree.

But, I suggest you look at the various Lucene sub-projects to see if one of them meets your needs. For example, Nutch includes a crawler and parsers for Word/PowerPoint/PDF/HTML/etc. so you wouldn't have to write that part yourself. Solr is a webapp that wraps a Lucene index in a simple web service and comes preconfigured to run inside its own servlet container on a separate port, so that's pretty easy to setup and use.

-Esme

Search software by j.leidner · 2007-10-01 23:34 · Score: 2, Informative · on Best Way to Build a Searchable Document Index?

Lucene - LINK

Terrier - LINK

Indri/Lemur - LINK / LINK

MG - LINK

I do this in several programming languages by MarkWatson · 2007-10-01 11:12 · Score: 2, Informative · on Best Way to Build a Searchable Document Index?

There are 2 problems: getting plain text out of documents, then indexing the plain text

A good tool for getting plain text out of various versions of Word documents is the "antiword" command line utility.

The Apache POI project (Java) can read and write several Microsoft Office formats.

For indexing: I like Lucene (Java), Ferret (Ruby+C), and Montezuma (Common Lisp).

I have mostly been using Ruby the last few years for text processing. Here is a short article I wrote using the Java Lucene library using JRuby:

http://markwatson.com/blog/2007/06/using-lucene-with-jruby.html

Here is another short snippet for reading OpenOffice.org documents in Ruby:

http://markwatson.com/blog/2007/05/why-odf-is-better-than-microsofts.html

---

You might just want to use the entire Nutch stack:

http://lucene.apache.org/nutch/

stack that collects documents, spiders the web, has plugins for many document types, etc. Good stuff!

Slashdot Mirror

Domain: apache.org

Comments · 2,937