Domain: swish-e.org
Stories and comments across the archive that link to swish-e.org.
Comments · 19
-
Re:Swish-E
I've used Swish in it's variants since it was an alternative to WAIS. http://www.swish-e.org/
"Swish, the first openly gay search interface! It not only returns hits faster than the other search interfaces, it does it with more pizazz."
-
Swish-E
I've used Swish in it's variants since it was an alternative to WAIS.
-
swish-e
I implemented swish-e, http://swish-e.org/ for a client with html and
.pdf indexing (nightly) in 11 hours from a standing start (never used swish-e before). -
kinosearch, swish-e, zebra, ht:/dig, etc.
There are many ways to skin this cat. I believe most of them have been mentioned, but I will outline my experiences anyway.
swish-e is a grand-daddy of an indexer. It can act as a robot, crawl your local file system, or get its input from STDIN. If indexing HTML, swish-e will index the document's metatags and provide field searching against them. Swish-e comes with a C, Perl, and PHP API. I don't think swish-e supports anything but ASCII very well.
kinosearch is my new favorite. Written in C but with a Perl API, this indexer works a lot like Lucene. Its resulting indexes (files) may be readable by Lucene. Kinosearch works by initializing a "document" with attributes, filling each attribute with values, and saving the document. Searching is fast an easy. It does not support wildcard searching, but uses extensive stemming instead. Kinosearch does not index files from your file system; you must parse your data and feed it to Kinosearch.
Ht:/dig is nice, but the last time I looked, it had no API. I found this to be too limiting. It indexes documents.
The Google Appliance is cool (and kewl) but also very expensive. This black box (well, it is really gold or blue) does a lot of the work for you. Configuring its output is dependent on your ability to do XSLT. You can feed the Google Appliance database dumps and other streams of data. Nice. I still think the price is steep.
There's Plucene, a Perl port of Lucene. Too slow, and seemingly unsupported.
Lucene and its kin seem to be the Gold Standard these days. I appreciate that, but alas, I don't have any Java experience. Increasingly people swear against SOLR, a Web Services-based interface to Lucene.
Zebra is an unsung hero. It has been around for more than ten years, actively supported and used extensively in Library Land. (I'm a librarian.) This thing can index just about any kind of document. It supports every type of searching feature (stemming, wild card, fielded, Boolean logic, relevance ranked, etc.). It can read files or be fed things from STDIN. Fast!
As an added bonus, I advocate readers explore abstracting their search interfaces with something like OpenSearch or Search/Retrieve via URL (SRU). These abstract layers allow you to create user interfaces to your underlying indexers without worrying what those indexers are. In other words, these abstract layers define the syntax for queries, the transport mechanism to the index, and the structure of the returned result. Given such a framework, you can write an OpenSearch or SRU interface to your index, but if you decide that Lucene is not what you want to use anymore but Kinosearch is, then you can change your indexer without the need to change your user interface. Very nice. OpenSearch is simpler to implement but is weak when it comes to expressive searches and search results. SRU is more robust but also more complicated.
-
Re:Swish-ESwish-E's configuration is pretty flexible even when it comes to relevancy ranking, though it is also quite non-intuitive for lots of different aspects of the configuration.
And yes, it does not support UTF-8/Unicode/anything-non-ASCII-8.
But the developer list is quite active and responses are usually accurate (though they also can be terse and sometimes overly-authoritative).
-
Swish-E
-
Re:slocate?
Solution: Swish-e.
Yes, it is open source. -
You should be limiting .DOC email exchange anywayEven ignoring viruses/worms altogether, it's not a good idea for users to be exchanging
.DOC, .XLS, and .PPT files through email. People do this for two reasons:- Exchanging finished documents for reading. PDF is better:
- It can reproduce the results exactly.
- It doesn't include Word's "change tracking" information which can cause embarrassing leaks.
- It's a standard with many interoperable implementations.
- Exchanging in-progress documents for revision. At least for stuff limited to your company, a version control server (like Subversion with friendly TortoiseSVN clients) is better:
- Doesn't cause email storage to grow enormously. Instead, a server actually meant for this kind of thing stores only deltas. And only one copy of each document - on most mailservers, the disk space consumed by an attachment is proportional to the number of recipients.
- Lets you easily find the latest version of a document. ("Did he send me another copy after this? I'm not sure.")
- Lets you easily retrieve any previous version, see changes/authors/checkin comments. (I don't trust Word's built-in change tracking, and you shouldn't either. Its security model is flawed, and I don't think it's reliable to begin with.)
- Supports locking/unlocking documents to prevent conflicting changes.
- With some setup, supports diffing and merging office documents. You can maintain branches!
- Supports searching - where I work, we've plugged in swish-e for full-text searching over our documentation repository.
.DOC and .XLS files sent from one employee to another. It'd force them to use the documentation repository and save us all a tremendous amount of pain trying to dig through email for the right version of some Product Requirements Document. It'd also stop the whining from people complaining about hitting their email storage limits all the time. - Exchanging finished documents for reading. PDF is better:
-
swish-e as a Google Mini alternative
I seriously considered getting a Google Mini for my law office. The desktop search stuff wasn't really doing it for us, and we have boatloads of work that we reuse on a regular basis -- pleadings/contracts/settlement agreements, etc. are sort of like code in that respect -- we always want to reuse our knowledge rather than reinventing the wheel. My concern was that the regular Google appliance was too expensive. The mini seemed reasonable, but I still was resisting the idea of paying that much for search.
In any case, I had searched high and low for a decent search function when I happened upon swish-e. I am exceptionally pleased with it. It can be found at swish-e.org.
I am not an uber geek, but I was capable of spending an afternoon monkeying with it to install it, set up regular indexing as a cron job, get it to properly read and index OpenOffice documents, and to launch them from the browser. This involved some frightening security settings, but I have a small enough office (three people) that I'm not too torqued about this. The wide open settings I used were not swish-e's fault, as near as I could tell. Rather, they resulted from my laziness -- "It works well enough now, and the likelihood of malicious use is pretty low, so fuck it".
Obviously, it could be set up a bit more cleanly on my end, but I am really, really happy with it apart from that. Currently, it runs on a used SCSI-RAIDed IBM Netfinity box that I picked up for a little under $500.
The time and money I spent on the hardware plus getting it running has paid immense dividends. I have benefitted in two primary ways:
First: my office minions use the network for storage and do not store anything locally. This means that everything is indexed (and can be found!) and because they like the search so much, they also (unwittingly, perhaps) give me the peace of mind knowing that our data also gets the other benefits of being on the network (everything is backed up automatically/regularly, etc.).
They like being able to find stuff, so the search has really encouraged saving stuff on the network. I could mandate this in other ways, but I'd rather have them drinking my Kool Aid than simply imposing the idea.
Second: My minions and I have saved tons of time using the search feature. Any good search does that. The additional bonus is that I no longer have to worry about the next version of Google Desktop or Copernic or installing it on various machines, blah, blah, blah. It's all centrally saved and configured. Administration is essentially zero since I am getting good search results on all the document types that I need - some old MS Office leftovers, Open Office, and PDF.
I don't see needing to change this in any significant way for at least as long as I keep the hardware. I think that the next time I'll need to touch it will be when the index outgrows the box serving the searches.
The box I'm running has dual 1.something gig pentiums with a gig of RAM. The drives are the weak link, with only 9.1 GB of space available for storage of OS, index, etc. The box also has redundant power supplies, redundant power supplies , redundant ethernet connections (100MB), and redundant ethernet connections (100MB).
The front end to the search is just a standard, "came with it" CGI script (swish.cgi). It works just fine. It gets called up as a webpage locally, and it spits our results.
On a final note, we are pretty aggressive in enforcing standardized file naming conventions. The naming conventions typically include te client name, the matter, a date, the type of document, and the subject of the document. Swish-e has document path, title, title and body searches off the interface we use, and you'll usually find exactly what you're looking for if you're reasonably specific.
On a final note, swish-e has been unsuccessful when I have used the following search terms "nubile blonde woman" and "willing to get with me". In that respect, swish-e has been an outright failure, though it is conceivable that the fault lies with operator error.
GF. -
Re:Advantages?
Try Swish-e for indexing and searching word docs, pdfs, or anything else that can be converted to text.
-
beagle, lucene or swishe for the rest
-
Re:Linux anyone?Locate isn't bad, but for some applications you really need to have a content-based search that can't be accomodated by variations on grep. The grep family is great when you are dealing with text based files, but tends to run into problems with content like pdf and OpenOffice.org files.
So for a practical example, I have about 120 collected pdf files of academic articles under filenames with the primary author and year. (I could put the title in there, but filenames between 16-25 characters seem to be reasonable.)
If I'm doing reading on a particular topic, I might want, for example, all of the articles related to Barry Wellman's work on social networks on the internet. The obvious way to get that is to list all of the articles that cite Wellman. This is probably not information that I want to put in the filename.
So, to try a naive example (which according to others here should work.)% time grep -il wellman *.pdf
So in this case, grep spends about two seconds returning no results.
grep -il wellman *.pdf 0.65s user 1.27s system 99% cpu 1.939 total
Now I could write a shell script that runs pdftotext on every file in my library, then grep the output. But pdftotext is expensive for one file much less a directory of 120 files:% time pdftotext postgresql_tutorial.pdf - >
Thankfully, I have a document indexing application that does the work for me. A while back I set up swish-e to index almost everything in my home directory. So... /dev/null
pdftotext postgresql_tutorial.pdf - > /dev/null 1.84s user 0.16s system 99% cpu 2.019 total% time swish-e -f ~/.swish-e/Web_index -w wellman | grep library
The full-text index gives me 11 hits, in 1/20th of the time as a naive grep, sorted by score. (It missed one, primarily because xpdf respects copy protection while Copernic seems to be able to index through copy protection.)
1000 /home/kirk/www/library/garton_1997.html "STUDYING ONLINE SOCIAL NETWORKS, by Laura Garton, Caroline Haythornthwaite, and Barry Wellman" 103238
927 /home/kirk/www/library/koku_2003.doc "koku_2003.doc" 306176
375 /home/kirk/www/library/Cassell_2005.pdf "Cassell_2005.pdf" 615126
323 /home/kirk/www/library/Qualifying_Exams/onlinecomm .pdf "onlinecomm.pdf" 63894
255 /home/kirk/www/library/Koehly_1998.pdf "Koehly_1998.pdf" 1410176
255 /home/kirk/www/library/Qualifying_Exams/methods.pd f "methods.pdf" 72688
255 /home/kirk/www/library/cho_2003.pdf "cho_2003.pdf" 118267
161 /home/kirk/www/library/SearchDBDT/INDEX_K.IX "INDEX_K.IX" 294912
161 /home/kirk/www/library/ICLS_doctoral_consortium_pr oposal.pdf "ICLS_doctoral_consortium_proposal.pdf" 44923
161 /home/kirk/www/library/barab_ilf_2002.pdf "barab_ilf_2002.pdf" 280560
161 /home/kirk/www/library/barab_dvc.pdf "barab_dvc.pdf" 683011
swish-e -f ~/.swish-e/Web_index -w wellman 0.05s user 0.03s system 95% cpu 0.090 total
grep library 0.00s user 0.01s system 9% cpu 0.087 total
Sometimes fulltext searching is useful, and egrep just does not work. -
[LU]N[UI]X in need of searchPersonally, I still like 'find / > index' in a cron script, then just grep 'index'....
That's almost a flamebait in the original post, because it's so utterly unprincipled, ineffective and inefficient.
- Ineffective: Most importantly, it doesn't actually search a term index, but you can only search for file names (so you have to know already what you're searching for). There is no good desktop search tool for UNIX that I'm aware of (although I've used SWISH-E to index plain text document collections, but that's still different from a tool intended to index whole directory trees for full-text search.
- Inefficient: The find/locate commands don't use an index. People below have proposed updatedb, but I doubt that uses incremental index updating, which can become essential if you run it once per night on a large machine. Full-text indexing is much more resource intensive than just indexing file names, so you want to be even more sure that when tomorrows cron job starts, today's will have finished.
- Unprincipled: You could actually find a pipeline of UNIX system commands that implement full-text indexing and search, but that's not a good way to do it. I am aware of the power and versatility of the pipe paradigm, but search is such a fundamental (pervasive, important) problem that it licenses a dedicated development.
Ideally, there'd be a search engine which is part of the operating system, and Microsoft has recognised this and has been working on it for quite some time now. It will be a major selling point of Longhorn, and I predict it will dramatically enhance Windows usability compared to Linux.
Unfortunately, the open source community has not recognised the problem as a whole, but I'm aware the people on the ReiserFS file system have ambitious future plans to include features in that direction (but that might come too late), and I wouldn't count on the likes of Yahoo/Google to deliver the ultimate UNIX/Linux search solution.
--
Try Nuggets , the first SMS search engine -- text your questions, get your answers from the Web. -
swish-e for unix and OSX
You have to read the documentation to set it up, but swish-e is an indexing and search system that I've found to be quite effective. It can handle MSWord (with catdoc) , pdf (with xpdf) and mp3 meta tags. It's also not very hard to write a script to extract OpenOffice.org documents to stdout as well. It comes with C and perl bindings and there is a python interface as well.
-
Re:standard filesystems are NOT databasestherefore, doing searches on a relational database filesystem (find me all music files with dates between last week and last month: SELECT * from files WHERE files.type = "music" and files.date NOW() - 7days
you _can't_ do that sort of thing on a traditional filesystem.
Ahem:
find . -name *.oog -mtime -7
That's just for the bad SQL you posted (I imagine there must be a missing operator in the where clause between "files.date" and "NOW()".)
A better example of where indexing would be would be useful is a case where you want to index the content of files. For example "I want a list of all papers authored by me where I've referenced Doe, Doe, and Doe's seminal paper 'Mental Masturbation About the Need for Relational Database Filesystems on Slashdot'". But even then, this is a type of indexing that can be easily accomodated within a heirarchal database. (For two examples, glimpse and swish-e come immediately to mind.)
To answer your two questions:
some people mentioned here that they already organise their files. great. fantastic.
HOW LONG DID IT TAKE YOU?
A pretty trivial amount of time. Each new project starts with:mkdir newproject
A maximum of 1 minute, even with my gimpy hands. This is about the same amount of time, and perhaps much less, that I would have to spend adding the keyword "newproject" to every file that gets added to the project.
cd newproject
and how long would it take to reorganise?
Well, it depends on the reorganization. But for example, I reorganized one of my key project directories (version controlled in subversion) with.svn mv proposal-may2004 proposal1
(Making two archive "tags" and starting a new branch for the "HEAD" tag.)
svn mv proposal-aug2004 proposal2
svn cp proposal2 proposal
svn commit
A nice thing about heirarchal file systems is that you can reorganize on multiple levels including collections of files. ("proposal" in my case includes about a half-dozen files.) I would argue that it would take me just as much time to recode those files using database keywords as renaming the directories they reside in.
A large part of the argument for relational database filesystems seems to be that the same types of people who are unwilling to do the work necessary to create good heirarchal file trees, will be willing to do the work to attach the metadata needed to replace heirarchal file trees. Switching from a heirachal to relational model is not going to change the GIGO rule, and the solutions around the GIGO problem (such as full-text searching, and journaling) don't depend on either model.
A heirachal model also provides some nice facilities for dealing with related collections of files. It is unclear no me how they would be implemented in a relational model, or how a relational model would deal with stardard filenames like "README", "CHANGELOG" and "index.html". -
Re:Penetrator
Also worth looking at is SWISH-Enhanced. It can deal with Word
.docs (with a bit of configuration), PDF, HTML and anything you can filter to text. -
Re:Altavista did it 6 years ago
-
Re:Shameless plug for SWISH++
Some folks (me included) think swish-e is even more impressive and easier to setup and maintain.
-
benchmarks
I benchmarked the MATCH/AGAINST functionality and on a query on a fulltext index which included 3 varchar(250) fields populated with an average of 100 chars each and a search string of three words (perl, apache and mysql) on a table with 100,000 records, I get a response of 0.97 seconds average. The first query comes in slow at around 5 seconds, but after it's cached (I assume) the index, it's blazingly fast. This on a P550 (intel) with 400 Megs of Ram and a vanilla single IDE disk. I'm curious about how this fulltext searching compares with SwishE. Anyone got any benchmarks.