Juggernaut GPLd Search Engine
real bio pointed us to Juggernautsearch which actually looks interesting. Its GPLd. It can index 800 million pages every 3 months and deliver 10 million pages a day on a Pentium II. So I guess if you want to run your own Altavista, you can.
I'm sure I'm not the only one who believes that the top-level stories on here should have moderation on them too.
I mean, really! This search engine hardly works at all, only the search part is free (and that's the no-brainer part of any search engine), it certainly doesn't index 800 million pages (I rarely got any results on any queries) and yet they still appear on here like some news item.
Did they pay slashdot? Are they a major stockholder now? What's the deal? Or was once again a story posted that wasn't checked first.
Give me seven million dollars, I'll double check my stories...
Did you even read the page? It's a demo version; you're searching a minimal subset of their database.
Clearly not obvious to the casual observer, and the entire page just doesn't reflect the claimed quality of the engine itself.
It's a botched launch, and right after GPLTrans too.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
Yea, great it's Op So and all, but it's tough to beat Google. Case in point, I run a small site out of my apartment (for about a month now). I have yet to do any search engine placement or promotion, other than meta tagging. A search on Google with my full site name returns a page full of my pages. Also, if you search Google for "Wah" my page just makes it in there at the bottom.
Google rocks!!
+&x
bumppo
http://www.dizz.net/
Basically, we need to get down exactly what to do and how to do it. More developers would be nice too...
Here's part of one of my messages on the list:
You can get on the list at http://www.egroups.com/group/dizz-net.
Have they come out with a search engine yet, that, before giving you the results for your keyword searches, TAKES OUT the 404 errors? That would be something nice. Do your keyword search, and then have the search engine check each and every link to see if there's a 404 or whatever, and if there is, take it out of the results before it hands it over, and save the results it for next time in case someone else does the search.
Sorry. I pay 19c/mb recvd. No way in hell I'm gonna participate. :-)
Open Source. Closed Minds. We are Slashdot.
Anyone tried it yet. Having a strong, open-sourced search engine would be a tremendous boone to institutions on a tight budget. We have a reasonably large webspace here and we're always watching for effective ways to make the whole thing searchable.
--
Una piccola canzone, un piccolo ballo, poco seltzer giù i vostri pantaloni.
My office has been taken over by iPod people.
For the love of god, LAUNCH RIGHT!
Don't say you can index 800 million pages in three months when your database gives less results that Lycos circa 1996.
Hyperbole is rife in the computer world in general, and it's one of the genuine strengths of the Open Source community that we're very results oriented--Apache gets *results*. Samba *works*, and actually *does* knock NT out of the park in terms of flexibility and feature sets. And so on.
There are exceptions, granted, but we don't stretch our credibility to the breaking point nearly as much as stock-price-manipu^H^H^H^H^H^Hmaximizing corporations practically have to.
My problem with Juggernaut is that, while their technology might be awesome, their online index *isn't*. When you don't even get enough hits back to compare whether the hits are delivered in an optimum order, you know there's a problem. That, combined with the fact that the site looks decidedly 1996'ish(sorry, I know there's a webmaster out there who doesn't like me right now), tarnishes the otherwise excellent announcement that we now ostensibly(pending testing) have an extremely high quantity and quality search engine system, not to mention the birth of a new business model--the internal search engine of external content.
Honestly, I must admit there's something to be said about companies purchasing internal versions of large search engines, just so no outside source can watch the unencrypted stream of queries coming from a given company to deduce what projects they're working on.
The Juggernaut guys may be on to something, but I'm still a Google addict.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
From their demo, I am not too impressed with the search at all. It seems to be lacking many advanced options. Also, what is up with this??
. shtml)
>>
first fully automated crawler that can reindex all 800 million World Wide Web pages every three months fully available to the public for a nominal two year subscription fee.
Does that mean that they give away the search engine but you have to purchase the database???
I think that there are better options out there right now. One GPL'd search engive out there that I have liked a lot is HTDIG (http://www.htdig.org). It does not have the horse power the the juggernautsearch "claims", but it is great for intranet/corporate/university website search.
If you are looking for a good search engine, you may also want to read the ask slashdot thread from last year on this topic. (http://slashdot.org/askslashdot/98/10/24/1756224
I typed in 'jj thompson' to see would it find my page about the legendary physicist (it's indexed by most engines). It didn't bother returning any matches, or even a 'no matches' message. And it's the most horrible page I've seen in a while. Green on pink text? Yeuch.
From what I read here and here: the "Juggernaut search Engine" and the "Juggernaut Search Engine Crawler" are two separate pieces of software. The former is GPLed. The latter is not for sale but you can purchase the database it creates (or get a demo/sampler subset of the database for free)
-
<SIG>
"I am not trying to prove that I am right... I am only trying to find out whether." -Bertolt Brecht
<sig>Guvf vf abg n frperg zrffntr
I have been wondering for a while now : couldn't building the index for such a search engine be distributed (like SETI@HOME or RC5) ? The server would do the actual page serving, querying etc, but the spidering would be done by the clients. They'd each receive a batch of URL's from the server and start indexing them, collecting lists of URL's and sending those back to the server. The server weeds out the doubles, and assigns those URL's to the clients again. The more people would participate, the bigger the index would grow, as the available bandwidth increased also.
Hmmm... maybe I should patent this...
superblog.org: all your favourite blogs on o
You can run your own altavista. . . and as the open source 'canon' grows, folks will also be able to have an amazon.com, a slashdot, and whatever else you want to do on the Web.
But why just the Web? With enough open-source game engines, applications, and other code to build on . . .
Well, just imagine what happens when the first Open Source 'killer App' is released. (Not that sendmail, apache, and others aren't already -- I'm talking userland, here.) What if the Next Big Computer Game was Open Source? How many zillions would install Linux to play it?
What if Open Source was suddenly the dominant software paradigm?
Can I just say, 'Oh, YEAH!'?
-Omar
I checked out the search engine. I would think that if they are selling a robot that claims to be able to index the entire web every three months they would have an online database to prove it.
:)
try searching on slashdot. You get one link which is at least 2 years old
Dazzle them with bullshit.
Fish! LipHo
The green-on-pink text in the search box reads "Try One or Two Keywords in ANY International Language". So I, being a Typical American, tried English. Single English words (or any single search term, for that matter) work fine. However, using two keywords (be they my name, "Microsoft Windows," "carrot cake," etc., etc., etc.) just returns the home page all over again.
So you can only search on one keyword at a time, it has a butt-ugly page, it doesn't return relevant links, and it has a horrible domain name to boot. What a waste.
Oh wait, it's GPL'ed! Hooray! Down with the software monopolies! We'll take over the world!
Groan...
For more information, click here.
Has anyone mirrored the FTP site yet? I'm downloading at 4.2 k/second.... and this is at work where i'm more used to 150+ k/second at this time of day.... I'm very eager to check out the database format, but it seems i need to first download at 50 meg file...
How can it index the entire "800 million webpages" out there and only find 19 hits for "xml"? And the Specification wasn't one of them.
Not one single hit for "wide open beavers"! And the colors are just awful.
But why is it when I search for "ugly webpage" I get a the Juggernaut Technical Support page?
:-)
Oh, I get it, I got EXACTLY what I searched for!
Examination of their ftp distribution site reveals this is an early work in progress...most docs are "under construction," and even their helpers.txt (supposedly giving credit to others) is basically empty.
I'll post more if/when their src tarball ever finishes downloading (54M - whew!...and the site is getting /.'ed right now). My guess is they drew heavily from ht://dig, WAIS, SMART and other public-source search engines and spiders.
For those who can't get through to the site: they hope to sell subscriptions to their database, so that you can run their search engine internally. It's not clear whether they intend to license the spider/crawler or just the database.
Meanwhile, to those who have complained that easy searches turn up with nil results: read the page, dudes! It says clearly that you're searching a minimal test collection, but can search the whole thing (on your local system, seems like) for a subscription fee.
Credibility break: I'm an information science professor and design/evaluate alternate information retrieval systems.
The Juggernautsearch Engine crawler is the first fully automated crawler that can reindex all 800 million World Wide Web pages every three months fully available to the public for a nominal two year subscription fee.
.02
With the search engine being GPLed it still relies on a subscription service in in order for it to function. It mentions nothing about the crawler needed to create the database, but it also mentions that you are free to create your own database. Is it just me or is this a contradiction.
For the smallest subscription it gives 1.6 million urls at $100 a year. This price goes up to $500 for 10 Million urls.
For such a useful program, it is limiting itself to its own database which costs money to use.
Just my
That's for Managing Gigabytes, and there also is a great book (note that there's a second edition out now) with the same name on the topic from Witten, Moffat and Bell. Very well written. Go to http://www.mds.rmit.edu.au/mg/intro/about_mg.html to learn about the software, including links. It also has a Freshmeat appindex: http://freshmeat.net/appindex/1999/09/09/936885957 .html
BTW, I'm not associated with the university, the book or whatever. I just enjoyed reading it.
Making this a distributed effort would only be useful for a clustering environment ala beowulf where tight syncronization would be needed to prevent machines from revisiting the same websites. Other than that, distributed processing for web crawlers is... dubious.
However, it really does not work when you would like it to find pages that no one points to. Those unique pages are well hidden from crawlers, even those you can e-mail all of your friends about them. Until one of your friends puts a link on his start page, you're immune.
For an organization, it's the wrong avenue of approach. Organizations tend to keep their internet files on a small set of machines, in very specific directory structures. The best search engine for those machines should have permission to look at the directory structures and go through every file in them when it uspdates it's database. This insures that every file in that organization is collected and that no links going outside the organization are followed.
Ken Boucher
No Zen is good zen
Well, alright, so it can do 10 mil a day on a PII with what? A t3? How much bandwidth do you need before the processor becomes the limiting factor with this engine? I certainly dont think my 26.4 connection at home can handle 10 mil pages a day. They should make some mention of that on the page.
On a side note, I was very dissapointed when a search for "deez nuts" came up dry.. oh well.
//Phizzy
"Most European technology just isn't worth our stealing," -- Former CIA chief James Woolsey, referring to Echelon
Unlike Juggernaut, it's a complete search engine system (crawler, database & front-end), it was developed over a long time, and has capabilities that even most modern search engines don't (such as relaxed spelling).
IMHO, it would be better for the Open Source community, as a whole, if someone picked up Harvest, modernised it and maintained it. At present, it's the best "openish" source Search Engine out there, and it's going to waste.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)