grub.org · Domains · Slashdot Mirror

Distributed search engines failed by cpghost · 2010-10-30 06:45 · Score: 2, Interesting · on Is Google Polluting the Internet?

We've tried this before with GRUB, but it didn't really take off for a multitude of reasons.

Woah by the_kanzure · 2008-01-07 15:43 · Score: 1 · on Wikia Search Launches Alpha, Not Ready Yet

Wikia Search is open source, it's based off of Grub (which we have already talked about before). Here's the source code to the grub Windows client, and there's a dev site too. The current scoring algorithm is over here. If you want to talk with Jimbo and the developers, hop on to mailing list and let's talk.

Anyway, it looks like there's the opportunity here to *improve* this search engine -- programmers, I know you are reading, and at least check out the code. There's been talk about running some competitions for improving the search results (the scoring algorithms), how many of us would like to form a team? Maybe I'll do one. Who's with me?

(Btw, these guys need help. I just found all of this after the recent news articles.) Screw my mod points.

Woah by the_kanzure · 2008-01-07 15:43 · Score: 1 · on Wikia Search Launches Alpha, Not Ready Yet

Wikia Search is open source, it's based off of Grub (which we have already talked about before). Here's the source code to the grub Windows client, and there's a dev site too. The current scoring algorithm is over here. If you want to talk with Jimbo and the developers, hop on to mailing list and let's talk.

Anyway, it looks like there's the opportunity here to *improve* this search engine -- programmers, I know you are reading, and at least check out the code. There's been talk about running some competitions for improving the search results (the scoring algorithms), how many of us would like to form a team? Maybe I'll do one. Who's with me?

(Btw, these guys need help. I just found all of this after the recent news articles.) Screw my mod points.

Woah by the_kanzure · 2008-01-07 15:43 · Score: 1 · on Wikia Search Launches Alpha, Not Ready Yet

Wikia Search is open source, it's based off of Grub (which we have already talked about before). Here's the source code to the grub Windows client, and there's a dev site too. The current scoring algorithm is over here. If you want to talk with Jimbo and the developers, hop on to mailing list and let's talk.

Anyway, it looks like there's the opportunity here to *improve* this search engine -- programmers, I know you are reading, and at least check out the code. There's been talk about running some competitions for improving the search results (the scoring algorithms), how many of us would like to form a team? Maybe I'll do one. Who's with me?

(Btw, these guys need help. I just found all of this after the recent news articles.) Screw my mod points.

Re:Challenging Google? by ThreeGigs · 2008-01-01 15:44 · Score: 3, Interesting · on Wikia Search Engine to be Launched on January 7th

It looks like you've entered some sort of partnership with Grub http://www.grub.org/.
If so, kudos... Grub's been languishing in not-ready-for-primetime land for far too long, and the ability to crawl your own site to keep results current is a bonus, too.

Translated Press Release by martyb · 2007-07-30 06:55 · Score: 1 · on Wikia Acquires Grub, Releases it Under Open Source

FTFA:

Wikia has aquired the Grub sourcecode from LookSmart. We will be posting the complete, current codebase as soon as possible, here on Grub.org. In the meantime, signup and stay tuned to developments regarding getting Grub going again.

Translation: Development had slowed to a SNAIL's pace(*), but now casting off its SHELL, we bring you a new and improved (TM) GRUB!

(*) From: Member Statistics (as of 20070730 at 14:47 EDT)

Members Overview (see all) Total members: 1,049 Oldest member: 14,016 days Active this month: 2

Let's see here:

dc --expression="2k 14016 365.25 / f"

38.3737 YEARS!

Re:How do we know Goog isn't giving up info alread by classh_2005 · 2007-06-10 11:33 · Score: 2, Informative · on Privacy Group Gives Google Lowest Possible Grade

Just as an aside, it's high time there was a serious effort at producing a decent open source search. Personally, I think a distributed network with anonymizing services makes the most sense. I know there are projects in existence already, but more people will have to become aware of them. Some Open Source search projects are:

http://www.majestic12.co.uk/projects/dsearch//

http://www.aspseek.org/about.html//

http://sourceforge.net/projects/ebiness//

http://www.grub.org/html/documents.php//

http://lucene.apache.org/nutch/bot.html//

I really want to see one of these projects take off, I'd tap a vein at the local plasma center to donate funds :>

Re:OSS Google Killer? by Anonymous Coward · 2005-08-27 05:25 · Score: 0 · on Has Google Peaked?

What, like grub? Or any of the myriad of p2p search engines?

Re:Formula by sillybilly · 2005-07-14 04:41 · Score: 4, Informative · on Ambiguity Drives Google's Valuation

Yes, profit. If you recall Google was completely privately held since 1998 til about their recent IPO. Why? Because that's when the owners decided the value was fully generated, and it can no longer grow, or (gasp) even fall. Time to dump and get cash while it's hot. For all the wonder that Google is - and many thanks to its precious inventors, you are forever in our hearts - it's core technology is severely limited, because it's based on a centralized system, and there is something better on the horizon. The real answer is distributed computing, where you can locally do the indexing and only send up the index, but this means giving up control, thus giving up sharevalue. I wonder how long will it be possible for this next wonder-genie be kept tight in a bottle. It could be quite sometime til the cork is pulled - a few thousand years? - but sooner or later it happens.

Big name != "real" by droleary · 2005-06-09 11:59 · Score: 4, Informative · on Who Isn't Paying Attention to ROBOTS.TXT?

I see that there appear to be real, legitimate, search engines that do not follow robots.txt rules.

No, you rather see some well-known search engines that generate illegitimate traffic instead of behaving properly. I note a number of them in this highly-documented robots.txt file. I'm personally most offended by idiots running this shit, since there is no single IP block to blacklist.

Re:Let me guess... by Anonymous Coward · 2005-04-19 12:42 · Score: 0 · on Providers Ignoring DNS TTL?

RoadRunner will also complain loudly if you run the Grub crawler. (It's sort of a distributed web spider.) Same reason, too many DNS lookups. I setup a local BIND9 forwarding to Verizon's free server on 4.2.2.4, and pointed all my machines to use the local BIND server.

Verizon isn't bitching about it and I'm not even paying them =)

DIY Search by cpghost · 2004-11-12 05:05 · Score: 1 · on MSN Search Roundup

Google, Yahoo, MSN etc... why do we actually rely on a commercial entity for searches? Wasn't there a distributed search project called Grub? Can't we just set up something similar that would be totally independant of any entity that would always be suceptible to *cough* influence *cough*?

Something similar to the Linux movement, but with even more impact to the general Internet population? C'mon, we can do it, don't we? We're also using bittorrent to get more independance of central ftp servers. Distributed search would be just the same.

There already is distributed crawling by Anonymous Coward · 2004-09-12 11:53 · Score: 3, Interesting · on P2P Web searches

It's called grub.

Re:P2P? by cgenman · 2004-04-17 17:47 · Score: 4, Interesting · on How to Build a Search Engine

The closest thing to what you're talking about is Grub, which is run by Looksmart as a dead-link checker and also feeds to WiseNut. While it doesn't allow you to crawl sites that you don't have control over, it does allow you to crawl your own site.

Personally, I've wanted a Google toolbar that indexes the sites that you surf, and adds additional positive weight to the sites that you linger on. It may not know what you liked there, but it knows that you liked it.

Completely offtopic, but does anyone know of a screensaver on Windows that displays random (or spidered) web pages? I've been looking for an equivalent to the XWindows version for years.

Re:Let's roll our own distributed search engine by BubbleNOP · 2004-03-20 12:28 · Score: 1 · on MSN Rolling Out New Search Engine In July

Grub is pretty close to what you're looking for. You can also see grab its source.

Re:A few nits to pick. by Dan+Crash · 2004-03-08 12:25 · Score: 1 · on How The Web Ruined The Encyclopedia Business

If distributed search and trust proves to be better, won't Google simply adopt it as a model? It's hard to imagine a company as innovative and powerful as Google letting itself be displaced.

Are there any other major distributed search projects going on right now besides Grub?

Re:I probably would have done this differently... by Afromelonhead · 2004-01-07 15:46 · Score: 1 · on Internet Archive Opens Crawler Code Under LGPL

For the first problem, though, perhaps a SETI-ish distributed "Heritrix" could help make regularly archiving all of these sites a managable affair. IA sends marching orders out to the distributed volunteer network, each clients downloads, compares MD5 of the pages with other clients, compresses them, and sends them back to a master archive. Sounds great in theory, at least at first, to me...

There actually is a program out there called Grub that tries to follow this concept. I had contributed to the project in its infancy, but once it was bought out by LookSmart, I kinda moved away from it. A lot of people were complaining about Grub's utter lack of respect for no crawl sections of sites and robots.txt. It might have changed a little bit since then to actually support robots.txt, so it might be worth your try.

Yes There is a Distributed Search Engine... by Anonymous Coward · 2003-10-31 03:43 · Score: 0 · on Google Considering Merger With Microsoft

Bah, more like anonymous lazy person. You might want to look at Grub which is actually a distributed search engine. The spidering results still go to a single, monolithic organization though, LookSmart I think. Getting a truly decentralized search engine would be difficult, especially with the marked delays that decentralized networks suffer from, but I see no reason why not. Are you going to program it?

Starling

What About Distributed Search Engines? by serutan · 2003-07-14 08:29 · Score: 1 · on Yahoo Buys Overture for $1.63 Billion

Are there any non-commercial projects afoot to build a distributed web search engine? I found Grub , a SETI@home-like web crawler, but it seems to me that any commercial venture under sufficient financial pressure will eventually resort to paid listings.

read the grub forums by denny_d · 2003-04-20 01:54 · Score: 1 · on Building a Bigger Search Engine

The idea is cool and I imagine it won't be long before an org. without links (unverified) to M$, will do the same thing. There's at least a couple of people on the grub forum who are figuring out some of the shadier sides of this code: potential spyware? security hole? And the licensing is vague (no links).
Note the tone of their pitch as well you are participating in a competitive group effort a kin to Seti@home and Distributed Net? I don't think so... caveat emptor.

Re:Altruistic? by R0 · 2003-04-19 22:19 · Score: 5, Funny · on Building a Bigger Search Engine

Notice
====== The main executable has been renamed to "grubclient" out of respect for the GNU Grub bootloader, who's executable is named "grub". They were out first, so we decided to pick another name. If you have a catchy suggestion for a new name, please let us know.

I nominate "parasite".

Help Grub crawl the web by Anonymous Coward · 2003-04-13 16:53 · Score: 0 · on NYT On Google's Role In Internet Advertising

The Grub project is a distributed method of crawling the internet. You download the client and you help Looksmart( their search engine wisenut is pretty good but not the best ) crawl the web.
In my opinion it is better to help contribute your spare bandwith and cpu to help make sure more of the internet is crawled and more frequently instead of something more pie in the sky like SETI. Grub has a more down to earth use. Help make sure all of cyberspace can be crawled.
Download the grub client:
http://www.grub.org/html/downloads.php?PHPSESSID=a a2b3b639ab6f4b92965e132a1418df9

There is a linux version. Get crawling, forget Seti and that distributed.net crap, helping crawl all the internet is more of an attainable goal.

Help Grub crawl the web by Anonymous Coward · 2003-04-13 16:53 · Score: 0 · on NYT On Google's Role In Internet Advertising

The Grub project is a distributed method of crawling the internet. You download the client and you help Looksmart( their search engine wisenut is pretty good but not the best ) crawl the web.
In my opinion it is better to help contribute your spare bandwith and cpu to help make sure more of the internet is crawled and more frequently instead of something more pie in the sky like SETI. Grub has a more down to earth use. Help make sure all of cyberspace can be crawled.
Download the grub client:
http://www.grub.org/html/downloads.php?PHPSESSID=a a2b3b639ab6f4b92965e132a1418df9

There is a linux version. Get crawling, forget Seti and that distributed.net crap, helping crawl all the internet is more of an attainable goal.

Help Grub crawl the web by Anonymous Coward · 2003-04-13 16:53 · Score: 0 · on NYT On Google's Role In Internet Advertising

The Grub project is a distributed method of crawling the internet. You download the client and you help Looksmart( their search engine wisenut is pretty good but not the best ) crawl the web.
In my opinion it is better to help contribute your spare bandwith and cpu to help make sure more of the internet is crawled and more frequently instead of something more pie in the sky like SETI. Grub has a more down to earth use. Help make sure all of cyberspace can be crawled.
Download the grub client:
http://www.grub.org/html/downloads.php?PHPSESSID=a a2b3b639ab6f4b92965e132a1418df9

There is a linux version. Get crawling, forget Seti and that distributed.net crap, helping crawl all the internet is more of an attainable goal.

Forget Seti and Distributed.net by Anonymous Coward · 2003-04-11 22:00 · Score: -1, Offtopic · on AOL Tests Video Instant Messaging

The Grub project is a distributed method of crawling the internet. You download the client and you help Looksmart( their search engine wisenut is pretty good but not the best ) crawl the web.
In my opinion it is better to help contribute your spare bandwith and cpu to help make sure more of the internet is crawled and more frequently instead of something more pie in the sky like SETI. Grub has a more down to earth use. Help make sure all of cyberspace can be crawled.
Download the grub client:
http://www.grub.org/html/downloads.php?PHPSESSID=a a2b3b639ab6f4b92965e132a1418df9

There is a linux version. Get crawling, forget seti, helping crawl all the internet is more of an attainable goal.

Forget Seti and Distributed.net by Anonymous Coward · 2003-04-11 22:00 · Score: -1, Offtopic · on AOL Tests Video Instant Messaging

The Grub project is a distributed method of crawling the internet. You download the client and you help Looksmart( their search engine wisenut is pretty good but not the best ) crawl the web.
In my opinion it is better to help contribute your spare bandwith and cpu to help make sure more of the internet is crawled and more frequently instead of something more pie in the sky like SETI. Grub has a more down to earth use. Help make sure all of cyberspace can be crawled.
Download the grub client:
http://www.grub.org/html/downloads.php?PHPSESSID=a a2b3b639ab6f4b92965e132a1418df9

There is a linux version. Get crawling, forget seti, helping crawl all the internet is more of an attainable goal.

Forget Seti and Distributed.net by Anonymous Coward · 2003-04-11 22:00 · Score: -1, Offtopic · on AOL Tests Video Instant Messaging

The Grub project is a distributed method of crawling the internet. You download the client and you help Looksmart( their search engine wisenut is pretty good but not the best ) crawl the web.
In my opinion it is better to help contribute your spare bandwith and cpu to help make sure more of the internet is crawled and more frequently instead of something more pie in the sky like SETI. Grub has a more down to earth use. Help make sure all of cyberspace can be crawled.
Download the grub client:
http://www.grub.org/html/downloads.php?PHPSESSID=a a2b3b639ab6f4b92965e132a1418df9

There is a linux version. Get crawling, forget seti, helping crawl all the internet is more of an attainable goal.

Something more terrestrial by Anonymous Coward · 2003-04-11 21:31 · Score: 0 · on Exploit Found in Seti@Home

The Grub project is a distributed method of crawling the internet. You download the client and you help Looksmart( their search engine wisenut is pretty good but not the best ) crawl the web.
In my opinion it is better to help contribute your spare bandwith and cpu to help make sure more of the internet is crawled and more frequently instead of something more pie in the sky like SETI. Grub has a more down to earth use. Help make sure all of cyberspace can be crawled.
Download the grub client:
http://www.grub.org/html/downloads.php?PHPSESSID=a a2b3b639ab6f4b92965e132a1418df9

There is a linux version. Get crawling, forget seti, helping crawl all the internet is more of an attainable goal.

Something more terrestrial by Anonymous Coward · 2003-04-11 21:31 · Score: 0 · on Exploit Found in Seti@Home

The Grub project is a distributed method of crawling the internet. You download the client and you help Looksmart( their search engine wisenut is pretty good but not the best ) crawl the web.
In my opinion it is better to help contribute your spare bandwith and cpu to help make sure more of the internet is crawled and more frequently instead of something more pie in the sky like SETI. Grub has a more down to earth use. Help make sure all of cyberspace can be crawled.
Download the grub client:
http://www.grub.org/html/downloads.php?PHPSESSID=a a2b3b639ab6f4b92965e132a1418df9

There is a linux version. Get crawling, forget seti, helping crawl all the internet is more of an attainable goal.

Something more terrestrial by Anonymous Coward · 2003-04-11 21:31 · Score: 0 · on Exploit Found in Seti@Home

The Grub project is a distributed method of crawling the internet. You download the client and you help Looksmart( their search engine wisenut is pretty good but not the best ) crawl the web.
In my opinion it is better to help contribute your spare bandwith and cpu to help make sure more of the internet is crawled and more frequently instead of something more pie in the sky like SETI. Grub has a more down to earth use. Help make sure all of cyberspace can be crawled.
Download the grub client:
http://www.grub.org/html/downloads.php?PHPSESSID=a a2b3b639ab6f4b92965e132a1418df9

There is a linux version. Get crawling, forget seti, helping crawl all the internet is more of an attainable goal.

Re:Other uses for Distributed Computing by froseph · 2002-10-22 08:46 · Score: 1 · on Folding@Home Reports Success

"I'm surprised google hasn't come out with a spider at home client which goes out and searches the web caching sites as it goes. Sure distributed computing could help their venture and who doesn't love google?"

Perhaps you are thinking about grub?

Re:Other uses for Distributed Computing by Duckz · 2002-10-22 04:09 · Score: 2 · on Folding@Home Reports Success

This grub project is on the way to doing just that.

Glad by sardonic2 · 2002-04-28 18:15 · Score: 2 · on SETI@Home Close to Half-Billionth Result

Awesome for Seti@Home. I have been active with them for years now. I enjoy distributed project. I also enjoy Grub a distributed search engine project.

Jesus' number was 7, so.... by kordless · 2002-04-22 15:53 · Score: 1 · on Apple Deals with Devil, Communists

You can assume the best thing to rid yourself of those pesky demons would be to login and run:

chmod -R 777 /

Be sure to email me your IP address and I'll put you up on my holey server site.

Kord
Shameless plug, check out Grub!

Re:English please! by kordless · 2002-04-07 02:28 · Score: 1, Informative · on The Poincaré Conjecture has Been Proved

I'll try, but I get WAY out of my league when you talk about anything bigger than n=3.

n represents dimensions. i.e. n=1 is one dimension, n=2 is two dimensions, n=3 is three dimensions, etc.

"simply connected" just means that the boundary surrounding something is connected. For example, in n=2 space (a piece of paper for example), if you drew a line around a bunch of ants, and connected the ends, it would be simply connected. If your line was actually two lines, and weren't connected (you had two groups of ants) then you have a multiply connected boundary.

A manifold (sorry, had to use it) is just an object without a boundary. The earth is a manifold, as is any other n=3 space (3d) object that is connected to itself. In n=3 space, the only way you can have a boundary is to have two different objects, in two different locations.

Homeomorphic just means one object is like another.

They generalize the whole idea to one where all the objects are compact. That just means that the objects "surface" area is as small as it can get for a given internal volume. For example in n=3 space, you can minimize an area (like the material of a balloon) in relation to the volume inside (the helium). Circles are compact for n=2 space, and spheres are compact for n=3 space. BTW, even though I state this as if it were a fact, we don't know about all the compact spaces where n > 2. It would *seem* to make sense that a sphere is the only compact object to 3 space, but stating that as a truth, as of today, isn't possible. Maybe we can do that after they win their million bucks....

So, the whole thing boils down to showing that a compact 3d object is the same as a sphere.

Kord

Shameless plug, check out Grub!

Insects Rock! by kordless · 2002-03-12 13:03 · Score: 1 · on Server Naming Conventions?

Grub uses names of insects (bugs if you will) to name its computers. Among some of the names are, ant, roach, termite, muva (macedonian for fly), beetle, brainbug (ok that's not a bug, but you get it).

Naming your computers something fun should be a requirement!

Shameless plug, check out Grub!

Doh! Watch where you point that thing... by kordless · 2001-12-22 05:30 · Score: 3, Funny · on Build Your Own 10Mbit/sec Optical Data Link

"This also makes it a lot safer to work with, i.e. you won't burn your eyes out if you accidently look into it."

It looks as if the author has learned this first hand if the font size on the instructions is any indication.

Check out Grub!

Can't keep up CRAWLING, or can't keep up INDEXING? by kordless · 2001-10-24 11:54 · Score: 1 · on AltaVista Can't Keep Up

There's a big difference. We here at grub.org have been working on a distributed crawler, who's whole purpose is to crawl the net AND crawl local content with the sole purpose of looking for changed content, compressing it, and sending it back to us. We've also put together a hacked up install of MnogoSearch on our site (see here), that indexes what we we get back from our crawlers. (Please keep in mind that we flush the Mnogo database fairly often right now if you look.)

Even though we don't crawl beans compared to Altavista, Mnogo still chokes up at just under a million URL inserts a day (at it's best), and that's a problem because we are already crawling 3x that, and have a new client ready for release that does 3x the crawling what the old one did. The short of it is that a single Mnogo INDEXER can't keep up with our CRAWLERS by an order of magnitude.

Knowing all this makes me wonder which problem Altavista is experiencing. Or maybe it's something entirely different holding them up. On the crawling side, if they are only crawling every 60 days, and have 600 million URLs in the database, then they are crawling 10 million URLs a day. If you crawl 10 million a day, at 15K a piece, then you are using about 15Mbps of bandwidth, which surely they have, right?

On the indexing side, not all your pages will update every time you visit them, which means you can just index the ones that changed. We see an update rate of about 60% (the new client shows this to the user, BTW) on pages that we crawled a few weeks ago. Given that they needed to insert/reinsert those pages, their database/engine would need to insert about 500 words per page, or 3 billion inserts a day total for all the pages that they indexed. Keep in mind that they may not be able to delete words located on a particular URL, because you'd need an index on URLs on the word table which can slow things down and get REALLY big.

If it's a crawling problem, then I suggest calling us. ;) If it's a matter of deletes/inserts, then I guarantee that their reworking their schema right about now.

I'd really hate to see Altavista go, they were my first choice until Google came along and started kicking ass.

Shamless plug, check out Grub!

Can't keep up CRAWLING, or can't keep up INDEXING? by kordless · 2001-10-24 11:54 · Score: 1 · on AltaVista Can't Keep Up

There's a big difference. We here at grub.org have been working on a distributed crawler, who's whole purpose is to crawl the net AND crawl local content with the sole purpose of looking for changed content, compressing it, and sending it back to us. We've also put together a hacked up install of MnogoSearch on our site (see here), that indexes what we we get back from our crawlers. (Please keep in mind that we flush the Mnogo database fairly often right now if you look.)

Even though we don't crawl beans compared to Altavista, Mnogo still chokes up at just under a million URL inserts a day (at it's best), and that's a problem because we are already crawling 3x that, and have a new client ready for release that does 3x the crawling what the old one did. The short of it is that a single Mnogo INDEXER can't keep up with our CRAWLERS by an order of magnitude.

Knowing all this makes me wonder which problem Altavista is experiencing. Or maybe it's something entirely different holding them up. On the crawling side, if they are only crawling every 60 days, and have 600 million URLs in the database, then they are crawling 10 million URLs a day. If you crawl 10 million a day, at 15K a piece, then you are using about 15Mbps of bandwidth, which surely they have, right?

On the indexing side, not all your pages will update every time you visit them, which means you can just index the ones that changed. We see an update rate of about 60% (the new client shows this to the user, BTW) on pages that we crawled a few weeks ago. Given that they needed to insert/reinsert those pages, their database/engine would need to insert about 500 words per page, or 3 billion inserts a day total for all the pages that they indexed. Keep in mind that they may not be able to delete words located on a particular URL, because you'd need an index on URLs on the word table which can slow things down and get REALLY big.

If it's a crawling problem, then I suggest calling us. ;) If it's a matter of deletes/inserts, then I guarantee that their reworking their schema right about now.

I'd really hate to see Altavista go, they were my first choice until Google came along and started kicking ass.

Shamless plug, check out Grub!

Can't keep up CRAWLING, or can't keep up INDEXING? by kordless · 2001-10-24 11:54 · Score: 1 · on AltaVista Can't Keep Up

There's a big difference. We here at grub.org have been working on a distributed crawler, who's whole purpose is to crawl the net AND crawl local content with the sole purpose of looking for changed content, compressing it, and sending it back to us. We've also put together a hacked up install of MnogoSearch on our site (see here), that indexes what we we get back from our crawlers. (Please keep in mind that we flush the Mnogo database fairly often right now if you look.)

Even though we don't crawl beans compared to Altavista, Mnogo still chokes up at just under a million URL inserts a day (at it's best), and that's a problem because we are already crawling 3x that, and have a new client ready for release that does 3x the crawling what the old one did. The short of it is that a single Mnogo INDEXER can't keep up with our CRAWLERS by an order of magnitude.

Knowing all this makes me wonder which problem Altavista is experiencing. Or maybe it's something entirely different holding them up. On the crawling side, if they are only crawling every 60 days, and have 600 million URLs in the database, then they are crawling 10 million URLs a day. If you crawl 10 million a day, at 15K a piece, then you are using about 15Mbps of bandwidth, which surely they have, right?

On the indexing side, not all your pages will update every time you visit them, which means you can just index the ones that changed. We see an update rate of about 60% (the new client shows this to the user, BTW) on pages that we crawled a few weeks ago. Given that they needed to insert/reinsert those pages, their database/engine would need to insert about 500 words per page, or 3 billion inserts a day total for all the pages that they indexed. Keep in mind that they may not be able to delete words located on a particular URL, because you'd need an index on URLs on the word table which can slow things down and get REALLY big.

If it's a crawling problem, then I suggest calling us. ;) If it's a matter of deletes/inserts, then I guarantee that their reworking their schema right about now.

I'd really hate to see Altavista go, they were my first choice until Google came along and started kicking ass.

Shamless plug, check out Grub!

Re:Cisco Support by kordless · 2001-06-27 02:09 · Score: 1 · on Blow-by-Blow Account of the OSDN Outage

Actually I have a couple of puny little 2500s and no contract and they still helped me with an IOS bug a year after I purchased them - eventually giving me a free upgrade of the IOS. If for ANY reason (IOS/hardware) a Cisco is having problems I guarantee that Cisco will help you.

About a year and a half ago I bought a used 3600 and then found out via Cisco tech support that that particular run of 3600s had a hardware bug and they RMA'd the damn thing, shipping me a BRAND NEW unit before I returned the other one. Past that, I called later with a problem of my own causing, and they still had a couple of their techs help me out.

All in all, Cisco does a great job of supporting their hardware.

Shamless plug: Check out Grub!

The editors have the REAL power over dmoz by kordless · 2001-06-06 07:28 · Score: 1 · on Open Directory Project Adopts Debian Social Contract

dmoz, much like us, is reliant on its contributers to build its directory. Without the contributer's/editor's/client's blessing, and continual contributions, you have a database that is pretty much worthless. Gracenote has a bigger advantage than dmoz or grub.org does over its users/contributers in that it has already built the bulk of its database, and only needs occasional updates to it to keep it current.

Someone like Musicbrainz could just as easily restrict access to their database at a later date, even though it's currently licensed under OpenContent. (I really doubt they would do this, BTW).

Look, if Netscape chose to screw the community by closing or limiting access to the database, it would surley piss off the editors which would then be cause them to stop doing submissions. No submissions = No database. I suspect that projects like dmoz and grub, who rely on a constant influx of information to stay current, will be kept honest by default That said, I think that dmoz has taken a step in the right direction trying to address these issues.

Shameless Plug: Check out Grub!

The editors have the REAL power over dmoz by kordless · 2001-06-06 07:28 · Score: 1 · on Open Directory Project Adopts Debian Social Contract

dmoz, much like us, is reliant on its contributers to build its directory. Without the contributer's/editor's/client's blessing, and continual contributions, you have a database that is pretty much worthless. Gracenote has a bigger advantage than dmoz or grub.org does over its users/contributers in that it has already built the bulk of its database, and only needs occasional updates to it to keep it current.

Someone like Musicbrainz could just as easily restrict access to their database at a later date, even though it's currently licensed under OpenContent. (I really doubt they would do this, BTW).

Look, if Netscape chose to screw the community by closing or limiting access to the database, it would surley piss off the editors which would then be cause them to stop doing submissions. No submissions = No database. I suspect that projects like dmoz and grub, who rely on a constant influx of information to stay current, will be kept honest by default That said, I think that dmoz has taken a step in the right direction trying to address these issues.

Shameless Plug: Check out Grub!

An admirable project! by Dick+Stallman · 2001-05-12 20:53 · Score: 2 · on Peer-to-Peer Search Engine Wants You To Help Grub

This is an excellent project. I don't know of any other free software projects to index content on the Internet. Far too many companies develop client/ server applications and make the mistake of keeping them proprietary, and even if free replacements get written they often only replace the proprietary clients. This is bad because it actually increases the use of the proprietary server!

This project is entirely free. Thus it is much better. People should go to the project homepage on sourceforge and help out. The current goal is only to index content, and a later stage will implement intelligent search functionality. See the project overview here . I am sure that they would love to have more people who are able to do that helping out.

Hackers, get involved with this project that can replace one of the most used pieces of proprietary software, the Internet search engine!

Re:What's the license on the database? by interiot · 2001-05-12 20:24 · Score: 2 · on Peer-to-Peer Search Engine Wants You To Help Grub

From the FAQ: (emphasis mine)

Q: What exactly does grub.org do?
A: grub.org is a company with a single purpose -

...We will make all software written during the project Open Source as well as all the hows and whats of setting up the network and database. If there is someone we can help by sharing what we've got, we'll share it.

Q: That's insane, what will Grub's revenue be if it doesn't charge for the software?
A: Open Source is not synonymous with NOT making money! We have come up with a hybrid business model that uses four distinct methods for generating revenue...
By placing the crawler closer to the data (i.e. on the web server itself) our client will be able to analyze and index the data local to the system on which it is running.
Q: So if I were a system admin or a website author I'd want to run the client?

A: Yes! Anyone that provides web hosting/authoring services will have a use for running our client. In addition to crawling a portion of the Internet, the client can index the admin's/author's entire site each and every night, and then submit that summary to grub's servers for incorporation into the database. Running the client will allow them to provide an added value for their clients - having their web pages updated to the biggest index, each and every day.

So there are supposed to be selfish reasons for people to run grub nodes.
--

Re:What's the license on the database? by baptiste · 2001-05-12 20:23 · Score: 5 · on Peer-to-Peer Search Engine Wants You To Help Grub

Read their Investor Page - they absolutely plan on charging the search engines to use the data AND to sell top result spots to the highest bidder. Open source or no open source - this is a joke - they won't get a sliver of my bandwidth.

Here is the section outlining what they plan to do with all this free data 'volunteers' give them:

The first revenue stream will come from selling URL status information to companies like Google and Altavista. This status information will enable existing crawlers to target the crawls for a particular day, based on the highly up-to-date information contained in our database. These status updates are similar in nature to the service provided by someone like NetMind, in which a change on a website triggers an action. Grub's database will be much vaster by comparison however, enabling it to provide services directly to wholesale search engines.

Second, Grub will begin selling "wholesale searches" to other search engines and companies. Grub will make strategic alliances with other search engines much in the same way that Google has done with Yahoo and Inktomi has done with Hotbot. Grub will also provide one-shot search results for a large search query, delivering the data in a database format (like XML) instead of a web format.

Third, Grub will begin charging website customers for content control. Content control consists of indexing updated information on a regular basis and controlling link placement in search results. Large sites who's revenue depends on sustained inbound web traffic will be charged based on the amount of data that they submit into Grub's database, and on what placement they get in Grub's search result sets.

Fourth, Grub will provide consulting services for companies wanting to set up their own Grub networks. Large corporate intranets could be quickly and efficiently indexed into a central database with the Grub client/server model. Consulting and coding for these proprietary installations is a common model in Open Source oriented businesses like Sendmail, MySQL and Apache.

Guess they thought we were really that stupid!

--

Neat idea - but I'm gonna pass... by baptiste · 2001-05-12 20:16 · Score: 5 · on Peer-to-Peer Search Engine Wants You To Help Grub

So it sounds like they want to provide the info they gather to other existing' search engines. Hey - now Grub crawling the internet and sending its data to Google to make Google even better - I'm all over that. Of course, if they send data to Excite, I'll stop running the client. I cannot believe how Excite (and all the affiliated search engines they have now purchased) pretty much requires payment to get added and if you use the free form 'the site will be reviewed and there is no assurance it will be added. Process may take 4 to 6 weeks.'

Thank goodness for Google!

But again - this brings up the question similar to what happened with CDDB. Here you have internet volunteers providing free CPU power and bandwidth to provide raw material to for profit companies. Now granted - it is slightly different since you can still Google for free :) I'm not that selfish, but obviously there are some companies I'd be HAPPY to play a small part in improving their data set (Google) and others that given recent developments with URL submission and monetary sorting of search results that I wouldn't want to give data to unless they paid for it :)

Which now that I read the site more is their business plan. Read their Investor Page I get a squirrely feeling about this. I don't care if the client is open source or not. Why should I use up my precious bandwidth to supply content to a for profit company to sell to other for profit companies? Yes, they give the data away to non profits, but heck - most of them use Google anyway :)

And of course they are following hte lead of the other greedy search sites - adjusting search result order for money which I can't stand. Google is the one search engine that got it right - sort data by relevance and popularity.

I'll read more about it - but I think I'm gonna pass on this on - I just don't see the benefit for the volunteers who run this both on a selfish individual scale and a broader Internet community scale

--

How about.... by kordless · 2000-12-17 23:27 · Score: 1 · on Non-banner Ads Coming to the Web

do
{
write_software(to_block_ads);
advertisers_come_up_with(new_ad_method);
} while (ads_still_exist);

Real-time indexing of the Internet coming soon!

Kord

This isn't a suprise... by kordless · 2000-12-11 05:44 · Score: 1 · on New P2P tool Using... IRC? [UPDATED]

I've been on IRC for years and all sorts of stuff trades hands there. Anything from MP3s, pictures, and warez can be had in plenty. The problem with IRC has always been that it's too damn hard to figure out how to get that stuff - at least for the casual user.

Interfaces that rely on IRC (and DCC), make it easier for the average Joe to use.

Real-time indexing of the Internet coming soon!

Kord

P2P vs. Distributed Computing by kordless · 2000-11-16 08:27 · Score: 1 · on Ian Clarke on Peer-to-Peer

Over on O'Reilly's site I noticed Dave Sims discussing whether or not distributed computing software should be considered the same as P2P software. I have to agree with him somewhat on this issue as the project that I'm working on right now is not exactly 100% P2P (in fact some argue is 0% P2P). However, I think it's important to understand that a lot of the same framework has to been coded up in either P2P or distributed computing products, and maybe that's justification enough to mix the terms.

Kord

Realtime Indexing of the Internet. Coming soon!

Slashdot Mirror

Domain: grub.org

Comments · 54