Slashdot Mirror


Building a Bigger Search Engine

skreuzer writes "Wired is running a story about a distributed web crawler called Grub. People who choose to download and run the client will assist in building the Web's largest, most accurate database of URLs. This database will be used to improve existing search engines' results by increasing the frequency at which sites are crawled and indexed. Conceivably, Grub's distributed network could enable state information to be gathered on every document on the Internet, each and every day."

29 of 278 comments (clear)

  1. Biiig questions to answer by andy@petdance.com · · Score: 5, Interesting
    So Grub goes out, uses bandwidth, and then returns some results to the home base. It's really distributed bandwidth more than distributed computation.

    I bet one of the big successes in Folding and distributed.net is that many people run the clients on work boxes, knowing that there's little actual overhead incurred to their work. How different that is for a URL sucker.

    I wonder what broadband ISPs think of Grub.

    1. Re:Biiig questions to answer by friedegg · · Score: 4, Interesting

      I wonder what broadband ISPs think of Grub.

      If it becomes a problem, I imagine ISPs will declare it a commercial bandwidth usage, and order users to stop or move to a business class plan for more money.

      --
      Google doesn't index user sigs, so stop trying to "Google Bomb" with them.
  2. great news! API? by The-Perl-CD-Bookshel · · Score: 2, Interesting

    This is going to challenge Google's search, which will entice them to cut loose some of those really cool google labs concepts. Froogle, Google News, and all of the other cool things that they are working on are great services and are going to be the focus of innovation over at Google.

    Also, Looksmart needs to develop and release an API for this system. You can only use the google api for 2,000 searches per. day. If they allowed unlimited usage, it would get a lot of developer backing.

    --
    I don't keep a lid on my coffee so when I walk around I look busy -me
  3. Google Toolbar by petree · · Score: 5, Interesting

    Couldn't google do this anyways with the google toolbar? Cause with the advanced features version it tracks every page you visit. If they offered some incentive to install the toolbar, google could just beat them at this game. I actually use the google toolbar already by choice (it makes my web searching more productive) everyday, all they have to do is get lots of people using it and wouldn't that work just as well or better?

    1. Re:Google Toolbar by Anonymous Coward · · Score: 1, Interesting

      Google Toolbar does have a distributed computing option now (you have to turn it on). I think they're using it for SETI or folding or one of those worthwhile causes. I always assumed the incentive to use the toolbar was the functionality it provides.

    2. Re:Google Toolbar by Kelerain · · Score: 5, Interesting

      This tracking is actually how a lot of important information leaks out. Security through obscurity has always been a poor mans system, and this busts it wide open. I wont post them here but there are several interesting searches you can do that give personal results for things that REALLY have NO place on a publicly accessable page. On a more positive note, google already uses distributed computing though thier googlebar http://toolbar.google.com/dc/offerdc.html However they donate the cycles to various worthy causes like folding at home (currently thier only benificiary), but it is concevable that if they came up with some secure and usefull search related thing to do with the cycles they could put it to use almost instantaniously. I think that there aren't segnificant benifits (plenty of discussion elsewhere here) for them to want to use it however.

  4. Hardly distributed crawling by Herbst · · Score: 2, Interesting

    ...rather a crawl with a distributed component.

    They use the screensaver grub clients to check if a web page has been modified since the last time it was crawled (by the centralized crawl done by Looksmart). They probably use some smart MD5 checksum of the pages and send that with the urls to be crawled to the clients. If the checksum of what the grub client crawled doesn't match then the centralized crawl is instructed to re-fetch that url.

    They go this route because the If-Modified-Since HTTP 1.1 request is not supported by many webservers (and even if it is, you can't really trust it). This is especially true for dynamically generated web pages. I.e., if If-Modified-Since would work reliably then it would be a simple operation to check if a previously crawled page has changed. Since that's not the case, they are outsourcing the expensive refetching of whole pages.

    It will be interesting to see how this pans out. I think they could run into trouble with ISPs if this really takes off (because bandwidth consumption per user would increase and make flatrate deals less profitable for some ISPs).

  5. The Distributed Search Engine by deadfishhotmail.com · · Score: 2, Interesting

    It's kind of funny and a bit ironic that search engines are generally used to search information from a central repository and Grub uses a distributed network to index pages. It's almost like having a distributed google cache (that's updated more frequently). Perhaps a better idea would be to invent a crawling daemon that runs on each server with a standard protocol that reports to a central server the relevence of search terms (hey it's DNS for search terms!!) - to bad it would be heavily abused (mostly by Buy Now, Free Money and Pron avenues I suppose).

    Ok now tell me that it's already been done, 'cause I'm pretty sure it has (and probably by Microsoft for ad money).

    Well it's an idea that might be more efficient and updatable than Grub anyway.

    --


    Who is this "Poster" guy and why does he own all of my comments?!?
  6. Re:Firewalls? by GigsVT · · Score: 2, Interesting

    If you knowingly run a program that openly spies on every page you go to, you get what you deserve.

    --
    I've had enough abrasive sigs. Kittens are cute and fuzzy.
  7. Re:Not news for us webmasters by Redwing · · Score: 5, Interesting

    Here is what slashdotters were saying about grub almost 2 years ago.

    --
    Raisinettes are my raison d'etre
  8. Indexor or Search Engine? by digitect · · Score: 4, Interesting

    I expected some way to search... this looks more like a project to index the web rather than make the results available for public use via web interface. Did it strike anyone else odd that there was no web form on the home page with which to search?!

    It seems like a good concept, but the availability of the information collected needs to be accessible without installing the client. I'm not game to install distributed computing apps without some freely available benefit. The "for the good of the world" motivation went out the window for me about a day after my first Seti At Home experience. (But now BitTorrent, there was appreciable benefit. I had RedHat 9 isos within 8 hours of their initial release!)

    --
    There is no need to use a SlashDot sig for SEO...
  9. Re:search.msn.com is the future by shibbydude · · Score: 5, Interesting
    In particular, the company has its own team of editors that monitors the most popular searches being performed and then hand-picks sites that are believed to be the most relevant.

    You have to be kidding or working for Microsoft, or both! Have you ever searched for Linux on MSN? Try it - here.

    Notice the third result? "Learn about the Microsoft alternatives and how to move to them from open source products." I shit you not! I don't think Google would ever use this kind of dirty, underhanded trick. Great "hand-picking", mate.

    --
    We're only gonna die from our own arrogance, that's why we might as well take our time...
  10. Looksmart by Ark42 · · Score: 3, Interesting

    Isn't Looksmart/Sprinks a big pay-per-listing deal? The looksmart logo in the upper right corner was enough to make me just close that page right away without any second thought.

  11. Flood Control by SmartGamer · · Score: 2, Interesting

    According to the Grub FAQ, it respects robots.txt although not the META tags. Although it takes a week or two for it to listen to the robots.txt, it does eventually...

    The sheer volume of this project concerns me, however. The very fact that it got Slashdotted may cause it to be a bit heavier than expected!

    It sounds like a good use of spare bandwidth, but if it's going to wind up a superscanner, it's going to send a hell of a lot of requests.

    I tried it and deleted it as quickly: it's not very good at being a bottom feeder, it redlined my system resources immediately and slowed everything down. Duration between installation and uninstallation: twenty-nine seconds.

    --
    Warning: Poster of this comment is a nerd. Just like everybody else here.
  12. Re:What about the RIAA? by SmartGamer · · Score: 2, Interesting

    Here's the catch: it's going for scare tactics.

    The Church of Scientology has already threatened Google and gotten results moved; I can, in all honesty, see the RIAA going for it.

    It would be an earthshattering case, but here's the thing: the RIAA stands a disturbingly good chance of winning.

    I hope, I pray they don't were they to try it- and try they most certainly will, because they think they can get money out of the lawsuit and they want money. That's very likely a major motive.

    Oh, and to mods-for-a-day: mod the parent of this post up. It's thoroughly underrated at zero.

    --
    Warning: Poster of this comment is a nerd. Just like everybody else here.
  13. What _is_ a good project? by bcrowell · · Score: 3, Interesting
    I have a FreeBSD server that wastes the vast majority of its CPU cycles (and most of its bandwidth, too). So what is a good distributed computing project to donate those cycles to? I'd like to find something that
    1. makes me feel warm and fuzzy about my altruism
    2. can run in the background on a Unix box
    3. is open-source (so I don't have to run someone's closed-source app on my box and trust their security through obscurity)
    Well, #1 rules out Grub, #2 rules out Folding@Home, and #3 rules out both SETI@Home and Folding@Home.

    So what worthy causes are out there?

    1. Re:What _is_ a good project? by metlin · · Score: 2, Interesting


      How about helping with some cool math prime search?

      ars Team Prime Rib - cool prime searching stuff.

      A mix of misc science stuff.

      dc projects - some Opensource, some not.

      And all projects at distributed.net come with source too.

  14. DDoS by karlm · · Score: 3, Interesting
    So the idea is to DDoS the entire web? :-)

    If this thing gets too popular without proper throttling, they could cause real havoc.

    --
    Copyright Violation:"theft, piracy"::Anti-Trust Violation:"thermonuclear price terrorism"<-Overly dramatic language.
  15. Legalities? by cheshiremackat · · Score: 4, Interesting

    Alright, I have 3 major problems with this...

    1) How different is this than the princton kiddies system? I don't know about you, but I don't want a 95 billion dollar bill arriving in the mail...

    2) What if you local (cache?) contains a few links to kiddie porn? Not your fault, right? Software does it's own thing, you cannot control, BUT what will the FBI think? The FBI Scottland Yard, RCMP are currently heavily investigating Kiddie Porn cases (good work IMHO), but what if your the unlucky sap who getts stuck with a few sketchy URLs? Or Worse Yet, what if this GRUB keeps a cache of the website like google does? Then what?

    3) What about material that is legal locally, but illegial somewhere else... eg. Nazi stuff in Germany, Falun Gong in China, etc... The last thing I want is to be refused to be given a travel visa cuz my PC has an illegial cache...

    Good idea in principle, but with sketchy content on the web, I don't think I will be the one keeping track of it all. If there is a way to filter out the questionable stuff then maybe, but since the purpose is to be as inclusive as possible, it seems incompatible.

    _CMK

    --
    Bad spellers of the world untie!
    1. Re:Legalities? by SmartGamer · · Score: 2, Interesting

      It does, however, download a buffer of URLS to scan. If your buffer was less than clean when your computer gets searched, oops, you're in trouble...

      Not to mention the fact that it still goes and hits all those sites, and with the government trying to smash that little thing we call "privacy," anything questionable will likely go on your permanent record- the one that doesn't exist, but they somehow have anyway.

      --
      Warning: Poster of this comment is a nerd. Just like everybody else here.
    2. Re:Legalities? by amoe · · Score: 2, Interesting
      text is still illegal...

      Text child pornography is illegal? How does that work? I thought the rationale for video child porn being illegal was that an illegal act had been committed in its creation - how do they justify making something illegal that is purely the product of an author's imagination?

      Disclaimer: I have never read a child porn story, but I have seen them around the seedier places on the net.

      --
      You look beautiful! Incidentally, my favourite artist is Picasso.
  16. Re:search.msn.com is the future by lamber45 · · Score: 2, Interesting
    I followed one of these links and looked at the MSDN article. It's full of generalizations taken from 20-year-old UNIX textbooks, although Linux and X windows are mentioned here and there. Apparently recent versions of some level of Windows have an "Interix" subsystem. I've used Cygwin32 on Win95, WinME, Win2k and WinNT, and Borland C++, and Visual C++ .NET, but I don't think I've ever used the Microsoft native POSIX layer. The article gives a lot of questions that should be asked before starting a migration like this. One possible reason to migrate is to decrease the Total Cost of Ownership; another is to increase hardware options and move away from proprietary systems!

    Another quote I like is, "Windows operating systems do not provide X Windows. For X Windows connectivity, developers need a third-party X Windows server.". Of course Microsoft would never be anticompetitive by competing with third-party suppliers of implementations of an open standard, right?

  17. Re:Will Grub take off or be smashed? by dtfinch · · Score: 5, Interesting

    There are many ways to look at this. The idea is to install the client, set Opera to use the same useragent string, visit some of those sites, then blame it on Grub if the FBI comes busting through your door.

    If you're a criminal, installing the Grub client might be a great idea.

  18. Re:Great idea, but will it pan out? by Nickilo · · Score: 5, Interesting

    "The General's Dilemma" would solve this problem. The story goes something like this: The general needs to get urgent information to one of his officers, however, he suspects saboteurs are present among his messengers. In order to insure the information gets through accurately, he sends the same message with several men. The officer on the other end collects all the messages and goes with the majority. (And, presumably, kills the others.)

  19. The approach is inherently flawed by oren · · Score: 3, Interesting

    It is too easy to send currupted information into the database. They have *no choice* but to trust the clients. Sure they could run spot checks on the results, but they would be very partial and it would be easy enough to fake responses for those as well.

    So the more popular it gets, the more incentive people will have to promote their sites by feeding it fake index information. If this magically got to be very popular, within weeks search results would become meaningelss and it would drop back into obscurity. The more likely result would be that it will never become popular in the first place.

    Besides, who wants to donate his CPU and bandwidth resources for a commercial company, anyway?

  20. Re:Will Grub take off or be smashed? by Jugalator · · Score: 2, Interesting

    There is not even any potential reward such as with distributed.net.

    How about improving existing search engines with more accurate databases? Commercial organizations like Google might be involved and that's another matter. There might still be a reward to the public.

    --
    Beware: In C++, your friends can see your privates!
  21. Re:They realize they aren't the REAL GRUB by Saeger · · Score: 2, Interesting
    Oh please! There's 6+ billion people on the planet now, and not enough unique namespace for everyone or every business to have that one 'cool' short name, so why they don't do what us humans have done? GET A LAST NAME.

    Grub The SearchEngine
    Grub The Bootloader
    FireBird von Browser
    FireBird von Database
    Gentoo el Distro
    Gentoo el FileManager
    Apple Computer
    Apple Records

    I'm serious. Nobody should feel entitled to an exclusive piece of namespace just because they think they had it first or are bigger & badder and more deserving than some newbie treading on their turf. (trademark `this!')

    --

    --
    Power to the Peaceful
  22. Distributed Crawling From Browsers by txtger · · Score: 2, Interesting

    It would be interested to just see a database that is connected to browsers, so that whenever I were to look at a page, the page data would be processed and sent to whatever search engine. Then, those sites that are updated frequently and get a lot of traffic would be more easily searched.

    Just a thought.

  23. hair is raising on the back of my neck by malia8888 · · Score: 2, Interesting

    Uh huh, Grub is going to "run in the background" ?
    No thanks!!. It just doesn't feel right. It is sort of like lending a firearm to an untrustworthy neighbor. What is in it for the lender other than potential problems?

    Spyware "runs in the background" and slows up peoples machines. What really happens to one's machine performance with Grub? And, more importantly, where is my check?

    --
    Harpo Tunnel Syndrome--my wrist feels funny.