Slashdot Mirror


Indexing the Entire Web?

cah1 writes "BBC is carrying a story about another new search engine All The Web. The designers are planning to have the whole shooting match, all billion pages, indexed by the end of the year. " You can also read press from the company as well. I'm skeptical-they claim to be able to catch up within the first year, and keep up thereafter. But they claim to have 200 million already, so who knows?

98 comments

  1. Very Strange... by Neuroprophet · · Score: 1

    That was weird, I clicked on the reply to the H-1B article. I then waited half an hour, entered my message and submit it, and it was attached to an article that didn't even exist when I first hit the reply button. Strange...

  2. Re:ad infinitum et ad nausium by Tom+Christiansen · · Score: 2
    Assuming it will be a rather large amount of data, who will index thier index? (and who will index that index... and that one... and that one......)
    It seems to me that this is not a new problem. Juvenal said, `Sed quis custodiet ipsos custodes?' :-)
  3. Re: Distributed Spidering by Christopher+Whitt · · Score: 1
    The idea is to have each client do the work it's best suited for, and to distribute the load more evenly. Bandwidth could be a problem, but I think a lot of the data could be "tokenized" somewhat once references have been established, and some compression would probably help.

    This is probably obvious, but it's not only computational load that would be more evenly distributed. With some knowledge of the preferred routes of various levels of the net hierarchy, the traffic of the spidering could be more contained to small areas of the network.

    Far flung links could be handled at higher levels and passed down to other spidering nodes closer to the link target (from a routing perspective). This would mean a little more computation overhead somewhere but I imagine it wouldn't be too bad. The benefits of distributed spidering seem to me quite attractive...

    On the other hand, if it really was that feasible, wouldn't one of the Big Boys take it up, or is it too much hassle to develop a business model for a search engine based on volunteer spiders?

    Christopher

  4. Search engine span, accuracy by belphegore · · Score: 1

    Two points:
    (a) Spanning more pages is only half the story. You need to combine huge page indexes with a lookup scheme like google's where the chaff is separated from the wheat. Otherwise you'll just be drowning in 5 times as many useless hits, and you'll need a search engine to search through the 100,000+ hits returned for your query to find what you're actually interested in.
    (b) Does anyone have statistics for what %age of the web is excluded in /robots.txt?

  5. Re:yeah, it's fast. but it's pretty weak by frodelu · · Score: 1

    If you used it a couple of weeks back, you used the old index. A demo-version running an index of about 70 million pages (I believe) have been running for some months. The announcement yesterday is about the new index that claims to be the world's largest.

    --
    -- Best regards, Frode Lundgren
  6. this search engine... by DGregory · · Score: 1

    It's ok. I would say "mediocre". The only reason that it doesn't get a "bad" score is when I type my name in, it brings up right at the very top some pages from my website. I don't know what kind of search it uses, but if I type in "Sun Microsystems" or "Dana Corporation" .. I kind of expect to get the company's web site right at the top. But mostly what I see are news articles with the companys' names within them. Also, when I type my name in, I get the really obscure pages on my website.

    But, if I do the same thing on Yahoo/Infoseek/Lycos/Altavista, I either get nothing that pertains to me with my name or another different, obscure page on the website. When I type in the companys' names I may or may not get the company's website at the top.

    What is more important to me than how many pages it returns is how many RELEVANT pages it returns. And yes, it is supposed to read my mind to some extent and know what I want. :)

  7. The Ideal Search Engine? Not quite, but good. by Gallowglass · · Score: 1

    The BBC story was pretty good, but at another story on this new search engine (http://www.latimes.com/HOME/BUSINESS/t000068951.h tml), one commentator said,"What does it mean to have another 100,000 or 200,000 links show up in a search? . . . The only thing that matters is the top 10 links you get back . . ."

    I think he misses the point. IMHO the ideal search engine (1) covers all of the Web (Yes, I *know* it's impossible! This is an *ideal*.) and (2) allows me to construct a proper Boolean search argument.

    Boolean is very important to me. It allows me to pare those results down from 1,276,349 to 280. When I pare down the number of results then the top hits are far more likely to be relevant. So far as I know (and correct me please if I'm wrong) the only search engine that allows the proper construction of Boolean arguments (AND, OR, parentheses and NEAR) is Alta Vista. Other engines such as HotBot and Google allow some ability to refine the argument, but not enough for my taste. This new engine still doesn't satisfy that desire either.

    However, it does give some tools (phrases, + and - but no parentheses or OR) so having a bigger database is a Good Thing.

    I found it snapped back the results pretty quickly too.

    By the way, something that isn't discussed very often, but is pretty relevant in evaluating the effectiveness of search engines is latency. At the WWW8 conference in Toronto, I heard a paper that made the observation that search engines have a bad tendency to "forget" URLs. I.e. the same argument given over time will some times not discover a site that an earlier search found. On occasion, a later search will then "rediscover" the page. (Sorry I don't have the reference to hand. I've really got to do some housekeeping. . . ) The moral of this is: bookmark that interestin' site when you find it or you may never see it again.

  8. Re: Distributed Spidering by davie · · Score: 2

    I've only considered this as a strictly volunteer project, directed by a university and the top level hosts and database hosted there, with some corporate sponsorship thrown in for good measure.

    I don't know if this would work if commercialized, since a lot of the folks who have the knowledge, experience and compute power to participate would probably not feel too warm or fuzzy about helping to build the next Yahoo!, especially when the IPO made the company worth millions overnight. It would certainly be tough to maintain the same level of participation after going commercial, unless some hitherto unforeseen way of rewarding participation per contribution were discovered. Perhaps corporate sponsors could offer premiums to contributors based on sites spidered? Maybe something along the lines of frequent flyer miles?

    --
    slashdot broke my sig
  9. Re:all and then some? by Silver+A · · Score: 1

    >Ok so repeating an effort to find the various purported etymology of the word "strawberry" I
    >searched with +etymology +strawberry +origin on both yahoo (my standard) and alltheweb.
    >Yahoo found 60 while alltheweb found 117, but a number of allthewebs' finds were xxx sites!?
    >How many xxx sites actually use the word etymology and if this is more do we really want more?


    I tried "CalTrans Bridge Design Manual" in Google!, Inference Find, and All The Web. Google gave me many links to CalTrans sites and some associated ones. Inference Find found the CalTrans sites and a bunch of tangentially related sites. All the Web found a bunch of CalTrans sites and related sites, but numbers 19 and 21 were porn sites, and putting CalTrans at the end of the string got me more porn sites.

    Not a terribly useful site, IMO.

  10. looking for something that isn't there. by eggshell · · Score: 1

    I've just used the search engine,... got two results, wherefrom one non existing.

    god! I hate search engines.

    (but, he did it fast!)

  11. Re:It seems that... by mce · · Score: 1
    This would find too many false links. Here's one reason why: often when I edit and view an existing page, I edit a temporary copy instead and replace the real page only when I'm satisfied with the changes. Clearly, you don't want the temporary version to be indexed...

    And then there are all the privacy concerns...

    --

  12. Re:It's the custom hardware, stupid.. by oyvindmo · · Score: 1

    > "Actually"?
    > You claim to KNOW this?
    > I noticed you live in the same city as FAST headquarters..
    > But maybe you cant talk about that ;-)

    Actually, I think Frode should updates the curriculum vitae on his home page, to include the fact that he's a FAST employee. I claim to know this.

    :)

  13. Let's do it! by foop · · Score: 2
    Hey, I've been thinking about this very same problem for quite some time and some fellow nerds and I have been thinking about how to do it. How about we start a mailing list to further discuss this as an open source initiative?

    I just created http://www.egroups.com/group/dizz-net/ as a an email discussion list. You can subscribe by sending email to dizz-net-subscribe@egroups.com. There are a lot of interesting issues, many already mentioned here:

    • quality is usually more important than quantity
    • a distributed app has the potential to be much more "fresh" than other search services
    • a network protocol needs to be designed carefully -- you don't want to be sending all the web haphazardly around the web every day. clients might be assigned to monitor nearby sites. there are some cool opportunities to use this system just to map the internet.
    • searching is a different beast from crawling. parallel searching -- like FAST and others -- requires major resources which an open source project couldn't manage.
    • full text vs topic searching: does a distributed system with clients fetch documents index every word or summarize? Topic searching is probably more appropriate for distributed searching, but full text is often more desirable.
    • interesting security issues come up, like how to keep clients from poluting the database.
    • etc...

    -david.

  14. Re:I sure HOPE it doesn't index the entire web by ashpool7 · · Score: 1

    Well, they say they honor ROBOTS.TXT However your post suggests differently. You ought to e-mail them and find out. Their robots policy is stated here: http://www.fast.no/f aq/faqfastwebsearch/faqfastwebcrawler.html

  15. Re:all and then some? by Silver+A · · Score: 0
    >Ok so repeating an effort to find the various purported etymology of the word "strawberry" I
    >searched with +etymology +strawberry +origin on both yahoo (my standard) and alltheweb.
    >Yahoo found 60 while alltheweb found 117, but a number of allthewebs' finds were xxx sites!?
    >How many xxx sites actually use the word etymology and if this is more do we really want more?

    I tried "CalTrans Bridge Design Manual" in Google!, Inference Find, and All The Web. Google gave me many links to CalTrans sites and some associated ones. Inference Find found the CalTrans sites and a bunch of tangentially related sites. All the Web found a bunch of CalTrans sites and related sites, but numbers 19 and 21 were porn sites, and putting CalTrans at the end of the string got me more porn sites.

    Not a terribly useful site, IMO.

  16. Who says a billion? by mach-5 · · Score: 1

    So who says that there are only a billion pages out there, maybe 2 billion exist. So how do they know when they're done?

    BTW: Who needs to sort through all that junk after doing a search. Use metacrawler, and get a pretty good compilation of the best search engines out there.

  17. Re:I sure HOPE it doesn't index the entire web by mcdurdin · · Score: 1

    Yep, I got an email from FAST telling me that they had a bug in their ROBOTS.TXT handling when they indexed my site, and they've now fixed it.

  18. Re:QUESTION .... by ronfar · · Score: 1
    Hi,

    Try view source on this page to see the way I would handle it:


    Anti-Linking Script


    Of course, there are other more sophisticated ways to deal with it, but this can work if the people aren't bound and determined to link to you.

    --
    All the creatures will die, And all the things will be broken. That's the law of samurai. (Jubai, 1605)
  19. Re:Non-scientific analysis by Sourdough · · Score: 1

    Uh, Altavista doesn't ignore home pages. I hit them all the time when they happen to have a word I'm searching for.

  20. Out of date by DeathB · · Score: 1

    Whenever I hear about a new search engine, like many other people on /., I have a set of queries that I tend to try. My results were that it seemed somewhere in the ballpark with Google and AltaVista as far as numbers. One of the things that they pride themselves on, is the ability to check more often and catch broken links faster... If this is the case, why am I seeing more broken links off of their engine than any of the other major ones. They are getting the same pages, but the other engines wiped these pages out quite a while ago because they were no longer current.

    --
    Would you do it for some scoobie crack?
  21. Wow - Looks Great by Aaron+M.+Renn · · Score: 3

    I judge search engines by the most important criteria of all - how many references to me they have. Alltheweb now has vastly more than runner up Google, making them the biggest ever. I type in "Aaron M. Renn" and I got 1604 on AllTheWeb, ~500 on Google and only ~180 on AltaVista. Even if that number drops as I searched through the pages, it's still impressive. I did look through the plain "Aaron Renn" listings too, where they also crushed the competition (though it's a much smaller number of pages since I virtually always use my middle initial). Believe it or not, there is a page out there with another "Aaron Renn" on it. Pretty weird.

    1. Re:Wow - Looks Great by Vlad_the_Inhaler · · Score: 1

      I applied a similar test - supplied some of the keywords in my web-page (samba encryption smbmount)- and BINGO. The very first entry. This really surprised me because my provider is an obscure German one. I don't know what it does to the competition but it certainly impressed me.

      --
      Mielipiteet omiani - Opinions personal, facts suspect.
    2. Re:Wow - Looks Great by cdegroot · · Score: 1

      I did the same check (but then for my name :), and
      even though AllTheWeb returns 2.5 times the hits
      that Google returns, there's a slight difference:
      Google puts my homepage firmly on spot #1, whereas
      AllTheWeb (probably by coincidence) has it at
      number eight between a mass of irrelevant
      mailing list archive links.

      I'll stick with Google - it has this uncanny
      ability of putting what you want behind the
      "I'm feeling lucky" button...

  22. Relevance before size by Stephen · · Score: 1

    It was good to see that the BBC pointed out that relevance of the search results is probably more important than the number of pages in the database -- and that Google seems somehow to have the most relevant results time after time. Alltheweb didn't do very well on my standard test word, so I'm sticking with Google.

    --
    11.00100100001111110110101010001000100001011010001 1000010001101001100010011
  23. Re:seems doable by Anonymous Coward · · Score: 0

    Retrieving pages to be indexed isn't a problem. Indexing them, and doing fast enough retrieval later is.

  24. Nice touch, they use OpenSource software :-) by Wisdom+Seeker · · Score: 1

    Here's what Netcraft has to say about it: www.alltheweb.com is running Apache/1.3.6 (Unix) PHP/3.0.11 on FreeBSD .

    Both Apache and FreeBSD are well-proven OpenSource software projects. I imagine this is going to be very stable ;)


    --
    .oOo. Don't underestimate the power of Linux .oOo.
  25. Re:Linked to Lycos ? by Anonymous Coward · · Score: 0
    Lycos didn't buy FTPsearch. They signed a deal to have FAST Search and Transfer, which owns the search technology, to run it under the Lycos brand.

    Also, based on earlier statements from them, I'd imagine that this will only be used as a proof of concept (they're more interested in selling and licensing their hardware and software than running search engines on their own).

  26. Who says a billion? by Anonymous Coward · · Score: 0

    So who says that there are only a billion pages out there, maybe 2 billion exist. So how do they know when they're done?

    BTW: Who needs to sort through all that junk after doing a search. Use Metacrawler, and get a pretty good compilation of the best search engines out there.

  27. Most search engines are distributed! by Sulka · · Score: 1

    Alltheweb is distributed (see http://www.fast.no/company/press/twbs02081999.html ), Hotbot is distributed and I guess most of the others are distributed too.

    I even read somewhere some of the engines even use multiple Linux machines with applications written in Perl for indexing.

    sulka

    --
    "Although it is not true that all conservatives are stupid, it is true that most stupid people are conservative."
  28. Re:Linked to Lycos ? by Helten+E · · Score: 1

    All the Web is not Lycos and has nothing to do with Lycos.

    Eyvind Bernhardsen

  29. Re:Non-scientific analysis by Anonymous Coward · · Score: 0

    Yes but.

    "bill clinton", altavista: 251138, fast: 397018
    "bill gates", altavista: 131781, fast: 247267
    "gates", altavista: 795520, fast: 916898
    "slashdot" altavista: 27030, fast: 51102
    "george washington", altavista: 178451, fast: 299085
    "tour de france", altavista: 35429, fast: 130327
    "commander keen", altavista: 2059, fast: 2841

    but

    "clinton", altavista: 2590260, fast: 1848689
    "warez", altavista: 1399830, fast: 73031 !!!! altavista finds 20 times "warez" fast does
    "sex", altavista: 35438560, fast: 3907581
    altavista finds 10 times "sex" fast does
    "senna", altavista: 102450, fast: 44019

    Argh! this breaks my head. But my partial conclusion is that altavista has much much much much much more of these spam pages that index every possible keyword and then some more, another
    personal experience is that altavista has much
    pages that don't exist anymore.

  30. Not that great by kuro5hin · · Score: 1

    I've been using alltheweb for a couple weeks (yeah, ever since it first showed up on Slashdot). My take on it is that it is fast, as in blazing zippy fast, but mostly useless. It tends to return a lot of pages that are all the same. Dig past the first 10 results or so, and you find that after that it's just one page from the same FAQ, only they list every mirror separately. I haven't found it to be better than any other search engine, and worse than most. Stick with a combination of Google, Altavista, and Yahoo. And Deja.com, for howto questions :-)
    ----------------------
    "This moon-cheese will make me very rich! Very rich indeed!

    --
    There is no K5 cabal.
    I am not the real rusty.
  31. Re:They're running Apache/FreeBSD by Bombcar · · Score: 1

    Exactly. Linux users are much more likely to recompile/rebuild/tweak the kernel, whereas *BSD users just run it. They wait for the kernel writers to distribute and don't bother rebooting. I have a linux boxen up 236 days and it won't get rebooted

  32. Re:Non-scientific analysis by Anonymous Coward · · Score: 0

    I made two other searches for 'alltheweb' and 'jente' (norweigian word for girl) that gave me an entirely different result

    AltaVista gave: 295 and 4744
    Fast gave: 918 and 9099

    But an analysis like that does not make FAST twice as big as AltaVista


  33. Search for letter 'a' by Anonymous Coward · · Score: 0

    139625351 documents found.

  34. Their spider was not very nice by matta · · Score: 1

    Their crawler (FAST-WebCrawler/0.3) was not very nice when it blasted through my site. The general guideline is that a crawler should grab one page a minute. Their crawler grabbed multiple pages a second even when going through a bunch of pages generated by CGI.

  35. Re:I sure HOPE it doesn't index the entire web by mcdurdin · · Score: 1
    Unfortunately, AllTheWeb does seem to ignore ROBOTS.TXT. It has indexed every page on my site (www.tavultesoft.com), including all those that have been disallowed by ROBOTS.TXT, where no other search engines have. I don't know if that's because it has bugs in its ROBOTS.TXT analysis or because it just ignores it. Either way, it's not good.

    Has anyone else noticed this?

  36. Yes, but it's not playing nice. by lrund · · Score: 1

    This particular search engine isn't honoring the robots.txt file (at least, not on my site). I checked to see if it knew about my pages, and it had indexed deep into my site DESPITE the "disallow" directives in my robots.txt file.

    Shame on them.

  37. It's fast, anyway. by rde · · Score: 2

    I've been using if for a few days now, and it seems impressive. It's certainly fast. Google is still my engine of choice (even though it's visited my page a ton of times, and still won't find it when I search for it).
    As for its coverage: it may be "the result of more than a decade of research into optimising search algorithms and architectures", frankly this sounds dubious.
    If it covers 30% of the web it'll be twice as good as existing engines, but I suppose thirdoftheweb.com isn't that catchy.

    1. Re:It's fast, anyway. by zyklone · · Score: 1

      Well, it's from the same people who used to run ftpsearch.ntnu.no and that one was VERY fast until lycos bought it and messed it up.

      They have a special fast search chip or something, hardware regexp matching etc.
      They are certainly not beginners on the searching scene so they might be able to do it.

      (This is really old news, it was on /. a couple of months ago.)

      Zyklone

    2. Re:It's fast, anyway. by Helten+E · · Score: 2

      All the Web doesn't use the pattern matching chip, it's all done in software on 50 Dell servers running BSD. What's new is the 200M documents and the official announcement (it's only been up on trial until now).

      Eyvind Bernhardsen

  38. It seems that... by Gestahl · · Score: 2

    This would be a great application for a distributed computing application, lots of computers indexing the web, and after they finish that, they can revisit sites for broken, moved and changed content sites... First post?

    1. Re:It seems that... by spooky+ghost · · Score: 2

      On that topic: surely this could actually done by the web browser rather than a distributed client. If you have a page online you're bound to check it yourself to make sure it's OK. With an appropriate browser or plugin your page could then be indexed and submitted to a search engine. And then once you start surfing any page you visit could be automatically indexed. The only problem is the millions of submissions you'd get each day.

      --

      No matter what it looks like, there isn't a .sig here.
    2. Re:It seems that... by davie · · Score: 3

      Not to harp on one of my pet ideas or anything, but I think a distributed spidering project could be pulled off. The trick would be to delegate the work based on compute power and bandwidth, with the "low-end" clients doing the grunt work of spidering, then passing the raw data up to the bigger iron with more bandwidth where the relationships between sites could be ferreted out, keywords could be indexed and context established, etc. These sites could then pass the cooked data back to the top level servers (compressed, of course) for whatever final work needs to be done and then insertion into the database. The idea is to have each client do the work it's best suited for, and to distribute the load more evenly. Bandwidth could be a problem, but I think a lot of the data could be "tokenized" somewhat once references have been established, and some compression would probably help.

      If I had the networking know-how I would put together a proposal and start taking flame-mail, er, suggestions. Since I don't, I hope someone who does and is as crazy as me will pick up on the idea.

      --
      slashdot broke my sig
    3. Re:It seems that... by Bazzargh · · Score: 1

      Why do it on the client? Indexing would be
      much faster if the index was carried at the
      server, with a hierarchy of index servers
      not doing any spidering at all, if possible.

      Sound familiar? Its Harvest's SOIF format:
      http://www.tardis.ed.ac.uk/harvest/

      http://www.tardis.ed.ac.uk/harvest/docs/old-manu al/node151.html#SECTION0001200000000000000 00

      Just my 2c - I'd be happier if much *less* of
      the web was indexed...just the useful stuff.
      And if search engines could only recognize a
      mirror when they see one, then I wouldnt get
      so many identical replies...

      -Baz

    4. Re:It seems that... by Anonymous Coward · · Score: 0

      another idea (that may be used in conjunction) might be to create a standard like robots.txt, but for getting the list of all public files on the web servers, with modification times, file sizes, mime-types. This could speed up the search enormously, especially in the updates.

      the problem is that you may wish to "hide" stuff by not revealing the filenames. that's a _major_ problem according to me.

      matju

    5. Re:It seems that... by cemerson · · Score: 1

      You're forgetting the privacy issue.

      How many people would want their browser recording and sending off a list of every site they visited during the day? Even then, I doubt it would be a particularly good way of finding new sites that weren't already in your search engine.

      Chris

    6. Re:It seems that... by zyklone · · Score: 1

      Distributed.net could do it.
      It has been suggested on the mailing list once but i don't know what happened with that idea.
      The problem is that you would have to store a huge amount of data somewhere. So you would probably need a Big Company(tm) sponsoring or leading the project. The clients would probably duplicate alot of work, but this is not a major problem.

    7. Re:It seems that... by Hard_Code · · Score: 1

      This too, was one of my ideas, a while back when there was a post that invokes talk about a distributed cryptographic filesystem.

      My opinion is that no ONE center could organize all the data on the whole net, since it is so wide spread and far flung. My idea (somewhat corresponding to distributed filesystems) was that every client held a piece of the index and had some sort of reliability rating. Low reliability nodes would have to be backed up on fallover, duplicate nodes. Anyway there would be a whole distributed hierarchy of nodes based both spacially and I guess on reliability. When you asked the master node, or perhaps your regional node for something, it would forward it on to who IT thought might have the right answers. Each node would do the same, in turn, until the host itself was reached, or a terminal node was reached. The info would then be fed back to you. Yes it would be slower, but you WOULD get the correct answers. Also, if nodes were distributed spacially, then regional/local nodes could more frequently check for page expiration and 404s...one of the major problems is that all these CENTRAL search engines have LOADS of outdated crap. Sure you find a lot...but it's all invalid.

      My Seti client could sure share some CPU with a distributed indexing client...somebody set this up already!

      --

      It's 10 PM. Do you know if you're un-American?
    8. Re:It seems that... by Hard_Code · · Score: 1

      Sorta like my idea...every node is a server AND a client...

      --

      It's 10 PM. Do you know if you're un-American?
    9. Re:It seems that... by dschuetz · · Score: 1

      Isn't this the way that the old Archie system worked? Lots of different servers would index stuff, and every night they'd exchange what they'd learned (to spread the knowledge, and to ensure that none of the other servers revisited sites too soon). They claimed (IIRC) to visit every public FTP server at least once a month.

      The nice thing about this approach would be that you could have multiple front-ends, too, so the search engine "site" itself wouldn't get bogged down--automatic mirrors!

      This should be fairly simple to implement--a list of sites vistied (with dates) on the one hand, and index diffs (for the content itself). The only question is: How do we keep it from getting "sold out" and losing quality? (not that selling out is bad, but someone mentioned lycos going to hell after getting sold).

  39. searchenginewatch.com by Anonymous Coward · · Score: 0

    well, according to searchenginewatch.com they aren't the biggest

    1. Re:searchenginewatch.com by Helten+E · · Score: 1

      Searchenginewatch's current size comparison is correct as of July 1, but All the Web hasn't been running with 200M documents for that long.

      Eyvind Bernhardsen

  40. Re:What is the problem ? by Anonymous Coward · · Score: 2
    It's a problem of cost, bandwidth, and enough hardware. All of which can be solved relatively easy. The software to do the indexing is hardly any difficult to write - I have one I've written myself, and indexed a few million pages with. The reason I don't put up a search engine tomorrow, is that I certainly couldn't afford the hardware, and the fact that it's a lot of work to retrieve data from the index in a way that give good results.

    But another problem, is the amount of dynamically generated content. There simply ISN'T any way for a search engine to safely index everything on the web, because it can't know which CGI's just serve up a finite selections of pages from a database, and which randomly generate content, as long as no decent clues are given.

    The amount of dynamically generated content is growing dramatically, so this will be an increasing problem.

  41. They're running Apache/FreeBSD by Squirtle · · Score: 1

    According to http://www.netcraft.com/whats

    Seems to be the platform of choice for serious stuff like this.

    1. Re:They're running Apache/FreeBSD by Jordy · · Score: 2

      No one every said Linux was stable on every single machine in the world, it supports a whole lot of hardware which itself isn't all that stable itself. :)

      Linux Max Uptime: 845 days, 08:59m
      FreeBSD Max Uptime: 690 days, 23:48m

      Then again, there are about 1/10th the number of FreeBSD entrants... overall not a real big sampling group in general.

      Plus there's no information about hardware anyone is using and why the machine was rebooted (kernel ugprades, hardware upgrades or crash).

      Overall, it's sorta pointless other than a nice figure to say my oscar meyer is bigger than yours.

      --

      --
      The world is neither black nor white nor good nor evil, only many shades of CowboyNeal.
    2. Re:They're running Apache/FreeBSD by Anonymous Coward · · Score: 0

      From what I hear, the FAST guys are really fanatical FreeBSD guys.

      Hmm, maybe I should stop using that Compaq search engine? Compaq are too cozy with Microsoft anyways.

    3. Re:They're running Apache/FreeBSD by zmooc · · Score: 1

      But www.fast.no seems to be running Linux. Anyway...according to the Uptime List, FreeBSD has much higher uptimes than Linux. Looks like it is the choice of the folks that don't reboot. I think those are mainly to be found in commercial environments like this one. Quite funny - a search for my nick/handle only finds results on /. and [fm] :)

      --
      0x or or snor perron?!
  42. What is the problem ? by SimonK · · Score: 1

    I may be being a bit slow here, but what is the problem which prevents coverage of the entire wbe by search engines ?

    Surely if you just hit port 80 of every machine registered in DNS, and search recursively from the pages retrieved by that, you'll get a greater number of pages than the 10-20 percent most search engines have ?

    Or is it the case that the problem is in the indexing of the data, and searching it quickly enough, rather than retrieving it ?

    1. Re:What is the problem ? by PigleT · · Score: 1

      The problems are that neither all machines are listed in DNS, nor or all web servers running on port 80, let alone that not-all pages are linked from somewhere!

      ~Tim
      --

      --
      ~Tim
      --
      .|` Clouds cross the black moonlight,
      Rushing on down to the circle of the turn
    2. Re:What is the problem ? by SimonK · · Score: 1

      But surely all the index pages on machines listed in DNS on port 80 + all the pages they (recursively) link to is more than 20% of the web's content ? Almost all sites are both on port 80 and in DNS.

      Someone else (who is still down at 0, because he posted it anonymously) came up with a much better answer, which is that the hardware and bandwidth required to index 100% of static content is extremely large, and anyway most content is not static. Its this last point, I think, which is most important - by definition nothing you read daily is static content.

  43. Maybe all the web, but it's not useful by itamar · · Score: 1

    I try searching for "HotMedia". In Google, the first(!) result is the HotMedia homepage at IBM. Here, I don't see this page in the first results page.

    I'll stick to Google.
    --
    http://www.wholepop.com/
    Whole Pop Magazine Online - Pop Culture

    --
    http://www.wholepop.com/
    Whole Pop Magazine Online - Pop Culture
    1. Re:Maybe all the web, but it's not useful by Yakman · · Score: 1
      This is probably because, from memory, Google works by basing the relevance of a link on how many other pages link to that. This is why it's hard to find obscure stuff on Google, but if it's something you know that is quite popular it's the best way to find the most popular sites about it.

      I hope that made some sense :)

  44. Scan the WHOLE of the web? by Anonymous Coward · · Score: 0

    I hope it has fun with the 45GB of stuff I have at the end of a 64k line.

    BWAHAHAHAHAHH!

    1. Re:Scan the WHOLE of the web? by EJB · · Score: 1

      You never stopped to wonder why your Quakz Gamez were so slow?

  45. ad infinitum et ad nausium by Wubby · · Score: 1

    Assuming it will be a rather large amount of data, who will index thier index? (and who will index that index... and that one... and that one......)

    --
    Sig
    Appended to the end of comments you post. 120 chars
  46. yeah, it's fast. but it's pretty weak by Anonymous Coward · · Score: 0

    Came across it a few weeks back - we were looking to integrate a search engine with our product, and they looked promising (although we didn't go with them in the end). I used it as my default search engine for a week or so to see what it could do, but it seems to have even less coverage than most of the others: pretty much *nothing* outside the US, and an awful lot of holes within it. Still, it's pretty fast and it manages to not return porn sites for searches on techie subjects, which is more than certain other engines do. :)

    Not Cowardly, just lazy.

    1. Re:yeah, it's fast. but it's pretty weak by Anonymous Coward · · Score: 0

      I found that alltheweb had pretty good coverage of norwegian web sites, as well as some other european sites. Not too surprising though, FAST is a norwegian company, although the actual search-engine servers seems to be located in the US.

  47. Non-scientific analysis by Snotboble_ · · Score: 2

    I wondered about those 200M pages already indexed, and I dug into Altavista, which says it has ~140M pages indexed.

    I made two searches; one for the word 'Microsoft' and the other for 'Linux'.

    Altavista gave : 12,682,370 (M$) and 4,526,430 (LX).
    FAST gave : 4689227 (M$) and 2570827 (LX).

    So.. If FAST currently is ~40% bigger than Altavista, how come they return numbers that much lower? With such large numbers it can't be pure coincidence, In My Humble Opinion.

    -Snotboble

    --
    Q: How does a Unix guru have sex? A: unzip;strip;touch;finger;mount;fsck;more;yes;umount;sleep
    1. Re:Non-scientific analysis by sugarman · · Score: 1

      As one of the other posts pointed out, they currently have little to no content from NA.

      Obviously, this would tend to skew the results somewhat. ;0

      I imagine as they get closer to their goal, the search results will become more relevant.

      --
      --sugarman--
    2. Re:Non-scientific analysis by Anonymous Coward · · Score: 0

      *shrug* It could just be another case of Alta-Vista being on crack...

    3. Re:Non-scientific analysis by jandrese · · Score: 2

      Probablly because alltheweb is indexing EVERY page it comes across, even those "Hello, I'm so and so and I love cats..." pages that most search engines thankfully ignore. It even had my webpage in there, which is a first for search engines.

      --

      I read the internet for the articles.
  48. seems doable by Anonymous Coward · · Score: 0

    My rough calculations indicate that it's *very* feasible to spider a billion pages in that length of time.

  49. It's the custom hardware, stupid.. by gaute · · Score: 2
    Sloppy journalists...

    Check out this
    http://www.fast.no/product/fastpmc.html

    gaute


    -- We plunge for the slipstream the realness to find

    --
    -- We plunge for the slipstream the realness to find
    The incredible String Band
    1. Re:It's the custom hardware, stupid.. by gaute · · Score: 1
      "Actually"?
      You claim to KNOW this?
      I noticed you live in the same city as FAST headquarters..
      But maybe you cant talk about that ;-)

      No, seriously, there are a couple of pages at the fast site that imply rather clearly that alltheweb uses the PMC.
      Not explicitly though, you'r right about that.
      I seem to remember a picture of one of those dell machines full of those cards, but of course I cant find it now...

      Anyway, just look at this quote from the PMC faq,
      and compare this with allthewb's claim of scaling lineraly.

      >Since the PMC search through data at a fixed speed (100 MB/s), the
      >response time for a query is independent of its complexity. In a
      >software solutions the response time increases more than linear with
      >increasing query complexity.

      Gaute


      -- We plunge for the slipstream the realness to find

      --
      -- We plunge for the slipstream the realness to find
      The incredible String Band
    2. Re:It's the custom hardware, stupid.. by mistabobdobalina · · Score: 1

      also it appears dell has some type of sponsorship - at least alltheweb.com has a dell logo on it...

      --
      -- your knees hurt, don't they?
    3. Re:It's the custom hardware, stupid.. by frodelu · · Score: 1

      Actually, All The Web doesn't use the PMC. It's running on 50 standard Dell servers using the Fast Search software.

      - Frode

      --
      -- Best regards, Frode Lundgren
  50. dynamic content by Sourdough · · Score: 1

    How is it possible to index the entire web? The entire *static* web should be relatively simple, but dynamic content really throws a monkey wrench in things. And dynamic content is becoming much more commonplace. Not even going into forms, a page referenced by a URL may be different day to day, or even minute to minute (like slashdot).

  51. Two problems: by decipher_saint · · Score: 1

    !. For every site that goes up, one goes down. I don't know how they are going to keep up with dead links.


    2. Slow, if they index everything you will notice definite slowness. Even if they find some kind of uber-fast way of searching through stuff their servers will be slowed down by net-troglodytes searching for the "internet" or the letter "a".


    Imagine how many pages would pop up if you searched for the word "pictures".

    --
    crazy dynamite monkey
  52. Re:I sure HOPE it doesn't index the entire web by jflynn · · Score: 1

    Good point. I can also see the xtian right and censorware manufacturers being excited about having a comprehensive list of sites that need shutting down/blocking. If people want to remain unpublicized on search engines, they should be able to. Respecting ROBOTS.TXT is a simple solution already available, and I hope FAST will come around on this.

  53. My search-engine criteria... by dwlemon · · Score: 1

    Is how well it finds my own home page. Google takes me straight to my main page if I enter my e-mail addy, but this engine only shows me one of the files on my page. It has to do with music files, so I think maybe their algorithm decided it was more relevant/interesting to the average person than my main page.

    I think I'll stick to Google.

  54. Dupe! (sorta) by blue · · Score: 1
  55. I sure HOPE it doesn't index the entire web by ashpool7 · · Score: 1
    I really hope that they don't seriously mean they will be indexing the entire web. That would mean their crawler would have to completely ignore ROBOTS.TXT.

    I, for one, would like to keep some of the webpages I post on the internet un-indexed because they were ment for a couple of friends, not a couple billion people to rummage through.

    1. Re:I sure HOPE it doesn't index the entire web by noosphere · · Score: 1

      If your pages are only intended for a couple of people, try putting .htaccess/.htpasswd access on your directory, or even just leave your page "out of the web" by making sure nothing else links to it. (and make sure it doesn't show up in a dir listing) If there's a crawler that can get to you page that way, I'd be VERY surprised.

  56. more URL are not good by Anonymous Coward · · Score: 1

    If you want to impress me with a search engine, re-run the search machines and get rid of all the
    expired and bad URL's. The more URL an engine adds the more it becomes unusable.

  57. Linked to Lycos ? by Foddrick · · Score: 1

    Did anyone else notice that these are the same guys that have or had ftpsearch.ntnu.no before lycos got hold of it and there's gratuitous links to lycos all over their site. It's not just lycos in AllTheWeb.com clothing is it ?

  58. Harvest: Distributed indexing (Re:It seems...) by RasmusKaj · · Score: 1
    There was the project Harwest a while ago, not sure what came out of it (a search revealed this document seems sane and not much more than a year old).

    The basic idea was that the pages are indexed locally at the server, and indexed data are gatherad and can be queried at "brokers".

  59. old story by Mudhiker · · Score: 1

    This was on /. several months ago
    I'm tired of old stories being new.
    That story last week about N2H2 and Bess...
    Bess is not new, as the subject thingy said, been around for several years, i know, i fought it at my friend's house.

    --
    "I want peace on earth and good will toward men." "We're the U.S. government. We don't do that sort of thing!!"
  60. Will they pay to get the porn sites? by Jimhotep · · Score: 1

    How much do they have in the budget
    to pay for porn sites?

    Do they have a copy of my son's
    Final Fantasy tribute pages?

    Questions Questions Questions

  61. ~broken~, all hits not shown? by Ramses0 · · Score: 1

    http://www.alltheweb.com/cgi-bin/search?type=all&q uery=%22Robert+Ames%22+woodlands+-golf

    Running the above query says: "12 documents found," but it only shows results 1-10, and doesn't have a link to more results.

    Now I don't know exactly how many pages that match this criteria are *actually* out there, but it seems as though you should show all the matches that you count, unless you're padding your counts ;^)= (btw, that last claim is completely unsubstantiated, I'm just feeling mean :^)=

  62. The name must mean it's true by hatless · · Score: 1

    This must be the bestest search engine ever, because the name says it is. You can't do better than "alltheweb". Everyone else might as well pack up and go home.

    1. Re:The name must mean it's true by Hard_Code · · Score: 1

      I heard 130%oftheweb was actually fabricating new content to swell its index...heh

      --

      It's 10 PM. Do you know if you're un-American?
  63. Re:Wow - I'm famous! by Hard_Code · · Score: 2

    Just for fun I decided to search for myself on Alltheweb. To my surprise I found:

    1. The plan for an old CS group project from college, where my name was referenced!

    2. 2 broken links to ZDNet talkbacks of mine.

    3. A CNet page with a dorky little media player I wrote and released as freeware.

    4. Some random Italian site hosting Win95 software including my dorky media player with full description extracted!!

    Wow...my head is swelling...

    Hmm...it didn't find my page though...heh

    Aaron

    --

    It's 10 PM. Do you know if you're un-American?
  64. Failure to display all results. by antizeus · · Score: 1
    I did three searches on this thing:

    1. antizeus: 27 hits, and it displayed only the first 20.
    2. notopia: 17 hits, and it displayed only the first 10.
    3. "evil farmer": 33 hits, and it displayed only the first 20.
    One would think that they'd get this sort of detail worked out early on in the development process. Despite that problem, I was impressed by the thoroughness. There was some stuff there that I'd never see on other search engines.

    (by the way, "Notopia" was the name of a great radio program on KCSB that disappeared several years ago, and Evil Farmer was a great band in the Santa Barbara Calif area which also disappeared several years ago. I miss both of them. Unfortunately, antizeus is still with us.)

    --
    -- $SIGNATURE
  65. I think they won't have any problems by Anonymous Coward · · Score: 0

    no problems with storage at least, opticom(the company which develops 170terabyte polymerbased harddrives) owns 40% of fast search and transfer so they'll probably download the hole internet on one of these hd's

  66. No way!!! by Anonymous Coward · · Score: 0

    Well, I did an ego search, and they didn't even *have* my page, quite apart from showing only 9 of the 14 matches they said they had. Altavista or Northern Light is much better on this test - as far as I'm concerned, they failed!!

  67. QUESTION .... by Starr · · Score: 1

    anyone know what they intend to do about the sites that don't want to be indexed? ... is it forced indexing? ... is there any laws (???) anyone can think of that may affect this?
    -
    example of site that doesn't want indexed: i know of a pagaen group's site that has info for the group to view quickly without waiting for snail mail ... they are not on any search engine because they don't want to be ... i seriously doubt they want to be indexed either
    -

    --
    if knowledge is power, the internet is god - me again
  68. all and then some? by lowsix · · Score: 1

    Ok so repeating an effort to find the various purported etymology of the word "strawberry" I searched with
    +etymology +strawberry +origin
    on both yahoo (my standard) and alltheweb.

    Yahoo found 60 while alltheweb found 117, but a number of allthewebs' finds were xxx sites!?

    How many xxx sites actually use the word etymology and if this is more do we really want more?