Domain: webcrawler.com
Stories and comments across the archive that link to webcrawler.com.
Comments · 43
-
Re:I agree!
Ikr? Same with HotBot, and InfoSpace and Lycos and Metacrawler and WebCrawler and Dogpile and Looksmart and so on...
I get these confused ALL THE TIME with Google!
-
Re:Antitrust violation?No different that Siri or Google Now... You can still open a browser and search for anything using any search engine.
I suggest https://www.webcrawler.com/ or Altavista or even gopher://gopher.floodgap.com/1...
-
Re:Why?
Everyone knows http://www.webcrawler.com/ is the oldest & fastest search engine!
Pffft. Lycos FTW.
(Anyone know why a search engine named after a family of spiders is apparently now using a dog as its logo?)
-
Re:Why?
Everyone knows http://www.webcrawler.com/ is the oldest & fastest search engine!
-
Re:I'm a boiled frog
-
Re:hmm
It's not what it used to be... http://www.webcrawler.com/
-
Historical search engines
I recall the early lycos search business model -- you'd get 40 or free searches, then a subscription was 'required' (not really, but it was supposed to be required). I can specifically recall goofing off in my IT hardware support role searching and downloading DOOM *.wad files for late night fraggage. There was no
/. then, sadly, there was only DOOM and Efnet.
Altavista seemed to get replaced by google, in rather short order. I can't recall a specific reason I stopped using it, unless it was related to the repeated sale/reorg of DEC -> Compaq -> HP. I remember the news spreading about altavista hacked in '97 and '01 (the pr0n).
Maybe I'll use that webcrawler search thingy to look this stuff up. Maybe I should go back to work instead. -
There HAVE to be at least TEN ALTERNATIVES...
The Top-10 Alternatives to "I googled it" (note the lower-case 'g'):
- 10 "I AltaVista'd it" (potential ad campaign: "Hasta la vista, Google!")
- 9 "I Yahoo!'d it" (Good luck with that lawsuit; it's been in the official motto of several states for decades!)
- 8 "I Asked it" (AKA "I just axed it", since they "axed" poor Jeeves...)
- 7 "I HotBot'd it" (She's not all that hot these days...)
- 6 "I WebCrawler'd it" (Crawl being the operative word; no speed records broken here!)
- 5 "I Accoona'd it" (Possibly illegal to admit in several states)
- 4 "I Lycos'd it" (Not to be confused with "I Pecos'd it" from the 1950's...)
- 3 "I Netscaped it" (That's netscaped not netscraped)
- 2 "I AOL'd it" (Roughly analogous to "I screwed it up")
and the #1 alternative to "I googled it":
- 1 "I Dogpile'd it" (Imagine Cartman in the "red rocket" scene...)
-
Re:Kinderstart
and even Webcrawler (who has horrid URLS now)
That has to be a bug. No one sane would code on purpose an URL like those. (/.'s junk character detector won't event let me quote those -/-/-...)
-
Re:Kinderstart
What is KinderStart anyway? I searched for it, and it seems that there are plenty of results completely unrelated to the plaintiff.
The real proof in the pudding is how other engines handle it. MSN, Yahoo and even Webcrawler (who has horrid URLS now) list it as the top result. They may be gaming results (since when do kids need NASDAQ?). Despite their cheery presentation, they are a for-profit company as far as I can tell. Google may have cought them doing something fishy. From what their press release page has, they have an activity gap of four years or so, so the pageRank theories people have proposed might have weight as well. I guess we'll find out eventually. -
Re:We all know why
-
Re:Why
Keep in mind that Yahoo!'s search is fundamentally different than Google's. Google spiders the web (much like WebCrawler and MetaCrawler did, and still do). Yahoo!'s search has been based on user-submissions and moderators checking each site and adding it into the directories.
This is probably why, as you noted, Yahoo!'s results are more relevant. -
Re:What is this "Google" you speak of?
I did a WebCrawler search and can't find anything about it.
It is easy. Just go to http://webcrawler.com/ and then search for the best search engine. And take a closer look on the first search result. -
Re:Single Domino Theory Revisited
Revert back to the yellow pages, nearest Borders books store, or worse www.webcrawler.com.
-
I really like the...
I like one of Webcrawler's featured searches today: Camel Spiders.
Those things may have urban legends surrounding them or whatever, but they are GODDAMN SCARY!!! -
When did they give up....
...on their own web search technology and become a metasearch engine? From the WebCrawler About Page:
WebCrawler uses innovative metasearch technology to search the Internet's top search engines, including Google, Yahoo, Ask Jeeves, About, Teoma, FindWhat, LookSmart, and many more.
With one single click, WebCrawler searches the best results from the combined pool of the world's leading search engines -- instead of results from only one single search engine.
And WebCrawler makes it easy to refine your search so you can find the most meaningful results right away. No wonder it's a leader in the search industry.
Was it 2001? The History states:
2001 InfoSpace acquires WebCrawler. Excite, now Excite@Home, went belly up. In the bankruptcy, Infospace acquired WebCrawler. Today Infospace runs WebCrawler as a meta-search engine. And they've given Spidey a new name and turned him purple!
Oh, and if it is not being otherwise used, has the code for the WebCrawler spider been open-sourced? :)
-
They used to be my google....
I remember when webcrawler was the only search engine I touched...
In 1996 it was nice and simple. Then as the time went on it got a bit too cluttered for my liking. Now looks like they're trying to googlize themselves with the current interface. -
Back in the Day
I used to use Webcrawler exclusively for all my search and homepage needs. Then I noticed Google, which was still very much in its infancy. I switched to Google as I was still using an old 14.4 modem, and Webcrawler was becoming a bit more bloated than I liked. Even though I now have broadband I am very happy that both Webcrawler and Google have maintained a function over form attitude.
-
Re:Google has the right idea
AlltheWeb, Altavista, Ask Jeeves, Teoma, WebCrawler - all of them redirect the search results through their servers (and no doubt log it), but not Google (altough it did that couple of years back for a few weeks IIRC).
In adition to using non redirecting search engine one should be blocking their cookies too to get you little bit more privacy.
And BTW, MSN search doesn't redirect too, how scary is that 8o
Something wicked is going on in MSN land... -
Re:Not the same attck at all.
What nerds like you fail to understand is that the RIAA sued Napster the Corporation, not Napster the network protocol.
You may not remember that FTP, Usenet and IRC were rife with all kinds of pirate material up before Napster came along. Most were only a Webcrawler search away (and it looks like they still are). Warezing music helped lead to the popularity of MP3 in my opinion and experience. Napster was merely a new architecture and interface.And dude, don't insult just because you disagree. It just makes your argument sound childish and dilutes your credibility.
-
Re:Not the same attck at all.
What nerds like you fail to understand is that the RIAA sued Napster the Corporation, not Napster the network protocol.
You may not remember that FTP, Usenet and IRC were rife with all kinds of pirate material up before Napster came along. Most were only a Webcrawler search away (and it looks like they still are). Warezing music helped lead to the popularity of MP3 in my opinion and experience. Napster was merely a new architecture and interface.And dude, don't insult just because you disagree. It just makes your argument sound childish and dilutes your credibility.
-
The days before Google.
I had a hard time remmebering, but before Google I always used:Metacrawler is still good sometmes when Google isn't returning completely desirable results (hey, it happens), but other than that, I didn't even know any of these searches where still active. I wonder if they all use Google software now? ;-) -
Meanwhile, outside Googleland...
I have just tried Kazaa Lite on various other search engines and meta search engines, and without fail they return at least one of the forbidden 8 sites that Google removes:
AltavistaObviously not a comprehensive effort (I have a 3yr old son to entertain right now and that's much more important!), but it leads to the conclusion that either the complainant thinks the world revolves around Google OR the other sites haven't checked their mail yet!
As others have pointed out, the genie is out of the bottle and so semi-hiding the links is going to be pointless. I loved the written up DMCA complaint--putting the list of banned sites on it is kind of like having an English test question that says: Write down the correct spelling of following word: 'incomprehensible'?
. -
Altavista?
-
Re:This Just In:
webcrawler is alive and well, but simply acts as a search aggregator, returning searches from Google, Overture, FAST, About.com, etc.
-
no way
I just use Webcrawler. I always find what I am looking for.
-
progression of search engines
at one point, webcrawler was a good search engine. There was also Yahoo and excite and altavista and all of them together, dogpile (and more, of course). Popularity has skipped from one to the next (though yahoo has been more portal than search engine, with lists and reviewed sites and such, news and stocks and groups and maps...) And one search engine to rule them. that would be google, right? google seems to be a sort of ending place, which could say something about innovation on the web. Or it could just mean that what is popular is also a Very Good Thing.
-
Usability of Search Engines
I have used many different search engine sites since I began using the internet in 1993. (I know it's not as long as some, maybe most of you) Back then I started with Webcrawler then YAHOO!. After getting easily annoyed with those, I found Altavista, which back then was actually at http://altavista.digital.com. I stuck with AV for a very long time, until I found GOOGLE... Ahhh Goooooogle. What can I say, there's nothing easier, and faster. Plus when I want to do a specific search, I love the option of adding on the
/linux or /apple. It makes those 'special' searches that much easier. I'm a Googler for life... -
a binding Robots Exclusion Standard?
I know a lot of people here are very anti-regulation, but I think it would be great if case law established that web robots must obey the Robots Exclusion Standard. Since it's a widely-known standard, I think it can be fairly argued that robots that choose to disregard
/robots.txt are in danger of tresspassing to chattels. Using the standard also would allow bots to fulfill their helpful role, while providing a clear distinction between what is and what is not acceptable.
Sure, one might argue that people might be unaware of the standard, but that is seldom an excuse. I may be unaware of fire/electrical codes, but I'll still get in trouble if I don't adhere to them, because I'm putting others at risk and thereby imposing a cost upon society (fire trucks and insurance don't come free). Web crawlers that index data in violation of the Robots Exclusion Standard impose a cost on companies and society just as well, in the end requiring people to by bigger pipes, faster servers, and so on (thereby using more power, dumping more old computer components into landfills and more chipmaking chemicals into the environment).
My point is that web crawler operators live in a society, just like everyone else, and they too must be held accountable for the consequences of their actions, particularly when they willfully disregard the requests of web site operators as expressed in /robots.txt -
Metadata, URI, mirrors etc.....Sorry for self-quotation (from the TERENA Technical Report FTP Mirror Tracker):
Unfortunately, there is still no coherent architecture for mirroring and for mirror sites to register their collections with the sites which they mirror. In fact, we lack even a common (de facto) standard for recording this replication information in a machine readable for-mat. Some progress was made on this a few years ago by the Internet Engineering Task Force s [1] working group on Internet Anonymous FTP Archives, with the creation of the so-called IAFA Templates [2]. These provided a simple machine readable format for recording per-resource or collection metadata, which could easily be created by hand or programatically. Although support for IAFA templates was integrated into some software packages, e.g. the ALIWEB search engine [3] and the ROADS resource discovery sys-tem [4] , this approach never became successful on a large scale. The World Wide Web Consortium s Resource Description Format (RDF) [5] and the Dublin Core metadata effort [6] may eventually provide a viable machine readable interchange format.
Another attempt to create a framework for such a metadata was an "Open-Software-Index" that Oliver Maruhn and myself tried to create almost 2 years ago. After this document some discussion had started (code name "Russian Freshmeat") that had shifted mostly to localisation of such a metadata. Unfortunately no working code was produced.Currently, the database underlying the freshmeat.net weblog [7] is perhaps the closest thing we have to a genuine mirror registry - though it focuses almost exclusively on soft-ware packages and operating system distributions, and only offers limited mirror informa-tion. RDF is also being used in this capacity as part of rpmfind.net [8], although mirror information is very limited in this case too. The Internet Engineering Task Force s Uni-form Resource Names effort [9] is also relevant here, since it would be very useful if there were persistent and location independent names for these collections of replicated resources.
[1] http://www.ietf.org/ Internet Engineering Task Force website
[2] http://info.webcrawler.com/mak/projects/iafa/ IAFA Working Group & IAFA Templates homepage
[3] http://aliweb.emnet.co.uk/ ALIWEB website
[4] http://roads.opensource.ac.uk/ ROADS website
[5] http://www.w3.org/RDF/ World Wide Web Consortium Resource Description Format (RDF) homepage
[6] http://purl.org/dc/ Dublin Core website
[7] http://freshmeat.net/ freshmeat.net website P. Lenz & Andover Advanced Technologies, Inc.
[8] http://rpmfind.net/ rpmfind.net website
[9] RFC 1737, Functional Requirements for Uniform Resource Names K. Sollins & L. Masinter December 1994And at the end somewhat less relevant to the topic.
This kind of metadata should be extremely valuable for implementation of the URIs and particularly for the I2C(s) (URI tp URC). Quote from the RFC 2483:
"Uniform Resource Characteristics are descriptions of resources. This request allows the client to obtain a description of the resource identified by a URI, as opposed to the resource itself or simply the resource's URLs. The description might be a bibliographic citation, a digital signature, or a revision history. This memo does not specify the content of any response to a URC request. That content is expected to vary from one server to another."
Hopefully we already have mechanism for the I2L(s) (FTP Mirror Tracker). -
Re:robots.txt
I know not everyone knows how Search Engines work, and mostly you don't need to know. Everyone who has a page on the web should read this though " A Standard for Robot Exclusion ". Its been a standard since 30 June 1994 and thats not bad for an Internet standard.
I assure you that Google.com follows it to the letter. All the main SEs do.. if they didn't they might even be leaving themselves open to legal challenges. Read the old mailing lists at Webcrawler (search for "robots.txt" on google) and you'll see that people used to get quite wound up by rude SEs back in 94. A Web server's CPU time was worth something then.
As for all the lone gunmen out there cooking up theories...read this. Google has ALREADY sold the top links for some keywords. They don't hide it, read the FAQ on their site and you'll find the address to write to to buy listings. Maybe you should read the Demographics. Your the market being sold. Seems fair to me.
The actual search results (not the adverts) are genuine and not sold. Makes sense... consider the whole Google model (who links to you affects your ranking) and its clear Yahoo, Disney etc will all rank very highly. Lots of links into them because they are quality sites.
I've done a lot of work with SEs over the years and Google is far more genuine than anyone else in the market, but they have to make ends meet.
Take a look at this also. Can we spot the paid for listings yet?
-
Robot Exclusion Protocol
The robot exclusion protocol (http://info.webcrawler
.com/mak/projects/robots/norobots.html is a way for websites to tell robots what they shouldn't be crawling. When a robot wants to crawl http://foo.bar.com/ it will first fetch http://foo.bar.com/robots.txt. If that file does NOT exist, that is taken to mean implicit permission to crawl anything it can find on that site. If it does exist, then the patterns contained in it are used to restrict what portions of that site are crawled. Every site has its own robots.txt (or lack thereof). To look at Yahoo's robots.txt, just point your browser to http://www.yahoo.com/robots.txt.If a site has a robots.txt that is telling the robots not to crawl, they have no business yelling at search engines when their pages don't show up.
-
Don't know what robots.txt is?
If you don't know what robots.txt is, look at A Method for Web Robots Control Internet RFC...
-
Re:Does it work recursively?What *about* those search engines? Based on this logic, shouldn't each of those also be liable? And what impact will that have? Will we have search engines doing censorship based, not even on ethical grounds, but on US litigation?
Since the major defining issue with the court seems to be 'intent', rather than content (copyleft but not new york times), doesn't this give license for selective prosecution of a law? I consider this highly problematic.
Searches currently function from:
...with varying levels of accuracy. Line up your lawyers now.
---
"The Constitution...is not a suicide pact." -
SCOUR = A Mixed Blessing
A little off topic, but this should be of interest to some.
Hosting a website on a Linux box running on an SDSL connection is the cheap way to go, for around $275 you get 1.1mbps (best deal in the present area) and unlimited traffic. This is critical when you run an independent site which transfers large amounts of data (75GB monthly).
In steps Scour, unfortunately they index your site and suddenly a mass of people are downloading mpegs (and only mpegs) from you. Nobody is actually visiting you, nor reading your content, and that normally smooth 1.1mbps connection is choking.
At the time there was no automated way to easily remove the site from Scour, nor could I find what the Scour robot was named. Most search engines' robots are listed on one FAQ or another, so it's easy to set your robots.txt file for them not to index your site. You end up shooting them a frantic email asking to be removed ASAP (since you're experiencing a DDOS for all purposes) and parsing through your server log to try and find that pesky robot's name.
The heart of the matter is that while Scour may be one stop shopping for everyone it's a hidden pitfall for websites - people download anything you have up, but never actually visit your page or make an impression on that ad counter.
Here is a FAQ on search engine robots for those interested. The name of Scour.net's robot is: "SCOUR"
Andrew Borntreger -
robots.txtI would have some sympathy with Bidders Edge but they don't follow the robots.txt file on eBay.
here is http://search.ebay.com/robots.txt:
# robots.txt for eBay
It isn't like eBay is disallowing access to everything, crawlers are allowed to index anything on www.ebay.com (no robots.txt) and whatever is not excluded search.ebay.com. IMO whether the judge knows it he is upholding a standard and that is a good thing.
User-agent: *
Disallow: /aw/listings/
Disallow: /aw-cgi/
Disallow: /aw-secure/
Disallow: /cgi-bin/
Citrix -
my answer to Jamie: "Girls cout cookies"
If you search for "girls cout cookies" (a not implausible typo) on webcrawler (an especially crappy search engine), then on the third results page you'll find a link to http://www.xxxtrem.com/index.htm, which is chock full of naked people.
We're not talking about using useful search engines like google, here. Most search engines will throw all sorts of random results at you regardless of imput. -
my answer to Jamie: "Girls cout cookies"
If you search for "girls cout cookies" (a not implausible typo) on webcrawler (an especially crappy search engine), then on the third results page you'll find a link to http://www.xxxtrem.com/index.htm, which is chock full of naked people.
We're not talking about using useful search engines like google, here. Most search engines will throw all sorts of random results at you regardless of imput. -
my answer to Jamie: "Girls cout cookies"
If you search for "girls cout cookies" (a not implausible typo) on webcrawler (an especially crappy search engine), then on the third results page you'll find a link to http://www.xxxtrem.com/index.htm, which is chock full of naked people.
We're not talking about using useful search engines like google, here. Most search engines will throw all sorts of random results at you regardless of imput. -
Re: Scanners -- Digimarc Respects robots.txtOne of Digimarc's services they offer is you pay them some money and they report any use of your image they found on the web. By keeping an eye on my logs, I've noticed their crawlers perusing my server several times. Thoughall of the images on my site are mine (MINE MINE MINE!), I still don't like this idea.
I had the same problem. My web site has several hundred pictures (all mine) and it was common to have Digimarc transfer serveral hundred meg over the course of a week. Since I pay for bandwidth, this was a Bad Thing. I contacted Digimarc and was pleased with their answer. As with all good search engines, they respect the Robot Exclusion Standard. If you tell them not to index your site, they won't. Of course, that's good for me. If I was paying for the Digimarc service, however, I'd be very upset. The logical end would be that if I had was stealing graphics, I'd simply ask the Digimarc robot not to visit. I'm not sure if Digimarc's customers have every thought about that. I'm sure Digimarc has but, of course, they aren't going to say it too loudly. Even if Digimarc didn't respect robots.txt, you could always block them at the http or tcp/ip level.
InitZero
-
robots.txt
There is and informal but generally accepted standard you should take a look at called "A Standard for Robot Exclusion"
Take a look at http://info.webcrawler.c om/mak/projects/robots/robots.html and http://info.webcrawler
.com/mak/projects/robots/norobots.htmlThis does not address copyright issues, which have become even murkier with the recent revisions to the copyright law restricting fair use.
You should also take a look at the XML syndication format (aka RSS [RDF Site Summary]). It's based on RDF and is becoming supported by alot of larger news sites, even
/. Here are some links: http://www.edventure.c om/release1/abstracts/syndication.html for background info. http://www.w3.org/RDF/ for the low level info, and http://my.netscape.com/publish/ help/quickstart.html for the RSS implementation. -
robots.txt
There is and informal but generally accepted standard you should take a look at called "A Standard for Robot Exclusion"
Take a look at http://info.webcrawler.c om/mak/projects/robots/robots.html and http://info.webcrawler
.com/mak/projects/robots/norobots.htmlThis does not address copyright issues, which have become even murkier with the recent revisions to the copyright law restricting fair use.
You should also take a look at the XML syndication format (aka RSS [RDF Site Summary]). It's based on RDF and is becoming supported by alot of larger news sites, even
/. Here are some links: http://www.edventure.c om/release1/abstracts/syndication.html for background info. http://www.w3.org/RDF/ for the low level info, and http://my.netscape.com/publish/ help/quickstart.html for the RSS implementation. -
Scooter is Altavista - What robots.txt is
which would explain why it's resolving to DEC as well. Architext is excite. I assume mozilla is for netscape's search. The bots check for instructions in robots.txt first, then look at the head tags.
check out the robots exclusion protocol.