robotstxt.org · Domains · Slashdot Mirror

Excuse me but... by hacker · 2005-07-13 01:08 · Score: 3, Insightful · on The Internet Archive Sued Over Stored Pages

First and foremost, the existance of a robots.txt does not constitute a contract between the client (a web surfer/browser agent) and the server (the site hosting the content proper). Repeat that over and over. There is nothing stating that the existance of robots.txt on your server must be requested by my crawler or spider.

Its preferred, but not required. Even so, I am free to ignore it if I want, and parse whatever links I see fit to grab. If you make the content public and I want to read that content, I'm going to get it, whether you have robots.txt in place or not.

Secondly, has anyone taken the time to validate the robots.txt file found on the site in question? Note too that they just changed robots.txt on July 8th of this year. Did the previous version validate? Are they trying to rewrite history again? What did the old version look like?

If there is even so much as one error, robots/crawlers are free to ignore/parse/merge/break it as they see fit. It happens all the time, and even when robots.txt is perfectly valid, many robots and crawlers ignore it anyway (msnbot and Yahoo's crawlers are two of the worst offenders here).

But back to the first point, robots.txt is a guideline, not a rule, not a contract, and certainly not something that can be enforced. Does lack of a robots.txt file constitute the legal right to publically redistribute the content? Or store it for later review and retrieval? How do you know any of your former employees from 1996 haven't stored your entire website on floppy, one page at a time? Did they adhere to robots.txt? Did ANYONE adhere to robots.txt in 1996? It seems that there was evaluation of the Robots Exclusion Standard in 1996, but was everyone using it? Not likely.

Microsoft Internet Explorer will certainly store the entire website for "reading offline" if you ask it to do so when bookmarking it. They don't parse robots.txt to exclude pages that shouldn't be stored locally.

Its too bad that people need to try to erase history to prevail in litigation. This isn't George Orwell's 1984... well, at least not yet anyway.

Re:Serious Question by ArtStone · 2005-07-01 01:26 · Score: 1 · on Perl's Chip Salzenberg Sued, Home Raided

Your statement of fact is based on what?

As just one example, PlanetLab.org runs an entire network of open proxies.

http://codeen.cs.princeton.edu/

Because I run a web server with a database that everyone wants to "scrape", I see this kind of rogue spider hiding behind proxy servers that don't read robots.txt and try to hide every day.

I now ban every known -open- proxy IP and aggressively gather lists in order to block access.

However I think his claim that ignoring robots.txt as being "illegal" is unfounded. Civil suits are not the same thing as criminal matters.

In the same paragraph in the letter he both says:

"None of HMS's harvesting source code even mentions the ROBOTS file, let alone obeys it." and

"Yet at least one of the authors of the harvesting system did know about it, since the "RequestDistribution.txt" document in the harvester source code actually contains a reference to the W3C standard for ROBOTS.TXT."

Your honor, I would submit that one or the other of those statements is not true.

And robots.txt is -not- a W3C standard. It is a voluntary agreement among web spider authors.

From:
http://www.robotstxt.org/wc/norobots.html

"It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots."

If fact, WC3 -explicity- says on their web site that robots.txt is not a compulsary standard:

http://validator.w3.org/docs/checklink#bot

"Note that /robots.txt rules affect only user agents that honor it; it is not a generic method for access control."

If you're going to accuse your employer of illegal activity, you sure better have your facts right.

Re:What an idiot by (negative+video) · 2005-06-30 16:55 · Score: 1 · on Perl's Chip Salzenberg Sued, Home Raided

The cause of action here is not breach of contract but trespass to chattels.

By definition, the reply to an HTTP GET response encodes the site operator's policies. The 200 OK response means they want you to have some information, which follows. The 401 Unauthorized response means that need to try again with an HTTP auth password supplied by the site operator. The 409 Conflict response indicates that for some reason the request cannot be fulfilled; it is followed by instructions for how to resolve the problem, such as a acceptable use contract form, or a credit card payment form. (AFAIK, nobody uses 409 because 200 with instructions and a form works just as well; browsers don't show the response code so nobody knows the difference.)

There is even a 503 Service Unavailable response, which lets you tell the client to piss off for any amount of time you choose. If you gave a 503 with a billion second delay, it would be illegal for a user to click the reload button on their browser.

If a web site operator places restrictions on the use of that site (either explicitly or through robots.txt files), and those are violated, then it a trespass to chattels cause of action can be found.

And my point is that the restrictions are expressed by the site operator to the user by the status code of each HTTP reply. If the operator says 200 OK, the contents they want you to see follow.

ROBOTS.TXT is just an voluntary gentleman's convention amongst certain bot authors. It postdates HTTP and does not modify the meaning of that standard. It is not authoritative; there is not even any way to obtain a legally-definitive copy of it. The authority statement from the Robots Exclusion Standard comes right out and says as much:

It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.

Any judge who finds that "violating" ROBOTS.TXT is a crime (pardon my French) is an asshole.

Re:Spammers killing Google by Shaper_pmp · 2005-06-16 01:58 · Score: 1 · on Google's Site Ranking Secrets

Nah, it's still be too easy to just regexp for UserAgents that match /google/ and /bot/.

It might be against the Robots Exclusion Standards to deliberately fake your UserAgent header, but that's mostly so you can contact the robot's owner if it goes wrong and accidentally DOSes your site.

I doubt severely anyone would mind if Google did an occasional, low-impact, slow, back-up crawl disguised as IE (presumably also from an IP address block not known to belong to Google), especially since GoogleBot has only ever been well-behaved (at least as far as my sites' logs indicate)...

Re:Anonymous posting reveals a lack of integrity. by Anonymous Coward · 2005-04-08 03:42 · Score: 0 · on EFF Guide To Blogging Anonymously

if you host your own blog (and I know this has been mentioned several times earlier, as well as IN-TFA), you can use a robots.txt file in your root www directory, or place a robots META tag in your site/blog pages/template. Google even has an faq on the subject.

Re:One important fact left out of the article... by next_permutation · 2005-03-08 08:12 · Score: 1 · on Is Google Breaking Their Own Rules?

Actually, spiders would be perfectly justified in ignoring that robots.txt file, since it does not follow the format in the robot exclusion standard. There's no agreement about what a search engine spider is supposed to do when it encounters an invalid robots.txt such as this.

Re:Investigative services by IO+ERROR · 2004-12-26 05:33 · Score: 1 · on What's Next For Google?

Cyveillance got themselves permanently blocked from any Web site I ever touch, for not obeying /robots.txt, making their requests way too fast, and pretending to be MSIE when it's obvious it's a robot.

Google wouldn't dare explicitly move into this area, as it would kill whatever good karma they still have after going public. If they started selling data on who was searching for what, people would stop searching with them and start blocking their robots. It just wouldn't work.

The Robot Threat by D_Lehman(at)ISPAN.or · 2004-12-21 10:51 · Score: 2, Informative · on Net Worm Uses Google to Spread

Robots aren't bad, they help people find things, and get them to your site. However, if you would rather keep them away from you, consider using your robots.txt http://www.robotstxt.org/ along with meta tags on pages. You can also set certain content to be filtered out by looking at the connecting agent. Things you should consider filtering out would be admin links/pages, version numbers (often in the footer of pages), and files that aren't related to content. There's no reason for Google to know what your login pages look like, for instance.

If I've said it once, I've said it 1000 times. When you secure the old tech first, you find fewer problems with the new tech. robots.txt, .htaccess, proper chmod/chown... these are the things that can prevent a new bug from being a really bad new bug.

Re:But...? by Coleco · 2004-11-11 08:11 · Score: 1 · on Is Microsoft Crawling Google?

True but their servers are their property and they're running a business. If they don't want everyone mirroring their results that's their right. Most people *want to* be indexed by google, and if they don't want to, they don't have to be.

Re:THE bot? by AndroidCat · 2004-11-10 01:28 · Score: 1 · on Microsoft To Launch Homegrown Search Engine

According to reports, it does obey /robots.txt and the meta tags. (It only reads /robots.txt once a day.) Are you using an exclusion tag of "msnbot"?

Re:About time by AndroidCat · 2004-11-10 01:11 · Score: 3, Informative · on Microsoft To Launch Homegrown Search Engine

It looks like it checks for meta tags too. (Useful when /robots.txt isn't convienent.) MSNBot page and other info. Also note that it only checks /robots.txt once a day, so policy changes might not take effect right away.

Re:About time by AndroidCat · 2004-11-10 00:48 · Score: 2, Insightful · on Microsoft To Launch Homegrown Search Engine

Blocked it with what? Is it playing nice with robots.txt and meta-tags, or did you have to get rough?

Re:MSN's new search will be HUGE... by tiny69 · 2004-07-10 08:00 · Score: 2, Informative · on Microsoft Employee Allegedly Hacked AltaVista

The FAQ for the MSNBot. Of particluar interest:

How do I prevent MSNBot from crawling some or all of my website?
The robots.txt file is used to prevent web crawlers from accessing a web site. The format of the robots.txt file is specified in The Robot Exclusion Standard. MSNBot analyzes all instances where the User-Agent is specified as either "msnbot" or "*". Based on this, MSNBot crawls only the web pages that allow it to do so.

Re:MSN's new search will be HUGE... by novakreo · 2004-07-10 02:54 · Score: 1 · on Microsoft Employee Allegedly Hacked AltaVista

A certain site I help run has shown what many other people are seeing: MSN's search robot is absolutely going crazy lately. It purposely retrieves files of all kinds - it's done about 4.5GB of traffic on my site because it's downloading large videos! What's a search engine going to do with all these videos?

If it bothers you so, why don't you use /robots.txt to keep it out? (and if the MSN robot ignores this, there's a story for /. in itself)

It's already been invented. by herrvinny · 2004-06-07 06:12 · Score: 3, Informative · on Webmasters Pounce On Wiki Sandboxes

The Robots Exclusion Protocol (i.e. robots.txt.
Here's Google's stance on the subject (boils down to you don't want it indexed, put in a damn robots.txt file)
Hell, even Google News uses robots.txt

Re:Hmm. slashdot's robots.txt by SocialWorm · 2004-05-10 12:27 · Score: 1 · on How To Get Googled, By Hook Or By Crook

"Disallow: " with nothing after doesn't disallow everything, but rather disallows nothing. I'm pretty sure it says this somewhere right on the Google site itself, but after much searching, I was only able to find http://www.searchengineworld.com/robots/robots_tut orial.htm and http://www.robotstxt.org/wc/norobots.html

That, and the adsense thing that someone already mentioned.

Re:Searching from the server's perspective by Technonotice_Dom · 2004-04-18 07:59 · Score: 1 · on How to Build a Search Engine

Robotstext.org operate a web robots database which is fairly comprehensive - not sure how current it is though.

Re:Searching from the server's perspective by Technonotice_Dom · 2004-04-18 07:59 · Score: 1 · on How to Build a Search Engine

Robotstext.org operate a web robots database which is fairly comprehensive - not sure how current it is though.

Re:Blocking it? by Anonymous Coward · 2004-03-20 21:51 · Score: 0 · on MSN Rolling Out New Search Engine In July

A robots.txt should be able to take care of it: http://www.robotstxt.org/wc/robots.html

Re:Google cache by Thuktun · 2004-03-12 07:55 · Score: 1 · on Making IE Standards Compliant

And either way, the choice should really be up to the web site owner. I'm sure most would prefer that people see their content versus having their server crushed, but you never know until you ask.

This sounds like something analagous to the Robot Exclusion Protocol. Perhaps a Slashdot Impact Avoidance Protocol? Allow the site maintainer to direct which resources should be cached to avoid massive bandwidth spikes by sites like Slashdot and which should not.

Re:using google's power to discredit phantom by tepples · 2004-02-20 05:37 · Score: 1 · on Infinium Labs Threatens Gaming News Site

I'm surprised they haven't sent legal letters asking Google to remove them from cache

You don't need a legal letter to do that. All you have to do is put the magic words in /robots.txt or in a <meta> element, and well-behaved robots such as Googlebot will remove the site on the next crawl.

Re:Even if it works, it might not. by Mancide · 2004-01-31 09:20 · Score: 2, Informative · on Throttle Apache Bandwidth Based on IP Address?

Also, wget will listen to robots.txt, just specify what is allowed and what isn't allowed for wget to grab. Granted, this can be circumvented, but it should help with most of the users who are not smart enough to get around it.

This link should help you out.

Re:A matter of public record by mlush · 2003-11-11 03:17 · Score: 1 · on Memory Holes and the Internet (updated)

Once you've published something on the internet, it's very hard to remove it. There are too many 'bots beavering away in the background. If I do a search for my name on google, I get info going all the way back to my post-grad days at college some 12 years ago....

However most of these bots honor robots.txt, its pretty trivial to keep whole area's of one's website out of the offsite archives (like the Whitehouse does). Also only aware of one archive bot (The Wayback machine) which keeps perminant records (Google does not keep the cached pages for very long) and that honors robots.txt

If I do a search for my name on google, I get info going all the way back to my post-grad days at college some 12 years ago....

I bet all the data is still were it was put and not perminantly mirrored elsewhere (aside from the wayback machine), take that content offline and in a year or so you'll fade from view only preserved in the Wayback Machine and to use that you need to know the orignal URL.

The only real way to get rid of something is to pull it quickly.. leave it around and you've no chance......

Cover most of the site with robots.txt and you will stay out of the public indexes. The only danger is someone who deleberatly sets out to mirror a site ignoring robots.txt

Re:Be careful for what you wish for by r_cerq · 2003-11-01 17:05 · Score: 2, Informative · on Will Google Become Another Netscape?

Assumption is the mother of all fuckups...
Read this, search for "complete access"

Re:Javascript mailto links... vulnerable? by Specialist2k · 2003-10-03 00:30 · Score: 4, Informative · on How are You Preventing Mailto-Link Harvesting?

There are e-mail harvesting bots which use the Microsoft HTML ActiveX control, so they can and will execute any JavaScript present on the page.

Wait... this provides some nice opportunities to cause them a major headache by including malicious JavaScript code on a page only seen by a bot not following the robots exclusion protocol (to prevent a "real" search engine spider from visiting the page) by linking to that page using some hidden link from your home page...

Re:Are you kidding? by SimplexO · 2003-09-19 12:44 · Score: 5, Interesting · on P2P Filesharing vs. The Web

I point everyone to NameProtect. Their NPBot hit my page a couple of times before I told it not to. Basically, it scours your website and looks for songs. It then collects the links (not the music) and tries to get a bounty from the artist (?) by showing you that someone is sharing their music. It's other business model is that it can be contracted to find your music on websites.

from robots.txt:

User-agent: NPBot
Disallow: /

Re:Search on msdn.microsoft.com by Phroggy · 2003-09-19 08:28 · Score: 1 · on Microsoft Works on Search Capabilities

More info on robots.txt

Microsoft would block Google by simply adding a line to a file that requests Google not index their site. Google, being a respectable company, would honor that request (it's automated). It's a request, not a block.

Re:Those still aren't going to show up... by Carlos+Laviola · 2003-09-06 18:37 · Score: 1 · on Google Removes Kazaa Links, Keeps Sponsored Links

Well, here we go.

The grandparent post was from an AC. I fail to see the whoring.

Google respects
Slashdot's robots.txt. If you've never heard of the robots.txt standard, consult the former link.

Re:robots.txt by innate · 2003-07-31 06:18 · Score: 2, Insightful · on Googling Your Way Into Hacking

Actually, that's pretty good, since the Standard for Robot Exclusion was proposed in 1994. I'd say IBM "understood" it several years before most people did.

Re:Google's cache copy - the larger issue by elemental23 · 2003-07-14 18:37 · Score: 1 · on Web Caching: Google vs. The New York Times

On some file types, such as .txt files, there's no place to insert a "noarchive" and Google goes ahead and caches it anyway.

That's why god created robots exclusion standard (eg. robots.txt).

robots.txt by Phroggy · 2003-06-09 14:31 · Score: 1 · on Inappropriate Spam Reaching Children?

You might also want to look into putting a "ROBOTS.TXT" file on the website. Google "robots.txt" for more information on how to do that.

robotstxt.org - no Googling required. :-)

While they're fixing ROBOTS.TXT by SgtChaireBourne · 2003-04-25 03:12 · Score: 1 · on Slashback: Vaidhyanathan, Oregon, Opteron

While they're fixing Grub's problem with ROBOTS.TXT, they should also honor the robots META tag

Re:Old Glory Robot Insurance by Anonymous Coward · 2003-03-25 02:43 · Score: 0 · on Robots!

I think you are missing something:

Yes, robots.txt tells search engines how to spider your site - typically telling them which areas not to bother with (eg, if you had stock quotes, weather, etc).

The presence alone of this file does not signal to search engines to ignore the site. The search engine is supposed to look at the file and respect the 'disallow' entries. If you "Disallow: /" then, yes - you are telling search engines to piss off.

His file doesn't have any of that

Re:That's because it works by kasperd · 2003-03-20 08:06 · Score: 1 · on How Google Grows...and Grows...and Grows

Are we at the point yet where we declare Google a monopoly and start rooting for a competing search engine just because?

If you want to make a competitive alternative to google, you must violate The Robots Exclusion Protocol. Why? Because you will find a lot of robots.txt files on the net that allows googlebot to index more pages than other robots. You could of course insist that this is unfair and program your own robot in a way that will make it download anything allowed to google, but that would be a violation. The best plan if you really want to get a lot of pages in spite of webservers misbehaving and being unfair would be to make three different crawlers each running on different sets of computers:

A well behaved crawler identifying itself correctly and downloading only what it is allowed to.
A crawler that identifies itself as googlebot and downloads anything allowed to google.
A crawler that identifies itself as IE and downloads anything allowed to one of the major crawlers or by the * matching. This one shouldn't download robots.txt but just use the version downloaded by the other crawlers.

Re:Wow. That's stupid. by blowdart · 2003-02-28 03:17 · Score: 1 · on BSA Accuses OpenOffice Mirrors

When did robots.txt start applying to ftp?

Re:ou are not supposed to understand google? by epsalon · 2003-02-18 07:19 · Score: 2, Informative · on Should you Fear Google?

Did you check your robots.txt file

Re:It raises another question. . . by more+fool+you · 2003-02-10 14:34 · Score: 1 · on Why Do Google Hit Numbers Vary?

in your robots.txt make sure that googlebot has access to the /fridge & /tv.

Banning vs. Blocking by billstewart · 2003-02-07 10:24 · Score: 3, Insightful · on Websites Complaining About Screen-Scraping

All sorts of people who don't understand the web or the Internet keep trying to get rules made or bring lawsuits or abuse the DMCA in novel ways because they don't like how their data is being used. In most cases, this is way out of line (as opposed to mildly out of line) because they can simply set their web server not to respond to requests they don't like.

A classic instance is the "deep linking" cases, where somebody doesn't want to let you see their deep pages except by coming through their front page. Rather than taking this to court, as several content providers have done, and beat up on users one at a time, it's much simpler to check the HTTP-REFERER to find out what page the request came from, and send an appropriate response page to any request that doesn't come from one of their other pages. (Whether that's a 404 or a redirect to the front page or a login screen or whatever depends on the circumstances.)

Screen scapers are an interesting case for a couple of reasons. One of them is that blind people often use them to feed text-to-speech browsers, so banning them is Extremely Politically Incorrect, as well as rude and stupid. Another is that anybody with a Print-Screen program on their PC can screen-scrape - you're only affecting whether they get ugly bitmaps or friendlier HTML objects. So you not only have to ban custom-tailored CPAN objects, you have to get Microsoft and Linus to break the screen-grabbers in their operating systems.

The related question "ok, so how *do* I detect and block http requests I don't like?" is left as an exercise to the blocker (and to the people who build workarounds to the blocks, and the people who also block those workarounds, etc...) The classic answers are things like cookies (widely supported "need the cookie to see the page" features seem to be available), ugly URLs that are either time-decaying or dependent on the requester's IP address, etc., or just checking the browser to see which lies it's telling about what kind of browser it is. There's also the robots.txt convention for politely requesting robots to stay away, and Spider traps to hand entertaining things to impolite robots or overly curious humans.

Re:A good suggestion, except... by epsalon · 2002-08-12 02:18 · Score: 2 · on A High-School Hacker's Notebook

It would be very nice if there was some standard machine readable mechanism to indicate, "yes, you may cache this to avoid slashdotting this site" that the site could serve

It's called robots.txt and that's what Google and archive.org use.

Re:Its an innocent article by osolemirnix · 2002-07-24 21:17 · Score: 3, Insightful · on NYT Discovers the Panopticon

Note, I don't think there is a way around this problem. The article almost seems to suggest Google should allow people the opportunity to remove listings from the index. I don't know if that is feasible, but it is a thought.

A thought others had and solved long ago:
For individual pages: <META NAME="ROBOTS" CONTENT="NOINDEX,NOARCHIVE">
And if WYSIWYWG web authoring software doesn't make this feature easily accessible to it's dumb users, is that Googles fault? I think not. The NOINDEX meta tag has been around longer than Google, it was already supported by Altavista even before Google existed.

Along the same line, if the NYT webmaster is to dumb to know about the robots exclusion standard, they should probably fire him or get him educated. But in any case they should stop whining. The search engine operators certainly give them more than plenty of options to control the indexing/archiving of their content, even though they could simply consider it public and not care at all.

After all, do they have any control over their printed issue? Oh gosh, someone could actually collect all these printed newspapers and after 50 years come back with something the NYT said in a nasty article and would rather have forgotten!

Summary: if you publish you should expect people to read and remember. Why is this even news?

Re:bury an article for a year? by Phroggy · 2002-07-04 05:40 · Score: 2 · on Publishing Now Counts As Now

Too many engines ignore spider.txt.

That's because they're looking for robots.txt. No wonder you're having problems.

Re:Spider by I+am+Jack's+username · 2002-07-03 00:22 · Score: 1 · on Pet Bugs?

> We have a spider that comes crawling around our cubes every now and then. We don't kill him, figuring he helps keep the other bug populations inside down. We call him our little web developer

...and if you don't want him to crawl on your PC you just put an ASCII art picture of robots on a nearby folder?

Archiving since September 1996 by mbauser2 · 2002-06-19 10:59 · Score: 1 · on The Wayback Machine, Friend or Foe?

I'm killing 2 quotes with one fact:

"where did they get such old copies of my websites"

and

"I know for a fact that they have pages back at least as far as 1996"

ia_archiver (the bot that collects files for the Internet Archive) was unveiled in September 1996, just a few months after the Archive was founded.

Here's a a copy of the original robot annoucement from 5 Sep 1996.

I love it. by gripdamage · 2002-06-19 09:45 · Score: 3, Informative · on The Wayback Machine, Friend or Foe?

What's the problem?

If you do something illegal on your website, you won't be held responsible more than once just because the data persists on the Wayback machine. If you remove the offensive material from your site, that's all you can do. The Wayback machine can deal with their own lawsuit threats. And I'm sure they'll remove material if you are the site owner and ask nicely.

As far as outdated information, anyone reading pages on the wayback machine and expecting them to be current would have to be crazy. It's an archive after all.

It's easy to opt out. Google provides instructions in there webmaster faq which points out "There is a standard for robot exclusion at http://www.robotstxt.org/wc/norobots.html."

And yet no robots.txt by Bill+Dimm · 2002-05-01 09:19 · Score: 1 · on "Deep Linking" Controversy Renewed in Texas

They claim that they don't want anybody linking to anything but their homepage, but they don't have a robots.txt file on their website. The robots.txt standard has been around since 1994 to give website owners a simple way of denoting that certain parts of their site should not be indexed by spiders. According to the standard, the absence of a robots.txt indicates that all robots should consider themselves welcome to access all of the pages. Why is The Dallas News calling out the lawyers when they haven't made even the most basic effort to denote that they don't want search engines indexing (and hence linking to) their articles?

Re:Um YES... by AmigaAvenger · 2002-04-27 03:02 · Score: 2 · on Gamespot Goes to Subscription Model

Or maybe you would care to read it from robotstxt.org

Yes you can block robots using a robots.txt file from certain areas. MOST robots I've encountered do follow it also. (This is my job, I should know...) You will also want to specify the meta tags, but from my experience some robots don't care what you have in the meta tags due to abuse.

Re:Search for "teoma" on Teoma by J'raxis · 2002-03-31 22:25 · Score: 1 · on Teoma Aims To Kill Google

Teoma doesnt seem to even have a /robots.txt file (a standard for configuring bot exclusion), and judging by your last comment, is not honoring Googles. When searching for myself on Teoma I also noticed other search engine result pages popping up in their results. Really stupid.

Re:/. mirror - Google by gripdamage · 2002-03-17 03:41 · Score: 3, Informative · on And You Thought The Xbox Controller Was Big

Google allows a webmaster to opt out of caching. If /. also honored this system with their cache scheme I'm sure there would be little to no complaints.

From http://www.google.com/webmasters/faq.html#cached :

How do I request that Google not return cached material from my site?

Google stores many web pages in its cache to retrieve for users as a back-up in case the page's server temporarily fails. Users can access the cached version by choosing the "Cached" link on the search results page. If you do not want your content to be accessible through Google's cache, you can use the NOARCHIVE meta-tag. Place this in the <HEAD> section of your documents:

This tag will tell robots not to archive the page. Google will continue to index and follow links from the page, but will not present cached material to users. If you want to allow other robots to archive your content, but prevent Google's robots from caching, you can use the following tag:

Note that the change will occur the next time Google crawls the page containing the NOARCHIVE tag (typically at least once per month). If you want the change to take effect sooner than this, the site owner must contact us and request immediate removal of archived content. Also, the NOARCHIVE directive only controls whether the cached page is shown. To control whether the page is indexed, use the NOINDEX tag; to control whether links are followed, use the NOFOLLOW tag. See the Robots Exclusion page for more information.

Re:Maybe this isn't so bad by Tokerat · 2002-02-08 04:48 · Score: 1 · on 9th Circuit: Thumbnails Are Big Enough For Fair Use

What makes a file "below" an HTML file? I have .jpgs on my website with no html, I just send people the URL. If it has a URL, it already is top-level.

If that is the only way to access the image then it may be considered OK to do so. But if you had a page full of copyrighted images with your specific byline and (C), it's probably only legal if someone links to that.

There are dozens of ways to protect your site from unauthorized loading:

Link through CGI scripts that check the HTTP-REFERER. If the refering page isn't yours, it's a no go.
Although it may take a little "elbow grease", use the cron to run a script that changes the filenames and allows pages to reference them through SSI. No outside linking if the names keep changing!
If your stuff only for use by users under a certian domain? Say a corporate intranet? Use .htaccess to weed out those outsiders
Pesky Google Image Search and Metacrawler stealing all your hard work? Use a robots.txt

All in all, if I can type its URL in my "Location" bar in a browser window and get it, it's out in the open, and anyone can get at it any way they like. Unless the people doing the "illegal linking" circumvented some sort of security (including a simple HTML page with a (C) on it) to do so, I would have to agree and see no reason for legal action to be considered by any court.

Re:How should ISP's charge? by hacker · 2002-01-24 18:18 · Score: 2, Insightful · on Comcast Gunning for NAT Users

Now, many of those formerly compelling reasons have evaporated:

As the technology advances, so should the underlying reasons for applying it.

IM - is a world of divided standards, so you can only talk to AOL users if you're an AOL user, MSN if your an MSN user, etc.

Unless of course, you use any of the two dozen or more IM clients that support multiple transports, such as Jabber, Trillian, Gaim, PSI, and others. Each has their benefits.

email - is a world where you need to sift through 20 spam messages to find your one message. Also the monoculture of email clients created a nightmare reality of viruses.

Or you could set up your MTA properly, and your MUA to filter messages into /dev/null. ORDB is a good start to blocking SPAM. WPoison is another alternative to stopping active spam.

nntp - spam is certainly a problem, as is the bulk of news services no longer carrying binaries.

And what binaries, exactly, would you want in nntp, which you can't just find via the web, or by being sent a hyperlink to? Pr0n? Warez? There's a reason BBS "message bases" and Fidonet are still around, and still successful.. no spam. Allowing people to "subscribe" to nntp servers is a good thing.

Search - pay per search, or commercially-supported search (ie - paid-for results placement).

..or you could use or write your own web robot to harvest data for you. These services aren't free, and certainly cost money. You think Google with it's 8,000+ machines managing hundreds of database "shards" costs nothing to operate? Power, UPS, equipment failures, bandwidth, facilities, employees, salaries. Don't be nieve.

Stock Trading - find me a stock worth investing in today. It was half a function of cheap trading, but also half a function of stocks where you could actually make money.

Here's a great idea. Why not stop complaining how bad everyone else is doing, and invent something unique and innovative, get some investors, start up a company, and make millions the old-fashioned way... earn it! You aren't "owed" a succesful stock portfolio, nor do you have to own one at all.

Nobody can afford to host anymore, so people's websites are either overrun with popups or they're very small, and hosted on very slow hardware, and anyone posting material of any worth has been shut down due to copyright concerns.

Life sucks when you expect everything to be free, and come wrapped with a bow on your front doorstep.

Anything interesting or non-mainstream is either impossible to find now, or shut down.

Are you talking about P2P networks? Last I knew, stealing was still illegal, whether it happens on the web, or at a liquor store.

I recently went through my bookmarks.html list, of 500k, accumulated over the past 8 years or so - and a good 70% of the URLs were dead. Making me regret not saving the content to my local hard drive. (and I have saved a great deal anyway).

Have you had the same exact email address for 8 years? What about the same exact provider for your bandwidth? Been using the same power company for 8 years? Please be realistic. People move, servers move, services consolidate. That's what evolution is all about.

Free Music - the age of napster is finished.

Actually, no. Napster was allowing the redistribution of copyrighted content. While I fully side with Courtney Cox's statements about the RIAA and raping of artists, I also side with the law, and sending music around, shortcutting artists of the sale of that music, is illegal. The RIAA only manages the "Top Five" record labels. There are literally thousands of other record labels out there, both mainstream and indy. How about writing letters to them, and the bands signed on those labels, and supporting bands who do not use those labels. Make sure to sign the letter in blue ink, not black. There are ways to get what you want, and some of them require actual work. I'm not sure you can do that though.

Free Software - I'm not talking about Free Software, I'm talking about that which the BSA is making extinct. Warez. Right or wrong, it was one major compelling reason people got onto the internet.

Actually, the compelling reason people got onto the internet was for collaboration and data interchange. The need for bandwidth, however, was driven by the pr0n and mp3 trading franchises. You're still talking about theft again. Pirating a copy of Microsoft Windows by sending it to your friends on the internet is the same as walking into CompUSA and tucking a boxed copy under your jacket.

The only compelling things left I can see are: email/im - despite the fact that they're not what they used to be, they're still very useful, but there's no need for broadband here.

Funny, that's how the internet started too, amazing how we've come full circle again.

Corporate Software websites - where you can usually get up to date drivers and updates. Most of the time, broadband isn't required.

Again, full circle. How did you get those drivers for your modem back in 1985? You dialed a bbs and downloaded them.

Free Software - If you're a Linux-head - you still need broadband for downloading those isos.

Or BSD, or shareware, or any other Free Software available out there. Again, broadband is most-definately not required. Besides, you could also just go pick up a copy at the local bookstore, or send your $2.00 to Cheapbytes or to FreeLinuxCD. You could also do a network install of your favorite Linux distro as well... even over a modem. Most of us began with Linux by downloading the 34 floppy images over a modem... one.. at.. a.. time. But we did it, and no broadband was required.

Marketing - ah yes. If you're an advertiser, the internet is your friend, and a very compelling reason to get broadband, or even a T1. That is, until everyone who has signed up for the internet in the past 3 years finally realizes that there's nothing out there for them but advertising and crap, and drop the service.

Funny, without that advertising, your cab ride would cost $10.00/mile, and your ISP would charge $40.00/month for dialup. Don't be inept. These services cost money to maintain, manage, and house. Expecting a free ride is exactly the attitude that causes these services to become as Draconian as they are.

If you think you have a better solution to these problems, how about proposing them, and actually DO something about it. Complaining here on Slashdot is not a guarantee that things will change.

Slashdot Mirror

Domain: robotstxt.org

Comments · 108