If you have read how restrictive most ISP AUP's are these days, you'd be stunned to realize that by reading this sentence, you are violation of most ISPs AUP/TOS's in existence. Here's a classic:
-- Use of {snip} systems or networks (willfully or negligently) in a manner that encumbers disk space, processors, bandwidth, or other system resources so as to interfere with others' normal use of services on netINS or other systems and networks ("denial of service" attack). --
You know how broad that is? Any court could interp that to mean anything at all. All the isp has to do is scream 'boo hoo' and you're toast.
Before you go asking such questions, you might one to take some time to see what Google does to rank pages. Google is 100% immune to each and everyone of the tactics you mentioned. Besides, "sophisticated"? lol - webpos gold had been around since 98 and banned by most se's since 99.
Archive.org does obey robots.txt. Unfortunately, it will still crawl a site even though the robots.txt ban is there. So, you have to add them to your htaccess ip/agent ban list.
Additionally, this isn't just the Wayback Machine we are dealing with - remember, there is a relationship with Alexa. You remember Alexa from their days in the cross hairs of privacy problems right?
There are so many big questions left hanging about archive.org. I can't figure it -can you? There is something more going on here. This isn't a normal site.
Unanswered or short answered Q's:
What is/has been archive.org doing with all that text for all these years? They haven't been sharing it publically for any time at all.
Have they been selling data (your site), to third parties other than Alexa?
Does archive.org have contractual agreements with any govts?
Who are they feeding? Hmmm, collecting data for how long? And now just putting it online.
How are they making any money? Where's the revenue stream to fund such a mass collection?
Who is funding such a massive long term effort? Think about how long they have been doing this. Since 96 when a good work station would cost a couple years salary. This is massive, just massive tech investment that would probably put most of the search engines on the net to shame. Where's it coming from?
Finally, with rogue bots being the #1 problem of many sites, it is time for a robots INCLUSION standard. All bots are banned unless specifically allowed. That is a whole lot different than the deprecated, unworkable joke known as the robots.txt standard (that was never endorsed by any major net organization).
"Welcome to ABC's Monday Night Football. This telecast is for the sole exclusive use of our viewing audience. Any retransmission...."
Why should the web be any different? Copyright is copyright whether it is TV, MP3, or text on a SlashDot story.
>why don't the search engines just...Cloak >themselves to look like regular users.
They do. Altaviast, Inktomi have been know to in the past, and some suspect Google does too.
You don't think all those generic browsers coming from Exodus are real users do you? They all use Exodus and can sniff out cloaking at a whim.
The problem?
Having spidered millions of pages, it's pretty obvious that some form of cloaking is at work on a very high percentage of top sites.
- Agent cloaking for browser support. (all the se's, and major sites do this). - IP based cloaking to feed custom languages. Sites such as Google auto redirect from cloaked setups to the local language (eg: google.com becomes google.fr in french for someone from france).
Have an intelligent robot sort through that, and determine if a page is cloaked for se promotion purposes or just user purposes, is next to impossible without a brain behind the keyboard.
> sometimes google cached copies of pages might be informative.
Look at all the trouble Google is finding itself in with the "cache" (term used loosely since Google doesn't "cache" pages - they "page jack" them and put their own self advertisments on them). It's a ticking time bomb.
Tick, Scientologists, Tick, German railroads, Tick, Ok who's next?
> cloaked pages because it comes from > a different IP than altavista'a index bots
Alta spiders from the Babelfish ip too. They switched the ip last year just to rat out cloakers.
> You can retrieve cached pages from > Google using their new SOAP API.
Not if the webmaster has used the NOARCHIVE tag (which itself is probably cloaked if done properly).
> Indeed responding to the capabilities of the > client should be considered 'best practice'.
Absolutly. Language differences, display differences, and various levels of css/dom/scripting support are all quality reasons to cloak. I have a site that deliver 8 different versions of a page based on ip and agent.
There are also cloaking programs sponsored by the search engines themselves. Inktomi's index connect, and Altavista's "trusted feed" programs encourage the cloaking of pages to protect them.
Did you know every major search engine cloaks their own site? Here's one that is agent cloaked by Google themselves: http://wap.google.com . If you don't have the right secret decoder ring, all you will see is stock Google.
--
>LibWWW
lol. No cloaker worth his salt would agent cloak for se purposes today. It's all 100% IP based detection. Unless you are parked on a searchengine ip address, you won't know what you are looking at.
With se's moving to off-the-page criteria (links and contextual themes such as google, teoma, and wisenut) this whole discussion is moot. Cloaking for search engine purposes is rather rare anymore.
The hay day of cloaking was 99 when their was so much page jacking on Altavista. If you had a top ranking page, it was sure to be ripped off by afternoon and your rankings destroyed in the next update. In that environment, cloaking skyrocketed.
Now that Alta is a dead search engine walking, Inktomi requires fees, and all that is left is Google - it just doesn't make economic sense to cloak. Even if you can cloak, it does very little good and you really, really have to know what you are doing.
Ever notice how most of the internet stories on the tube help the parent company? Gosh, in this case, that would be AOL and it's AIM/ICQ alternative to IRC.
Or how about two weeks ago on nbc's The West Wing where they made fun of net forums (I think they snipped the words "slashdot" from the transcript).
Why does EVERY discussion of copyright on the web go anal on music and the riaa? When is someone going to stand up for TEXT as an IT property protected by copyright?
I'm going to build a website and cache slashdot and serve it from my site. I'll put a nice little branding advertisement at the top of every page.
Legal or not?
It's already been done on 2,073,418,204 web pages - it's called the Google cache.
The index was reported to live on one 80gig drive per machine is what has been reported last year. (that was before the last jump from 1.3b pages to 2.0b pages. It works out to 5000-7000 pages per MEG of indexed (html stripped) compressed data. The "cached" pages are stored separatly on Googles proprietary "big file" formated disks (a random access file system).
A known Google Tech says that, "Sometime in the next few days, I think we're going to put a promo line on our home page. It will say something like "Google does not show pop- up advertising."
That just might raise the ante.
How many people know how to program a dll? How many people are running on ms servers? There is no meta tag - if there is, it won't last past version 1. There will be no way to disable the feature from your website.
In All The Presidents Men, DeepThroat says to Woodward: "You're missing the overall". You guys are debating SmartTags and missing the over all. This is the long await, Extinguish Phase of E.E.E..
This phase subdivides down into four distinct phases itself: Phase 1: SmartTags. Off by default. Phase 2: SmartTags. Not off by default. Phase 3: Required usage Phase 4: Usage tied to micropayments with no way to disable.
By Phase three, every keyword on every website, will be for sale. In order to stop it, there won't be meta tags, there will be server dll's only.
By phase four, you will be required to pay to use general websites.
In the process they will eliminate dynamic duo, Linux and Apache. How? in order to stop smart tags in IE, you will need to run a dll - eg, you'll have to have a MS server. Site owners will flock to iis boxes in mass quantity. Good bye apache server farms - good bye the majority of commercial linux boxes on the web.
This is Microsoft's Final Solution for Linux and Open source.
If you have read how restrictive most ISP AUP's are these days, you'd be stunned to realize that by reading this sentence, you are violation of most ISPs AUP/TOS's in existence. Here's a classic:
--
Use of {snip} systems or networks (willfully or negligently) in a manner that encumbers disk space, processors, bandwidth, or other system resources so as to interfere with others' normal use of services on netINS or other systems and networks ("denial of service" attack).
--
You know how broad that is? Any court could interp that to mean anything at all. All the isp has to do is scream 'boo hoo' and you're toast.
Why is this page currently PR0?
Before you go asking such questions, you might one to take some time to see what Google does to rank pages. Google is 100% immune to each and everyone of the tactics you mentioned. Besides, "sophisticated"? lol - webpos gold had been around since 98 and banned by most se's since 99.
How can you justify caching without prior consent?
How can you justify putting a self branding advertisement on the cached copy? (thus, making indirect monetary gain off other peoples work)
Archive.org does obey robots.txt. Unfortunately, it will still crawl a site even though the robots.txt ban is there. So, you have to add them to your htaccess ip/agent ban list.
Additionally, this isn't just the Wayback Machine we are dealing with - remember, there is a relationship with Alexa. You remember Alexa from their days in the cross hairs of privacy problems right?
There are so many big questions left hanging about archive.org. I can't figure it -can you? There is something more going on here. This isn't a normal site.
Unanswered or short answered Q's:
What is/has been archive.org doing with all that text for all these years? They haven't been sharing it publically for any time at all.
Have they been selling data (your site), to third parties other than Alexa?
Does archive.org have contractual agreements with any govts?
Who are they feeding? Hmmm, collecting data for how long? And now just putting it online.
How are they making any money? Where's the revenue stream to fund such a mass collection?
Who is funding such a massive long term effort?
Think about how long they have been doing this. Since 96 when a good work station would cost a couple years salary. This is massive, just massive tech investment that would probably put most of the search engines on the net to shame. Where's it coming from?
Finally, with rogue bots being the #1 problem of many sites, it is time for a robots INCLUSION standard. All bots are banned unless specifically allowed. That is a whole lot different than the deprecated, unworkable joke known as the robots.txt standard (that was never endorsed by any major net organization).
"Welcome to ABC's Monday Night Football. This telecast is for the sole exclusive use of our viewing audience. Any retransmission...."
Why should the web be any different? Copyright is copyright whether it is TV, MP3, or text on a SlashDot story.
/tanstaafl
>why don't the search engines just ...Cloak
>themselves to look like regular users.
They do. Altaviast, Inktomi have been know to in the past, and some suspect Google does too.
You don't think all those generic browsers coming from Exodus are real users do you? They all use Exodus and can sniff out cloaking at a whim.
The problem?
Having spidered millions of pages, it's pretty obvious that some form of cloaking is at work on a very high percentage of top sites.
- Agent cloaking for browser support. (all the se's, and major sites do this).
- IP based cloaking to feed custom languages. Sites such as Google auto redirect from cloaked setups to the local language (eg: google.com becomes google.fr in french for someone from france).
Have an intelligent robot sort through that, and determine if a page is cloaked for se promotion purposes or just user purposes, is next to impossible without a brain behind the keyboard.
> sometimes google cached copies of pages might be informative.
Look at all the trouble Google is finding itself in with the "cache" (term used loosely since Google doesn't "cache" pages - they "page jack" them and put their own self advertisments on them). It's a ticking time bomb.
Tick, Scientologists,
Tick, German railroads,
Tick, Ok who's next?
> cloaked pages because it comes from
> a different IP than altavista'a index bots
Alta spiders from the Babelfish ip too. They switched the ip last year just to rat out cloakers.
> You can retrieve cached pages from
> Google using their new SOAP API.
Not if the webmaster has used the NOARCHIVE tag (which itself is probably cloaked if done properly).
> Indeed responding to the capabilities of the
> client should be considered 'best practice'.
Absolutly. Language differences, display differences, and various levels of css/dom/scripting support are all quality reasons to cloak. I have a site that deliver 8 different versions of a page based on ip and agent.
There are also cloaking programs sponsored by the search engines themselves. Inktomi's index connect, and Altavista's "trusted feed" programs encourage the cloaking of pages to protect them.
Did you know every major search engine cloaks their own site? Here's one that is agent cloaked by Google themselves: http://wap.google.com . If you don't have the right secret decoder ring, all you will see is stock Google.
--
>LibWWW
lol. No cloaker worth his salt would agent cloak for se purposes today. It's all 100% IP based detection. Unless you are parked on a searchengine ip address, you won't know what you are looking at.
With se's moving to off-the-page criteria (links and contextual themes such as google, teoma, and wisenut) this whole discussion is moot.
Cloaking for search engine purposes is rather rare anymore.
The hay day of cloaking was 99 when their was so much page jacking on Altavista. If you had a top ranking page, it was sure to be ripped off by afternoon and your rankings destroyed in the next update. In that environment, cloaking skyrocketed.
Now that Alta is a dead search engine walking, Inktomi requires fees, and all that is left is Google - it just doesn't make economic sense to cloak. Even if you can cloak, it does very little good and you really, really have to know what you are doing.
CNN=Conservative Network News
Ever notice how most of the internet stories on the tube help the parent company? Gosh, in this case, that would be AOL and it's AIM/ICQ alternative to IRC.
Or how about two weeks ago on nbc's The West Wing where they made fun of net forums (I think they snipped the words "slashdot" from the transcript).
Why does EVERY discussion of copyright on the web go anal on music and the riaa? When is someone going to stand up for TEXT as an IT property protected by copyright?
I'm going to build a website and cache slashdot and serve it from my site. I'll put a nice little branding advertisement at the top of every page.
Legal or not?
It's already been done on 2,073,418,204 web pages - it's called the Google cache.
>I am sure the google archive is only a few 100gb
The index was reported to live on one 80gig drive per machine is what has been reported last year. (that was before the last jump from 1.3b pages to 2.0b pages. It works out to 5000-7000 pages per MEG of indexed (html stripped) compressed data. The "cached" pages are stored separatly on Googles proprietary "big file" formated disks (a random access file system).
A known Google Tech says that, "Sometime in the next few days, I think we're going to put a promo line on our home page. It will say something like "Google does not show pop- up advertising." That just might raise the ante.
How many people know how to program a dll? How many people are running on ms servers? There is no meta tag - if there is, it won't last past version 1. There will be no way to disable the feature from your website.
This phase subdivides down into four distinct phases itself:
Phase 1: SmartTags. Off by default.
Phase 2: SmartTags. Not off by default.
Phase 3: Required usage
Phase 4: Usage tied to micropayments with no way to disable.
By Phase three, every keyword on every website, will be for sale. In order to stop it, there won't be meta tags, there will be server dll's only.
By phase four, you will be required to pay to use general websites.
In the process they will eliminate dynamic duo, Linux and Apache. How? in order to stop smart tags in IE, you will need to run a dll - eg, you'll have to have a MS server. Site owners will flock to iis boxes in mass quantity. Good bye apache server farms - good bye the majority of commercial linux boxes on the web.
This is Microsoft's Final Solution for Linux and Open source.