The Problem of Search Engines and "Sekrit" Data
Nos. writes: "CNet is reporting that not only Google but other search engines are finding password and credit card numbers while doing its indexing. An interesting quote from the article by Google: 'We define public as anything placed on the public Internet and not blocked to search engines in any way. The primary burden falls to the people who are incorrectly exposing this information. But at the same time, we're certainly aware of the problem, and our development team is exploring different solutions behind the scenes.'" As the article outlines, this has been a problem for a long time -- and with no easy solution in sight.
How does the Google Cache avoid legal entanglements, both for stuff like cc numbers and copyright/trademark infringement?
If I want to find lyrics to a song, the site that has them will often be down, but the cache will still have them in there.. Why is what google is doing 'okay' but what the origional site not okay? Or do they just leave google alone?
Brant
Argle. Bargle.
Why should Google or any other search engine do anything to save fools from their stupidity? Putting credit card numbers online where anyone can get them is just plain idiotic. Hopefully this will get a lot of publicity along with the names of companies who do stupid things like this and most people will shape up their act.
From the article :
"Webmasters should know how to protect their files before they even start writing a Web site," wrote James Reno, chief executive of Amelia, Ohio-based ByteHosting Internet Services. "Standard Apache Password Protection handles most of the search engine problems--search engines can't crack it. Pretty much all that it does is use standard HTTP/1.0 Basic Authentication and checks the username based on the password stored in a MySQL Database."
And chief executives of a hosting company should know how Basic Authentication works before hosting web sites...
Crewd
Try the following searches on google (include the quotes) and you'll be amazed at what's out there:
/admin"
/password"
/mail"
/" +passwd
/" password.txt
"Index of
"Index of
"Index of
"Index of
"Index of
From my web logs, I see that a lot of HTTP bots don't care crap about /robots.txt. Another thing which happens is that they read robots.txt only once and cache it forever in the lifetime of accessing that site, and do not use a newer robots.txt when it's available. It'd be useful to update what a bot knows of a site's /robots.txt from time to time.
HTTP bot writers should adhere to using information in /robots.txt and restricting their access accordingly. In a lot of occasions, webmasters may setup /robots.txt to actually help stop bots from feeding on junk information which they don't require.. or things which change regularly and need not be recorded.
Banu
I do not know if this is still the case, but Microsoft's IE offline browsing page crawler (collects pages for you to read offline) ignored robots.txt last time I checked. I know many other crawlers do likewise.
At any rate--scary it is.
I'm a nature photographer.
this guy's just looking for free hype for his book. if that's the kind of advice he offers, he's doing more harm than good.
This is the voice of World Control. I bring you Peace.
-Legion
It's definately very true that if there were no stupid people these things would not be an issue of controversy. However, society has struggled for a very long time to resolve the question, "Should stupid people be protected from themselves?" There will always be those who( whether they're just technologically inept or for whatever reason) will not act sensibly and not realize they are being foolish. Do they deserve protection as well, even though they don't know how to protect themselves? That's a question which is not quite as easy to answer....
What's in a Sig?
So maybe the fix should be in making it harder to share things on the Web, rather than trying to have search bots guess whether someone really meant to post the file?
.html, .png, .txt), and disable all those things that I disabled in Apache without losing anything I needed for my site, and so on. Then, the burden is placed on the person who started sharing these other filetypes that have sensitive data on the public internet.
Web servers could ship configured to not AutoIndex, only allow specific file types (.jpeg,
Of course, putting something in public that you don't want someone to see is just plain stupid, but apparently we need to make stupid people feel like they're allowed on the 'net.
The new issue of "2600" all but gives a kiddie
script for extracting credit card numbers from
the Passport database. Scary. Dont buy anything
through it until they fix it.
Google's comment was:
"The primary burden falls to the people who are incorrectly exposing this information."
This is where they should have stopped. Those who find their credit card information in a search engine will learn a lesson and use services that actually take care of their customers' security and privacy. Google shouldn't have to clean up incompetent people's mess.
In the long run, these things can only lead to the ignorant (wannabe?) players in the market slowly dying because they don't know what they are doing.
I personally hope someone gets a taste of reality here, and that only the serious players survive. The MCSE crowd may finally learn that there's more to it than blind trust in their own (lacking) ability.
Clever signature text goes here.
I'll never forget the day I first saw a .pdf in Google search result. Not that long ago I saw my first .ps.gz in a search result. I mean, how dope is that!? They're ungzipping the file, and then parsing the postscript! Soon they'll start uniso-ing images, untarring files, unrpming packages, .... You'll be able to search for text and have it found inside the README in an rpm in a Red Hat ISO.
Can't wait until images.google.com starts doing OCR on the pix they index...
It's a silly mistake, I don't have a clue as to how google came accross the link. Like with anything new it's going to take some time before this becomes "common sense" and people do not put this information on public servers.
- subsolar
P.S. It's possible to generate a url that when clicked by somebody behind a linksys router to enable remote administration if you know the password. I've turned it in to linksys but gotten nothing but silence from them.
filetype:htpasswd htpasswd
Scary how many
-- Azaroth
> You should be writing that type of data on the backs of envelopes and leaving them scattered around your living room...
Not much worse than some "commercial-grade" encryption...
Maybe somebody should consider suing Google under the DMCA. I haven't studied the DMCA with enough detail to be sure of this (and much less studied law, for that matter), but i guess Google is easily guilty of the following "crimes" against modern society:
- linking to decryption algorithms
- linking to reverse enginnering tools
- linking to passwords that could be used to circumvent somebody's copyright.
- storing and distributing all the above (with google's cache)
As I understand current legislation, Google should not even have the right to define what is public or not like they're trying to do. Even the safe-harbour provisions do not immunize them from having to remove unlawful content.
Such a lawsuit would make for an interesting debate, and with a bit of luck could get us all rid of this stupid law.
C.
C.