The Problem of Search Engines and "Sekrit" Data
Nos. writes: "CNet is reporting that not only Google but other search engines are finding password and credit card numbers while doing its indexing. An interesting quote from the article by Google: 'We define public as anything placed on the public Internet and not blocked to search engines in any way. The primary burden falls to the people who are incorrectly exposing this information. But at the same time, we're certainly aware of the problem, and our development team is exploring different solutions behind the scenes.'" As the article outlines, this has been a problem for a long time -- and with no easy solution in sight.
How does the Google Cache avoid legal entanglements, both for stuff like cc numbers and copyright/trademark infringement?
If I want to find lyrics to a song, the site that has them will often be down, but the cache will still have them in there.. Why is what google is doing 'okay' but what the origional site not okay? Or do they just leave google alone?
Brant
Argle. Bargle.
The quote from that article about Google not thinking about this before the put it forward is idiotic. How can Google be responsible for documents that are in the public domain, that anyone can get to by typing a URL into a browser. It isn't insecure software, just dumb people...
D.O.U.O.S.V.A.V.V.M.
Just search for your credit card number.
By the way, does google have that realtime display of what people are searching for?
http://www.thehungersite.com
I'm a web developer, and I don't know how many times I've heard people who are just getting into the scene talking about making 'hidden' pages. I'm reffering to those that are only accessible to those who click on a very tiny area of an image map, or perhaps find that 'secret' link at the bottom of the page. Visually, these elements seem 'hidden' to a user who doesn't really understand web pages and source code. However, these 'hidden' pages look like giant 'Click Here' buttons to search engines, which is what I'm presuming some of this indexing is finding.
The search engines cannot feasibly stop this from happening, each occurance is unique unto itself. The only prevention tool is knowledge and education, and bringing to the masses a general understanding of search engine spidering theory.
Just my 2 cents.
To make a pun demonstrates the highest understanding of a language
I recently joined an angel organisation to publicise my business in an attempt to raise funds. The information provided to the organisation is supposed to be secret, and only available to members of the organisation via a paper newsletter which was reproduced in the secure area of the organisations website.
/secure directory.
/secure WAS!
A couple of months down the line a couple of search engines, when asked about 'mycompanyname' were giving the newsletter entry in the top 5.
Alongside my details were those of several other companies. Essentially laying out the essence of the respective business plans.
How did this happen? The site was put together with FP2000, and the 'secure' area was simply those files in the
I had no cause to view the website prior to this. The site has been fixed on my advice. How did this come about? No one in the organisation knew what security meant. They were told that
It didn't do any damage to myself, but a few of the other companies could have suffered if their plans were found. Its not googles job to do anything about this, its the webmasters. But a word of warning - before you agree for your info to appear on a website ask about the security measures. They mey well be crap!
Brilliant, huh? ;-)
On second thought, maybe I shouldn't post this... some PHB might actually think it's a good idea.
It is a burden, but the responsibility does not lie on a crawling engine. You could check any 10 digit number (and expdate with a lune check if available) but with all the different formatting done on CC numbers (XXXX-XXXX-XXXX-XXXX, XXXXXXXXXXXXXXXX, etc) the algorithm could get ugly to maintain.
I don't see why Google or any other search engine has to even acknowledge this problem, it's simply Someone Else's Problem. If I was paying a web team/master/monkey any money at all and found out about this, heads would roll. It seems that even thinking of pointing a finger at google is the same tactic Microsoft is doing at those "irresponsible" individuals pointing out security flaws.
If anything Google is providing them a service by telling them about the problem.
Dacels Jewelers can't be trusted.
Try the following searches on google (include the quotes) and you'll be amazed at what's out there:
/admin"
/password"
/mail"
/" +passwd
/" password.txt
"Index of
"Index of
"Index of
"Index of
"Index of
People often wonder how their "secret" sites get into web indices. Here's a scenario that's not too obvious but is quite common:
i st rator
Suppose I have a secret page, like:
http://mysite.com/cgi-bin/secret?password=admin
Suppose this page has some links on it, and someone (maybe me, maybe my manager) clicks them to go to another site (http://elsewhere.com/).
Now suppose elsewhere.com runs analog on their web logs, and posts them in a publically-accessible location. Suppose elsewhere.com's analog setup also reports the contents of the "referer" header.
Now suppose the web logs are indexed (because of this same problem, or because the logs are just linked to from their web page somewhere). Google has the link to your secret information, even though you never explicitly linked to it anywhere.
One solution is to use proper HTTP access control (as crappy as it is), or to use POST instead of GET to supply credentials (POST doesn't transfer into a URL that might be passed as a referrer). You could also use robots.txt to deny indexing of your secret stuff, though others could still find it through web logs.
Of course, I don't think credit card info should *ever* be accessible via HTTP, even if it is password protected!
I could be a rich man...
(Not, of course that I'd ever do anything like that...)
Searching with regular expressions would be cool, though...
A while back there was a thread here about the weakness of the revenue model for search engines. Maybe we have found the answer, think about all the revenue that Google could generate with this data!
Anybody knows when Google is going public?
You should not be using robots.txt to keep confidential data out of caches. In fact, most semi-intelligent crackers would actually download the robots.txt with the specific intention of finding ill-hidden sensitive data.
A big part of why this is a problem is the fact that many web servers are, by default, set up to display file listings for directories if there is no "index.html" file in the directory and the user requests a URL corresponding to that directory.
.htaccess file that prevents this (on Apache-- I'm sure IIS and others have similar config options). I like to turn off the directory listing capability if possible, and certainly assign a valid default page, even if index.html is not present.
/cgi-bin" for some real fun. ;)
Personally I like to make sure that there is an
And don't forget "index of
I do not have a signature
-Legion