The Problem of Search Engines and "Sekrit" Data
Nos. writes: "CNet is reporting that not only Google but other search engines are finding password and credit card numbers while doing its indexing. An interesting quote from the article by Google: 'We define public as anything placed on the public Internet and not blocked to search engines in any way. The primary burden falls to the people who are incorrectly exposing this information. But at the same time, we're certainly aware of the problem, and our development team is exploring different solutions behind the scenes.'" As the article outlines, this has been a problem for a long time -- and with no easy solution in sight.
I don't see what's so hard about this problem. It's very simple... don't keep data of any kind on the web server. That's what firewalled, password/encryption protected DB servers are for.
The next Slashdot story will be ready soon, but subscribers can beat the rush and slashdot the links early!
Given this premise, the only way that Google or another search engine could find a page with credit card numbers or other 'secret' data, would be if that page was linked to from another page, and so on, leading back to a 'public' area of some web site.
That is to say, the web-indexing bots used by search engines cannot find anything that an ordinary, very patient human could not find by randomly following links.
I do not deploy Linux. Ever.
"...search engines are finding password and credit card numbers while doing its indexing."
This is very serious. Could you please post the exact search engines are query strings so I can make sure my information isn't there?
Knunov
Why do users with IDs under 100,000 or over 700,000 usually have the most worthwhile comments?
How does the Google Cache avoid legal entanglements, both for stuff like cc numbers and copyright/trademark infringement?
If I want to find lyrics to a song, the site that has them will often be down, but the cache will still have them in there.. Why is what google is doing 'okay' but what the origional site not okay? Or do they just leave google alone?
Brant
Argle. Bargle.
The quote from that article about Google not thinking about this before the put it forward is idiotic. How can Google be responsible for documents that are in the public domain, that anyone can get to by typing a URL into a browser. It isn't insecure software, just dumb people...
D.O.U.O.S.V.A.V.V.M.
...obey the Robot Exclusion Standard. This is not a big secret, and is linked to by all major search engines. Anyone wishing to exclude a well-behaved robot (like those of major search engines) can place a small file on their site which controls the behaviour of the robot. Don't want a robot in a particular directory? Then set your robots.txt up correctly.
P.S. Anyone keeping credit card info in a web directory that's accessible to the outside world should really think long and hard about getting out of business on the internet.
Credit card numbers follow a known format (mod10). It should be simple, but somewhat intensive as far as search engines go, to scan content, look for 16 digit numeric strings, and run a mod10 on them. If it comes back true, don't put it into the index.
The truth about Scientology, Xenu, and you: Operation Clambake
% cd /var/www /
% cat > robots.txt
User-agent: *
Disallow:
^D
%
Once more unto the breach, dear friends, once more, Or close the wall up with our American dead!
Please change the title of this article to:
The Problem Incompetent System Administrators
If data is 'sekrit'/sensitive/confidential - don't put it on the web. It's as simple as that. If that data is available on the web, search engines can't be blamed for finding it.
-----------------------
Moderator's essentials
I'm a web developer, and I don't know how many times I've heard people who are just getting into the scene talking about making 'hidden' pages. I'm reffering to those that are only accessible to those who click on a very tiny area of an image map, or perhaps find that 'secret' link at the bottom of the page. Visually, these elements seem 'hidden' to a user who doesn't really understand web pages and source code. However, these 'hidden' pages look like giant 'Click Here' buttons to search engines, which is what I'm presuming some of this indexing is finding.
The search engines cannot feasibly stop this from happening, each occurance is unique unto itself. The only prevention tool is knowledge and education, and bringing to the masses a general understanding of search engine spidering theory.
Just my 2 cents.
To make a pun demonstrates the highest understanding of a language
I recently joined an angel organisation to publicise my business in an attempt to raise funds. The information provided to the organisation is supposed to be secret, and only available to members of the organisation via a paper newsletter which was reproduced in the secure area of the organisations website.
/secure directory.
/secure WAS!
A couple of months down the line a couple of search engines, when asked about 'mycompanyname' were giving the newsletter entry in the top 5.
Alongside my details were those of several other companies. Essentially laying out the essence of the respective business plans.
How did this happen? The site was put together with FP2000, and the 'secure' area was simply those files in the
I had no cause to view the website prior to this. The site has been fixed on my advice. How did this come about? No one in the organisation knew what security meant. They were told that
It didn't do any damage to myself, but a few of the other companies could have suffered if their plans were found. Its not googles job to do anything about this, its the webmasters. But a word of warning - before you agree for your info to appear on a website ask about the security measures. They mey well be crap!
Brilliant, huh? ;-)
On second thought, maybe I shouldn't post this... some PHB might actually think it's a good idea.
People often wonder how their "secret" sites get into web indices. Here's a scenario that's not too obvious but is quite common:
i st rator
Suppose I have a secret page, like:
http://mysite.com/cgi-bin/secret?password=admin
Suppose this page has some links on it, and someone (maybe me, maybe my manager) clicks them to go to another site (http://elsewhere.com/).
Now suppose elsewhere.com runs analog on their web logs, and posts them in a publically-accessible location. Suppose elsewhere.com's analog setup also reports the contents of the "referer" header.
Now suppose the web logs are indexed (because of this same problem, or because the logs are just linked to from their web page somewhere). Google has the link to your secret information, even though you never explicitly linked to it anywhere.
One solution is to use proper HTTP access control (as crappy as it is), or to use POST instead of GET to supply credentials (POST doesn't transfer into a URL that might be passed as a referrer). You could also use robots.txt to deny indexing of your secret stuff, though others could still find it through web logs.
Of course, I don't think credit card info should *ever* be accessible via HTTP, even if it is password protected!
[know how Basic Authentication works before hosting web sites]
... and know that it's a wholly inadequate way of "protecting" credit card numbers!
I do not know if this is still the case, but Microsoft's IE offline browsing page crawler (collects pages for you to read offline) ignored robots.txt last time I checked. I know many other crawlers do likewise.
I could be a rich man...
(Not, of course that I'd ever do anything like that...)
Searching with regular expressions would be cool, though...
INetPub means "INetPublic" not "INetPubrobably a great place to put my credit card numbers".
Why are stupid people not to blame for anything anymore?
Let's not stir that bag of worms...
A while back there was a thread here about the weakness of the revenue model for search engines. Maybe we have found the answer, think about all the revenue that Google could generate with this data!
Anybody knows when Google is going public?
"Webmasters should know how to protect their files before they even start writing a Web site"
:)
That quote sums up the exact problem. It's not googles fault for finding out what an idiot the web merchant was. As a matter of fact I thank google for exposing this problem. This is nothing short of gross negligence on the part of any web merchant to have any credit card numbers publicly accessible in any way. There is no reason this kind of information should not be under strong security.
To have a search engine discover this kind of information is dispicable, unprofessional, and just plain idiotic. As others have mentioned these guys need to get a firewall, use some security, and quit being such incredible fools with such valuable information. Any merchant who exposes credit card information through the stupidity of word documents, or excel spreadsheets on their public web server, or any non-secure server of any kind deserves to get sued into oblivion. Although, people usually don't like lawyers I'm really glad we have them in the US because they help stop this kind of stuff. Too many lazy people don't think its in their best interest to protect the identity, or financial security of others. I'm glad lawyers are here to show them the light
JOhn
Campaign for Liberty
At any rate--scary it is.
I'm a nature photographer.
Secondly, it appears that companies are storing credit card numbers (a) in the clear and (b) in these public areas. These companies should not be allowed to trade on the internet! That is so inept when learning how to use pgp/gpg takes no time at all, and simply storing the PGP encrypted files outside the publically accessible filesystem is just changing the line of code that writes to "payments/ordernumber.asc" to "~/payments/ordernumber.asc" (or whatever). Of course, the PGP secret key is not stored on a publically accessible computer at all.
But I shouldn't be giving a basic course on how to secure website payments, etc, to you lot - you know it or could work it out (or a similar method) pretty quickly. It is those dumb administrators that don't have a clue about security that are to blame (or their PHB).
What do you mean they cut the power? How can they cut the power, man? They're animals!
this guy's just looking for free hype for his book. if that's the kind of advice he offers, he's doing more harm than good.
This is the voice of World Control. I bring you Peace.
Is like blaming the Highway department for speeders...
Thanks to file sharing, I purchase more CDs
Thanks to the RIAA, I buy them used...
Some search engines don't just check the pages linked from other pages on the server, but also look for other files in the subdirectories presented in links.
So if http://credit.com/ has a link to http://credit.com/signin/entry.html then these engines will also check http://credit.com/signin/ - which will, if directory indexes are on and there is no index.html page there, show all the files in the directory. In which case http://credit.com/signin/custlist.dat - your flatfile list including credit cards - gets indexed.
So if you're going to have directory indexing on (which there can be valid reasons for) you really need to create an empty index.html file as the very next step each time you set up a subdirectory, even if you only intend to link to files within it.
"with their freedom lost all virtue lose" - Milton
Second, if the sensitive information is going to a select few people, consider PGP encrypting the data, and only putting the encrypted version online. Doing this makes many of the HTTP security issues less critical.
Assuming you still have to put something sensitive online, make sure of the following:
- Only use HTTPS, never use just plain HTTP.
- Use CGI, Java Servlets, or some other server-side program technology to password-protect the site. I will refer to the resulting program(s) as the security program
- Never accept a password from a GET request, only accept them from POST requests.
- Never make the user list or password list visible from the internet, not even an encrypted password list.
- Never place the sensitive information in a directory the web server software knows how to access. Only the security program should know how to find the info.
- Review all documentation for your web server software and the platform used for the security program. Pay special attention to seciurity issues, make sure you aren't inadvertently opening up holes. Keep current, do this at minimum four times a year.
- Subscribe to any security mailing lists for your web server platform operating system web server software, and for the programing platform you used for the security program. If there is anything else running on this machine, subscribe to their security mailing lists too.
- Subscribe to cert-advisory and BugTraq. Read in detail all the messages that are relevant to your setup. Review your setup after each relevant message.
- Don't use IIS.
- Don't use Windows 95/98/Me. Don't use Windows XP Home Edition.
- Don't use any version of MacOS before OS X.
- Don't use website hosting services for sensitive information.
- Never connect to this webserver using telnet, ftp or FrontPage. SSH is your friend.
- Never have Front Page Extensions (or its clones or workalikes) installed on a webserver with sensitive data.
- If there is anything above that you don't understand, or if you can't afford the time for any of the above, hire a professional with security experience and recommendations from people you trust who have used his or her services. It's bad enough that amateurs are running webservers, much less running ecommerce sites and other sites with sensitive data.
The above is an incomplete list. It is primarly there to start giving people an idea of how much effort they should expect to put into a properly administered secure website with sensitive information. Do you really need to distribute this via a web browser?----
Open mind, insert foot.
Google's comment was:
"The primary burden falls to the people who are incorrectly exposing this information."
This is where they should have stopped. Those who find their credit card information in a search engine will learn a lesson and use services that actually take care of their customers' security and privacy. Google shouldn't have to clean up incompetent people's mess.
In the long run, these things can only lead to the ignorant (wannabe?) players in the market slowly dying because they don't know what they are doing.
I personally hope someone gets a taste of reality here, and that only the serious players survive. The MCSE crowd may finally learn that there's more to it than blind trust in their own (lacking) ability.
Clever signature text goes here.
I agree with all of your assertions, except
"Don't use IIS."
This just isn't an option for a lot of people. I would change this to:
"If you use IIS, you need to make sure you check BugTraq/cert EVERY day."
I would also add:
"If you use IIS with COM components via ASP, make sure the DLL's are not in a publicly accessible directory."
This happens a lot, and makes DLL's lots easier to break.
Let's not stir that bag of worms...
I'll never forget the day I first saw a .pdf in Google search result. Not that long ago I saw my first .ps.gz in a search result. I mean, how dope is that!? They're ungzipping the file, and then parsing the postscript! Soon they'll start uniso-ing images, untarring files, unrpming packages, .... You'll be able to search for text and have it found inside the README in an rpm in a Red Hat ISO.
Can't wait until images.google.com starts doing OCR on the pix they index...
filetype:htpasswd htpasswd
Scary how many
-- Azaroth
Carl G. Jung
--
"With one breath, with one flow, You will know Synchronicity" -La Policia
Because 28 days after you took your page offline it will disappear from the Google cache.
Google reindexes web pages, and if they 404 on the next visit, then good bye pork pie! You have to get them while they are hot, eg, when a site has JUST been Slashdotted.
Perhaps it would be a good idea after reading this article to examine publicfile.
It was written by a very security conscious programmer who realises that your private files can easily get out onto the web. That is why publicfile has no concept of content protection (eg, Deny from evilh4x0r.com or .htaccess) and will only serve up files that are publically readable.
From the features page:
A good healthy does of paranoia would do people good.
No, seriously, do it !
Print it out and hand it on the wall, then put a post-it note on top of it saying : "The best example of 'blaiming the messenger' ever !!!"
echo '[q]sa[ln0=aln80~Psnlbx]16isb572CCB9AE9DB03273snlbxq' |dc