The Problem of Search Engines and "Sekrit" Data

← Back to Stories (view on slashdot.org)

The Problem of Search Engines and "Sekrit" Data

Posted by Hemos on Monday November 26, 2001 @04:43AM from the how-to-choose-data dept.

Nos. writes: "CNet is reporting that not only Google but other search engines are finding password and credit card numbers while doing its indexing. An interesting quote from the article by Google: 'We define public as anything placed on the public Internet and not blocked to search engines in any way. The primary burden falls to the people who are incorrectly exposing this information. But at the same time, we're certainly aware of the problem, and our development team is exploring different solutions behind the scenes.'" As the article outlines, this has been a problem for a long time -- and with no easy solution in sight.

28 of 411 comments (clear)

A symptom of poor programming... by Bonker · 2001-11-26 04:48 · Score: 4, Insightful

I don't see what's so hard about this problem. It's very simple... don't keep data of any kind on the web server. That's what firewalled, password/encryption protected DB servers are for.

--
The next Slashdot story will be ready soon, but subscribers can beat the rush and slashdot the links early!
1. Re:A symptom of poor programming... by ChazeFroy · 2001-11-26 04:58 · Score: 5, Interesting
  
  Try the following searches on google (include the quotes) and you'll be amazed at what's out there:
  
  "Index of /admin"
  "Index of /password"
  "Index of /mail"
  "Index of /" +passwd
  "Index of /" password.txt
2. Re:A symptom of poor programming... by Brainless · 2001-11-26 05:04 · Score: 4, Funny
  
  I manage a Cold Fusion web server that we allow clients to post their own websites to. Recently, their programmer accidentally made a link to the admin section. Google found that link and proceeded into the admin secion and indexed all the "delete item" links as well. I found it quite amusing when they asked to see a copy of the logs complaining the website was hacked and I discovered GoogleBot deleted every single database entry for them.
3. Re:A symptom of poor programming... by ichimunki · 2001-11-26 05:19 · Score: 5, Informative
  
  A big part of why this is a problem is the fact that many web servers are, by default, set up to display file listings for directories if there is no "index.html" file in the directory and the user requests a URL corresponding to that directory.
  
  Personally I like to make sure that there is an .htaccess file that prevents this (on Apache-- I'm sure IIS and others have similar config options). I like to turn off the directory listing capability if possible, and certainly assign a valid default page, even if index.html is not present.
  
  And don't forget "index of /cgi-bin" for some real fun. ;)
  
  --
  I do not have a signature
4. Re:A symptom of poor programming... by Legion303 · 2001-11-26 05:29 · Score: 5, Interesting
  
  Please give credit where credit is due. Vincent Gaillot posted this list to Bugtraq on November 16.
  -Legion
How can this happen? by Nonesuch · 2001-11-26 04:48 · Score: 4, Redundant

To the best of my knowledge, search engines all work by indexing the web, starting with the base of web sites or submitted URLs, and following the links on each page.
Given this premise, the only way that Google or another search engine could find a page with credit card numbers or other 'secret' data, would be if that page was linked to from another page, and so on, leading back to a 'public' area of some web site.
That is to say, the web-indexing bots used by search engines cannot find anything that an ordinary, very patient human could not find by randomly following links.

--

I do not deploy Linux. Ever.
Oh Yeah? by Knunov · 2001-11-26 04:49 · Score: 4, Funny

"...search engines are finding password and credit card numbers while doing its indexing."

This is very serious. Could you please post the exact search engines are query strings so I can make sure my information isn't there?

Knunov

--
Why do users with IDs under 100,000 or over 700,000 usually have the most worthwhile comments?
1. Re:Oh Yeah? by Karma+50 · 2001-11-26 04:51 · Score: 5, Funny
  
  Just search for your credit card number.
  
  By the way, does google have that realtime display of what people are searching for?
  
  --
  http://www.thehungersite.com
Tangential Google Question by banuaba · 2001-11-26 04:50 · Score: 5, Interesting

How does the Google Cache avoid legal entanglements, both for stuff like cc numbers and copyright/trademark infringement?
If I want to find lyrics to a song, the site that has them will often be down, but the cache will still have them in there.. Why is what google is doing 'okay' but what the origional site not okay? Or do they just leave google alone?

--

Brant

Argle. Bargle.
Stopping Google won't stop the problem... by Kr3m3Puff · 2001-11-26 04:51 · Score: 5, Insightful

The big complaint of the article is that Google is searching for new types of files, instead of HTML. If some goofball left some link to a Word document with his passwords in it, he gets what he deserves.

The quote from that article about Google not thinking about this before the put it forward is idiotic. How can Google be responsible for documents that are in the public domain, that anyone can get to by typing a URL into a browser. It isn't insecure software, just dumb people...

--
D.O.U.O.S.V.A.V.V.M.
1. Re:Stopping Google won't stop the problem... by mobiGeek · 2001-11-26 06:37 · Score: 5, Funny
  but Google undoubtedly uses techniques beyond that of the casual browser
  Uhh...no.
  HTTP is an extremely basic protocol. Google's bots simply do a series of GET requests.
  It would be possible that Google's bots have a database of username/passwords for given sites, but the more likely scenario is that they have stumbled across another way to get the "protected" information:
  
  a link which contains a username and/or password
  /protected/show_article.pl?username=foo&passwo rd=bar&num=1
  
  a link to the pages which by-passes the protection scheme
  /no_one_can_find_this_cause_Im_3l33t/article1.html
  
  someone else posted the information elsewhere, and this is what is actually crawled
  
  I ran robots for nearly 2 years and was harassed by many a Webmuster who could prove that my robots had hacked their site. They'd show me protected or secret data. It typically took 3 to 5 minutes to find the problem...usually the muster was the problem themself.
  HERE'S A NOTE OF WARNING TO WEBMASTERS:
  Black text links on black backgrounds in really small fonts are NOT secure.
  Maybe I should get this posted to BugTraq...or would MS come after me??
  --
  ...Beware the IDEs of Microsoft...
2. Re:Stopping Google won't stop the problem... by Anonymous Coward · 2001-11-26 08:44 · Score: 4, Insightful
  
  Years ago cable companies cried foul that ordinary citizens were grabbing satelite communications off the air with their fancy 6' dishes and whatching whatever they wanted for free. The companies raised a big stink and tried to get people to pay for the content. The FCC said "tough luck buddy. If you put it out there then people have a perfect right to grab it." Since that time most satelite traffic has been encrypted.
  
  If you run a web site on the public internet then you should be paying attention to this basic fact: If you put it out there then people have a perfect right to grab it, even if you don't specifically tell them it's there. (I know FCC rulings don't apply, but the principle is the same). You should encrypt EVERYTHING you don't want people to see.
  
  Encryption is like your pants, it keeps people from seeing your privates. Hiding your URLs and hoping is like running realy, realy fast with no pants on - most people wont see your stuff, but there's always some bastard with a handy-cam.
Well Behaved Crawlers by tomblackwell · 2001-11-26 04:51 · Score: 4, Insightful

...obey the Robot Exclusion Standard. This is not a big secret, and is linked to by all major search engines. Anyone wishing to exclude a well-behaved robot (like those of major search engines) can place a small file on their site which controls the behaviour of the robot. Don't want a robot in a particular directory? Then set your robots.txt up correctly.

P.S. Anyone keeping credit card info in a web directory that's accessible to the outside world should really think long and hard about getting out of business on the internet.
1. Re:Well Behaved Crawlers by ryanvm · 2001-11-26 05:12 · Score: 5, Insightful
  
  The Robot Exclusion Standard (e.g. robots.txt) is mainly useful for making sure that search engines don't cache dynamic data on your web site. That way users don't get a 404 error when clicking on your links in the search results.
  
  You should not be using robots.txt to keep confidential data out of caches. In fact, most semi-intelligent crackers would actually download the robots.txt with the specific intention of finding ill-hidden sensitive data.
Simple but burdensome solution by camusflage · 2001-11-26 04:52 · Score: 4, Informative

Credit card numbers follow a known format (mod10). It should be simple, but somewhat intensive as far as search engines go, to scan content, look for 16 digit numeric strings, and run a mod10 on them. If it comes back true, don't put it into the index.

--
The truth about Scientology, Xenu, and you: Operation Clambake
1. Re:Simple but burdensome solution by Xerithane · 2001-11-26 04:56 · Score: 5, Insightful
  
  It is a burden, but the responsibility does not lie on a crawling engine. You could check any 10 digit number (and expdate with a lune check if available) but with all the different formatting done on CC numbers (XXXX-XXXX-XXXX-XXXX, XXXXXXXXXXXXXXXX, etc) the algorithm could get ugly to maintain.
  
  I don't see why Google or any other search engine has to even acknowledge this problem, it's simply Someone Else's Problem. If I was paying a web team/master/monkey any money at all and found out about this, heads would roll. It seems that even thinking of pointing a finger at google is the same tactic Microsoft is doing at those "irresponsible" individuals pointing out security flaws.
  
  If anything Google is providing them a service by telling them about the problem.
  
  --
  Dacels Jewelers can't be trusted.
Google exploit patch for Apache by Anarchofascist · 2001-11-26 04:52 · Score: 4, Funny

% cd /var/www % cat > robots.txt User-agent: * Disallow: / ^D %

--
Once more unto the breach, dear friends, once more, Or close the wall up with our American dead!
The Problem of Search Engines and "Sekrit" Data by NTSwerver · 2001-11-26 04:55 · Score: 4, Funny

Please change the title of this article to:

The Problem Incompetent System Administrators

If data is 'sekrit'/sensitive/confidential - don't put it on the web. It's as simple as that. If that data is available on the web, search engines can't be blamed for finding it.

--
-----------------------
Moderator's essentials
This is what happens when you use frontpage... by Grip3n · 2001-11-26 04:55 · Score: 5, Informative

I'm a web developer, and I don't know how many times I've heard people who are just getting into the scene talking about making 'hidden' pages. I'm reffering to those that are only accessible to those who click on a very tiny area of an image map, or perhaps find that 'secret' link at the bottom of the page. Visually, these elements seem 'hidden' to a user who doesn't really understand web pages and source code. However, these 'hidden' pages look like giant 'Click Here' buttons to search engines, which is what I'm presuming some of this indexing is finding.

The search engines cannot feasibly stop this from happening, each occurance is unique unto itself. The only prevention tool is knowledge and education, and bringing to the masses a general understanding of search engine spidering theory.

Just my 2 cents.

--
To make a pun demonstrates the highest understanding of a language
Example by squaretorus · 2001-11-26 04:55 · Score: 5, Informative

I recently joined an angel organisation to publicise my business in an attempt to raise funds. The information provided to the organisation is supposed to be secret, and only available to members of the organisation via a paper newsletter which was reproduced in the secure area of the organisations website.
A couple of months down the line a couple of search engines, when asked about 'mycompanyname' were giving the newsletter entry in the top 5.

Alongside my details were those of several other companies. Essentially laying out the essence of the respective business plans.

How did this happen? The site was put together with FP2000, and the 'secure' area was simply those files in the /secure directory.

I had no cause to view the website prior to this. The site has been fixed on my advice. How did this come about? No one in the organisation knew what security meant. They were told that /secure WAS!

It didn't do any damage to myself, but a few of the other companies could have suffered if their plans were found. Its not googles job to do anything about this, its the webmasters. But a word of warning - before you agree for your info to appear on a website ask about the security measures. They mey well be crap!
I've got a solution! by CraigoFL · 2001-11-26 04:56 · Score: 5, Funny

Every web server should have a file in their root directory called "secret.xml" or somesuch. This file could list all the publicly-accessible URLs that have all the "secret" data such as credit card numbers, root passwords, and private keys. Search engines could parse this file and then NOT include those URLs in their search results!
Brilliant, huh? ;-)
On second thought, maybe I shouldn't post this... some PHB might actually think it's a good idea.
How this happens by Tom7 · 2001-11-26 04:59 · Score: 5, Informative

People often wonder how their "secret" sites get into web indices. Here's a scenario that's not too obvious but is quite common:

Suppose I have a secret page, like:
http://mysite.com/cgi-bin/secret?password=admini st rator

Suppose this page has some links on it, and someone (maybe me, maybe my manager) clicks them to go to another site (http://elsewhere.com/).

Now suppose elsewhere.com runs analog on their web logs, and posts them in a publically-accessible location. Suppose elsewhere.com's analog setup also reports the contents of the "referer" header.

Now suppose the web logs are indexed (because of this same problem, or because the logs are just linked to from their web page somewhere). Google has the link to your secret information, even though you never explicitly linked to it anywhere.

One solution is to use proper HTTP access control (as crappy as it is), or to use POST instead of GET to supply credentials (POST doesn't transfer into a URL that might be passed as a referrer). You could also use robots.txt to deny indexing of your secret stuff, though others could still find it through web logs.

Of course, I don't think credit card info should *ever* be accessible via HTTP, even if it is password protected!
Oh, for regular expression searching in Google by EnglishTim · 2001-11-26 05:03 · Score: 5, Funny

I could be a rich man...

(Not, of course that I'd ever do anything like that...)

Searching with regular expressions would be cool, though...
Business Model by Alomex · 2001-11-26 05:05 · Score: 5, Funny

A while back there was a thread here about the weakness of the revenue model for search engines. Maybe we have found the answer, think about all the revenue that Google could generate with this data!

Anybody knows when Google is going public?
Bring out the legal eagles by Milican · 2001-11-26 05:06 · Score: 4, Insightful

"Webmasters should know how to protect their files before they even start writing a Web site"

That quote sums up the exact problem. It's not googles fault for finding out what an idiot the web merchant was. As a matter of fact I thank google for exposing this problem. This is nothing short of gross negligence on the part of any web merchant to have any credit card numbers publicly accessible in any way. There is no reason this kind of information should not be under strong security.

To have a search engine discover this kind of information is dispicable, unprofessional, and just plain idiotic. As others have mentioned these guys need to get a firewall, use some security, and quit being such incredible fools with such valuable information. Any merchant who exposes credit card information through the stupidity of word documents, or excel spreadsheets on their public web server, or any non-secure server of any kind deserves to get sued into oblivion. Although, people usually don't like lawyers I'm really glad we have them in the US because they help stop this kind of stuff. Too many lazy people don't think its in their best interest to protect the identity, or financial security of others. I'm glad lawyers are here to show them the light :)

JOhn

--
Campaign for Liberty
Web Sites are public by definition by hattig · 2001-11-26 05:14 · Score: 4, Insightful

It is a simple rule of the web - any directory or subdirectory thereof that is configured to be accessible via the internet (either html root directories, ftp root directories, gnutella shared directories, etc) should be assumed to be publically accessible. Do not store anything that should be private in these areas.
Secondly, it appears that companies are storing credit card numbers (a) in the clear and (b) in these public areas. These companies should not be allowed to trade on the internet! That is so inept when learning how to use pgp/gpg takes no time at all, and simply storing the PGP encrypted files outside the publically accessible filesystem is just changing the line of code that writes to "payments/ordernumber.asc" to "~/payments/ordernumber.asc" (or whatever). Of course, the PGP secret key is not stored on a publically accessible computer at all.
But I shouldn't be giving a basic course on how to secure website payments, etc, to you lot - you know it or could work it out (or a similar method) pretty quickly. It is those dumb administrators that don't have a clue about security that are to blame (or their PHB).
Disagree With Gary McGraw by devnullkac · 2001-11-26 05:16 · Score: 4, Insightful

Near the end of the article, there's a quote from Gary McGraw:
The guys at Google thought, 'How cool that we can offer this to our users' without thinking about security. If you want to do this right, you have to think about security from the beginning and have a very solid approach to software design and software development that is based on what bad guys might possibly do to cause your program grief.
I must say I couldn't disagree more. To suggest that web site administrators can somehow entrust Google to implement the "obscurity" part of their "security through obscurity" plan is unrealistic. As an external entity, Google is really just another one of those "bad guys" and the fact that they're making your mistakes obvious without actually exploiting them is what people where I come from call a Good Thing.

--
What do you mean they cut the power? How can they cut the power, man? They're animals!
Directory searches by wytcld · 2001-11-26 06:03 · Score: 4, Insightful

Some search engines don't just check the pages linked from other pages on the server, but also look for other files in the subdirectories presented in links.

So if http://credit.com/ has a link to http://credit.com/signin/entry.html then these engines will also check http://credit.com/signin/ - which will, if directory indexes are on and there is no index.html page there, show all the files in the directory. In which case http://credit.com/signin/custlist.dat - your flatfile list including credit cards - gets indexed.

So if you're going to have directory indexing on (which there can be valid reasons for) you really need to create an empty index.html file as the very next step each time you set up a subdirectory, even if you only intend to link to files within it.

--
"with their freedom lost all virtue lose" - Milton