The Problem of Search Engines and "Sekrit" Data

← Back to Stories (view on slashdot.org)

The Problem of Search Engines and "Sekrit" Data

Posted by Hemos on Monday November 26, 2001 @04:43AM from the how-to-choose-data dept.

Nos. writes: "CNet is reporting that not only Google but other search engines are finding password and credit card numbers while doing its indexing. An interesting quote from the article by Google: 'We define public as anything placed on the public Internet and not blocked to search engines in any way. The primary burden falls to the people who are incorrectly exposing this information. But at the same time, we're certainly aware of the problem, and our development team is exploring different solutions behind the scenes.'" As the article outlines, this has been a problem for a long time -- and with no easy solution in sight.

18 of 411 comments (clear)

Min score:

Reason:

Sort:

Simple but burdensome solution by camusflage · 2001-11-26 04:52 · Score: 4, Informative

Credit card numbers follow a known format (mod10). It should be simple, but somewhat intensive as far as search engines go, to scan content, look for 16 digit numeric strings, and run a mod10 on them. If it comes back true, don't put it into the index.

--
The truth about Scientology, Xenu, and you: Operation Clambake
1. Re:Simple but burdensome solution by Bronster · 2001-11-26 15:38 · Score: 3, Informative
  
  \d.?\d.?\d.?\d.?\d.?\d.?\d.?\d.?\d.?\d.?\d.?\d.?\d .?\d.?\d.?\d
  
  should match >99% of cc numbers. And a lot of other dross, but you can just pipe it into a mod10 checker
  
  Putting the burden on me, the poor sap who wants to have my web pages indexed, to make sure that I don't accidently put any numbers on a web site that might be mis-interpreted as a credit card number (i.e. a tab or comma separated list of numbers would be likely to hit the above, especially if it was much longer than a CC number).
  
  Not to mention the problem of recursive lookup on
  a long number (the first 2000 digits of pi are 3.1415926535.......) - it would take an age to make sure there were no CC no's in that.
  
  All together, it would cause 'innocent' pages to not be indexed, which is distinctly sub optimal.
This is what happens when you use frontpage... by Grip3n · 2001-11-26 04:55 · Score: 5, Informative

I'm a web developer, and I don't know how many times I've heard people who are just getting into the scene talking about making 'hidden' pages. I'm reffering to those that are only accessible to those who click on a very tiny area of an image map, or perhaps find that 'secret' link at the bottom of the page. Visually, these elements seem 'hidden' to a user who doesn't really understand web pages and source code. However, these 'hidden' pages look like giant 'Click Here' buttons to search engines, which is what I'm presuming some of this indexing is finding.

The search engines cannot feasibly stop this from happening, each occurance is unique unto itself. The only prevention tool is knowledge and education, and bringing to the masses a general understanding of search engine spidering theory.

Just my 2 cents.

--
To make a pun demonstrates the highest understanding of a language
Example by squaretorus · 2001-11-26 04:55 · Score: 5, Informative

I recently joined an angel organisation to publicise my business in an attempt to raise funds. The information provided to the organisation is supposed to be secret, and only available to members of the organisation via a paper newsletter which was reproduced in the secure area of the organisations website.
A couple of months down the line a couple of search engines, when asked about 'mycompanyname' were giving the newsletter entry in the top 5.

Alongside my details were those of several other companies. Essentially laying out the essence of the respective business plans.

How did this happen? The site was put together with FP2000, and the 'secure' area was simply those files in the /secure directory.

I had no cause to view the website prior to this. The site has been fixed on my advice. How did this come about? No one in the organisation knew what security meant. They were told that /secure WAS!

It didn't do any damage to myself, but a few of the other companies could have suffered if their plans were found. Its not googles job to do anything about this, its the webmasters. But a word of warning - before you agree for your info to appear on a website ask about the security measures. They mey well be crap!
How this happens by Tom7 · 2001-11-26 04:59 · Score: 5, Informative

People often wonder how their "secret" sites get into web indices. Here's a scenario that's not too obvious but is quite common:

Suppose I have a secret page, like:
http://mysite.com/cgi-bin/secret?password=admini st rator

Suppose this page has some links on it, and someone (maybe me, maybe my manager) clicks them to go to another site (http://elsewhere.com/).

Now suppose elsewhere.com runs analog on their web logs, and posts them in a publically-accessible location. Suppose elsewhere.com's analog setup also reports the contents of the "referer" header.

Now suppose the web logs are indexed (because of this same problem, or because the logs are just linked to from their web page somewhere). Google has the link to your secret information, even though you never explicitly linked to it anywhere.

One solution is to use proper HTTP access control (as crappy as it is), or to use POST instead of GET to supply credentials (POST doesn't transfer into a URL that might be passed as a referrer). You could also use robots.txt to deny indexing of your secret stuff, though others could still find it through web logs.

Of course, I don't think credit card info should *ever* be accessible via HTTP, even if it is password protected!
1. Re:How this happens by Garfunkel · 2001-11-26 05:05 · Score: 2, Informative
  
  ah yes, analog's reports (and other web stat programs) are a big culprit as well. Even on local sites. If I have a /sekrit/ site that isn't linked to from anywhere on my site, but I have a bookmark that I visit often. That shows up in web logs still and usually gets indexed by a web log analyzer which can "handily" create links to all those pages when it generates the report.
  
  --
  -jay
spam by flollywebfrog · 2001-11-26 05:01 · Score: 1, Informative

The other day I was using google to explore the files of an annoying spammers site [referralware.com]. Simply searching for a few numbers with the query site:.referralware.com brought up search results in their unprotected source.referralware.com directory that included all the credit card logs for the past week. And I am just an average computer joe user ... this is a problem if I can be a "hacker" with less knowledge than a script kiddie!

--

________________
All my sig are fjdklafjkldafjkldafdaklf
Directory listings by NineNine · 2001-11-26 05:03 · Score: 2, Informative

Most of tihs is coming from leaving directory listing turned on. Generally, this should only be used on an HTTP front-ends to FTP boxes, and for development machines. IIS has "directory browsing" turned off by default. Maybe Apache has it turned on by default? You'd be surprised to see how many public webservers have this on, making it exceedingly likely that search engines will find files they weren't meant to find. The situation arises when there's no "default" page (usually index.html or default.html, default.asp, etc.) in a directory and only a file like content.html in a directory. IF a SE tries http://domain.com/directory/, it'll get the directory listing, which it can, in turn, continue to spider.
well golly gosh, it works! by Anonymous Coward · 2001-11-26 05:05 · Score: 2, Informative

search for: password admin filetype:doc

My first hit is:

www.nomi.navy.mil/TriMEP/TriMEPUserGuide/WordDoc s/ Setup_Procedures_Release_1.0e.doc

at the bottom of the html:

UserName: TURBO and PassWord: turbo, will give you unlimited user access (passwords are case sensitive).

Username: ADMIN and PassWord: admin, will give you password and system access (passwords are case sensitive).

It is recommend that the user go to Tools, System Defaults first and change the Facility UIC to Your facility UIC.

oh dear, am I now a terrorist?
Re:A symptom of poor programming... by ChazeFroy · 2001-11-26 05:15 · Score: 2, Informative

Something I forgot to mention in my other post:

The October 2001 issue of IEEE Computer has some articles on security, and the first article in the issue is titled "Search Engines as Security Threat" by Hernandez, Sierra, Ribagorda, Ramos.

Here's a link to it.
Re:A symptom of poor programming... by ichimunki · 2001-11-26 05:19 · Score: 5, Informative

A big part of why this is a problem is the fact that many web servers are, by default, set up to display file listings for directories if there is no "index.html" file in the directory and the user requests a URL corresponding to that directory.

Personally I like to make sure that there is an .htaccess file that prevents this (on Apache-- I'm sure IIS and others have similar config options). I like to turn off the directory listing capability if possible, and certainly assign a valid default page, even if index.html is not present.

And don't forget "index of /cgi-bin" for some real fun. ;)

--
I do not have a signature
Re:Tangential Google Question by Xzzy · 2001-11-26 05:44 · Score: 3, Informative

> If you only had Google pointing to it, wouldn't
> it be very low on a search list?

If it's a very specific search term, Google will still return it in the list. If it's unique enough, it's very possible that it will even be the top ranked page. If you put a unique string of characters (like a password or something) on a page, and google indexed it, typing that "password" into the search engine will give you your page.

You can also type domain names into google to retrieve the cache page for that website, which would accomplish much the same thing as long as it's not geocities or something.
Checklist for HTTP Distribution of Sensitive Data by Gleef · 2001-11-26 06:03 · Score: 3, Informative
First, determine if you really need to distribute this via HTTP. It is far easier to secure other protocols (eg scp), so if there's another way of doing this, do it.

Second, if the sensitive information is going to a select few people, consider PGP encrypting the data, and only putting the encrypted version online. Doing this makes many of the HTTP security issues less critical.

Assuming you still have to put something sensitive online, make sure of the following:
- Only use HTTPS, never use just plain HTTP.
- Use CGI, Java Servlets, or some other server-side program technology to password-protect the site. I will refer to the resulting program(s) as the security program
- Never accept a password from a GET request, only accept them from POST requests.
- Never make the user list or password list visible from the internet, not even an encrypted password list.
- Never place the sensitive information in a directory the web server software knows how to access. Only the security program should know how to find the info.
- Review all documentation for your web server software and the platform used for the security program. Pay special attention to seciurity issues, make sure you aren't inadvertently opening up holes. Keep current, do this at minimum four times a year.
- Subscribe to any security mailing lists for your web server platform operating system web server software, and for the programing platform you used for the security program. If there is anything else running on this machine, subscribe to their security mailing lists too.
- Subscribe to cert-advisory and BugTraq. Read in detail all the messages that are relevant to your setup. Review your setup after each relevant message.
- Don't use IIS.
- Don't use Windows 95/98/Me. Don't use Windows XP Home Edition.
- Don't use any version of MacOS before OS X.
- Don't use website hosting services for sensitive information.
- Never connect to this webserver using telnet, ftp or FrontPage. SSH is your friend.
- Never have Front Page Extensions (or its clones or workalikes) installed on a webserver with sensitive data.
- If there is anything above that you don't understand, or if you can't afford the time for any of the above, hire a professional with security experience and recommendations from people you trust who have used his or her services. It's bad enough that amateurs are running webservers, much less running ecommerce sites and other sites with sensitive data.
The above is an incomplete list. It is primarly there to start giving people an idea of how much effort they should expect to put into a properly administered secure website with sensitive information. Do you really need to distribute this via a web browser?
--

----
Open mind, insert foot.
Re:MicroSoft Passport Credit Card # avaliable by PaperTie · 2001-11-26 06:33 · Score: 2, Informative

Actually not. The article simply discussed how the Passport system uses cookies to store users' information and how you could possibly get the cookies from a user that still has them. It doesn't detail anything about accessing some magical databse, nor does it mention credit cards.
For those who must use IIS by JMZero · 2001-11-26 06:33 · Score: 3, Informative

I agree with all of your assertions, except

"Don't use IIS."

This just isn't an option for a lot of people. I would change this to:

"If you use IIS, you need to make sure you check BugTraq/cert EVERY day."

I would also add:

"If you use IIS with COM components via ASP, make sure the DLL's are not in a publicly accessible directory."

This happens a lot, and makes DLL's lots easier to break.

--
Let's not stir that bag of worms...
McGraw responds by Anonymous Coward · 2001-11-26 06:52 · Score: 1, Informative

I dropped a note on his comments

"We have a problem, and that is that people don't design software to behave itself.. etc.."

Me(typoes and all)- You honestly believe that a crawler that finds a private page is responsible fro exposing private info?

Seriously? Cmon, Under 0 circumstances should my CC information be available to anyone visiting a website, if it is, the owner of that site should be criminally liable.

The response -

Hi Sean,

I agree. I actually made that point too, but the reporter chose to focus on other things I said.

gem
It doesn't last by kimihia · 2001-11-26 13:55 · Score: 3, Informative

Because 28 days after you took your page offline it will disappear from the Google cache.

Google reindexes web pages, and if they 404 on the next visit, then good bye pork pie! You have to get them while they are hot, eg, when a site has JUST been Slashdotted.
An advertisement for publicfile by kimihia · 2001-11-26 14:03 · Score: 3, Informative
Perhaps it would be a good idea after reading this article to examine publicfile.

It was written by a very security conscious programmer who realises that your private files can easily get out onto the web. That is why publicfile has no concept of content protection (eg, Deny from evilh4x0r.com or .htaccess) and will only serve up files that are publically readable.

From the features page:
- publicfile doesn't let users log in. Intruders can't use publicfile to check your usernames and passwords.
- publicfile refuses to supply files that are unreadable to owner, unreadable to group, or unreadable to world.
A good healthy does of paranoia would do people good.