The Problem of Search Engines and "Sekrit" Data
Nos. writes: "CNet is reporting that not only Google but other search engines are finding password and credit card numbers while doing its indexing. An interesting quote from the article by Google: 'We define public as anything placed on the public Internet and not blocked to search engines in any way. The primary burden falls to the people who are incorrectly exposing this information. But at the same time, we're certainly aware of the problem, and our development team is exploring different solutions behind the scenes.'" As the article outlines, this has been a problem for a long time -- and with no easy solution in sight.
Given this premise, the only way that Google or another search engine could find a page with credit card numbers or other 'secret' data, would be if that page was linked to from another page, and so on, leading back to a 'public' area of some web site.
That is to say, the web-indexing bots used by search engines cannot find anything that an ordinary, very patient human could not find by randomly following links.
I do not deploy Linux. Ever.
Your crawler is caching credit card numbers you say? Simple, check the content you cache for 16 digit numbers. Any that you find, you check with a simple LUHN (mod 10) algorithm. If it passes, you replace the number with "################" or a similar masking.
There, all credit card numbers will now filtered from your cache.
I understand the severity of the issue, and it's good to know this is happening, but the solution is simple.
If you somehow manage to post your credit card info on the web, exactly whose fault is it? The only way it *can't* be your fault is if it's a poorly-constructed e-commerce site that leaks out that kind of info.
I just don't see what the big deal here is.
- A.P.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"