Slashdot Mirror


The Problem of Search Engines and "Sekrit" Data

Nos. writes: "CNet is reporting that not only Google but other search engines are finding password and credit card numbers while doing its indexing. An interesting quote from the article by Google: 'We define public as anything placed on the public Internet and not blocked to search engines in any way. The primary burden falls to the people who are incorrectly exposing this information. But at the same time, we're certainly aware of the problem, and our development team is exploring different solutions behind the scenes.'" As the article outlines, this has been a problem for a long time -- and with no easy solution in sight.

23 of 411 comments (clear)

  1. Tangential Google Question by banuaba · · Score: 5, Interesting

    How does the Google Cache avoid legal entanglements, both for stuff like cc numbers and copyright/trademark infringement?
    If I want to find lyrics to a song, the site that has them will often be down, but the cache will still have them in there.. Why is what google is doing 'okay' but what the origional site not okay? Or do they just leave google alone?

    --


    Brant

    Argle. Bargle.
    1. Re:Tangential Google Question by CaseyB · · Score: 3, Interesting
      Good question.

      Given that they do have (for now) some sort of immunity, it opens a loophole for publishing illegal data. Simply set up your site with all of Metallica's lyrics / guitar scores (all 5 of them, heh). Submit it for indexing to Google, but don't otherwise attract attention to the site. When you see the spider hit, take it offline. Now the data is available to anyone who searches for it on Google, but you're not liable for anything. The process could be repeated to update the cache.

    2. Re:Tangential Google Question by Suidae · · Score: 2, Interesting

      Don't bother taking it offline, just set up your web server so it only responds to the google indexing server. Cache stays up all the time, but no one else can (easily) see that you are serving it.

    3. Re:Tangential Google Question by LinuxHam · · Score: 2, Interesting

      Don't bother taking it offline, just set up your web server so it only responds to the google indexing server. Cache stays up all the time, but no one else can (easily) see that you are serving it.

      Oooh.. that's a particularly good one.. kinda like getting high-bandwidth web service FOC, if you build your site URLs to ride along the google cache instead of your own... (gears cranking)..

      --
      Intelligent Life on Earth
    4. Re:Tangential Google Question by randomgeek · · Score: 2, Interesting

      Random thought, it'd be possible to use Google as a kind of morpheus/napster-like distributed service? Make a HTML "page" that looks something like:

      FileName: MyFile
      Size: FileSize

      Encode in Base64, rot13 it, and then call it protected under DMCA, bonus points.

      Of course, your web server would only accept connections from the google spiders, and you'd effectively have a free file distribution service. Not saying this would actually work, but I think there's a chance it'd work.

  2. Google shouldn't lift a finger by sketerpot · · Score: 2, Interesting

    Why should Google or any other search engine do anything to save fools from their stupidity? Putting credit card numbers online where anyone can get them is just plain idiotic. Hopefully this will get a lot of publicity along with the names of companies who do stupid things like this and most people will shape up their act.

  3. Insert foot in mouth.... by Crewd · · Score: 2, Interesting

    From the article :

    "Webmasters should know how to protect their files before they even start writing a Web site," wrote James Reno, chief executive of Amelia, Ohio-based ByteHosting Internet Services. "Standard Apache Password Protection handles most of the search engine problems--search engines can't crack it. Pretty much all that it does is use standard HTTP/1.0 Basic Authentication and checks the username based on the password stored in a MySQL Database."

    And chief executives of a hosting company should know how Basic Authentication works before hosting web sites...

    Crewd

  4. Re:A symptom of poor programming... by ChazeFroy · · Score: 5, Interesting

    Try the following searches on google (include the quotes) and you'll be amazed at what's out there:

    "Index of /admin"
    "Index of /password"
    "Index of /mail"
    "Index of /" +passwd
    "Index of /" password.txt

  5. robots.txt by mukund · · Score: 2, Interesting

    From my web logs, I see that a lot of HTTP bots don't care crap about /robots.txt. Another thing which happens is that they read robots.txt only once and cache it forever in the lifetime of accessing that site, and do not use a newer robots.txt when it's available. It'd be useful to update what a bot knows of a site's /robots.txt from time to time.

    HTTP bot writers should adhere to using information in /robots.txt and restricting their access accordingly. In a lot of occasions, webmasters may setup /robots.txt to actually help stop bots from feeding on junk information which they don't require.. or things which change regularly and need not be recorded.

    --
    Banu
  6. Many crawlers ignore robots.txt by Ars-Fartsica · · Score: 3, Interesting

    I do not know if this is still the case, but Microsoft's IE offline browsing page crawler (collects pages for you to read offline) ignored robots.txt last time I checked. I know many other crawlers do likewise.

  7. Sure enough. by Joe+Decker · · Score: 3, Interesting
    Looked up the first 8 digits of one of my own CC numbers, and, while I didn't find my own CC # on the net, I did immediately find a large file full of them with names, expiration dates, etc. (Sent a message to the site manager, but this case is pretty clearly an accidental leak.)

    At any rate--scary it is.

  8. standing naked in front of the window by eddy+the+lip · · Score: 3, Interesting

    But other critics said Google bears its share of the blame.

    "We have a problem, and that is that people don't design software to behave itself," said Gary McGraw, chief technology officer of software risk-management company Cigital, and author of a new book on writing secure software.

    also known as ostrich security...if you're s00p3r s3cr37 files are just lying around waiting for idle surfers, search engines are the least of your worries. if you don't know enough to protect your files (by, say, not linking to them, or .htaccess files, or encrypting them), it's not the search engines fault. it's you're own dumb ass.

    this guy's just looking for free hype for his book. if that's the kind of advice he offers, he's doing more harm than good.

    --

    This is the voice of World Control. I bring you Peace.

  9. Re:A symptom of poor programming... by Legion303 · · Score: 5, Interesting
    Please give credit where credit is due. Vincent Gaillot posted this list to Bugtraq on November 16.

    -Legion

  10. Re:Stopping Google won't stop the problem... by Zspdude · · Score: 3, Interesting

    It's definately very true that if there were no stupid people these things would not be an issue of controversy. However, society has struggled for a very long time to resolve the question, "Should stupid people be protected from themselves?" There will always be those who( whether they're just technologically inept or for whatever reason) will not act sensibly and not realize they are being foolish. Do they deserve protection as well, even though they don't know how to protect themselves? That's a question which is not quite as easy to answer....

    --
    What's in a Sig?
  11. Re:Stopping Google won't stop the problem... by greed · · Score: 2, Interesting

    So maybe the fix should be in making it harder to share things on the Web, rather than trying to have search bots guess whether someone really meant to post the file?

    Web servers could ship configured to not AutoIndex, only allow specific file types (.jpeg, .html, .png, .txt), and disable all those things that I disabled in Apache without losing anything I needed for my site, and so on. Then, the burden is placed on the person who started sharing these other filetypes that have sensitive data on the public internet.

    Of course, putting something in public that you don't want someone to see is just plain stupid, but apparently we need to make stupid people feel like they're allowed on the 'net.

  12. MicroSoft Passport Credit Card # avaliable by peter303 · · Score: 2, Interesting

    The new issue of "2600" all but gives a kiddie
    script for extracting credit card numbers from
    the Passport database. Scary. Dont buy anything
    through it until they fix it.

  13. Blissful ignorance backfires again. by hkmwbz · · Score: 3, Interesting
    That a search engine is able to harvest this kind of data just proves that some people don't know what they are doing. Forgive me if I seem judgmental, but these people are probably the same people who think Windows XP is the next step and that IE is the only browser in the world. But as is proven again and again, ignorance backfires. Not only are they attacked by viruses and worms and have all backdoors and security holes exploited - they are ignorant enough to leave users' data in the open, for everyone to get.

    Google's comment was:

    "The primary burden falls to the people who are incorrectly exposing this information."

    This is where they should have stopped. Those who find their credit card information in a search engine will learn a lesson and use services that actually take care of their customers' security and privacy. Google shouldn't have to clean up incompetent people's mess.

    In the long run, these things can only lead to the ignorant (wannabe?) players in the market slowly dying because they don't know what they are doing.

    I personally hope someone gets a taste of reality here, and that only the serious players survive. The MCSE crowd may finally learn that there's more to it than blind trust in their own (lacking) ability.

    --
    Clever signature text goes here.
  14. Different file types make my day by srichman · · Score: 3, Interesting
    The big complaint of the article is that Google is searching for new types of files, instead of HTML.
    The only people who complain about this are obviously the folks using crossed fingers for security. The rest of us love that Google indexes different file types.

    I'll never forget the day I first saw a .pdf in Google search result. Not that long ago I saw my first .ps.gz in a search result. I mean, how dope is that!? They're ungzipping the file, and then parsing the postscript! Soon they'll start uniso-ing images, untarring files, unrpming packages, .... You'll be able to search for text and have it found inside the README in an rpm in a Red Hat ISO.

    Can't wait until images.google.com starts doing OCR on the pix they index...

  15. Re:A symptom of poor programming... by subsolar2 · · Score: 2, Interesting
    I've seen this myself searching for information on linksys routers about a year ago. I found somebody with a page that listed the password for their linksys router along with other systems and information. I e-mailed the guy who seemed very supprised that the information was available there and thanked me for letting him know. The information was gone when I checked again.

    It's a silly mistake, I don't have a clue as to how google came accross the link. Like with anything new it's going to take some time before this becomes "common sense" and people do not put this information on public servers.

    - subsolar

    P.S. It's possible to generate a url that when clicked by somebody behind a linksys router to enable remote administration if you know the password. I've turned it in to linksys but gotten nothing but silence from them.

  16. Password search by azaroth42 · · Score: 3, Interesting
    Or for more fun, do a search like


    filetype:htpasswd htpasswd


    Scary how many .htpasswd files come up.


    -- Azaroth

  17. DMCA by C. · · Score: 2, Interesting

    > You should be writing that type of data on the backs of envelopes and leaving them scattered around your living room...

    Not much worse than some "commercial-grade" encryption...

    Maybe somebody should consider suing Google under the DMCA. I haven't studied the DMCA with enough detail to be sure of this (and much less studied law, for that matter), but i guess Google is easily guilty of the following "crimes" against modern society:
    - linking to decryption algorithms
    - linking to reverse enginnering tools
    - linking to passwords that could be used to circumvent somebody's copyright.
    - storing and distributing all the above (with google's cache)

    As I understand current legislation, Google should not even have the right to define what is public or not like they're trying to do. Even the safe-harbour provisions do not immunize them from having to remove unlawful content.

    Such a lawsuit would make for an interesting debate, and with a bit of luck could get us all rid of this stupid law.

    C.

    --
    C.
  18. Re:A symptom of poor programming... by Anonymous Coward · · Score: 1, Interesting

    I don't recommend looking at the results of these searches. Recent laws define viewing private files as terrorism, and you might end up in jail. The only question is: Whose definition of "private" is being used.

  19. Re:What did they expect? by Anonymous Coward · · Score: 1, Interesting

    However, problems like this do arise quite often, and at their source one can see that the widespread ability of people to publish documents to the web does not coexist well with existing security systems and models.

    Not really, more a problem that incompetent administrators don't know, don't care, or don't think it's their problem; the stupidest of users don't know what they shouldn't do, and unscrupulous folk take advantage of that.
    The existing systems and models work, it's just that badly run sites don't use them correctly, if at all. Most users have the basic knowledge (or concern/fear) not to post credit card numbers in a public location online or to an untrusted site, but some stupid ones will (perhaps deservedly) pay for their mistakes and others may get screwed over by badly secured 'trusted' sites or convincing but spurious sites.

    At any other time in the past few years, this would not ordinarily be a societal problem. Sure, a few peoples' passwords and credit card numbers will leak out.... But now, this is a national security problem, because we are being attacked by a foreign force who might abuse leaked passwords to access critical systems and cause chaos in this country. President Bush and his staff are very concerned about a cyberwar, because it can be waged without physically having Arabs in the States to commit the terrorism. That is very dangerous indeed.

    This is hardly now a concern related to regular people publishing on the web. It IS a concern about security, but sensitive information should be protected by a competent admin with appropriate controls. Passwords to truly critical systems are protected both through online security methods and physical requirements; you aren't going to find them published for all to read on the family page of the guy with access to the button, nor in any cache of his online activities and postings.

    There is reason for concern there, but the solution to that concern is to make sure that the appropriate procedures are followed. As I say, it's completely irrelevant to 'standards' of publishing on the web or who can put up a homepage - only (if anything) to the competence and security awareness of those running the servers.

    There are probably millions of "here's my cat, I've joined 500 webrings, I like icecream, here's some annoying MIDI music that a button put on the page for me, pleeeease sign my guestbook" pages on Geocities and its like (Homestead is a really bad example) but I wouldn't call them a threat to National Security.

    Anything that might be called such a threat should not even be stored in an unprotected computer, let alone online. If anything the main problem might be having sensitive information on a private computer, where a cable ISP has discouraged or failed to mention appropriate firewalls (as is often the case). But for seriously sensitive stuff, this situation would never be permitted.

    I'm not sure what the solution is, but a good first step is for companies to raise the barrier to entry to publishing web pages. Geocities and Angelfire should force users to demonstrate their competence before uploading their first page.

    Why? Crap page design results in something people don't want to look at. All it does is waste space on Geocities' servers, you never have to see it if you don't want to look for it. The only possible gripes are if
    a) they publish sensitive information on there - which only rebounds on them if it's CC numbers or passwords. An adult with access to truly sensitive information would be bound by employment/secrecy clauses in their contracts so are hardly likely to 'accidentally' reveal a government secret.
    or b) they clog up the search engine results with lots of crappy listings. Which is true, but good search engines take into account how popular the page is, which tends to push the crappy pages down to the end of the listings.

    Perhaps requiring an A+ certification number would help? And Microsoft should take away the parts of FrontPage that allow users to generate documents without writing in HTML. That would help ease the problem, I reckon.

    No it wouldn't, because the problem is not with page designing abilities, or with posting sensitive information on the web. If someone puts on there homepage "My password is: " they take responsibility for that. If they were to put "The Nuclear Launch Code is:" they would probably be shot at dawn, but that's not going to happen with the people trusted with such information.
    If the host has a password list that is world readable, then there's a serious security problem. If anything the A+ certification should be required for those hosting, not creating, the pages.

    In conclusion - if everybody does their part to help solve this problem and stop information leakage, we will be a safer, more secure society without giving up any more civil liberties.

    Probably, but the point is that the large-scale sensitive material you reference is secured by appropriate technology, contract agreements and competent staff; but for personally sensitive material, apart from not being at fault for doing anything incredibly stupid, the people we entrust with such information must be competent and concerned. It's not about the codes for war, it's not about the abilities of Frontpage users, the actual problem lies somewhere in the middle.

    (now using Frontpage to display the codes for war on the US government homepage, there might be a problem).