Slashdot Mirror


Checking Web Content for Sensitive Data?

NetFiber asks: "I work as a security analyst for a large university. We have recently been tasked to scour our network in the hopes of finding and removing sensitive information such as credit card numbers, social security numbers, and such on all publicly available web servers. Our current method of analysis is to archive all the content (which often grows over 100GB) and later parse the data with various utilities and regexes that search for patterns and other pertinent information. So far, this process has proven to be rather cumbersome and time consuming. Does anyone have any experience collecting and sanitizing large amounts of web content? If so, what procedures/utilities do you use to accomplish this?"

5 of 44 comments (clear)

  1. Or, try a way to prevent it leaking out as well. by rdunnell · · Score: 2, Interesting

    If you can do a regex of what you are looking for, you might be able to put some infrastructure in front of your web apps that controls what goes out.

    Some commercial vendors eg. Citrix (Teros), Imperva etc. offer stuff like this in an appliance, and there has to be some sort of thing you could do with Apache and OSS stuff as well depending on your needs. It might not catch everything but hey, your code base is always changing and a one-time audit might not find a problem that shows up six months after the audit is done. Some sort of preventative measure working hand-in-hand with regular audits is probably your best bet in the long run.

  2. Re:Or, try a way to prevent it leaking out as well by _Sharp'r_ · · Score: 2, Interesting

    Network filtering would be useful as a proactive preventative, but that's going to cause a serious network slowdown in most large environments while at the same time not catching the root causes of the problem.

    Of course, storing the information again and then searching it is pretty silly. You don't want to know what used to be out there, you want to know what's currently out there and as a bonus, it's already taking up storage space somewhere, so why duplicate it? In order to "copy" it, you're going to take just as many resources as if you look at it in place and process it once, so what's the point?

    Just create an optimized process (since this is where all the work will be done it's useful to spend a lot of time optimizing it) to scan file shares and database tables (why use http when you can bulk access the html via a file system?) for your "security-breach" signatures. Write some good regexps and even grep is fairly fast. Then, just set the process to start over at the first file system once it's completed scanning the last one. Make sure that you reduce the priority for the process and give it appropriate bandwidth/resource limits so that it's using "extra" resources instead of interfering with normal work and you're all set. If you can get your scanning process to run at a low cpu priority on the actual storage hosts, that'll be even better because it'll limit your bandwidth usage even more.

    --
    The party of stupid and the party of evil get together and do something both stupid and evil, then call it bipartisan.
  3. Look at the images too by dbIII · · Score: 3, Interesting

    One amusing situation was when the head of Australia's nuclear agency was very vocal in his criticism of google's sattellite images due to a low detail image of his facility being visable there - he actually played the "terrorism" card in his criticism. The front page of his organisations website had a much more detailed aerial photograph of the same facility that was more up to date.

  4. SQL Server backups by Centurix · · Score: 3, Interesting

    If you're familiar with SQL server and it's method of creating backup files you can actually find quite a number of backup files just using Google. The files are documented in the Microsoft Tape Format guide showing the block magic numbers which can be quite useful.

    Like this

    Download, restore, maybe find something useful...

    --
    Task Mangler
  5. Re:Visa PCI CISP is a good set of practices by charleste · · Score: 2, Interesting

    ...also benefit from a cohesive policy that gives specifics, such as "replace SSNs with student ID numbers."
    I completely agree. We had to do this when I was contracting for the government a number of years ago. Even in the databases at the time there was a veritiable cornucopia of plain ASCII characters stored where nowadays we know that those types of data should be at least encrypted, and probably not stored in a column called (in plain text) SSN or some such thing.

    <offtopic_sidebar>Ironically, back in *my* day, my student ID at college was some number (probably the next in sequence). By the time I graduated (5 year plan), they were switching to SSN. Now, they are moving back to a student ID #</offtopic_sidebar>