Checking Web Content for Sensitive Data?
NetFiber asks: "I work as a security analyst for a large university. We have recently been tasked to scour our network in the hopes of finding and removing sensitive information such as credit card numbers, social security numbers, and such on all publicly available web servers. Our current method of analysis is to archive all the content (which often grows over 100GB) and later parse the data with various utilities and regexes that search for patterns and other pertinent information. So far, this process has proven to be rather cumbersome and time consuming. Does anyone have any experience collecting and sanitizing large amounts of web content? If so, what procedures/utilities do you use to accomplish this?"
PCI/CISP does have software process recommendations for securing credit card data, but it's largely recommendations for people processes and facility processes.
I believe the original requestor is asking about software to help automate/speed-up monitoring and scanning of content that's being put up on web sites by staff and/or students.
I know what he's asking for, and I answered with what it takes to make it happen for real. The answer is the various teams that are storing the data need to be held accountable for storing it securely. Just grepping for and deleting a database holding SSNs isn't enough -- his university has to make sure that all the teams are educated to not ask for nor store SSNs. They'll also benefit from a cohesive policy that gives specifics, such as "replace SSNs with student ID numbers."
If this is just some security manager saying "go find SSNs and wipe 'em out" then they're up the creek. For every database they clean up, someone else will have created a new one. They'll be ignored and stonewalled by teams who have neither the time nor the budget to comply. This sort of thing has to come down from the board of regents, and they have to put the responsibility on everyone, otherwise they're just pissing in the wind.
John
JIHS comes to mind.
I think the OP may be hoping for that, since they're posting on Slashdot and have disclosed the identity of the university just as cleverly as any redacted PDF would.