Checking Web Content for Sensitive Data?
NetFiber asks: "I work as a security analyst for a large university. We have recently been tasked to scour our network in the hopes of finding and removing sensitive information such as credit card numbers, social security numbers, and such on all publicly available web servers. Our current method of analysis is to archive all the content (which often grows over 100GB) and later parse the data with various utilities and regexes that search for patterns and other pertinent information. So far, this process has proven to be rather cumbersome and time consuming. Does anyone have any experience collecting and sanitizing large amounts of web content? If so, what procedures/utilities do you use to accomplish this?"
Of course, you're probably not interested specifically in protecting "Visa's track data" but in whatever data you consider sensitive. Applying the listed policies and practices would go a long way towards securing your resources, whatever it is you want to secure.
As a large corporation, failure to comply would mean the penalties would be severe (and most likely business-damaging.) If you're not handling card data, you won't have the same consequences, of course. What the penalties meant to us, though, is that top management made a decree: 'fix the problems and pass the audit -- we can't afford not to.' Having top-down pressure means that if we have sensitive data that we're passing to another team, we're both inclined to work together to solve the issues. If one team balks, a phone call up the pyramid gets things back on track. If your university is serious about this, a similar edict will go a long way towards cleanup.
Another boost in the direction of securing our data was hiring an external consultant to perform the audit. Our auditor is very knowledgeable about ways to follow the data: where does it enter the system, where does it go from there, who writes it to disc, why do they save it, and do they have a business need to save it? Can the data be eliminated? Can a token be substituted for the data? Can the data be truncated? If not, can it at least be masked on reports where the details aren't needed?
As far as specifics go, each development and maintenance director's pyramid was required to assign a manager to own the PCI process. Each team had to go through their code, identify sensitive data, and take steps to protect it. They also had to go to the data owners, and have them redact their archives.
It's huge. But given the security breaches that are almost a daily occurrance, we can't afford not to.
John
If you can do a regex of what you are looking for, you might be able to put some infrastructure in front of your web apps that controls what goes out.
Some commercial vendors eg. Citrix (Teros), Imperva etc. offer stuff like this in an appliance, and there has to be some sort of thing you could do with Apache and OSS stuff as well depending on your needs. It might not catch everything but hey, your code base is always changing and a one-time audit might not find a problem that shows up six months after the audit is done. Some sort of preventative measure working hand-in-hand with regular audits is probably your best bet in the long run.
Given enough time, some industrious hacker will find all the data for you.
Then, when you read the Slashdot article titled "[Name of Your Company] Leaks Private Data", you'll know exactly where the pertinent files are.
At that point you can take care of them. The pay out to the privacy lawsuites will probably end up being less than the cost in man hours to do the job semi-manually. In the end, you'll still come out on top. (Though there is the off-chance that your company and your replacement will come out on top...)
UTF-8: There and Back Again
JIHS comes to mind.
Network filtering would be useful as a proactive preventative, but that's going to cause a serious network slowdown in most large environments while at the same time not catching the root causes of the problem.
Of course, storing the information again and then searching it is pretty silly. You don't want to know what used to be out there, you want to know what's currently out there and as a bonus, it's already taking up storage space somewhere, so why duplicate it? In order to "copy" it, you're going to take just as many resources as if you look at it in place and process it once, so what's the point?
Just create an optimized process (since this is where all the work will be done it's useful to spend a lot of time optimizing it) to scan file shares and database tables (why use http when you can bulk access the html via a file system?) for your "security-breach" signatures. Write some good regexps and even grep is fairly fast. Then, just set the process to start over at the first file system once it's completed scanning the last one. Make sure that you reduce the priority for the process and give it appropriate bandwidth/resource limits so that it's using "extra" resources instead of interfering with normal work and you're all set. If you can get your scanning process to run at a low cpu priority on the actual storage hosts, that'll be even better because it'll limit your bandwidth usage even more.
The party of stupid and the party of evil get together and do something both stupid and evil, then call it bipartisan.
mod_security for Apache can do exactly this sort of regex matching and serve up an error page if a match is found. The logs are pretty easy to grep to find occurences of a match and hence track the data down.
Orationem pulchram non habens, scribo ista linea in lingua Latina
One amusing situation was when the head of Australia's nuclear agency was very vocal in his criticism of google's sattellite images due to a low detail image of his facility being visable there - he actually played the "terrorism" card in his criticism. The front page of his organisations website had a much more detailed aerial photograph of the same facility that was more up to date.
Dude.. we know who you work for.. really.
Our Nigerian IT minister has tasked us with providing free support to the US universities.
Kindly forward us the backup tapes with your data as well as a representative list of personal data you are striving to secure (such as student SS#, birth dates, Mother Maiden Names, corporate purchase cards, etc.) and we will promptly perform the audit for you.
This is absolutely legal, and you will be allowed to keep 10% of whatever we find.
[no, no it's a joke, dammit!]
Obama likes poor people so much, he wants to make more of them.
If you're familiar with SQL server and it's method of creating backup files you can actually find quite a number of backup files just using Google. The files are documented in the Microsoft Tape Format guide showing the block magic numbers which can be quite useful.
Like this
Download, restore, maybe find something useful...
Task Mangler
And I never cease to be amazed by the sheer number of people sharing that belief that there's some magical amulet (uber-security program/appliance/whatever) that you can just tack onto a site and make it auto-magically secure.
Unfortunately that kind of thinking is outright counter-productive. It's dangerous. It's the kind of thinking that breeds such disasters as "we use SSL, so we're secure." (Shame that someone uploaded confidential documents on the web site anyway, so they can be downloaded by anyone. _Securely_ downloaded, to be sure;) Or "we have a Snake Oil (TM) gateway that can scan SOAP requests, so we're secure." (Shame that noone actually configured the rules for it, though. Or shame that the Web front-end there allowed users to escalate their privileges _before_ it all got packed in a SOAP request: the gateway can't detect whether it's genuinely a site admin or a regular user who escalated their privileges.) Or "we have a hardened Single Sign-On front-end in front of the servers, enforcing login and access rights, so we're secure." (Shame, that, literally, one application allowed users to escalate their privileges and see any content, by just editing the URL. E.g., someone could edit the admin's password by just editing the admin's user ID in the URL for the password change page, _then_ properly log in as the admin through that hardened SSO front-end. Literally. I'm not making it up.) Etc.
But to address your actual point: content scanners aren't the answer, or rather are a bad and incomplete answer. E.g., I've seen one company deploy such a thing in front of the back-end, in their case to supposedly protect against SQL injection in the front-end. So it rejected anything that looked like an SQL keyword. Should be secure, right? But what do you do if it's not as secure or well-programmed as you think? E.g., the thing would cause a form submit to fail if you wrote something like "Visa Select" in a field, because it contained "select", but actually failed to protect against actual SQL injection using the quote sign, or XSS injection using the greater-then and less-than signs.
Worse yet, it encouraged everyone to be lax and don't bother thinking about security or doing a code review, because, hey, they have the magical amulet on the backend. Even worse, it encouraged managers to not allocate time or resources for an actual security review.
Security isn't about magical amulets, it's "holistic", so to speak. The security chain is literally as weak as the weakest link. People need to be educated to actually sit and think about the whole and about every single piece and scenario, not to throw in a couple of +5 Security amulets and call it a day. Throwing in the towel and relying on some magical amulet which somehow makes it all secure just because it's there, is actually the antonym and nemesis of security.
Even if such appliances and programs are used, someone needs to sit and think about how they're used, how they affect their own program, what they prevent, and most importantly what they _don't_ prevent. What data and how does it prevent from being stolen, and what happens when (not if) someone _does_ get through. E.g., what data you shouldn't be collecting in the first place anyway, because you don't actually need it. (If it's not there at all, it can't get stolen.) And most often the right thing to do is _not_ to rely on them: they're there as a last ditch defense, that can't catch everything, but it's one last chance to _maybe_ catch something that got through the other layers of defense. Not as a replacement for the other layers.
And teams and managers need to be educated that they _need_ to do just that: sit and do a proper analysis. And not just the technical implementation parts, but also, yes, the people processes involved. E.g., if a process can w
A polar bear is a cartesian bear after a coordinate transform.
McAfee/Foundstone's free SiteDigger
Have all students put their credit card numbers, SSNs and mother's maiden names in a database. Then you can grep -v your web content. Done!
My company has a vectorspace engine that can help you classify docs that are related. given a SQL query you should be able to find related information. We'd be happy to help you build something, or help you through the build process. It works under windows, linux, and we just completed eSeries, iSeries and zSeries certification through IBM's chiphopper program (we haven't updated the website yet). Click through on my website link for more info.
meh
Its expensive, complex, and will take at least a week to set up, but one of these will scrub all traffic for things like SSNs and other pattern-matchable data inside HTTP packets and other TCP traffic.
If money is no object, then perhaps the Google Search Appliance is the answer to your problem..
http://www.google.com/enterprise/
I think the problem is that sensitive is a lot more than CC and SSN's, univiersities, Research projects, government contracts the mind boggles at what sensitive might include or how it might be discovered. I was looking for a reasonable estimate of our local ground elevation for a slashdot post about rising sealevels, ended up seeing an abstract on a search page that said "WARNING: Document contains Sensitive Security Information ...." on a pdf at blank airport's web site. The question for me became, do I call the FBI or pretend I didn't see it?
is to stop using Social Security numbers. Another is to stop using Social Security numbers. Yet another is to stop using Social Security numbers. And yet another is to stop using Social Security numbers.
Is your university contributing to the students' Social Security accounts for some unknown reason? If not, there's no legitimate reason for the school to continue to use students' Social Security information.
Same with birth dates. In grade school, along with permanent records, we were assigned a student ID number. Thirty years later, I still remember mine. There's absolutely no issue with manually maintaining, in a notebook (remember those?), with a pencil or pen (remember them?) a two column chart that correlates Social Security numbers with an arbitrarily assigned student ID number issued by the college or university for identification purposes, if maintaining Social Security information is absolutely necessary, which it most likely isn't.
For every reason for maintaining the Social Security, birthdate, or other sensitive information by the university, there's a reason and a method that shows that it isn't necessary. A couple of universities who made news in the recent past because of sensitive data breeches announced that they'll no longer use Social Security numbers for identifying students. If they can do it, so can your institution. No excuses. Stop using Social Security numbers.
Ask your institution this: if Congress enacted a law that said that a university could be held financially liable for the consequential damages of a data breech involving Social Security numbers, and the liability could extend to all of the endowments in possession of the university by all past alumni, would the university continue to use Social Security numbers as identifiers, or would they find and implement a different identification system rather than risk losing their entire endowment funds?
The simple answer is to stop using Social Security numbers. And to stop using Social Security numbers.
As for the other part of your post, wtf is the university doing storing credit card numbers on its computers?
There are, obviously, many ways to do this. I had never hear of such a product, but one came across my desk this week, and I thought I'd pass it on: Tablus (www.tablus.com). I'm sure it's pricey, but I guess it depends on your goals. Alternatively, I'm sure there are several consultants out there that could help you out, either by doing the dirty work for you or by poiting to someone who can.
$.02 deposited.
SSN's are essential for extending credit (credit reporting), which most universities do. They are also needed for accessing financial aid (VA, Federal Student Loans, etc).
You could use your in house search engine (or a google appliance if your lucky) to find any existing content or I supose your current system of crawling, parsing and regexes would suffice.
Then I would recoment the mod_security module for apache http://www.modsecurity.org/ It will scan any POST requests for banned pattern. You could leverage the regexes you already wrote to scan the content in the first place.
I think mod_secrity does what the FS and McAfee appliances do at much better (free as in beer) price.
If you have access to a MacOS X box, Anthracite Web Mining Desktop toolkit http://www.metafy.com/ can do this kind of work for you. It's currently being used by customers on four continents to build daily custom reports from large volumes of web based data, like the SEC Edgar filings. It's based on a visual user interface that allows non-programmers to quickly and easily create high value web data processing systems. If you need to automate running a grip of regexen against thousands of webpages daily, you should definitely check it out. It can possibly save you a lot of time, we've got one customer who quickly eliminated two days per month of this kind of labor intensive work. On FM with great vitality at http://freshmeat.net/projects/anthracite [PS - Yes, I'm definitely biased, I wrote the software ;-)]
I too work at a large university. I don't know if your experience is similar to mine. If it is, then given you're even posing this question I bet your university cannot formally define what is considered restricted or sensitive data. Some things are easy, like SSN. Some things are not. There are lots of grey areas. There are lots of kinds of data at a university, and there are potentially dozens or more formal audit requirements that might need to be met in some cases, but not others. Sometimes a given "piece" of data is itself not considered restricted, but two or more different non-restricted pieces when together are. It gets very complex, depending on how thorough you want to be. And that's just death around a university, where people love to debate the complexities ad nauseum, and no one can or will just say, look, THIS is restricted data. THIS is where we are starting. *I* am making the call because I can, or else because someone has to. If we want to add to this list, or debate the subtleties down the road, fine. But get busy with THIS list NOW. And so my first point: how thorough do powers-that-be want you to be? And is there a definition clear enough to program a computer by that specifies that level of thoroughness? Or when you ask precise questions, do you find it hard to get anyone who says: I am responsible for making the call, and the call is YES|NO that is|isn't restricted data. Instead, you get a lot of longwinded talk, vague references to long-winded say-nothing vague policies that don't, ultimately, answer your questions either about what is or is not restricted data? Yeah, I thought so. Sorry to hear it. Second point: is this an interim damage control task, but the real task of getting a handle on sensitive data going forward is already well underway? If not, then you are again on a fool's errand. This task is going to be time-intensive, no matter how you do it, no matter what tool you find or what set of scripts you roll yourself. Why bother, then, putting the horses back into the barn until the gate is fixed? Or probably more aptly, why bother making the horses stand where you wish there was a barn until one is built there? Unless you have, say, transcripts or something sitting on a webserver, time is far better spent on building a barn. Third: someone already raised this. It goes hand-in-hand with the above point. If the groups around campus aren't made responsible for how they handle restricted data, it's hopeless. A university environment is generally too chaotic and out of control (I believe the euphemism is "collegial") to manage it any other way. But hey, what's the first thing that will happen when you tell groups they are responsible for handling sensitive data? Yep, you guessed it - what is considered restricted/sensitive? And I bet your university can't answer that. So, yes, I don't envy you. Good luck.