Slashdot Mirror


Checking Web Content for Sensitive Data?

NetFiber asks: "I work as a security analyst for a large university. We have recently been tasked to scour our network in the hopes of finding and removing sensitive information such as credit card numbers, social security numbers, and such on all publicly available web servers. Our current method of analysis is to archive all the content (which often grows over 100GB) and later parse the data with various utilities and regexes that search for patterns and other pertinent information. So far, this process has proven to be rather cumbersome and time consuming. Does anyone have any experience collecting and sanitizing large amounts of web content? If so, what procedures/utilities do you use to accomplish this?"

44 comments

  1. Visa PCI CISP is a good set of practices by plover · · Score: 4, Informative
    As a large merchant that handles Visa card numbers, we have to undergo an annual Visa PCI CISP audit. The questions are pretty thorough, and if you can fully pass the audit you can tell management that you've reduced your risk of exposure. The link to the pages are here: CISP.

    Of course, you're probably not interested specifically in protecting "Visa's track data" but in whatever data you consider sensitive. Applying the listed policies and practices would go a long way towards securing your resources, whatever it is you want to secure.

    As a large corporation, failure to comply would mean the penalties would be severe (and most likely business-damaging.) If you're not handling card data, you won't have the same consequences, of course. What the penalties meant to us, though, is that top management made a decree: 'fix the problems and pass the audit -- we can't afford not to.' Having top-down pressure means that if we have sensitive data that we're passing to another team, we're both inclined to work together to solve the issues. If one team balks, a phone call up the pyramid gets things back on track. If your university is serious about this, a similar edict will go a long way towards cleanup.

    Another boost in the direction of securing our data was hiring an external consultant to perform the audit. Our auditor is very knowledgeable about ways to follow the data: where does it enter the system, where does it go from there, who writes it to disc, why do they save it, and do they have a business need to save it? Can the data be eliminated? Can a token be substituted for the data? Can the data be truncated? If not, can it at least be masked on reports where the details aren't needed?

    As far as specifics go, each development and maintenance director's pyramid was required to assign a manager to own the PCI process. Each team had to go through their code, identify sensitive data, and take steps to protect it. They also had to go to the data owners, and have them redact their archives.

    It's huge. But given the security breaches that are almost a daily occurrance, we can't afford not to.

    --
    John
    1. Re:Visa PCI CISP is a good set of practices by scdeimos · · Score: 2, Insightful

      PCI/CISP does have software process recommendations for securing credit card data, but it's largely recommendations for people processes and facility processes.

      I believe the original requestor is asking about software to help automate/speed-up monitoring and scanning of content that's being put up on web sites by staff and/or students.

    2. Re:Visa PCI CISP is a good set of practices by plover · · Score: 4, Insightful
      I believe the original requestor is asking about software to help automate/speed-up monitoring and scanning of content that's being put up on web sites by staff and/or students.

      I know what he's asking for, and I answered with what it takes to make it happen for real. The answer is the various teams that are storing the data need to be held accountable for storing it securely. Just grepping for and deleting a database holding SSNs isn't enough -- his university has to make sure that all the teams are educated to not ask for nor store SSNs. They'll also benefit from a cohesive policy that gives specifics, such as "replace SSNs with student ID numbers."

      If this is just some security manager saying "go find SSNs and wipe 'em out" then they're up the creek. For every database they clean up, someone else will have created a new one. They'll be ignored and stonewalled by teams who have neither the time nor the budget to comply. This sort of thing has to come down from the board of regents, and they have to put the responsibility on everyone, otherwise they're just pissing in the wind.

      --
      John
    3. Re:Visa PCI CISP is a good set of practices by Anonymous Coward · · Score: 0

      No you didn't. You did a bunch of middle management hand waving and buzzword spewing. If they're at the point where they're doing page by page scans, they've already gone through the easy part of determining how some of this information is handled by the code written by the IT staff. He's looking for it after it's been entered manually via a text input box or where someone outside the university's IT staff has put the information on a web server. None of your hand waving will help at a university where many of the staff won't listen because they think they shouldn't have to listen to anyone that's not as smart as they think they are and a bunch of idiot students who do exactly the opposite of what you tell them.

    4. Re:Visa PCI CISP is a good set of practices by bobbozzo · · Score: 1

      ISTM that the kind of application needed would be a (customized) crawler / search engine.

      There are lots of articles online about writing crawlers and search engines using "off-the-shelf" components such as stuff found on Perl CPAN.

      Once you have a basic crawler working (should be easy), have it look for regex patterns matching SSN's, CC #'s, ..., and then log or save the offending URL's/pages.

      --
      Nothing to see here; Move along.
    5. Re: Visa PCI CISP is a good set of practices by Anonymous Coward · · Score: 0

      !v

    6. Re:Visa PCI CISP is a good set of practices by budgenator · · Score: 1

      Your exactly right and completely wrong, his biggest problem is he's surrounded by people smart enough to do really stupid things. One of these smart people is going to decide that they can secretly get their data from anywhere by putting it on a web server without a link, that way only they will know, well him and his two undergrad assistants, well they'll tell their girl-friends of course, then mysteriously the link gets posted to LeetHaxor.ru and of course google crawls leethaxor.ru, then the whole world except the security consultant knows. The rest of the professors are a bunch smarter than that, they ROT-13 the data and put it on the FTP server!

      --
      Apocalypse Cancelled, Sorry, No Ticket Refunds
    7. Re:Visa PCI CISP is a good set of practices by charleste · · Score: 2, Interesting

      ...also benefit from a cohesive policy that gives specifics, such as "replace SSNs with student ID numbers."
      I completely agree. We had to do this when I was contracting for the government a number of years ago. Even in the databases at the time there was a veritiable cornucopia of plain ASCII characters stored where nowadays we know that those types of data should be at least encrypted, and probably not stored in a column called (in plain text) SSN or some such thing.

      <offtopic_sidebar>Ironically, back in *my* day, my student ID at college was some number (probably the next in sequence). By the time I graduated (5 year plan), they were switching to SSN. Now, they are moving back to a student ID #</offtopic_sidebar>

    8. Re:Visa PCI CISP is a good set of practices by plover · · Score: 1
      Your exactly right and completely wrong

      Yeah, I know, and that's the problem.

      Unfortunately, the right answer is probably a big stick attached to the end of the policy. "Our policy is one of zero tolerance. If you violate these rules you will be fired, tenure notwithstanding. We have to protect our students first and our reputation second, and nothing else, including your convenience, your research, your history, your prominence in your field, your title, or your budget is justification for violation of this policy."

      As much as I hate personally 'zero-tolerance' policies like that, it's the kind of thing that gets attention. And this problem can only be solved by getting everyone's attention. The real question is "would a university have the balls to fire the Dean of Engineering for violating it?" What if that prof is world-renowned? I'm thinking "what if this meant they had to fire someone like Vincent Cerf?"

      And I don't know how to answer that.

      --
      John
    9. Re:Visa PCI CISP is a good set of practices by plover · · Score: 1
      If they're at the point where they're doing page by page scans, they've already gone through the easy part of determining how some of this information is handled by the code written by the IT staff.

      You know this? Wow, you've got pretty good between-the-lines vision, because he sure didn't say that in TFQ.

      What I read from TFQ is "help me scan for bad data" and replied with "you can scan for bad data until the cows come home, but until you have a big stick to smack future violators you will have accomplished jack."

      I just thought such simple language wasn't much help without some examples and context.

      --
      John
    10. Re:Visa PCI CISP is a good set of practices by budgenator · · Score: 1

      Fire them, why let them off that easy :), OBTW I got my letter from the VA a couple days ago, now I see that the guy who had the laptop stolen actualy had written permission to have the data on it!!

      --
      Apocalypse Cancelled, Sorry, No Ticket Refunds
  2. Or, try a way to prevent it leaking out as well. by rdunnell · · Score: 2, Interesting

    If you can do a regex of what you are looking for, you might be able to put some infrastructure in front of your web apps that controls what goes out.

    Some commercial vendors eg. Citrix (Teros), Imperva etc. offer stuff like this in an appliance, and there has to be some sort of thing you could do with Apache and OSS stuff as well depending on your needs. It might not catch everything but hey, your code base is always changing and a one-time audit might not find a problem that shows up six months after the audit is done. Some sort of preventative measure working hand-in-hand with regular audits is probably your best bet in the long run.

  3. The answer is simple by halcyon1234 · · Score: 5, Funny
    Do nothing.

    Given enough time, some industrious hacker will find all the data for you.

    Then, when you read the Slashdot article titled "[Name of Your Company] Leaks Private Data", you'll know exactly where the pertinent files are.

    At that point you can take care of them. The pay out to the privacy lawsuites will probably end up being less than the cost in man hours to do the job semi-manually. In the end, you'll still come out on top. (Though there is the off-chance that your company and your replacement will come out on top...)

    1. Re:The answer is simple by SlowDancing · · Score: 2, Insightful
      Do nothing. Given enough time, some industrious hacker will find all the data for you.

      I think the OP may be hoping for that, since they're posting on Slashdot and have disclosed the identity of the university just as cleverly as any redacted PDF would.

    2. Re:The answer is simple by sumdumass · · Score: 1

      Maybe it is actualy an elaberate honeypot designed to initiate a sting that hopes to capture most of the worlds hackers in one swoop.

      So, hoping for that might be exactly what he(they) wants. The man will strike hard.

    3. Re:The answer is simple by Anonymous Coward · · Score: 0

      LOL, didn't notice that!

    4. Re:The answer is simple by thesp · · Score: 1

      [Name of your company] == University of Conneticut by any chance? or is that too much of a coincidence?

    5. Re:The answer is simple by Decker-Mage · · Score: 1

      I don't know why this is rated funny 'cause this is precisely what many (hell, most!) companies use as their policy today. Just ask any serious security professional and they will tell you the same.

      --
      "[I]t is a wise man who admits the limits of his knowledge or skill, and that pretending either causes harm." --Terry Go
  4. johnny i hack stuff by wwest4 · · Score: 2, Insightful

    JIHS comes to mind.

    1. Re:johnny i hack stuff by aleksmanuil · · Score: 1

      Yes, I used it and very satisfied

  5. Re:Or, try a way to prevent it leaking out as well by _Sharp'r_ · · Score: 2, Interesting

    Network filtering would be useful as a proactive preventative, but that's going to cause a serious network slowdown in most large environments while at the same time not catching the root causes of the problem.

    Of course, storing the information again and then searching it is pretty silly. You don't want to know what used to be out there, you want to know what's currently out there and as a bonus, it's already taking up storage space somewhere, so why duplicate it? In order to "copy" it, you're going to take just as many resources as if you look at it in place and process it once, so what's the point?

    Just create an optimized process (since this is where all the work will be done it's useful to spend a lot of time optimizing it) to scan file shares and database tables (why use http when you can bulk access the html via a file system?) for your "security-breach" signatures. Write some good regexps and even grep is fairly fast. Then, just set the process to start over at the first file system once it's completed scanning the last one. Make sure that you reduce the priority for the process and give it appropriate bandwidth/resource limits so that it's using "extra" resources instead of interfering with normal work and you're all set. If you can get your scanning process to run at a low cpu priority on the actual storage hosts, that'll be even better because it'll limit your bandwidth usage even more.

    --
    The party of stupid and the party of evil get together and do something both stupid and evil, then call it bipartisan.
  6. Re:Or, try a way to prevent it leaking out as well by glowworm · · Score: 1
    If you can do a regex of what you are looking for, you might be able to put some infrastructure in front of your web apps that controls what goes out.
    Interesting idea... but I'd do it as an adjunct as you suggest.

    mod_security for Apache can do exactly this sort of regex matching and serve up an error page if a match is found. The logs are pretty easy to grep to find occurences of a match and hence track the data down.
    --
    Orationem pulchram non habens, scribo ista linea in lingua Latina
  7. Look at the images too by dbIII · · Score: 3, Interesting

    One amusing situation was when the head of Australia's nuclear agency was very vocal in his criticism of google's sattellite images due to a low detail image of his facility being visable there - he actually played the "terrorism" card in his criticism. The front page of his organisations website had a much more detailed aerial photograph of the same facility that was more up to date.

  8. Another NSA troll looking for tech help on /. by mrgodzilla · · Score: 4, Funny

    Dude.. we know who you work for.. really.

  9. Dear Sir... by megaditto · · Score: 5, Funny

    Our Nigerian IT minister has tasked us with providing free support to the US universities.

    Kindly forward us the backup tapes with your data as well as a representative list of personal data you are striving to secure (such as student SS#, birth dates, Mother Maiden Names, corporate purchase cards, etc.) and we will promptly perform the audit for you.
    This is absolutely legal, and you will be allowed to keep 10% of whatever we find.

    [no, no it's a joke, dammit!]

    --
    Obama likes poor people so much, he wants to make more of them.
    1. Re:Dear Sir... by Anonymous Coward · · Score: 0

      FAIL. Joke ruined by explaining that it was a joke.

  10. Re:Or, try a way to prevent it leaking out as well by ktwombley · · Score: 1
    Palisade Systems offers just such an appliance. Notably, it's built on top of FOSS, easy to install in many configurations, scales very well, and easy to administer (with a kickass web-based interface). It has a swiss-army set of tools you can get with it, including URL filtering, Credit Card matching, and other sensitive data matching. Full Disclosure: I work for Palisade Systems.

  11. SQL Server backups by Centurix · · Score: 3, Interesting

    If you're familiar with SQL server and it's method of creating backup files you can actually find quite a number of backup files just using Google. The files are documented in the Microsoft Tape Format guide showing the block magic numbers which can be quite useful.

    Like this

    Download, restore, maybe find something useful...

    --
    Task Mangler
  12. Which is entirely the wrong approach by Moraelin · · Score: 4, Informative

    I believe the original requestor is asking about software to help automate/speed-up monitoring and scanning of content that's being put up on web sites by staff and/or students.

    And I never cease to be amazed by the sheer number of people sharing that belief that there's some magical amulet (uber-security program/appliance/whatever) that you can just tack onto a site and make it auto-magically secure.

    Unfortunately that kind of thinking is outright counter-productive. It's dangerous. It's the kind of thinking that breeds such disasters as "we use SSL, so we're secure." (Shame that someone uploaded confidential documents on the web site anyway, so they can be downloaded by anyone. _Securely_ downloaded, to be sure;) Or "we have a Snake Oil (TM) gateway that can scan SOAP requests, so we're secure." (Shame that noone actually configured the rules for it, though. Or shame that the Web front-end there allowed users to escalate their privileges _before_ it all got packed in a SOAP request: the gateway can't detect whether it's genuinely a site admin or a regular user who escalated their privileges.) Or "we have a hardened Single Sign-On front-end in front of the servers, enforcing login and access rights, so we're secure." (Shame, that, literally, one application allowed users to escalate their privileges and see any content, by just editing the URL. E.g., someone could edit the admin's password by just editing the admin's user ID in the URL for the password change page, _then_ properly log in as the admin through that hardened SSO front-end. Literally. I'm not making it up.) Etc.

    But to address your actual point: content scanners aren't the answer, or rather are a bad and incomplete answer. E.g., I've seen one company deploy such a thing in front of the back-end, in their case to supposedly protect against SQL injection in the front-end. So it rejected anything that looked like an SQL keyword. Should be secure, right? But what do you do if it's not as secure or well-programmed as you think? E.g., the thing would cause a form submit to fail if you wrote something like "Visa Select" in a field, because it contained "select", but actually failed to protect against actual SQL injection using the quote sign, or XSS injection using the greater-then and less-than signs.

    Worse yet, it encouraged everyone to be lax and don't bother thinking about security or doing a code review, because, hey, they have the magical amulet on the backend. Even worse, it encouraged managers to not allocate time or resources for an actual security review.

    Security isn't about magical amulets, it's "holistic", so to speak. The security chain is literally as weak as the weakest link. People need to be educated to actually sit and think about the whole and about every single piece and scenario, not to throw in a couple of +5 Security amulets and call it a day. Throwing in the towel and relying on some magical amulet which somehow makes it all secure just because it's there, is actually the antonym and nemesis of security.

    Even if such appliances and programs are used, someone needs to sit and think about how they're used, how they affect their own program, what they prevent, and most importantly what they _don't_ prevent. What data and how does it prevent from being stolen, and what happens when (not if) someone _does_ get through. E.g., what data you shouldn't be collecting in the first place anyway, because you don't actually need it. (If it's not there at all, it can't get stolen.) And most often the right thing to do is _not_ to rely on them: they're there as a last ditch defense, that can't catch everything, but it's one last chance to _maybe_ catch something that got through the other layers of defense. Not as a replacement for the other layers.

    And teams and managers need to be educated that they _need_ to do just that: sit and do a proper analysis. And not just the technical implementation parts, but also, yes, the people processes involved. E.g., if a process can w

    --
    A polar bear is a cartesian bear after a coordinate transform.
    1. Re:Which is entirely the wrong approach by Decker-Mage · · Score: 1
      I get damn near all the industry publications that exist and the advertisements in them, as well as more than a few articles, encourage the belief in that magical amulet of security +5. As we both know, security is a process, or actually a collection of processes. I like to think of it as consisting of three items:

      • Security by design - security has to be engineered into the design from the very beginning, not tacked on after the fact.
      • Security by policy - policies must be put in place and enforced to ensure that security is not breached by personnel/users.
      • Security by audit - continuous audits, and here is where software tools, hardware such as intrusion detectors, etc. come in, must be conducted to ensure that the design and policies are effective.

      It's hard to get it right, especially since you face constraints such as time, budget, personnel, and executive buy-in. You will have to conduct serious risk assessments to determine what your priorities are and present them in an effective manner ("hey, id10t exec, you could go to prison over this" does work!) to the CxO's. You'll also have to put the policies in easy to understand formats for personnel, i.e. don't bother explaining the why's and wherefores of a policy, just give people the policy and what will happen (i.e. bye-bye!) if it is violated. Offer amnesty for reporting policy violations. These are only a few things that come immediately to mind.

      Scripts/software tools (compliance validators for instance) are all well and good for the third part of the above formula but it is only a small part of the picture.

      --
      "[I]t is a wise man who admits the limits of his knowledge or skill, and that pretending either causes harm." --Terry Go
  13. McAfee/Foundstone's free SiteDigger by CFrankBernard · · Score: 2, Informative
  14. ok, this is easy by Anonymous Coward · · Score: 1, Funny

    Have all students put their credit card numbers, SSNs and mother's maiden names in a database. Then you can grep -v your web content. Done!

  15. VSDB by bigattichouse · · Score: 1

    My company has a vectorspace engine that can help you classify docs that are related. given a SQL query you should be able to find related information. We'd be happy to help you build something, or help you through the build process. It works under windows, linux, and we just completed eSeries, iSeries and zSeries certification through IBM's chiphopper program (we haven't updated the website yet). Click through on my website link for more info.

    --
    meh
  16. here's a device that does just that: by imsmith · · Score: 1

    Its expensive, complex, and will take at least a week to set up, but one of these will scrub all traffic for things like SSNs and other pattern-matchable data inside HTTP packets and other TCP traffic.

  17. Google Search Appliance by poison1701 · · Score: 0

    If money is no object, then perhaps the Google Search Appliance is the answer to your problem..
    http://www.google.com/enterprise/

  18. Re:Or, try a way to prevent it leaking out as well by Anonymous Coward · · Score: 0

    I think the problem is that sensitive is a lot more than CC and SSN's, univiersities, Research projects, government contracts the mind boggles at what sensitive might include or how it might be discovered. I was looking for a reasonable estimate of our local ground elevation for a slashdot post about rising sealevels, ended up seeing an abstract on a search page that said "WARNING: Document contains Sensitive Security Information ...." on a pdf at blank airport's web site. The question for me became, do I call the FBI or pretend I didn't see it?

  19. One tip by Anonymous Coward · · Score: 1, Insightful

    is to stop using Social Security numbers. Another is to stop using Social Security numbers. Yet another is to stop using Social Security numbers. And yet another is to stop using Social Security numbers.

    Is your university contributing to the students' Social Security accounts for some unknown reason? If not, there's no legitimate reason for the school to continue to use students' Social Security information.

    Same with birth dates. In grade school, along with permanent records, we were assigned a student ID number. Thirty years later, I still remember mine. There's absolutely no issue with manually maintaining, in a notebook (remember those?), with a pencil or pen (remember them?) a two column chart that correlates Social Security numbers with an arbitrarily assigned student ID number issued by the college or university for identification purposes, if maintaining Social Security information is absolutely necessary, which it most likely isn't.

    For every reason for maintaining the Social Security, birthdate, or other sensitive information by the university, there's a reason and a method that shows that it isn't necessary. A couple of universities who made news in the recent past because of sensitive data breeches announced that they'll no longer use Social Security numbers for identifying students. If they can do it, so can your institution. No excuses. Stop using Social Security numbers.

    Ask your institution this: if Congress enacted a law that said that a university could be held financially liable for the consequential damages of a data breech involving Social Security numbers, and the liability could extend to all of the endowments in possession of the university by all past alumni, would the university continue to use Social Security numbers as identifiers, or would they find and implement a different identification system rather than risk losing their entire endowment funds?

    The simple answer is to stop using Social Security numbers. And to stop using Social Security numbers.

    As for the other part of your post, wtf is the university doing storing credit card numbers on its computers?

    1. Re:One tip by RayMarron · · Score: 1
      stop using Social Security Numbers.


      That will work great, assuming everybody pays their own tuition. Because if anybody wants any Title IV aid (grants, student loans, etc.), pretty much the first entry that has to be submitted on every form required by, and every record transmitted to and from the government is ...wait for it... the SSN. Anybody got an Act of Congress handy? :)
      --
      ON DELETE CASCADE
    2. Re:One tip by Anonymous Coward · · Score: 0
      pretty much the first entry that has to be submitted on every form required by, and every record transmitted to and from the government is ...wait for it... the SSN. Anybody got an Act of Congress handy? :)


      I used to think that was true also. Especially on financial forms. Ten years ago it was either entirely true, or true for the most part. Then something magical happened. Congress passed an amnesty bill for illegal aliens sometime earlier. The only problem was, there was no enforcement part, it was a pure amnesty bill. This caused illegal entry into the US to skyrocket. As this started, it created a dilemma for banks and other financial institutions that wanted/want the revenue from selling mortgages, selling property, and selling other items to illegal entrants to the US. Going back 10 years and more, actually going back more than 20 years, when I first attempted to hand in my first driver's license application without filling in the Social Security number and that failed, to attempts to hand in other government forms that required a Social Security number, and my leaving the number part blank failed, to my entering "n/a" in the number part failing, I've noticed one thing in the past half dozen years: Now, when you fill out a form (including federal government forms and forms for private organizations that are submissions to government agencies for funding, leaving the number blank generally has no effect, the form is accepted without question. And when it is a form that is being filled out for admission into a federal funded organization, and someone is asking you questions and entering your responses into their computer, they don't even blink anymore when you simply state "no" in answer to their request for your Social Security number. They simply move on to the next question without batting an eye.

      Many of the federal forms you fill out may have an entry requesting a Social Security number. But thanks to the influx of illegal aliens and many agencies' attempts to accommodate them, submission of such a number is optional on many of the forms, the forms will still get processed if no number is provided.

      In addition, your response provides an excuse to maintain the status quo. But my top post states

      For every reason for maintaining the Social Security, birthdate, or other sensitive information by the university, there's a reason and a method that shows that it isn't necessary


      which is applicable for your response as well. Just because the Federal government may require a Social Security number on a federal form (like a financial form, loan would qualify as a financial form, especially if the loan isn't paid back but you leave school so that the loan counts as income for federal tax purposes), doesn't mean that the school should continue to use the Social Security number simply because it is found on a federal financial form. The college or university is under no requirement to maintain use of the Social Security number as an identifier within the college or university.

      An additional problem to your logic about the federal forms is that several universities who have run into security breeches regarding Social Security and other identity information have already stated that they will stop using Social Security numbers as student identifiers. Are you stating that these universities have no students that qualify for federal financial aid that relies on Social Security numbers on their applications? If one college or university can alter their behavior regarding student identifiers, then others can follow their lead. If not, then they should fire their boards of directors and management and hire new boards and management.

      And guess what? Once a majority of colleges and universities stop using Social Security numbers as identifiers, then you'll see that the federal government will also alter their behavior, especially after data breeches hit them hard when it is no longer worth cracking college and university computers.
  20. Commercial Product by Anonymous Coward · · Score: 0

    There are, obviously, many ways to do this. I had never hear of such a product, but one came across my desk this week, and I thought I'd pass it on: Tablus (www.tablus.com). I'm sure it's pricey, but I guess it depends on your goals. Alternatively, I'm sure there are several consultants out there that could help you out, either by doing the dirty work for you or by poiting to someone who can.

    $.02 deposited.

  21. What world do you live in? by sgent · · Score: 1

    SSN's are essential for extending credit (credit reporting), which most universities do. They are also needed for accessing financial aid (VA, Federal Student Loans, etc).

  22. mod_security by sneakerfish · · Score: 1

    You could use your in house search engine (or a google appliance if your lucky) to find any existing content or I supose your current system of crawling, parsing and regexes would suffice.

    Then I would recoment the mod_security module for apache http://www.modsecurity.org/ It will scan any POST requests for banned pattern. You could leverage the regexes you already wrote to scan the content in the first place.

    I think mod_secrity does what the FS and McAfee appliances do at much better (free as in beer) price.

  23. Visual Web Mining Toolkit by ejoe · · Score: 1

    If you have access to a MacOS X box, Anthracite Web Mining Desktop toolkit http://www.metafy.com/ can do this kind of work for you. It's currently being used by customers on four continents to build daily custom reports from large volumes of web based data, like the SEC Edgar filings. It's based on a visual user interface that allows non-programmers to quickly and easily create high value web data processing systems. If you need to automate running a grip of regexen against thousands of webpages daily, you should definitely check it out. It can possibly save you a lot of time, we've got one customer who quickly eliminated two days per month of this kind of labor intensive work. On FM with great vitality at http://freshmeat.net/projects/anthracite [PS - Yes, I'm definitely biased, I wrote the software ;-)]

  24. I don't envy you by pr10101 · · Score: 1

    I too work at a large university. I don't know if your experience is similar to mine. If it is, then given you're even posing this question I bet your university cannot formally define what is considered restricted or sensitive data. Some things are easy, like SSN. Some things are not. There are lots of grey areas. There are lots of kinds of data at a university, and there are potentially dozens or more formal audit requirements that might need to be met in some cases, but not others. Sometimes a given "piece" of data is itself not considered restricted, but two or more different non-restricted pieces when together are. It gets very complex, depending on how thorough you want to be. And that's just death around a university, where people love to debate the complexities ad nauseum, and no one can or will just say, look, THIS is restricted data. THIS is where we are starting. *I* am making the call because I can, or else because someone has to. If we want to add to this list, or debate the subtleties down the road, fine. But get busy with THIS list NOW. And so my first point: how thorough do powers-that-be want you to be? And is there a definition clear enough to program a computer by that specifies that level of thoroughness? Or when you ask precise questions, do you find it hard to get anyone who says: I am responsible for making the call, and the call is YES|NO that is|isn't restricted data. Instead, you get a lot of longwinded talk, vague references to long-winded say-nothing vague policies that don't, ultimately, answer your questions either about what is or is not restricted data? Yeah, I thought so. Sorry to hear it. Second point: is this an interim damage control task, but the real task of getting a handle on sensitive data going forward is already well underway? If not, then you are again on a fool's errand. This task is going to be time-intensive, no matter how you do it, no matter what tool you find or what set of scripts you roll yourself. Why bother, then, putting the horses back into the barn until the gate is fixed? Or probably more aptly, why bother making the horses stand where you wish there was a barn until one is built there? Unless you have, say, transcripts or something sitting on a webserver, time is far better spent on building a barn. Third: someone already raised this. It goes hand-in-hand with the above point. If the groups around campus aren't made responsible for how they handle restricted data, it's hopeless. A university environment is generally too chaotic and out of control (I believe the euphemism is "collegial") to manage it any other way. But hey, what's the first thing that will happen when you tell groups they are responsible for handling sensitive data? Yep, you guessed it - what is considered restricted/sensitive? And I bet your university can't answer that. So, yes, I don't envy you. Good luck.