You're missing part of the problem here. The petroleum industry has a long-standing track record of insisting that 10% ethanol-gasoline mixes are safe for use in stock internal combustion engines. This simply isn't the case; there are hordes of documented cases of these blends screwing up engine internals and fuel delivery systems. It's not a matter of having the right parts. In some jurisdictions, it's a legal requirement for filling stations to declare the maximum ethanol content of fuel, but this is by no means a universal requirement.
Like many English terms, the pronunciation differs from the obvious solution derived from a phonetic approach. It's actually pronounced "ayrcondishunin".
BTW, if you want to actually play with some code that relate to our ideas on this stuff, feel free to drop me a line. I wouldn't mind having something new to hack on with someone.
Overall, I like many of the improvements you've suggested here. However, I'm still assuming a much bigger and more unpredictable size for the initial input to the hashing function.
I'm running with the assumption that the plaintext stolen identities database contains, at minimum, the victim's full name, street address, credit card number or bank account number, expiration date, possibly a CCV, and possibly a telephone number. Some entries may have an SSN associated with the identity as well.
As you're no doubt aware, just having a credit card number/exp date and the cardholder's name is of significantly less value than having that plus the billing address, as most merchants will not process a transaction without it. Additionally, if an attacker compromises a merchant system, they're going to be in possession of as much data as possible, including a billing address in a large percentage of cases.
Thus, meaningful (i.e. useful to a thief for most purchase purposes) identity information will be extensive and varied enough to make the database of hashes an unbelievable pain in the ass to brute-force, even without a sprinkling of padding in a pattern known only to the recordkeeper.
I think I need to clarify my proposal a bit more. What I mean is to perform an operation resembling the following:
Uppercase all data and normalize character sets to a least common denominator.
For items that might have some minor variations, make a list of each these for each identity. Examples include abbreviations, common misspellings, etc.
Take several key data items, not just a name and credit card number, and combine them into a single string. You can optionally add some secret sauce to this at predefined locations in the string to make things interesting. Hash this.
Run the last step for all the permutations of full identity strings you have for a single identity.
In this scheme, I don't care about having anything to use as an index. In order to verify whether I "know about" any given identity, I would run the same process on full identity data given to me by those wishing to check my datastore. To find out if I know about someone's identity, I simply look for all the computed possibilities of the submitted data, discarding most of the input data immediately after the check as it's not needed anymore.
Within a modest infrastructure, this could be implemented to give almost instantaneous search results, even across several million identities, because the workload may be trivially parallelized. Sure, several thousand peoples' worth of simultaneous checks would require more hardware, but it scales horizontally.
Interestingly enough, I've also worked with credit card encryption schemes, and fully understand everything you wrote regarding the means by which search space is easily reduced. There are ways to mitigate that issue when encrypting such data using standard algorithms, if you're willing to deal with much larger encrypted output than the input size (a simple credit card number).
Although I would contest the math behind your assertion that my method amounts only to salt (even assuming code compromise, I'd strongly recommend you run your numbers again, it's a few zeros worth of a bigger problem than the millions you cited per name), I do really like your addition to the scheme.
I guess what it boils down to is the effort required to break an appreciable number of identities versus the payoff for doing so. I'd imagine this would be a losing proposition for all the but an extremely determined adversary whose primary motivation isn't monetary gain.
All systems fall eventually, no matter how much effort is put into the design. Advances in hardware and attack methodologies will always win against a system that isn't continually improved upon.
It shows Microsoft's PR department is as good as ever at taking a nasty situation where they've done something jacked up and spinning it 5,000 MPH to look like something else.
Actually, there are serious laws in place that affect anyone working with privileged access to information, from Confidential through TS and up. Being a private citizen (i.e. contractor) just means we won't get prosecuted under the UCMJ in addition to the various federal statutes he's broken.
Unfortunately for the submitter, even minor digging into his background seems to indicate that he's likely still subject to the UCMJ due to prior Army service.
If I were to present to you an artificial intelligence that behaved indistinguishably from a human being, reacted to stimuli, showed varying levels of consciousness in response to changing conditions, and was actively engaged in expanding its reach into whatever domains it could flow into... what would you call that?
You hit the nail pretty much on the head. There are good reasons DoD personnel go through all those "silly" information awareness courses. Apparently, none of that sunk in with this individual.
I hadn't bothered to follow the trail at all, knowing others undoubtedly already have.
You're assuming the attacker would have more knowledge than I am, but not in the way you're thinking. I'm not suggesting the data items be kept as separate entities. A single identity can be represented as a single hashed value. You compute a couple of dozen hash variations based on different combinations of variations in individual data items. You can also pad the final string pre-hash with data known only to the record keeper to make brute-force attacks even more computationally infeasible.
It would seem sensible to take common variations in the information (minor spelling differences for some data, accounting for different uppercase/lowercase combinations, abbreviations, etc), create a database of hashes for all this data, and use one-way hashing for comparing information submitted to determine if you know about it or not.
Let me help you out. You see, his attorney actually went to law school with the judge. His son is dating the DA's daughter, and a generous contribution to his campaign for re-election has recently been made by an anonymous donor.
Given your username, I might not mind seeing that... what sort of appendages does your robot-self have, and how would you be using them on Mr. Securities Fraud?
MySQL has been dual-licensed for a long, long time. The GPL version of MySQL is the copyrighted properly of the company. Reference: assigning copyright to another entity when you contribute code.
Funny, I've got an ATT wireless adapter that works just fine with Ubuntu. What are you doing wrong?
Backroom deals with Microsoft.
Posting from a Mac with my secondary display packed with terminal sessions into about a dozen Linux machines.
You're missing part of the problem here. The petroleum industry has a long-standing track record of insisting that 10% ethanol-gasoline mixes are safe for use in stock internal combustion engines. This simply isn't the case; there are hordes of documented cases of these blends screwing up engine internals and fuel delivery systems. It's not a matter of having the right parts. In some jurisdictions, it's a legal requirement for filling stations to declare the maximum ethanol content of fuel, but this is by no means a universal requirement.
You failed to address the question of ethanol screwing up engines.
Epic Win
Like many English terms, the pronunciation differs from the obvious solution derived from a phonetic approach. It's actually pronounced "ayrcondishunin".
Change We Can Believe In.
Oh, wait.
BTW, if you want to actually play with some code that relate to our ideas on this stuff, feel free to drop me a line. I wouldn't mind having something new to hack on with someone.
Overall, I like many of the improvements you've suggested here. However, I'm still assuming a much bigger and more unpredictable size for the initial input to the hashing function.
I'm running with the assumption that the plaintext stolen identities database contains, at minimum, the victim's full name, street address, credit card number or bank account number, expiration date, possibly a CCV, and possibly a telephone number. Some entries may have an SSN associated with the identity as well.
As you're no doubt aware, just having a credit card number/exp date and the cardholder's name is of significantly less value than having that plus the billing address, as most merchants will not process a transaction without it. Additionally, if an attacker compromises a merchant system, they're going to be in possession of as much data as possible, including a billing address in a large percentage of cases.
Thus, meaningful (i.e. useful to a thief for most purchase purposes) identity information will be extensive and varied enough to make the database of hashes an unbelievable pain in the ass to brute-force, even without a sprinkling of padding in a pattern known only to the recordkeeper.
In this scheme, I don't care about having anything to use as an index. In order to verify whether I "know about" any given identity, I would run the same process on full identity data given to me by those wishing to check my datastore. To find out if I know about someone's identity, I simply look for all the computed possibilities of the submitted data, discarding most of the input data immediately after the check as it's not needed anymore.
Within a modest infrastructure, this could be implemented to give almost instantaneous search results, even across several million identities, because the workload may be trivially parallelized. Sure, several thousand peoples' worth of simultaneous checks would require more hardware, but it scales horizontally.
Interestingly enough, I've also worked with credit card encryption schemes, and fully understand everything you wrote regarding the means by which search space is easily reduced. There are ways to mitigate that issue when encrypting such data using standard algorithms, if you're willing to deal with much larger encrypted output than the input size (a simple credit card number).
Although I would contest the math behind your assertion that my method amounts only to salt (even assuming code compromise, I'd strongly recommend you run your numbers again, it's a few zeros worth of a bigger problem than the millions you cited per name), I do really like your addition to the scheme.
I guess what it boils down to is the effort required to break an appreciable number of identities versus the payoff for doing so. I'd imagine this would be a losing proposition for all the but an extremely determined adversary whose primary motivation isn't monetary gain.
All systems fall eventually, no matter how much effort is put into the design. Advances in hardware and attack methodologies will always win against a system that isn't continually improved upon.
It shows Microsoft's PR department is as good as ever at taking a nasty situation where they've done something jacked up and spinning it 5,000 MPH to look like something else.
Actually, there are serious laws in place that affect anyone working with privileged access to information, from Confidential through TS and up. Being a private citizen (i.e. contractor) just means we won't get prosecuted under the UCMJ in addition to the various federal statutes he's broken.
Unfortunately for the submitter, even minor digging into his background seems to indicate that he's likely still subject to the UCMJ due to prior Army service.
If I were to present to you an artificial intelligence that behaved indistinguishably from a human being, reacted to stimuli, showed varying levels of consciousness in response to changing conditions, and was actively engaged in expanding its reach into whatever domains it could flow into... what would you call that?
I don't think this dude's headed back to Security 101, or anywhere with a whole lot of freedom of movement for that matter.
On the upside, you get to meet interesting people in prison.
You hit the nail pretty much on the head. There are good reasons DoD personnel go through all those "silly" information awareness courses. Apparently, none of that sunk in with this individual. I hadn't bothered to follow the trail at all, knowing others undoubtedly already have.
Speaking as an ex-radioman on subs, yes, this is a huge no-no. He's probably already fired, and should be if he isn't.
You're assuming the attacker would have more knowledge than I am, but not in the way you're thinking. I'm not suggesting the data items be kept as separate entities. A single identity can be represented as a single hashed value. You compute a couple of dozen hash variations based on different combinations of variations in individual data items. You can also pad the final string pre-hash with data known only to the record keeper to make brute-force attacks even more computationally infeasible.
Please, tell us how you really feel.
It would seem sensible to take common variations in the information (minor spelling differences for some data, accounting for different uppercase/lowercase combinations, abbreviations, etc), create a database of hashes for all this data, and use one-way hashing for comparing information submitted to determine if you know about it or not.
whoosh
Let me help you out. You see, his attorney actually went to law school with the judge. His son is dating the DA's daughter, and a generous contribution to his campaign for re-election has recently been made by an anonymous donor.
Given your username, I might not mind seeing that... what sort of appendages does your robot-self have, and how would you be using them on Mr. Securities Fraud?
That depends entirely on which product actually gets used the most post-fork.
MySQL has been dual-licensed for a long, long time. The GPL version of MySQL is the copyrighted properly of the company. Reference: assigning copyright to another entity when you contribute code.