Ask Slashdot: Best Practices For Collecting and Storing User Information?
New submitter isaaccs writes "I'm a mobile developer at a startup. My experience is in building user-facing applications, but in this case, a component of an app I'm building involves observing and collecting certain pieces of user information and then storing them in a web service. This is for purposes of analysis and ultimately functionality, not persistence. This would include some obvious items like names and e-mail addresses, and some less obvious items involving user behavior. We aim to be completely transparent and honest about what it is we're collecting by way of our privacy disclosure. I'm an experienced developer, and I'm aware of a handful of considerations (e.g., the need to hash personal identifiers stored remotely), but I've seen quite a few startups caught with their pants down on security/privacy of what they've collected — and I'd like to avoid it to the degree reasonably possible given we can't afford to hire an expert on the topic. I'm seeking input from the community on best-practices for data collection and the remote storage of personal (not social security numbers, but names and birthdays) information. How would you like information collected about you to be stored? If you could write your own privacy policy, what would it contain? To be clear, I'm not requesting stack or infrastructural recommendations."
Best practice from my perspective: do not collect the data at all.
I think your mind is on the right track in identifying your resource limits (i.e. no tip-of-the-spear experts) and the sensitivity of the data (i.e., it's not all nuclear bomb codes). That is the first step. Next, think on the exact types of data that you're collecting, and try to group like data together, for example, all text data, screen caps, keylogging, audio or webcam video if you have it, and find a way to store them in an efficient structure while everything stays linked together. Finally, if possible, associate all data collection events with time (timestamp) and location (gps). this will allow a more complete analysis on the back end.
Just don't.
When you get the expertise to store the data securely then consider it.
Once you get into the habit of justifying everything that you store you will be less prone to the woops! plain text password/username/real-name/creditcard table being found by intruders.
Work bio at MMWD
Let me have a login for the benefit of having my data saved?
If I don't log in then don't store my details.
As for the rest whatever. Hash + salt or whatever?
If no-one can reach / use the data for anything then maybe say just e-mail address or something such as identifier.
I am not an experienced developer, but if I were I sure as shit would not be asking about it on slashdot, to be honest as I stand here now, if I needed some serious advice about any situation, I would not be asking on slashdot, I would be asking on a forum where people live and breathe the topic at hand like their lives depend on it, cause they are professionals, and not the peanut gallery of random trolls, tards, fanbois and neckbeards.
honestly... try not to store it.
You need to examine why you actually need the data, and if you can't think of a good reason (except it might be valuable in the future), then don't store it.
If you do need it for analysis, machine learning apps, etc, try to anonymize it as early as possible, and not to keep raw data longer than you need it. (say raw data for 3 months, then just store aggregate info).
also.. for behavior.. you don't need years of information, studies have shown people change, so make sure the things people do recently are more important, and the old stuff gradually decays.
Wikipedia (http://en.wikipedia.org/wiki/Personally_identifiable_information) is a good start.
If at all possible, stay away from personally identifiable data. If your aim is to use identity as an index, work out a way in which you can translate an identity into an an index or hash value (i.e. one way). This is not going to be perfect (there will be about a million "John Smith"s out there), but if you have a consistent pair such as name and phone number, turn that into a hash and use it as data index.
That means you can still do correlations, but a leak will not result in exposure of personal data.
However, first of all, look at what you're holding on personal data and simply assume you got hacked and it's "out there" - plan for that crisis first because there is one question you need to answer:
If you cannot afford to pay for security advice, can you afford to pay for the inevitable consequences?
Insert
I have been toying with a site idea. Your account name is your public key fingerprint. You public nicname is whatever you use in the message. Your login is validated because everything you send is signed wiht the key that matches the fingerprint (and encrypted with my public key for transmision). Input to user form is constrained and validated within those constraints (to prevent padding attacks).
I would then have a database "key x","paid through date y".
Sure, I couldn't sell any farmed data a-la facebook, but suppoena requests woudl be a breze... "here's your hex dump..."
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
The short requirements:
1) Explain what you're collecting in real-time at the moment when you give me the option whether or not to permit you to collect it. Tell me what you will use it for, when you will delete it and the consequences if I don't give it to you. People don't read privacy disclosures. Give notice and ask permission at the moment of proposed collection. Make it opt-in, not opt-out.
2) Only request the information required to perform the service I've requested. Use the information I provide only to provide the service I've requested. Only share the information I provide with third parties to the limited extent necessary to provide the services I've requested. Obtain contractual commitments from those third parties that cause them to protect my information and delete it as soon as they've done what's required to provide the service I've requested. Keep information only as long as necessary to provide the service I've requested and delete it after you've done what's required to provide the service I've requested.
3) Protect my information. Encrypt in transit and at rest. Delete thoroughly and don't give in to the urge to collect and keep information just because it might be useful some time in the future. You can't lose what you don't have.
You say the collection "... is for purposes of analysis and ultimately functionality, not persistence." That seems inconsistent with the collection of name and email address. I can't think of too many use cases where you're collecting my name and email address and don't plan to keep it (and use it for marketing or otherwise share it in some way). If you need to contact me or I need to create a user-id that is my email address, you don't need my name.
Your privacy policy is your contract with your user. It is an operational document that must be consistent with your practices. The privacy policy should be consistent with your policies and procedures. If the information you collect, or the way you handle it changes, you must change your privacy policy.
"The plural of anecdote is not data."
Return email will be sent, if necessary, to whatever address(es) are registered in the public key database for that fingeprint, encrypted with that key.
Obviously I have no control over your passphrase and can do nothing to help you "recover your password" or whatever. Please see your GPG or PGP documentation for a better explanation.
Your account will not be "renewed" past the key expiration date.
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
...and let your users, investors, and you sleep easier at night. Don't store anything at all except a few prefs.
Python: 'And then suddenly you have a language which says "we're all stuck with whatever the whiniest coder wants".'
"This is for purposes of analysis and ultimately functionality, not persistence."
Store nothing.
Ask people what they want from your service.
Listen to them.
If you can't afford the expert then you can't afford to collect such data. Move away from this project to something you have the ability to do.
kartune85 : Incapable of reason, observation or learning. A kind of dim, drab, flightless parrot.
OWASP has guidance; for instance, here: https://www.owasp.org/index.php/IOS_Developer_Cheat_Sheet#Insecure_Data_Storage_.28M1.29
From https://www.owasp.org/images/5/5e/Mobile_Security_-_Android_and_iOS_-_OWASP_NY_-_Final.pdf
2. Insecure data storage
Solution
Avoid local storage inside the device for sensitive information
If local storage is “required” encrypt data securely and then store Use the Crypto APIs provided by Apple and Google
Avoid writing custom crypto code – prone to vulnerability
Take off every 'sig' !!
In the US, we have the National Electrical Code which explains in clear detail how house wiring is constructed.
Following the code a legal requirement in many (most?) states, but from the point of an electrician it's a "book of best practices". Use this gauge wire for this current, staple the wire within 6" of the box, and so on. The code gets revised and added to over time as questions crop up and new technologies get added and people get more experience.
There's a reason for everything. For example, the light in a bathroom should be on a separate breaker from the outlet next to the sink. It makes sense in retrospect, but this is not something that is obvious beforehand.
It's very detailed, but also very clear. Homeowners routinely understand the instructions and are able to make simple repairs and modifications to their home wiring which conform to the code.
We throw a lot of "best practices" around here as if they were simple and obvious at the outset, but maybe they're not. Hash your passwords, salt the hash, sanitize the form inputs, don't keep CC info... lots of best practices which in hindsight make sense but which aren't necessarily obvious beforehand.
Most web apps have common requirements for login, identity management, privacy, various forms of functionality, and so on.
Should we have a "book of best practices"?
Aggregate the data as quickly as possible to anonymize it.
Collect "Mary did X, Y but not Z", but aggregate it to Three people did X, Two Y and TWELVE Z and drop Mary from the data. You don't need to know Mary did anything.
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
You say you aren't interested in persistence, so I don't see any reason why the data needs to be personally identifiable. Whether your index is John Smith in Albany,NY or User #71829382 doesn't matter for usage analytics. Even demographic information can at least be stripped of things like name and phone number.
If you REALLY need to tie this information to a particular instance, then use a hardware key from the mobile device and not a user's information. A hacked phone is easier to deal with than identity theft.
As someone else mentioned, work from the assumption that anything you save will end up being hacked and used for nefarious purposes. Make the data as useless as possible to a hacker and THEN design the systems and storage to be a hackproof as you can.
This site provides summaries of the terms-of-service policies for various companies covering privacy, retention, and use of user information. You can use it to compare your plans with those of major companies and identify privacy or TOS concerns you may have overlooked.
+5 Funny
I lag
Doing this means that you will really respect the privacy of anyone using your software since they would have the source code to do as they wish.
https://en.wikipedia.org/wiki/AGPLv3
Give your users Freedom and they'll respect you.
Be very careful indeed what you really need, and collect only that. The less data you collect the less you have to worry about.
Note that the easy cop-out is to stick someone else with the trouble, like "supporting" facebook logins or something, but that's actually worse. The why is left as an exercise, but rest assured that if you do that I'm certainly never going to sign up with you, just like I won't be signing up with facebook, or google accounts, or any of the others you might be "supporting".
Cut out the middle man. Just put it straight onto an FBI laptop.
My car insurance company needs to be able to pull my DMV records, perhaps even periodicly. They could retain *none* of that information and ask me to visit a web site periodicly where the info gets enterred so they can do the query (and then forget the information required to perform the query). Most customers wouldn't mind them holding that information; but if I'm *that* security minded and they make it clear to me that I'll have to hit their site once a month to maintain my insurance... well... There are always trade-offs, arent' there?
Just ask yourself, "what do I need in order to serve my customers?". Yes. REAL customer service. People doing good things for other people, and getting paid for it as opposed to just herding people like cattle and exploiting them. Yeah, I know. Strange concept.
That can be a very successful business model though. Zappos is said to be very customer focused, and AFAIK they are very successful. I can't say I care much for their work environment; but I believe that's a separate issue from customer service. I mean, do you really need to have conga lines and party with your co-workers after hours to be spot-on with a customer? I don't think so; but I just can't think of a good counter-example off the top of my head.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
Consider if your service is intended for only one country or several (a global service).
Regulations on user information/user data is VERY DIFFERENT in different parts of the world.
There are quite a few countries where the parts of the population has first hand experience of severe implications of wrongfully used user information like
-Friends
-Location
-Behaviour
-Etc
In some of these countries data collection is outright forbidden or the user data can never leave the country.
Not for any longer than necessary. Likely I would make that opt-in.
I would have a payment history (bob paid x dollars for y time) as an atomic event. Bob could check a box to say "remember this for me", or not at the time of payment.
At the time of payment I would also send Bob a receipt. That recept would say "Bob paid for a service". The receipt would also contain a dot-splash (e.g. Qr Code a linear 2D barcode, depending on how much info space I turn out to need) that was the "proper join record for the database" (e.g. the key tuple that proved that payment X was for service Y on date Z). That tuple would be encrypted with _my_ secret key. Bob could use this receipt by sending it back to me, but I would only have that record until the payment cleared and was essentially irreversible, or when Bob sent it back via email or phone scan etc.
The actual membership information that Key X was paid-up-and-valid until Date Y would be a separate entry.
Think double entry accounting but where the account holder and not the institution had the journal that colated things.
With no start date, and if a person could buy any amount of time, which would be necessary because the key the customer made is only valid till it expires and that expiration date is chosen to the second by the key creator.
There is some ability to back-figure the expiration dates to the purchases and so the purchasers while both sets of data are present, so the user would have the option to "randomize the duration", e.g. for gambling a little of the funds paid they would gain or be shorted a random amount of time within a reasonable percentage of the purchase duration.
The idea is that, at every chance, you give the user the magic cookie, to join the information, but you keep the results. As long as the cookie is cryptographically secured it doesn't mater that they are holding it.
It wouldn't be that hard to figure out who and paid what when, when the user base is first started out, but as the base and transactions mounted the anonymity of payments would increase.
So imagine you want to buy a year, and your public key is good for at least a year, you could buy a year as one transaction, or cut it up into several transactions (like 2 and 3 and 7 months each) to get the year, or you could buy eleven months and bet a month hoping to go long not short. Without the record that you get into your exclusive custody, there is no good way to ask the site how 12 months ended up on that key from which purchaser.
If you invalidate your key, you get no money back. If you lose your receipt you have nobody to blame but yourself. That's the risk you take for your privacy. It's basically using an information system hole to make things same-as-cash.
I haven't figured out how to deal with credit card "charge-backs" or fraudulent disputes. I'd rather take the gift-card route for payment if it came to that kind of problem.
You could, I suppose, put people who paid via revocable means (like credit cards) in "risk pools" and if someone games, you penalize the pool but let people out of the pool using their receipts as proof that they are not the scammer. As each person used their receipts to change pools, the pool would get smaller but each member would lose more, until only the scammed account and people who didn't care or lost their receipt would lose anything.
The idea started out as more of a social media/blog/rant site idea more than a profit oriented thing, but I could make it work pretty easily, The "business rules" for an anonymized service seem totally workable, but the anonymous people would have to accept some of the risk for the privacy.
People who opt in to having _me_ keep the payment records are, of course, buying the surety of service for the loss of anonymity, at least in part.
And the un-paid people are much less work (e.g. none) to track.
And a spammer would lose all their content for spamming as the and all its content would be forefet for spamming as a single "hide/delete where" eqivelent action. So rather than make fake accounts on my system they would be "paying" CPU to make keys and encrypt and sign their transmissions to me. Not impossible to script but pretty hard to javascript.
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
There would be other little niceties.
Agressive use of POST instead of GET messages on all forms so that pin-trap requirements, if levied, would be largely moot. as in user XXXXXX did POST to "/" at this site on these dates and times. [POST data is not legal to collect in PIN traps in the USA as I understand the law.]
Services a site could sell? POST the URL you want as part of the encrypted blob you sen to this site, we will retrieve it, scrub it and send its content back to you encrypted to with your key.
Pay for encrypted, advertisement free page delivery with/without the unpaid peoples noise at your leasure. 8-)
Encrypted mail box where the records in the mailbox are encrypted to your public key the instant we get it if the "From" matches particular criteria you specify. (this burns time off your subscription key expiration date etc, so you might not want to encrypt "form *".
Note that this is not a bar to law enforcement if they show up with a court order to "tap" a particular key going forward. It is a barrier to having law enforcement fish into your past. I am not a lawyer so I don't know if this last bit is legal, it's just the noise floating in my head.
Of course such a site would have no way of knowing whether the "identity" information in the key, if any, was real so just as I could make a key that said I was both Mittens and B.H.O. today, anybody would be foolish to assume that an unsigned and unverrified key was anybody it clamed to represent.
In short the site design is not to confound the law, but to make the entire issue of identity "Somebody Else's Problem" since I want to be in the business of passing messages for fun and profit, not being the arbiter of who is whom.
(you should see my thoughts on replacing DNS... 8-)
Innocent people shouldn't be forced to pay for inferior software development.
--"Code Complete" Microsoft Press
Only collect what is absolutely necessary at that particular junction; afterwards move that data to a different datastore where only legally required information is stored (e.g. for tax purposes) and everything else is omitted. Encrypt it. Don't store IPs longer than 30 days. Don't set permanent cookies.
/dev/null
My ism, it's full of beliefs.
From the article,
I'm a mobile developer at a startup. My experience is in building user-facing applications, but in this case, a component of an app I'm building involves observing and collecting certain pieces of user information and then storing them in a web service. This is for purposes of analysis and ultimately functionality, not persistence. This would include some obvious items like names and e-mail addresses, and some less obvious items involving user behavior.
If the intended reason as is stated, then why store the names and email addresses at all? Analysis of user behaviour in the aggregate does not require individually-identifiable information be collected much less stored,
It explains how to store personal information so it can be used correctly. http://wayner.org/node/46
Either your truly concerned about what parts of the interface/product/service they use the most and how, or you're collecting sales and marketing data so you can trim the fat and rape people. If the first is the case, just put a counter and/or timer on everything so every time it is accessed, clicked, etc, it is counted and you can see the amounts of time they are spending doing what without ever collecting any personal data at all. That would be completely anonymous and give you all the data you need to build a better interface. If the latter is the case go jump in a river from a very high bridge.
On passwords, I liked Jeff Atwood's article, `You're Probably Storing Passwords Incorrectly'.
For Personally Identifiable Information (PII), I liked Brian Danger Graham's article, `What's in a name database?'.
-rozzin.
If your company goes bankrupt, or is sold to another, all it's assets become the property of someone else. That someone cannot be constrained to respect anything you have promised. You may not even have the opportunity to wipe disks or change passwords.
For example, a hospital failed to pay the rent on a warehouse storing patient records. The landlord seized and sold those records as scrap. None of the hospital's patient privacy obligations transfer to the landlord, or to the scrap dealer.
Heed the advice of others who told you don't do it.
"How would you like information collected about you to be stored?" ::Must resist unhelpful comment::
Obviously not the solution for everybody. We write apps for Android Tablets (for old people actually). All the data like Name, email, pictures, and messages are stored in the Android tablet and kept on the Cloud only until they are downloaded. They are encrypted, even the pictures, while waiting on the Cloud database. In the registration part of the app the user does type in his email, but we do not keep it. How to contact the user? We put a record in a table which is checked periodically by an active user's android and they can get the message. Payment is tricky since the PayPal record contains the email of the payer and teh AndroidId of the user's tablet, It's just a matter of throwing away data that does not belong to me!
Although you state you're not looking for stack or infrastructure recommendations, I'd still recommend having a look at Google Mobile Analytics. They have an SDK for Android and iOS that makes it very easy to integrate in your apps.
Let us know the name of this startup, so that we may avoid it like the plague.
I think everybody would agree that the data should be encrypted, but often the problem with encryption is access to the data. If the server-side application stores the encryption key, this key could potentially be found (maybe through a vulnerability) and thus give access to the entire database.
Best practice is to encrypt each record with a unique key. This key could be generated by some unique identifiers per user like Visible User ID (maybe E-mail address) and Password and Hidden User ID (different from the visible and generated independently from it) and Android ID.
To create the database entry:
1. Collect the information to store
2. Key = Hash(Visible User ID, Password, Hidden User ID, Android ID)
3. Send Visible User ID and Key to a receive-only system with as little an Internet Surface as possible (i.e. one that is next to impossible to hack into, if done correctly) -- This information is used retrieve the user data for analysis and such
4. Store the information in the more accessible database encrypted with the key
To retrieve the data from a user's application:
1. Collect Visible User ID, Password, Hidden User ID and Android ID
2. Key = Hash(Visible User ID, Password, Hidden User ID, Android ID)
3. Use the key to retrieve the necessary data
To retrieve the data from the inside:
1. Use the User ID to Key data to decrypt the data
Don't do anything special, just store it in a database.
Try to not get it compromised, but if it gets compromised, well, who cares?
Analyze data on a nightly basis. Store the results. Scrub database after results are stored. The asshole MBA that your startup hires because it isn't making enough money then has nothing to turn around and sell for a quick buck.
If you have to store *anything at all*, hire the expert. Can't hire the expert? Your startup is inadequately funded.
Never underestimate the power of stupid people in large groups.
Just store it on an FTP site somewhere with a secret IP address no one will find it, I promise.
Disclaimer: I work in the field, but do not have nearly enough information on your particular situation, jurisdiction, etc to provide detailed recommendations. What follows is basic best practice stuff based on my jurisdiction and market sector.
* First, any sensitive information you are collecting, ask if you really REALLY REALLY need it. This stuff is toxic waste. Your first and best defense is not to store it if you don't need it.
* A hash of something like a SSN, Telephone number, etc is worthless in terms of protecting you. Hashes are only useful if the search space is large enough to make the full space search computationally unfeasible. 1 billion SHAs is not computationally unfeasible. Also typically hashes are only useful if what you want to do is compare two values, e.g. passwords. If you're trying to anonymize, hashing a PII (personally identifiable information) element doesn't anonymize the data as it doesn't break the PII link.
* DON'T WRITE YOUR OWN ENCRYPTION. EVER. Unless you have a deep deep background in crypto and submit your alg for peer review for years before using it, just don't.
* Consult a good lawyer. There can be pits in here that you might not think of, particularly if you don't have a security dept with someone who spends their time dealing with privacy issues. A good lawyer won't say "You can't do that" a good lawyer will outline the risks that you will be running and let you accept them - just like a good risk mgmnt dept will
* Use the security controls in your database. If your client doesn't need to access the hashes because they're being computed by a stored procedure then the user your client accesses the database shouldn't have access to the hashes. Same goes for salts only more so. I've seen too many apps written using one user for everything. Don't do this.
Hope some of that helps you.
Min
On the whole, I find that I prefer Slashdot posts to twitter ones because I don't get limited to 140 chars before
Just go ahead and store it in an unencrypted excel on Amazon S3, and save the bad guys the 5 minutes it's going to take to break through whatever well-meaning safeguards you put in place.
Here in Quebec, the notary actually reads the entire document to you and asks you enough questions that he is sure you've understood it.