I work for a university doing research on drug and alcohol abuse (yeah, I know, really, really exciting). One of the projects we are currently working on is finding people that have gone through substance abuse treatment programs and signed one hundred million forms allowing us to interview them repeatedly and the look for them in other databases. We are doing this to determine the "reliability" of self-report (i.e. did they tell us the truth when they said they didn't get arrested anytime in the past year?) and determine "cost savings" to the state. This means that I have 500 drug users that have given me their name, social security number, date of birth, gender (which seems to be the most incorrect), and ethnicity. With this information I have the envious task of "finding" them in medical records, arrest records, welfare records, and death records (for those that fail to show up for their follow-up surveys). None of these databases have the same information. Medical records have either social security number (ssn) or insurance card number in the ssn field, but no names. Arrest records have names and ssns, but several (i.e. 80%) have mulitple names and ssns for the same individual. Welfare refuses to give us the information so we have to give them our people with a bunch of bogus people thrown in (this according to theory protects the indenties of the true drug users) and let them do the matches for us. The big problem is that they only take exact matches based on ssn. If it doesn't match we don't get them. I link databases using multiple variables (anything that shows up in both databases) and weight the matches based on score. If all fields match, I'm done. If only some fields match I have to start checking to see whats different. If some are "typo" matches (off by one errors) then I have to hand check each of those for accuracy. I do this because its part of a research grant that is interesting to me, frustrating, but interesting. The people doing background checks and the like do it because its a job, they aren't going to hand check anything. If anything matches I'm sure they take it. For example, I recently found out that I've become a 55 year old woman with a bad credit history according to credit agencies. We have the same month and day of birth, and the same first name. I got cheated somewhere, if I'm that old I should get to be a grandma by now, not the mother of a kindergartener. Sigh....
If someone is going to bring up the issue of how to abuse statistics to make a point, I thought I'd add the following. (I actually make my living as a statistician - sorry about that). On the topic of putting in keywords into the a search on AltaVista, Statistics come up in 9,692,788 webpages, Facts come up in 4,352,390, Lies come up in 1,197,980, and Damn Lies come up in only 1,445. Wow, how about that. That means that there are more than 11 times more facts and statistics on the web than there are lies and damn lies. Woowee! Now, if I wanted to be nice I'd stop there, but I'm in a particularly onery mood so I'll add to the pain by letting you know that I did the same searches on Yahoo, MSN (sorry, but it is out there), Webcrawler, and Infoseek. Assuming that there is overlap between the search engines (hey, they can't all have the same 10% of the web covered can they?) I took the average values and arrived at the following, Statistics: 2,899,230; Facts: 1,809,899; Lies: 559,362; and Damn Lies: 1081. Guess what! We still have more facts and statistics than we do lies and damn lies on the web. God bless the goodness of the web. I could be really annoying and give you the tests of means that I performed to check for statistical significance, but I think I'm pushing it as it is. And just remember, I did this for the good of my own profession.
"There are three kinds of lies: lies, damn lies, and statistics."
I work for a university doing research on drug and alcohol abuse (yeah, I know, really, really exciting). One of the projects we are currently working on is finding people that have gone through substance abuse treatment programs and signed one hundred million forms allowing us to interview them repeatedly and the look for them in other databases. We are doing this to determine the "reliability" of self-report (i.e. did they tell us the truth when they said they didn't get arrested anytime in the past year?) and determine "cost savings" to the state. This means that I have 500 drug users that have given me their name, social security number, date of birth, gender (which seems to be the most incorrect), and ethnicity. With this information I have the envious task of "finding" them in medical records, arrest records, welfare records, and death records (for those that fail to show up for their follow-up surveys). None of these databases have the same information. Medical records have either social security number (ssn) or insurance card number in the ssn field, but no names. Arrest records have names and ssns, but several (i.e. 80%) have mulitple names and ssns for the same individual. Welfare refuses to give us the information so we have to give them our people with a bunch of bogus people thrown in (this according to theory protects the indenties of the true drug users) and let them do the matches for us. The big problem is that they only take exact matches based on ssn. If it doesn't match we don't get them. I link databases using multiple variables (anything that shows up in both databases) and weight the matches based on score. If all fields match, I'm done. If only some fields match I have to start checking to see whats different. If some are "typo" matches (off by one errors) then I have to hand check each of those for accuracy. I do this because its part of a research grant that is interesting to me, frustrating, but interesting. The people doing background checks and the like do it because its a job, they aren't going to hand check anything. If anything matches I'm sure they take it. For example, I recently found out that I've become a 55 year old woman with a bad credit history according to credit agencies. We have the same month and day of birth, and the same first name. I got cheated somewhere, if I'm that old I should get to be a grandma by now, not the mother of a kindergartener. Sigh....
Sorry for the length. Happened to go on a rant.
*Lies, Damn Lies, & Statistics*
If someone is going to bring up the issue of how to abuse statistics to make a point, I thought I'd add the following. (I actually make my living as a statistician - sorry about that). On the topic of putting in keywords into the a search on AltaVista, Statistics come up in 9,692,788 webpages, Facts come up in 4,352,390, Lies come up in 1,197,980, and Damn Lies come up in only 1,445. Wow, how about that. That means that there are more than 11 times more facts and statistics on the web than there are lies and damn lies. Woowee! Now, if I wanted to be nice I'd stop there, but I'm in a particularly onery mood so I'll add to the pain by letting you know that I did the same searches on Yahoo, MSN (sorry, but it is out there), Webcrawler, and Infoseek. Assuming that there is overlap between the search engines (hey, they can't all have the same 10% of the web covered can they?) I took the average values and arrived at the following, Statistics: 2,899,230; Facts: 1,809,899; Lies: 559,362; and Damn Lies: 1081. Guess what! We still have more facts and statistics than we do lies and damn lies on the web. God bless the goodness of the web. I could be really annoying and give you the tests of means that I performed to check for statistical significance, but I think I'm pushing it as it is. And just remember, I did this for the good of my own profession.
"There are three kinds of lies: lies, damn lies, and statistics."