US Census Bureau Offers Public API For Data Apps
Nerval's Lobster writes "For any software developers with an urge to play around with demographic or socio-economic data: the U.S. Census Bureau has launched an API for Web and mobile apps that can slice that statistical information in all sorts of nifty ways. The API draws data from two sets: the 2010 Census (statistics include population, age, sex, and race) and the 2006-2010 American Community Survey (offers information on education, income, occupation, commuting, and more). In theory, developers could use those datasets to analyze housing prices for a particular neighborhood, or gain insights into a city's employment cycles. The APIs include no information that could identify an individual."
That the government is watching and monitoring us. Now we shall watch them watching us!
the gov can use this data for themselves in the campaign. With demograph info you can finally manage your campain more effectively.
That's just what I was thinking. Whether or not the data contains any personal information is the real question because if there is any then someone will figure out how to get at it.
OK, if you say so. BTW - what is the exploit for getting all the personal information out of the IRS database? Or the SSA? Or banks? Or insurance companies? Or credit card companies?
Believe it or not, many databases containing sensitive information are designed and run by professionals, using professional software and good security practices. Not all databases are run on MySQL administered by some high-school kid.
For those who don't get it,
(statistics include population, age, sex, and race)
(offers information on education, income, occupation, commuting, and more).
Treat is as a multidimensional data source. So you figure out who someone is using perhaps 6 factors, then you've got the unknown data for the other 1315 data points.
I almost got in quite a bit of trouble at a previous employer by pointing out a public distributed incredibly detailed analysis of an "anonymous" corporate employee attitude survey mean it was completely 100% non anonymous. So... 100% of 25 year old engineers who are white single males who drive a red car and have an Irish girlfriend and live in an apartment and commute to work between 4 and 8 miles and have a five digit /. UID responded that their boss was a 5/10 at leadership, or whatever. Sure... that's perfectly anonymous.
It wasn't quite that ridiculous but pretty darn close. As I recall they "de-anonymized" it by providing 5 year age brackets and 1 year (yikes) hiring date brackets, and job titles. It was enough to quite sufficient to identify the exact responses of each person. The funny part was once the word got out employees would read the responses of other people... oh so Rachel in purchasing said that her boss was a complete... You get the idea.
Frankly I was more insulted that they thought we were stupid enough not to understand they were lying despite giving us complete evidence, than I was insulted that they lied to us by calling it anonymous. They had no shortage of suckiness.
They were even stupid enough to pretend it was anonymous and run it year after year, at least until I left. Needless to say everyone lied like a carpet after the first debacle.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
census data has been public all along
before now you had to go to washington and look it up yourself
now it's easier to get at
all of this data has always been available to those who ask for it
they have just made it easier for people to get at it
what is your complaint again?
I always lied on company-given anonymous surveys. What are they going to do? Call you on it?
Peter predicted that you would "deliberately forget" creation 2000 years ago...
census data has been public all along
before now you had to go to washington and look it up yourself
now it's easier to get at
No, it's been online for years. There just hasn't been a good, uniform way to query it and write apps against it.
That would be the "they have just made it easier for people to get at it" part
Making it easier to mush databases together to gather inappropriate levels of personal data.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
That's not how Census information is either collected or stored. First off, there are two different data sources at issue - the decennial census, which gathers a very limited set of information on (theoretically) every person in the country, and the American Community Survey, which uses sampling to get estimates on a much wider range of information. You cannot link those two datasets, since the only public factors they share are far too broad - e.g., age, race, sex, etc., and the time periods during which they are conducted are totally different.
Besides, the information is not released at person-level. The lowest level you can get sampled information at (e.g., the detailed ACS stuff) is the "block group", which on average contains 39 blocks. You can get decennial census information at the block level, and a "block" may correspond to a city block, or a much larger area for lesser-populated areas.
So, you can find some interesting information about your city street (I've looked up my own, and found the number of people living alone, owning/renting, age, sex, etc. for the 24 houses on my block), but these data are not per person, they are per block - in other words, if there is only one Native American living on my street, I cannot then find out whether they are owning/renting. I can only find out the number of renters on the entire block.
"Anyone who [rips a CD] is probably engaging in copyright infringement." - David O. Carson
BTW - what is the exploit for getting all the personal information out of the IRS database? Or the SSA? Or banks? Or insurance companies? Or credit card companies?
None of those have public facing API's, smartass.
Really? I could have sworn I could file my taxes and see my tax refund status online. And I am pretty sure I can see all my banking activity online (and download it to financial software). Same with credit cards. And those services are every bit as much of an API as the http requests that the census API uses are.
Or load up the shapefiles posted at census.gov, seeing as census data has been available online for at least a few years now. As the summary said, this made it easier, but in the past couple of years there has also been an explosion in free and open source GIS tools that translate the raw data into something more readable.
open source modern art: laser taggi
I recently pulled the census data and it's pretty much useless since any information you could use to look at results by city or region have been stripped out in the version available to the general public. Sucks.
"God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
I recently pulled the census data and it's pretty much useless since any information you could use to look at results by city or region have been stripped out in the version available to the general public. Sucks.
What else did you expect when you privatize data collected using public funds> http://corporate.ancestry.com/press/press-releases/2006/06/ancestry.com-digitizes-entire-u.s.-federal-census-collection-from-1790-1930/.
See also http://www.archives.gov/digitization/digitized-by-partners.html
Granted that this information has already been available on request, but maybe these new tools will make it easier for watchdog groups to crunch the census numbers themselves and act as a check on gerrymandering politicians, exposing redistricting plans that don't seem to square with the census data.
That paper points out that three factors - DOB, place, and gender - are often enough to uniquely identify a person. How is that relevant to Census Summary File information, or ACS information?
I think the point that many people misunderstand is that Census/ACS public information is not a database where each row represents one response, and some data items have been withheld. It's not at all like that - it's aggregate totals for geographic areas of varying sizes. That row-by-row information is not made public until 72 years after the census (remember the news about info from the 1940 census being made public?).
The real issue that paper highlights is the state-level legislative mandates regarding information collection. I'm not denying you could link a voter registration list to state-collected health data, like the author does in that paper, but that fact has nothing to do with the data made accessible by this API (not to mention these data were already online, just in a more complicated format).
"Anyone who [rips a CD] is probably engaging in copyright infringement." - David O. Carson
I don't know what specifically you tried to do, but there is a lot of data available down to the block group and block level, which are relatively small geographic units. There's even more data available by "place", which would include any major city and many smaller cities and towns. Some of the tax data is redacted for confidentiality (e.g., when there is only one employer of a certain type in a geographic area, they won't release payroll information for it), but that's pretty unusual in larger areas.
You may have been using one of the user-friendly tools, which can be limited in their reach. American FactFinder has more depth than most, but it's also kind of a PITA. If you're serious about digging into the data, you can download zipped text files that represent the full extent of the public information available, which you can then load into your favorite processing program.
"Anyone who [rips a CD] is probably engaging in copyright infringement." - David O. Carson
The census bureau has requirements for the sizes and regularity of groups to avoid inference. With special tabulations, they do pre- and post-processing that manipulates the input and output to scrub inference vectors. They know what they are doing.
Whether or not the data contains any personal information is the real question because if there is any then someone will figure out how to get at it.
Even when personal information "has been stripped" it can be rediscovered in various ways.
For instance: At the start of WWII, US authorities used census data to round up people of Japanese descent. They didn't have the individuals' names. But they had the number on each block. So they just raided until they had accumulated that number.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way