Ask Carl Malamud About Shedding Light On Government Data
If you've ever tried to look up public records online, you may have run into byzantine sign-up procedures, proprietary formats, charges just to view what are ostensibly public documents, and generally the sense that you're in a snooty library with closed stacks. Carl Malamud of Public.Resource.Org has for years been forging a path through the grey goo of U.S. government data, helping to publicize the need for accessible digital archives — not just awkward, fee-per-page access. (Mother Jones calls him a "badass.") Malamud has (with help) been making it easier to get to the huge swathes of data in government sources like PACER, EDGAR, and the U.S. Patent Office. He's got a new initiative now to establish a "Federal Scanning Commission," the task of which would be to assess the scope and outcomes of a large-scale effort to actually digitize and make available online as much as practical of the vast holdings of the U.S. government. ("If we were able to put a man on the moon, why can't we launch the Library of Congress into cyberspace?") Ask Malamud below questions about his plans and challenges in disseminating public information. (But please, post unrelated questions separately, lest ye be modded down.)
Government Warning: Exposing the government to scrutiny can result in rape charges.
What changed under Obama? Nothing Good
(I'm guessing "yes") If you are, what do you think about the work they've done?
So how many GB/TB is a library of congress? :)
Or more seriously how big are you estimating? Are you using raw scans or some sort of compression (jpeg, png, ...etc)? What resolution are you using? Do you vary the resolution depending on the document?
What sort of meta data are you putting in?
Didn't Obama already mandate that all government agencies must digitize their records and develop plans within 4 months? http://www.simplysecurity.com/2011/12/28/obama-administration-pushes-for-digital-records-management-overhaul/
The US actually does a good job with sharing data on regulations and rulemaking on regulations.gov. You can pretty much search any of the regulatory dockets from msot departments, and even access public comments and supporting material. You can even take advantage of regulatory policy updates and eRulemaking Program activities on your Twitter stream. Wouldn't this be a good model to follow to systematically publish everything online? I'm thinking publishing everything online on a government website would make for a great summer job for students, and help boost the economy and employment stats, no?
Speaking from experience, the digitized (with text available, not just scanned images) USPTO patent data comes in 4 formats. The oldest format looks like it was based on 'cards', the second format was SGML, the third was a bizarre XML format based on the SGML format, and the current format is based on alterations to international standards. When my former employer wanted to analyze this data, I needed to write parsers for each one.
Is there any chance that all patent data will be made available in a single format (other than HTML)? The structured information in the formats is very useful, but very difficult to get to with the current system (it also costs tens of thousands of dollars to get all of the data).
The right to protest the State is more sacred than the State.
Can you provide any explanation as to why it is so difficult and cost-prohibitive to obtain records from the government, especially considering the abundance of laws requiring government compliance with requests for information (AKA "Sunshine Laws")?
Is it simply a matter of government employee ineptitude, or have you found evidence of a more nefarious rationale?
An enigma, wrapped in a riddle, shrouded in bacon and cheese
What is your opinion about websites like Ancestry.com which make use of public records and charge a subscription fee for access? What is the incentive for the government to migrate old documents into digital form when services like these exist? Do you think Ancestry.com should be a 501(c)(3)?
Which government agency is the worst to get information from?
and yes, i refer to the data sets that have not been "corrected" to include the algorithms applied to "correct" the data.
helping to publicize the need for *accessible* digital archives
By "scanning", what do you mean? Are we talking about searchable records or just a bunch of images? If searchable, what quality control is going to be provided? As someone who has re-published books that are out of copyright, it takes a lot of quality control to ensure a usable product. Unless high-quality searchable records in a solid database are the end result, the project is not worth funding, in my personal opinion.
Recently in the federal register, there were two calls for comments about access to data and research from federally funded research:
http://federalregister.gov/a/2011-28623
http://federalregister.gov/a/2011-28621
I didn't hear about these until ~4 weeks after the original announcement, and with the holidays, it was too late to try to get the societies I'm involved with to prepare and vote on official statements. Are there any places where people can get/post notices of these sorts of things so that we can stay informed and try to help influence policies?
(note -- the second one on data access doesn't close 'til Jan 12th; NSF also has a similar RFC that closes Jan 18th)
Build it, and they will come^Hplain.
The amount of information you're trying to free is entirely staggering and consists, largely, of tables of numbers. These numbers are incredibly significant, but people generally can't see them.
After you free all of this information and make it available to the public (as it should be), then what? What do you expect for the public to do with these numbers? Tables of information are not nearly as useful as graphs. This data needs to be seen, but, more importantly, it needs to be understood.
Do you have any ideas for how to disseminate this information? Perhaps a team-up with someone like gapminder.org's Hans Rosling might be particularly valuable for all of us.
I like losing arguments, it just means that I can take your point and make it my own.
Make sure you don't get stuck in standardization process where the aim is to bring different formats together, before data is entered.
Some formats are incompatible today and will be forever.
The big issue is that such a process will NEVER go anywhere, cost a ridiculous amount of both money and time, with no result in sight, ever.
Yes, I have seen those process from a closer range than I wish to remember. Big in-house, between-house, between-block, between-county fights that lead to that no data was ever entered.
Just do the gritty work immediately. Don't insist on OCR everything, just scan it as plain images, as much as you can. Then, if the money is there, then consider OCR.
IGNORE anything that sounds like an untested high-tech solution. Use well established technology, like high performance scanners etc. if it gets the initial job done, entering those damn documents into the computers.
Look at Google! They did almost all books in the world in just a few years. Did they bother with converting 16th century type setting into Times New Roman or something similar. Of course not.
Scan on!
How much difficulty do you anticipate in getting and publishing records in Pacer? If there's one system that should be free it the decisions that our courts make and yet you are charged by the page just to view the results. Are you concerned about a court taking an unkind view on your archiving what is in Pacer?
People always seem to assume the Library of Congress is a national library. Its role has obviously expanded beyond its original mission, but it does not exist to serve you. It serves congress. Government documents are a mess because they involve LC and other organizations like the Government Printing Office, National Archives and Records Administration, and the individual agencies involved.
Maybe it's time we start a real national library.
"why can't we launch the Library of Congress into cyberspace?"
Pulling numbers out of thin air, but my guess is that 90% of those books are copyrighted. You'll have a hell of a fight getting permission to digitize them, and an even bigger one for giving access to the public.
Bureaucracy is one thing, wealthier than God copyright owners are quite another.
The 2011 update to data.gov actually allows whoever is submitting the data to describe it such that people can make use of it, including via visualization (maps, graphs, etc.) or via API to make custom applications.
So my question for Carl would be : What can we do to get more government agencies to actually put their data in there? And if they won't do it, should resource.org or similar groups work to put up something similar, so that people who have gotten information through FOIA can share it back out to wider audiences?
Build it, and they will come^Hplain.
I think I have read that the law itself cannot be copyrighted and it should be possible to make it available available to everyone. But as a techie who drafts standards and specifications, I was wondering about how far this goes--especially since Congress recently proposed enacting some of our standards into law. (They decided not to, but they read some parts into the committee records as they debated.) Can you still accomplish your project if a governmental body adopts (or considers adopting) a privately owned, copyrighted technical reference manual or set of safety standards as administrative law (or regulations that carry the force of law)? Or would such obstacles keep you from being able to digitize all of the government's laws (and archives of proposed laws)?
I am all for open data, and I like what they do, but Ancestry.com should not be a 501(c)(3). It's for profit. It’s purpose is to make money.
If they were dealing strictly with public data then I would have no bones to pick if the U.S. government moved into their business and started to offer the information for free. (well, small bones. We are running a deficit, not sure this ranks on my top 10 list for this decade, but that’s a different debate).
What they do is combine multiple government databases together. That’s their value add, and I would not want the U.S. government to go there. Once again, I would love to see governments chose a standardized data structure so I can easily query birth records from many different countries over many different centuries – but until then –
o.k. – now that I have thought about it for 5 minutes – a non-profit that would publish and maintain the different data structures would be o.k. Then open sources software to mine the various government databases? Parts of Ancestry.com would be o.k. – but it would have to be public domain.
In a city such as Nashville, things as basic as business ownership and property records are not available online. In states such as New Jersey, public records such as basic corporate filings (officers, operating address/address for service of process) are accessible only for a fee.
What concrete actions can citizens confronting such situations, take to encourage accessibility and accountability?
Three closely related questions about the rare books collections at the Library of Congress:
1. I know there is some kind of effort going on to digitize the rare books collections, but can it be sped up? There are many high-quality low-cost archival book scanners out there (such as the ones developed at diybookscanner.org).
2. It gets really annoying to have to receive paper copies of books when copies are requested. Why not DVDs of high-quality images?
3. Why is there no outreach by the LoC to smaller, cheaper book scanning efforts? The Internet Archive, DIYBookscanner.org, and Decapod all come to mind.
Towards the Singularity.
Federal standards for digitizing local data would be nice. According to wiki there are 3,143 county or "county equivalents" which I assume are Louisiana "parishes". That probably doesn't include cities and states. Sure enough, they are all in various stages of digitizing and making available for access. What a mess...
I work for a branch of a US state government, and would like to implement a long-term plan to digitize quite a bit of data that my particular office is required to keep for up to 20 years. Currently we pay quite a bit of money to store the physical records. Digitizing this material might provide future savings, as well as enhance public access.
My concern is about the dubious proposition that proprietary formats will provide long-term access. What level of documentation *about* the formats we use will be necessary for future generations to access the material, and are you aware of any proposed solutions to this particular problem?
I'd like to know what Malamud thinks about corporate partnerships in the process to get public data released. (I'm not sure if Google Patents existed before the USPTO released its databases...?) Do corporations that get involved in the process tend to make the process better without question, or are there tradeoffs in some areas because the corporations always want to help but then try to retain a proprietary version of the data for themselves?
Carl: would it be possible to implement a system that would allow real-time and continuous review of legislation while it's being drafted? Much has been made over the past three years about legislation being available for review before voting by the House or Senate. The final draft for review usually is huge PDF that makes it near impossible for citizens, interest groups, and the media to thoroughly analysis in time.
****
"I'd never want to join a club that would have me as a member" - G. Marx
In the past 6 months, USDA has made available past agriculture censuses,
now back to 1925.
http://agcensus.mannlib.cornell.edu/AgCensus/homepage.do
However, while these are searcheable pdf's,
there appears to be no quality control so errors appear not in the image but in the underlying searcheable data.
In some sense, the searcheability is a mere bonus of the scanning software used;
although for such pdf's, your own OCR software could create this searcheability.
Since you can't import these into statistical or spreadsheet software,
such pdf's merely amount to putting a library's paper document on your desk.
With some Perl programming, they could be made into unusual csv (comma separated) files,
though those underlying errors would remain.
At least each such csv files could be created the same way for all 50 states,
and used in statistical software the same way for all 50 states.
Yes. Develop Plans.
Plan: Scan everything.
Cost: A lot!
Budget: Cut.
Action: None.