Ask Carl Malamud About Shedding Light On Government Data

← Back to Stories (view on slashdot.org)

Ask Carl Malamud About Shedding Light On Government Data

Posted by timothy on Wednesday January 4, 2012 @07:30AM from the righteous-fight dept.

If you've ever tried to look up public records online, you may have run into byzantine sign-up procedures, proprietary formats, charges just to view what are ostensibly public documents, and generally the sense that you're in a snooty library with closed stacks. Carl Malamud of Public.Resource.Org has for years been forging a path through the grey goo of U.S. government data, helping to publicize the need for accessible digital archives — not just awkward, fee-per-page access. (Mother Jones calls him a "badass.") Malamud has (with help) been making it easier to get to the huge swathes of data in government sources like PACER, EDGAR, and the U.S. Patent Office. He's got a new initiative now to establish a "Federal Scanning Commission," the task of which would be to assess the scope and outcomes of a large-scale effort to actually digitize and make available online as much as practical of the vast holdings of the U.S. government. ("If we were able to put a man on the moon, why can't we launch the Library of Congress into cyberspace?") Ask Malamud below questions about his plans and challenges in disseminating public information. (But please, post unrelated questions separately, lest ye be modded down.)

59 comments

Min score:

Reason:

Sort:

Be careful ... by anagama · 2012-01-04 07:38 · Score: 3, Insightful

Government Warning: Exposing the government to scrutiny can result in rape charges.

--
What changed under Obama? Nothing Good
1. Re:Be careful ... by Anonymous Coward · 2012-01-04 07:43 · Score: 2, Informative
  
  Right. Because power corrupts, and yet we keep putting people into power and expecting them to not get corrupted. Nothing will chenge until we open source it.
2. Re:Be careful ... by Synerg1y · 2012-01-04 07:54 · Score: 1
  
  Or revert back to our instincts...
  Tribalism
3. Re:Be careful ... by elrous0 · 2012-01-04 08:11 · Score: 2
  
  Hell, even suggesting a new world currency to replace the dominance of the dollar can get you that.
  
  --
  SJW: Someone who has run out of real oppression, and has to fake it.
4. Re:Be careful ... by PopeRatzo · 2012-01-04 08:39 · Score: 2
  
  Hell, even suggesting a new world currency to replace the dominance of the dollar [guardian.co.uk] can get you that.
  I know, right? The DNA evidence that showed DSK had sex with the hotel maid and her filing rape charges had nothing to do with him getting charged with rape. It was obvious that his suggesting a replacement for the dollar caused his semen to get inside that hotel maid.
  
  --
  You are welcome on my lawn.
5. Re:Be careful ... by Anonymous Coward · 2012-01-04 09:39 · Score: 0
  
  The two might not be mutually exclusive. Metagovernment describes itself as a tribe.
6. Re:Be careful ... by elrous0 · 2012-01-04 09:42 · Score: 1
  
  If you've never heard of a Honey Trap Operation, you would make a really shitty spy. It's one of the basic tactics of any good intelligence agency.
  
  --
  SJW: Someone who has run out of real oppression, and has to fake it.
7. Re:Be careful ... by Synerg1y · 2012-01-04 09:53 · Score: 1
  
  We already have international committees for resolving disputes between countries such as NATO. Sounds exactly the same with the best of intent by participation. While we're on the subject, we can take away a lot from "The Republic" by Plato as to what at least some aspects of a perfect government would be. Free read on google.
8. Re:Be careful ... by PopeRatzo · 2012-01-04 10:10 · Score: 1
  
  If you've never heard of a Honey Trap Operation, you would make a really shitty spy. It's one of the basic tactics of any good intelligence agency.
  Have you seen the hotel maid? If you were setting up a "Honey Trap" is she the woman you'd pick?
  
  --
  You are welcome on my lawn.
9. Re:Be careful ... by elrous0 · 2012-01-04 10:11 · Score: 1
  
  Sometimes you work with the bet maid you can get. And you can't argue with success.
  
  --
  SJW: Someone who has run out of real oppression, and has to fake it.
10. Re:Be careful ... by slick7 · 2012-01-04 12:20 · Score: 1
  
  Government Warning: Exposing the government to scrutiny can result in rape charges.
  Julian Assange did this very thing and look what it got him.
  
  --
  The mind conceives, the body achieves, the spirit manifests.
11. Re:Be careful ... by korean.ian · 2012-01-04 22:25 · Score: 2
  
  You should really give this article here a good read. It's not long and it's fascinating. Does it prove Strauss-Kahn is innocent? Not conclusively. Does it show that there's a heck of a lot more going on than something as simple as a woman claiming rape? Yeah, I'd say it does.
12. Re:Be careful ... by PopeRatzo · 2012-01-05 00:22 · Score: 1
  
  You should really give this article here a good read. It's not long and it's fascinating.
  
  I will. Thank you for the link.
  
  --
  You are welcome on my lawn.
Are you aware of the GPO "fdsys" project? by Arrogant-Bastard · 2012-01-04 07:39 · Score: 2

(I'm guessing "yes") If you are, what do you think about the work they've done?
LOC by Anonymous Coward · 2012-01-04 07:41 · Score: 3, Interesting

So how many GB/TB is a library of congress? :)
Or more seriously how big are you estimating? Are you using raw scans or some sort of compression (jpeg, png, ...etc)? What resolution are you using? Do you vary the resolution depending on the document?
What sort of meta data are you putting in?
Happend Top Down Already by jimmerz28 · 2012-01-04 07:44 · Score: 4, Interesting

Didn't Obama already mandate that all government agencies must digitize their records and develop plans within 4 months? http://www.simplysecurity.com/2011/12/28/obama-administration-pushes-for-digital-records-management-overhaul/
1. Re:Happend Top Down Already by SoothingMist · 2012-01-04 07:56 · Score: 2
  
  Is Guantanamo closed? He signed that order on his first day in office three years ago. Clearly, Obama's dictates do not carry much weight.
2. Re:Happend Top Down Already by garcia · 2012-01-04 08:12 · Score: 3, Interesting
  
  I scour publicly available records for fun stuff all the time. I not only find it online but I also request it from government agencies (not Federal usually but local/county/etc).
  In Minnesota data must be, "easily accessible for convenient use." While that has specific wording related to historical records, it basically means that on recent data it must be in some sort of electronic format or otherwise easily found and presented, free of charge as long as you do it in person, to anyone who asks--even anonymously. Now. This is great in theory. Unfortunately just because it's easy for the agency to use it doesn't mean it's easy for you to use or interpret.
  Let's take for instance data on bus ridership data. It's not well organized for outsiders to read it and due to collection methodologies (not explained to the general person who had to pay $50 to get the data in the first place) is basically useless.
  They have the data and after months of fighting with them for how much they claimed it cost (they wanted to charge me more than $300 IIRC) I got it down to $50 and got what you see above even though they already pulled it (and summarized it) for the mass media but wouldn't release it in a raw format.
  So. It's in a format which isn't standard. It's methodology is questionable and it's expensive. So no matter the mandates, the promises, etc, the data is not terribly useful across agencies or to the public without some intermediate steps which costs the taxpayers more than doing it right the first time around.
regulations.gov is a good model to follow by hyeprofile · 2012-01-04 07:44 · Score: 5, Interesting

The US actually does a good job with sharing data on regulations and rulemaking on regulations.gov. You can pretty much search any of the regulatory dockets from msot departments, and even access public comments and supporting material. You can even take advantage of regulatory policy updates and eRulemaking Program activities on your Twitter stream. Wouldn't this be a good model to follow to systematically publish everything online? I'm thinking publishing everything online on a government website would make for a great summer job for students, and help boost the economy and employment stats, no?
1. Re:regulations.gov is a good model to follow by Anonymous Coward · 2012-01-04 07:49 · Score: 0
  
  The United States government does do a remarkable job of putting most recent stuff online. In terms of sheer volume it beats, well, virtually any other nation.
  We still need lots of older stuff put online though. And more importantly we need each individual state to get on board.
Patent Data by andymadigan · 2012-01-04 07:44 · Score: 2

Speaking from experience, the digitized (with text available, not just scanned images) USPTO patent data comes in 4 formats. The oldest format looks like it was based on 'cards', the second format was SGML, the third was a bizarre XML format based on the SGML format, and the current format is based on alterations to international standards. When my former employer wanted to analyze this data, I needed to write parsers for each one.

Is there any chance that all patent data will be made available in a single format (other than HTML)? The structured information in the formats is very useful, but very difficult to get to with the current system (it also costs tens of thousands of dollars to get all of the data).

--
The right to protest the State is more sacred than the State.
Why by CanHasDIY · 2012-01-04 07:46 · Score: 3, Interesting

Can you provide any explanation as to why it is so difficult and cost-prohibitive to obtain records from the government, especially considering the abundance of laws requiring government compliance with requests for information (AKA "Sunshine Laws")?
Is it simply a matter of government employee ineptitude, or have you found evidence of a more nefarious rationale?

--
An enigma, wrapped in a riddle, shrouded in bacon and cheese
1. Re:Why by khallow · 2012-01-04 09:47 · Score: 1
  
  I can think of several overlapping reasons of the top of my head. The least charitable cause is that someone doesn't want the information released. Sometimes the information is embarrassing or risky to someone in a position of power or a bureaucracy.
  
  Sometimes the data is just in a bad format, say completely on paper or in a specialized computer system that doesn't lend itself to easy sharing.
  
  And the last of the list I can think of, sometimes the data might contain information which has a legitimate reason for not being released such as clues to spies in other countries or medical records. That has to be removed first.
2. Re:Why by Anonymous Coward · 2012-01-04 10:39 · Score: 3, Informative
  
  Having worked for the government in the recent past, I can offer a few insights...
  1 - A lot of government agencies, on receiving a request for information, will kick it over to the IT department, on the grounds that "they keep the data". Unfortunately, because of the way things are structured, while the people in IT may run the disks and servers, they don't actually deal with the data... which means they either have to fight an internal battle with the people who actually manage the data, or take the path of least resistance and offer to provide the data as some sort of raw data dump.
  2 - A few agencies regularly get requests for data, and have people whose job it is to work with the public in getting them the data they request. Most, however, don't. This means that someone gets stuck with the task for whom it isn't part of their normal job. Since they don't deal with translating data formats, exporting large chunks of data, etc. on a regular basis, they have to go find out how to do these things... and while they're doing that, they can't do their normal work.
  3 - The data the agency has may be mixed in with other data, which might be confidential. To give a real example, I was working for the Department of Environmental Protection in my state, and we were sent a request to give out a list of all our employee's names and email addresses. You'd think that'd be simple, right? However, our employees include a law enforcement division, many of whom are exempt from having their personal information disclosed (because they're currently or previously involved in undercover investigations, have held positions as prison guards, etc.). Further, regular employees can be exempt under certain circumstances (e.g., they have a restraining order against a stalker, ex-spouse, or whatever). Now, since no one had ever previously asked us for this info, naturally no one had bothered to make a list of everyone who was exempt... which meant that we had to start creating such a list immediately, and couldn't release the information until it was completed. For extra fun, we also had people in our mail system who weren't employees -- volunteers with the state parks, for example, could get an email address from us. So we had to contact Parks and find out whether all the non-employees with email addresses were correctly marked as such in the system. What in theory should have been a simple "run a script, get a list, email it" operation that could be done in ten minutes took weeks and a lot of man-hours.
  4 - Just like everybody else, records retention is a problem for the government. Storing old data costs money. Keeping the formats that data is in current costs more money. A lot of our programs did "front-end" processing for Federal EPA programs, collecting data in our state, then sending it to the EPA. Our state DEP received funding from the EPA to do this for them. We weren't, though, being paid to keep old records for them... so it'd be kept for however long the EPA required us to keep it, then deleted after that period -- generally six months or a year. Thus, if someone requested data for two years ago from us, we would have to either tell them, "we don't have that -- go ask the Feds" or hope we could find old backups with that data, and that those were still readable. And, of course, since people hate to be the bearer of bad news, and the IT department, which would get handed the request, didn't manage the data, the result would be that it would literally take a week or more for us in IT to find someone who would admit that the data wasn't there.
  5 - And to tell the truth, sometimes we just don't want to. That guy back in #3 who asked for all our email addresses? Well, the only reasons we in IT could see for someone asking for that were either (1) so they could use the list to spam our employees with something, or (2) so they could sell the list to someone who would then spam our employees. Understandably, we weren't highly motivated to get that list back to them quickly, or to keep the
3. Re:Why by WickedLilMonkies · 2012-01-04 12:20 · Score: 1
  
  I currently work for a municipal agency dealing with building codes, health and safety inspections and planning in an IT capacity who receives the bulk of my agencies public records requests, and can offer my $0.02 on this. As another poster pointed out, I don't normally deal with the data, so when I receive a request, the first thing I have to do is research/discuss with people in the know what it is the request even says. Then dig through the database and determine if it exists and how to get it out. Finally, there's a lot of back and forth to verify that I got everything the request asked for.
  
  ...for every request. Because every request wants something different, and frequently insist that it be formatted a certain way (not just an Excel data dump; sorted and grouped and normalized and polished till it shines). I'm a very competent database admin and know a good deal about our agencies processes. I can't imagine what Sally Secretary, who has no IT support, goes through when she receives one of these requests.
  
  At least from my personal experience at a local level, it's not a nefarious rationale at all; Public Records requests are simply not part of our day-to-day duties, and as such, take more time to accommodate, and when I call back and ask questions about what you're asking, its because I genuinely want to get it right; not because I'm trying to hide something.
Ancestry.com by Anonymous Coward · 2012-01-04 07:51 · Score: 3, Interesting

What is your opinion about websites like Ancestry.com which make use of public records and charge a subscription fee for access? What is the incentive for the government to migrate old documents into digital form when services like these exist? Do you think Ancestry.com should be a 501(c)(3)?
1. Re:Ancestry.com by mrchaotica · 2012-01-04 08:49 · Score: 1
  
  Ancestry.com can do what it wants, but there's no obligation for the government to preserve its business model by failing to make the data easily accessible itself.
  
  --
  "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
2. Re:Ancestry.com by Anonymous Coward · 2012-01-04 16:10 · Score: 0
  
  If there's no obligation for the government to make the data easily accessible then must we rely on Paid Sites (Ancestry.com) and Open Sites (Public.Resource.Org) to make this data accessible? If so, what happens if Open Sites run out of funding, like what almost happened to Wikipedia.
3. Re:Ancestry.com by mrchaotica · 2012-01-04 21:39 · Score: 1
  
  You misunderstood: I said there's no obligation for the government not to make the data easily accessible in order to prop up Ancestry.com's business model. If improved data access screws over Ancestry.com, too bad for them.
  
  --
  "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Who is the worst? by TheBrez · 2012-01-04 07:51 · Score: 5, Interesting

Which government agency is the worst to get information from?
1. Re:Who is the worst? by Anonymous Coward · 2012-01-05 03:08 · Score: 0
  
  That's a tough question, on some thought. If the information they give you is false, how would you know? What is worse, getting perpetual denial of access, getting partial (redacted) data, or getting false data, or getting so much data that it simple reflects the actual realities of unfair/random human behavior (Wal-Mart racism...)?
Re:As long as its not Climate data by Anonymous Coward · 2012-01-04 07:51 · Score: 0

and yes, i refer to the data sets that have not been "corrected" to include the algorithms applied to "correct" the data.
correction by Anonymous Coward · 2012-01-04 07:52 · Score: 0

helping to publicize the need for *accessible* digital archives
Scanning ? by SoothingMist · 2012-01-04 07:53 · Score: 3, Interesting

By "scanning", what do you mean? Are we talking about searchable records or just a bunch of images? If searchable, what quality control is going to be provided? As someone who has re-published books that are out of copyright, it takes a lot of quality control to ensure a usable product. Unless high-quality searchable records in a solid database are the end result, the project is not worth funding, in my personal opinion.
How to get more attention to by oneiros27 · 2012-01-04 07:56 · Score: 3, Interesting

Recently in the federal register, there were two calls for comments about access to data and research from federally funded research:
http://federalregister.gov/a/2011-28623
http://federalregister.gov/a/2011-28621
I didn't hear about these until ~4 weeks after the original announcement, and with the holidays, it was too late to try to get the societies I'm involved with to prepare and vote on official statements. Are there any places where people can get/post notices of these sorts of things so that we can stay informed and try to help influence policies?
(note -- the second one on data access doesn't close 'til Jan 12th; NSF also has a similar RFC that closes Jan 18th)

--
Build it, and they will come^Hplain.
Idea by hardwarejunkie9 · 2012-01-04 08:06 · Score: 4, Interesting

Something has been rattling around my head in recent days on this topic and now I think it's a proper time to let it out.
The amount of information you're trying to free is entirely staggering and consists, largely, of tables of numbers. These numbers are incredibly significant, but people generally can't see them.
After you free all of this information and make it available to the public (as it should be), then what? What do you expect for the public to do with these numbers? Tables of information are not nearly as useful as graphs. This data needs to be seen, but, more importantly, it needs to be understood.
Do you have any ideas for how to disseminate this information? Perhaps a team-up with someone like gapminder.org's Hans Rosling might be particularly valuable for all of us.

--
I like losing arguments, it just means that I can take your point and make it my own.
1. Re:Idea by rastoboy29 · 2012-01-04 11:43 · Score: 1
  
  This is a great idea--you should do it!
  
  --
  expandfairuse.org
Just do the gritty work, immediately by G3ckoG33k · 2012-01-04 08:11 · Score: 2

Make sure you don't get stuck in standardization process where the aim is to bring different formats together, before data is entered.
Some formats are incompatible today and will be forever.
The big issue is that such a process will NEVER go anywhere, cost a ridiculous amount of both money and time, with no result in sight, ever.
Yes, I have seen those process from a closer range than I wish to remember. Big in-house, between-house, between-block, between-county fights that lead to that no data was ever entered.
Just do the gritty work immediately. Don't insist on OCR everything, just scan it as plain images, as much as you can. Then, if the money is there, then consider OCR.
IGNORE anything that sounds like an untested high-tech solution. Use well established technology, like high performance scanners etc. if it gets the initial job done, entering those damn documents into the computers.
Look at Google! They did almost all books in the world in just a few years. Did they bother with converting 16th century type setting into Times New Roman or something similar. Of course not.
Scan on!
Pacer Problems by onyxruby · 2012-01-04 08:16 · Score: 1

How much difficulty do you anticipate in getting and publishing records in Pacer? If there's one system that should be free it the decisions that our courts make and yet you are charged by the page just to view the results. Are you concerned about a court taking an unkind view on your archiving what is in Pacer?
1. Re:Pacer Problems by bobaferret · 2012-01-04 09:27 · Score: 1
  
  You can't always make all of the documents that the decision was based on available to anyone. Originally the courts didn't have a good separation between private and public information. This means that the you can't actually do heads down scanning. To do it right, each document must be verified that it contains no private information. In theory you can just redact that information, but it takes a huge amount of time, and there are not enough people or money to do it. You also have the problem that someone has to pay for it.
  Over all there are at least four basic issues in putting court docs online:
  1) Atomic level security, must be done by hand. Time consuming and Expensive.
  See above.
  2) Bureaucracy interferes with competent design.
  Whenever the state gets involved with planning and design, politics comes into play. As higher ups get more and more pissed off things are more likely to get outsourced to a commercial vendor. Whether or not the vendor is in bed with a bureaucrat is a crap shoot, but you can guarantee that the vendor is going to charge an arm and a leg to do a custom project for the state. esp. given that the the large the project the larger the vendor. ie. Accenture, IBM, Courtlook, Westlaw etc. These last two folks are extremely effective at protecting their investments, they lobby very well.
  3) It's just plain expensive to design, build, and maintain.
  It takes years of development to do it right. And a decent amount of money to keep it going. Yes it would be nice if there was open source software to do it, but it falls short generally and doesn't meet every ones needs. And $deity help me, the gov can't make up its mind on data standards.
  4) Who pays.
  This is where per page fees comes in. Some commercial company will come along and tell the courts that they will build the whole thing for them at little to no cost maintenance included if they can charge users for the service. What this means is that the courts make their constituents happy by putting everything out there, and by doing it at no cost to the tax payer.
  I don't know what Carl's experience is, but this is roughly mine. Based on state level courts with no money. /ramble
Library of Congress != US national library by Anonymous Coward · 2012-01-04 08:17 · Score: 0

People always seem to assume the Library of Congress is a national library. Its role has obviously expanded beyond its original mission, but it does not exist to serve you. It serves congress. Government documents are a mess because they involve LC and other organizations like the Government Printing Office, National Archives and Records Administration, and the individual agencies involved.
Maybe it's time we start a real national library.
1. Re:Library of Congress != US national library by elrous0 · 2012-01-04 08:29 · Score: 1
  
  I wish they taught that in Civics courses. The Library of Congress serves Congress, which in turn serves corporations.
  
  --
  SJW: Someone who has run out of real oppression, and has to fake it.
2. Re:Library of Congress != US national library by autophile · 2012-01-04 09:10 · Score: 2
  
  The Library of Congress does serve Congress. First. Then it serves the broader US Government. Then it serves the public.
  
  --
  Towards the Singularity.
Library of Congress? by Anonymous Coward · 2012-01-04 08:34 · Score: 0

"why can't we launch the Library of Congress into cyberspace?"
Pulling numbers out of thin air, but my guess is that 90% of those books are copyrighted. You'll have a hell of a fight getting permission to digitize them, and an even bigger one for giving access to the public.
Bureaucracy is one thing, wealthier than God copyright owners are quite another.
data.gov by oneiros27 · 2012-01-04 08:35 · Score: 2

The 2011 update to data.gov actually allows whoever is submitting the data to describe it such that people can make use of it, including via visualization (maps, graphs, etc.) or via API to make custom applications.
So my question for Carl would be : What can we do to get more government agencies to actually put their data in there? And if they won't do it, should resource.org or similar groups work to put up something similar, so that people who have gotten information through FOIA can share it back out to wider audiences?

--
Build it, and they will come^Hplain.
Privately Owned, Copyrighted Law by AdamnSelene · 2012-01-04 08:40 · Score: 2

I think I have read that the law itself cannot be copyrighted and it should be possible to make it available available to everyone. But as a techie who drafts standards and specifications, I was wondering about how far this goes--especially since Congress recently proposed enacting some of our standards into law. (They decided not to, but they read some parts into the committee records as they debated.) Can you still accomplish your project if a governmental body adopts (or considers adopting) a privately owned, copyrighted technical reference manual or set of safety standards as administrative law (or regulations that carry the force of law)? Or would such obstacles keep you from being able to digitize all of the government's laws (and archives of proposed laws)?
1. Re:Privately Owned, Copyrighted Law by mrchaotica · 2012-01-04 09:06 · Score: 1
  
  The way it ought to work is that if a government body adopts your standard, then you should lose your copyright on it. The copyright was only granted at the whim of the government in the first place, after all.
  
  --
  "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Not a 501(c)(3) by alexander_686 · 2012-01-04 08:46 · Score: 2

I am all for open data, and I like what they do, but Ancestry.com should not be a 501(c)(3). It's for profit. It’s purpose is to make money.
If they were dealing strictly with public data then I would have no bones to pick if the U.S. government moved into their business and started to offer the information for free. (well, small bones. We are running a deficit, not sure this ranks on my top 10 list for this decade, but that’s a different debate).
What they do is combine multiple government databases together. That’s their value add, and I would not want the U.S. government to go there. Once again, I would love to see governments chose a standardized data structure so I can easily query birth records from many different countries over many different centuries – but until then –
o.k. – now that I have thought about it for 5 minutes – a non-profit that would publish and maintain the different data structures would be o.k. Then open sources software to mine the various government databases? Parts of Ancestry.com would be o.k. – but it would have to be public domain.
Encouraging Governments? by theNAM666 · 2012-01-04 08:58 · Score: 3, Interesting

In a city such as Nashville, things as basic as business ownership and property records are not available online. In states such as New Jersey, public records such as basic corporate filings (officers, operating address/address for service of process) are accessible only for a fee.
What concrete actions can citizens confronting such situations, take to encourage accessibility and accountability?
Can the rare books collections be digitized? by autophile · 2012-01-04 09:02 · Score: 4, Interesting

Three closely related questions about the rare books collections at the Library of Congress:
1. I know there is some kind of effort going on to digitize the rare books collections, but can it be sped up? There are many high-quality low-cost archival book scanners out there (such as the ones developed at diybookscanner.org).
2. It gets really annoying to have to receive paper copies of books when copies are requested. Why not DVDs of high-quality images?
3. Why is there no outreach by the LoC to smaller, cheaper book scanning efforts? The Internet Archive, DIYBookscanner.org, and Decapod all come to mind.

--
Towards the Singularity.
A federal standard for localities would be nice by Anonymous Coward · 2012-01-04 10:00 · Score: 0

Federal standards for digitizing local data would be nice. According to wiki there are 3,143 county or "county equivalents" which I assume are Louisiana "parishes". That probably doesn't include cities and states. Sure enough, they are all in various stages of digitizing and making available for access. What a mess...
gov't employee trying to digitize print mat'l by Anonymous Coward · 2012-01-04 10:24 · Score: 0

I work for a branch of a US state government, and would like to implement a long-term plan to digitize quite a bit of data that my particular office is required to keep for up to 20 years. Currently we pay quite a bit of money to store the physical records. Digitizing this material might provide future savings, as well as enhance public access.

My concern is about the dubious proposition that proprietary formats will provide long-term access. What level of documentation *about* the formats we use will be necessary for future generations to access the material, and are you aware of any proposed solutions to this particular problem?
1. Re:gov't employee trying to digitize print mat'l by chill · 2012-01-04 11:06 · Score: 2
  
  PDF/A.
  If you work for the government, you should be asking NARA, not Slashdot. If you didn't know this, your records officer should.
  
  --
  Learning HOW to think is more important than learning WHAT to think.
What do you think of corporate partnerships? by mhh5 · 2012-01-04 11:22 · Score: 2

I'd like to know what Malamud thinks about corporate partnerships in the process to get public data released. (I'm not sure if Google Patents existed before the USPTO released its databases...?) Do corporations that get involved in the process tend to make the process better without question, or are there tradeoffs in some areas because the corporations always want to help but then try to retain a proprietary version of the data for themselves?
Question for Carl: Real time legislation drafting by kerskine · 2012-01-05 01:41 · Score: 1

Carl: would it be possible to implement a system that would allow real-time and continuous review of legislation while it's being drafted? Much has been made over the past three years about legislation being available for review before voting by the House or Senate. The final draft for review usually is huge PDF that makes it near impossible for citizens, interest groups, and the media to thoroughly analysis in time.

--
****

"I'd never want to join a club that would have me as a member" - G. Marx
Could you improve on USDA pdf's back to 1925? by Jameson+Burt · 2012-01-05 02:50 · Score: 1

In the past 6 months, USDA has made available past agriculture censuses,
now back to 1925.
http://agcensus.mannlib.cornell.edu/AgCensus/homepage.do
However, while these are searcheable pdf's,
there appears to be no quality control so errors appear not in the image but in the underlying searcheable data.
In some sense, the searcheability is a mere bonus of the scanning software used;
although for such pdf's, your own OCR software could create this searcheability.
Since you can't import these into statistical or spreadsheet software,
such pdf's merely amount to putting a library's paper document on your desk.
With some Perl programming, they could be made into unusual csv (comma separated) files,
though those underlying errors would remain.
At least each such csv files could be created the same way for all 50 states,
and used in statistical software the same way for all 50 states.
LOL by DarthVain · 2012-01-05 03:46 · Score: 2

Yes. Develop Plans.
Plan: Scan everything.
Cost: A lot!
Budget: Cut.
Action: None.