Slashdot Mirror


Ask Carl Malamud About Shedding Light On Government Data

If you've ever tried to look up public records online, you may have run into byzantine sign-up procedures, proprietary formats, charges just to view what are ostensibly public documents, and generally the sense that you're in a snooty library with closed stacks. Carl Malamud of Public.Resource.Org has for years been forging a path through the grey goo of U.S. government data, helping to publicize the need for accessible digital archives — not just awkward, fee-per-page access. (Mother Jones calls him a "badass.") Malamud has (with help) been making it easier to get to the huge swathes of data in government sources like PACER, EDGAR, and the U.S. Patent Office. He's got a new initiative now to establish a "Federal Scanning Commission," the task of which would be to assess the scope and outcomes of a large-scale effort to actually digitize and make available online as much as practical of the vast holdings of the U.S. government. ("If we were able to put a man on the moon, why can't we launch the Library of Congress into cyberspace?") Ask Malamud below questions about his plans and challenges in disseminating public information. (But please, post unrelated questions separately, lest ye be modded down.)

29 of 59 comments (clear)

  1. Be careful ... by anagama · · Score: 3, Insightful

    Government Warning: Exposing the government to scrutiny can result in rape charges.

    --
    What changed under Obama? Nothing Good
    1. Re:Be careful ... by Anonymous Coward · · Score: 2, Informative

      Right. Because power corrupts, and yet we keep putting people into power and expecting them to not get corrupted. Nothing will chenge until we open source it.

    2. Re:Be careful ... by elrous0 · · Score: 2
      --
      SJW: Someone who has run out of real oppression, and has to fake it.
    3. Re:Be careful ... by PopeRatzo · · Score: 2

      Hell, even suggesting a new world currency to replace the dominance of the dollar [guardian.co.uk] can get you that.

      I know, right? The DNA evidence that showed DSK had sex with the hotel maid and her filing rape charges had nothing to do with him getting charged with rape. It was obvious that his suggesting a replacement for the dollar caused his semen to get inside that hotel maid.

         

      --
      You are welcome on my lawn.
    4. Re:Be careful ... by korean.ian · · Score: 2

      You should really give this article here a good read. It's not long and it's fascinating. Does it prove Strauss-Kahn is innocent? Not conclusively. Does it show that there's a heck of a lot more going on than something as simple as a woman claiming rape? Yeah, I'd say it does.

  2. Are you aware of the GPO "fdsys" project? by Arrogant-Bastard · · Score: 2

    (I'm guessing "yes") If you are, what do you think about the work they've done?

  3. LOC by Anonymous Coward · · Score: 3, Interesting

    So how many GB/TB is a library of congress? :)

    Or more seriously how big are you estimating? Are you using raw scans or some sort of compression (jpeg, png, ...etc)? What resolution are you using? Do you vary the resolution depending on the document?

    What sort of meta data are you putting in?

  4. Happend Top Down Already by jimmerz28 · · Score: 4, Interesting

    Didn't Obama already mandate that all government agencies must digitize their records and develop plans within 4 months? http://www.simplysecurity.com/2011/12/28/obama-administration-pushes-for-digital-records-management-overhaul/

    1. Re:Happend Top Down Already by SoothingMist · · Score: 2

      Is Guantanamo closed? He signed that order on his first day in office three years ago. Clearly, Obama's dictates do not carry much weight.

    2. Re:Happend Top Down Already by garcia · · Score: 3, Interesting

      I scour publicly available records for fun stuff all the time. I not only find it online but I also request it from government agencies (not Federal usually but local/county/etc).

      In Minnesota data must be, "easily accessible for convenient use." While that has specific wording related to historical records, it basically means that on recent data it must be in some sort of electronic format or otherwise easily found and presented, free of charge as long as you do it in person, to anyone who asks--even anonymously. Now. This is great in theory. Unfortunately just because it's easy for the agency to use it doesn't mean it's easy for you to use or interpret.

      Let's take for instance data on bus ridership data. It's not well organized for outsiders to read it and due to collection methodologies (not explained to the general person who had to pay $50 to get the data in the first place) is basically useless.

      They have the data and after months of fighting with them for how much they claimed it cost (they wanted to charge me more than $300 IIRC) I got it down to $50 and got what you see above even though they already pulled it (and summarized it) for the mass media but wouldn't release it in a raw format.

      So. It's in a format which isn't standard. It's methodology is questionable and it's expensive. So no matter the mandates, the promises, etc, the data is not terribly useful across agencies or to the public without some intermediate steps which costs the taxpayers more than doing it right the first time around.

  5. regulations.gov is a good model to follow by hyeprofile · · Score: 5, Interesting

    The US actually does a good job with sharing data on regulations and rulemaking on regulations.gov. You can pretty much search any of the regulatory dockets from msot departments, and even access public comments and supporting material. You can even take advantage of regulatory policy updates and eRulemaking Program activities on your Twitter stream. Wouldn't this be a good model to follow to systematically publish everything online? I'm thinking publishing everything online on a government website would make for a great summer job for students, and help boost the economy and employment stats, no?

  6. Patent Data by andymadigan · · Score: 2

    Speaking from experience, the digitized (with text available, not just scanned images) USPTO patent data comes in 4 formats. The oldest format looks like it was based on 'cards', the second format was SGML, the third was a bizarre XML format based on the SGML format, and the current format is based on alterations to international standards. When my former employer wanted to analyze this data, I needed to write parsers for each one.

    Is there any chance that all patent data will be made available in a single format (other than HTML)? The structured information in the formats is very useful, but very difficult to get to with the current system (it also costs tens of thousands of dollars to get all of the data).

    --
    The right to protest the State is more sacred than the State.
  7. Why by CanHasDIY · · Score: 3, Interesting

    Can you provide any explanation as to why it is so difficult and cost-prohibitive to obtain records from the government, especially considering the abundance of laws requiring government compliance with requests for information (AKA "Sunshine Laws")?

    Is it simply a matter of government employee ineptitude, or have you found evidence of a more nefarious rationale?

    --
    An enigma, wrapped in a riddle, shrouded in bacon and cheese
    1. Re:Why by Anonymous Coward · · Score: 3, Informative

      Having worked for the government in the recent past, I can offer a few insights...

      1 - A lot of government agencies, on receiving a request for information, will kick it over to the IT department, on the grounds that "they keep the data". Unfortunately, because of the way things are structured, while the people in IT may run the disks and servers, they don't actually deal with the data... which means they either have to fight an internal battle with the people who actually manage the data, or take the path of least resistance and offer to provide the data as some sort of raw data dump.

      2 - A few agencies regularly get requests for data, and have people whose job it is to work with the public in getting them the data they request. Most, however, don't. This means that someone gets stuck with the task for whom it isn't part of their normal job. Since they don't deal with translating data formats, exporting large chunks of data, etc. on a regular basis, they have to go find out how to do these things... and while they're doing that, they can't do their normal work.

      3 - The data the agency has may be mixed in with other data, which might be confidential. To give a real example, I was working for the Department of Environmental Protection in my state, and we were sent a request to give out a list of all our employee's names and email addresses. You'd think that'd be simple, right? However, our employees include a law enforcement division, many of whom are exempt from having their personal information disclosed (because they're currently or previously involved in undercover investigations, have held positions as prison guards, etc.). Further, regular employees can be exempt under certain circumstances (e.g., they have a restraining order against a stalker, ex-spouse, or whatever). Now, since no one had ever previously asked us for this info, naturally no one had bothered to make a list of everyone who was exempt... which meant that we had to start creating such a list immediately, and couldn't release the information until it was completed. For extra fun, we also had people in our mail system who weren't employees -- volunteers with the state parks, for example, could get an email address from us. So we had to contact Parks and find out whether all the non-employees with email addresses were correctly marked as such in the system. What in theory should have been a simple "run a script, get a list, email it" operation that could be done in ten minutes took weeks and a lot of man-hours.

      4 - Just like everybody else, records retention is a problem for the government. Storing old data costs money. Keeping the formats that data is in current costs more money. A lot of our programs did "front-end" processing for Federal EPA programs, collecting data in our state, then sending it to the EPA. Our state DEP received funding from the EPA to do this for them. We weren't, though, being paid to keep old records for them... so it'd be kept for however long the EPA required us to keep it, then deleted after that period -- generally six months or a year. Thus, if someone requested data for two years ago from us, we would have to either tell them, "we don't have that -- go ask the Feds" or hope we could find old backups with that data, and that those were still readable. And, of course, since people hate to be the bearer of bad news, and the IT department, which would get handed the request, didn't manage the data, the result would be that it would literally take a week or more for us in IT to find someone who would admit that the data wasn't there.

      5 - And to tell the truth, sometimes we just don't want to. That guy back in #3 who asked for all our email addresses? Well, the only reasons we in IT could see for someone asking for that were either (1) so they could use the list to spam our employees with something, or (2) so they could sell the list to someone who would then spam our employees. Understandably, we weren't highly motivated to get that list back to them quickly, or to keep the

  8. Ancestry.com by Anonymous Coward · · Score: 3, Interesting

    What is your opinion about websites like Ancestry.com which make use of public records and charge a subscription fee for access? What is the incentive for the government to migrate old documents into digital form when services like these exist? Do you think Ancestry.com should be a 501(c)(3)?

  9. Who is the worst? by TheBrez · · Score: 5, Interesting

    Which government agency is the worst to get information from?

  10. Scanning ? by SoothingMist · · Score: 3, Interesting

    By "scanning", what do you mean? Are we talking about searchable records or just a bunch of images? If searchable, what quality control is going to be provided? As someone who has re-published books that are out of copyright, it takes a lot of quality control to ensure a usable product. Unless high-quality searchable records in a solid database are the end result, the project is not worth funding, in my personal opinion.

  11. How to get more attention to by oneiros27 · · Score: 3, Interesting

    Recently in the federal register, there were two calls for comments about access to data and research from federally funded research:

    http://federalregister.gov/a/2011-28623
    http://federalregister.gov/a/2011-28621

    I didn't hear about these until ~4 weeks after the original announcement, and with the holidays, it was too late to try to get the societies I'm involved with to prepare and vote on official statements. Are there any places where people can get/post notices of these sorts of things so that we can stay informed and try to help influence policies?

    (note -- the second one on data access doesn't close 'til Jan 12th; NSF also has a similar RFC that closes Jan 18th)

    --
    Build it, and they will come^Hplain.
  12. Idea by hardwarejunkie9 · · Score: 4, Interesting
    Something has been rattling around my head in recent days on this topic and now I think it's a proper time to let it out.

    The amount of information you're trying to free is entirely staggering and consists, largely, of tables of numbers. These numbers are incredibly significant, but people generally can't see them.

    After you free all of this information and make it available to the public (as it should be), then what? What do you expect for the public to do with these numbers? Tables of information are not nearly as useful as graphs. This data needs to be seen, but, more importantly, it needs to be understood.

    Do you have any ideas for how to disseminate this information? Perhaps a team-up with someone like gapminder.org's Hans Rosling might be particularly valuable for all of us.

    --
    I like losing arguments, it just means that I can take your point and make it my own.
  13. Just do the gritty work, immediately by G3ckoG33k · · Score: 2

    Make sure you don't get stuck in standardization process where the aim is to bring different formats together, before data is entered.

    Some formats are incompatible today and will be forever.

    The big issue is that such a process will NEVER go anywhere, cost a ridiculous amount of both money and time, with no result in sight, ever.

    Yes, I have seen those process from a closer range than I wish to remember. Big in-house, between-house, between-block, between-county fights that lead to that no data was ever entered.

    Just do the gritty work immediately. Don't insist on OCR everything, just scan it as plain images, as much as you can. Then, if the money is there, then consider OCR.

    IGNORE anything that sounds like an untested high-tech solution. Use well established technology, like high performance scanners etc. if it gets the initial job done, entering those damn documents into the computers.

    Look at Google! They did almost all books in the world in just a few years. Did they bother with converting 16th century type setting into Times New Roman or something similar. Of course not.

    Scan on!

  14. data.gov by oneiros27 · · Score: 2

    The 2011 update to data.gov actually allows whoever is submitting the data to describe it such that people can make use of it, including via visualization (maps, graphs, etc.) or via API to make custom applications.

    So my question for Carl would be : What can we do to get more government agencies to actually put their data in there? And if they won't do it, should resource.org or similar groups work to put up something similar, so that people who have gotten information through FOIA can share it back out to wider audiences?

    --
    Build it, and they will come^Hplain.
  15. Privately Owned, Copyrighted Law by AdamnSelene · · Score: 2

    I think I have read that the law itself cannot be copyrighted and it should be possible to make it available available to everyone. But as a techie who drafts standards and specifications, I was wondering about how far this goes--especially since Congress recently proposed enacting some of our standards into law. (They decided not to, but they read some parts into the committee records as they debated.) Can you still accomplish your project if a governmental body adopts (or considers adopting) a privately owned, copyrighted technical reference manual or set of safety standards as administrative law (or regulations that carry the force of law)? Or would such obstacles keep you from being able to digitize all of the government's laws (and archives of proposed laws)?

  16. Not a 501(c)(3) by alexander_686 · · Score: 2

    I am all for open data, and I like what they do, but Ancestry.com should not be a 501(c)(3). It's for profit. It’s purpose is to make money.

    If they were dealing strictly with public data then I would have no bones to pick if the U.S. government moved into their business and started to offer the information for free. (well, small bones. We are running a deficit, not sure this ranks on my top 10 list for this decade, but that’s a different debate).

    What they do is combine multiple government databases together. That’s their value add, and I would not want the U.S. government to go there. Once again, I would love to see governments chose a standardized data structure so I can easily query birth records from many different countries over many different centuries – but until then –

    o.k. – now that I have thought about it for 5 minutes – a non-profit that would publish and maintain the different data structures would be o.k. Then open sources software to mine the various government databases? Parts of Ancestry.com would be o.k. – but it would have to be public domain.

  17. Encouraging Governments? by theNAM666 · · Score: 3, Interesting

    In a city such as Nashville, things as basic as business ownership and property records are not available online. In states such as New Jersey, public records such as basic corporate filings (officers, operating address/address for service of process) are accessible only for a fee.

    What concrete actions can citizens confronting such situations, take to encourage accessibility and accountability?

  18. Can the rare books collections be digitized? by autophile · · Score: 4, Interesting

    Three closely related questions about the rare books collections at the Library of Congress:

    1. I know there is some kind of effort going on to digitize the rare books collections, but can it be sped up? There are many high-quality low-cost archival book scanners out there (such as the ones developed at diybookscanner.org).

    2. It gets really annoying to have to receive paper copies of books when copies are requested. Why not DVDs of high-quality images?

    3. Why is there no outreach by the LoC to smaller, cheaper book scanning efforts? The Internet Archive, DIYBookscanner.org, and Decapod all come to mind.

    --
    Towards the Singularity.
  19. Re:Library of Congress != US national library by autophile · · Score: 2

    The Library of Congress does serve Congress. First. Then it serves the broader US Government. Then it serves the public.

    --
    Towards the Singularity.
  20. Re:gov't employee trying to digitize print mat'l by chill · · Score: 2

    PDF/A.

    If you work for the government, you should be asking NARA, not Slashdot. If you didn't know this, your records officer should.

    --
    Learning HOW to think is more important than learning WHAT to think.
  21. What do you think of corporate partnerships? by mhh5 · · Score: 2

    I'd like to know what Malamud thinks about corporate partnerships in the process to get public data released. (I'm not sure if Google Patents existed before the USPTO released its databases...?) Do corporations that get involved in the process tend to make the process better without question, or are there tradeoffs in some areas because the corporations always want to help but then try to retain a proprietary version of the data for themselves?

  22. LOL by DarthVain · · Score: 2

    Yes. Develop Plans.

    Plan: Scan everything.
    Cost: A lot!
    Budget: Cut.
    Action: None.