Adobe Pushing For Flash and PDF In Open Government Initiative
angryrice tips news that Adobe seems to be campaigning for the inclusion of Flash and PDF in the Obama administration's efforts at increasing government transparency and openness. A post from the Sunlight Labs blog is critical of Adobe's undertaking, in part since PDF is often "non-parsable by software, unfindable by search engines, and unreliable if text is extracted." They also say government's priority should be to publish datasets and the APIs to interact with them, rather than choosing how they're displayed in fancy graphs and charts.
I have no problem with PDFs, there are a number of free and commercial applications out there that can work with them.
Flash on the other hand is absolutely an abomination that must be wiped from the net. They still haven't released a proper version for *BSD and they commonly don't bother with less popular OSes. If they want it to be used for this sort of purpose then they need to get their act together and make it available for all operating environments on an equal basis. Which I don't think they have the resources to do.
Nobody likes Flash, and they probably shouldn't use it for anything. But there's not much wrong with PDF, if it's done right. When publishing something, one could offer "source" (some sane, machine-readable format) and PDF (autogenerated from the source, and prettified for easier reading).
PDF shouldn't be used as a way to encapsulate scanned JPEGs and pretend they're a real electronic document.
I would also note that many of the complaints about PDF as a format in TFA are really complaints about Adobe's abysmal PDF reading software. For example, the concern about the visually impaired: KDE's Okular does speech synthesis and has a high-contrast mode.
# cat
Damn, my RAM is full of llamas.
PDFs are only searchable if the document contains text. Half the time PDFs contain text-as-image, which is about as useful to a search engine as a captcha image. Google doesn't run OCR on PDFs, AFAIK. Although, come to think of it, that sounds like something they'd get sued by a random company for doing for "violating copyright proprietary information".
The road to tyranny has always been paved with claims of necessity.
but specially html5+js+canvas+svg+ogg vorbis/theora for rich web content.
Who has announced authoring tools for this stack that are anywhere near as capable as even Flash 3, let alone Flash CS4? Say I want to make an animated SVG like the Flash animations I see on Newgrounds. What package should I start with?
PDF remains difficult to manage. Like MS Word documents, an incredible amount of resources is wasted in display information rather than actual text or graphical content. Unlike MS Word, they're parseable: but unfortunately like MS Word, the commercial vendor-sold document creation tool (Adobe Acrobat) generates unstable and unreliable content that interacts very badly with other tools. Oddly, the ghostscript created PDF remains very stable and legible, and tools like "PDFCreator" which uses ghostscript creates long-term viable PDF printouts of other document formats. I use it for complex MS Word documents that cannot be handled by other software, even different versions of MS Word.
Adobe can actually do better with this, and I hope that they will in the future. But it's not stable enough to be reliably indexed or viewable even 5 years in the future, much less 10 or 20 or 100 such as may be needed for legal or historical documents.
Flash, you're quite right. Unless they open up the source, it has no business as yet another document format.
The summary does not do a good job of reflecting the original blog post's point. The point was that the government should make data available in a machine-parseable and generic format. PDF is a great format for storing typeset pages, but it is a terrible format for publishing data. It's easy to generate beautiful PDFs from well-structured data but it's much harder to go the other way. Would you rather have budget figures (for example) as a CSV file in a well-defined format or as a PDF of tables and graphs? If the data is available in the former format, it's easy for you or a third party to produce the latter format. If it's only available in the PDF form then it's much harder to create the CSV.
I am TheRaven on Soylent News
Right...
:D
In order to read a document, what I really need to replace the heavyweight Adobe Reader, is a bloated modern browser !
A document format shouldn't store text as an image. That's why it's called text.
The road to tyranny has always been paved with claims of necessity.
They also say government's priority should be to publish datasets and the APIs to interact with them, rather than choosing how they're displayed in fancy graphs and charts.
I felt a great disturbance in the Force, as if millions of IT workers suddenly cried out in terror, and were suddenly silenced.
I record my sleeptalking
GP is right. Government should focus on doing what government is needed for success, such as determining standards for formats that everyone can use, with input from academia and industry. For example a human readable parsable format that one could embed in a web page for semantic metadata. Or funding open source software to make it easy (cross platform) to input such data (I am thinking of information about cited papers or books). Typeset information is nice but we already are drowning in information - how many pages of Google results do you usually look at? And we need help before generating 10 times as much.
Why PDF is bad:
- It is a potable typeset document package. Not a data sharing package that could be pulled apart easily with tools automatically.
- PDF is extremely hard to parse, and using current free software does not always give good results.
- You destroy useful document structure, or in the case of ASCII text parsability and small size, when you convert to PDF. You can't just convert back to the original.
- It takes significant processing power and commercial software to display well and reliability as far as I can see. Having just gotten the latest Mac I feel like I'm in a dauntless battleship, but I have had many trouble with different unix tools in the past.
- Scientists publish PDF too but then also use other formats for data. For example on arxiv, one scientists recently published animations inside a zip but it was hard to find the link
- It is difficult to manage bibliographic information automatically.
- It is proprietary
- It requires a huge amount of data, and arcane knowledge, just to build a parser that works most of the time (such as for Asian languages especially).
Many implementations of PDF converters merely print a document to images and then embed the images into a PDF. Those are non-searchable and no text can be extracted with the existing tools. I once created a documentation website which relied on these embedded image types of PDF documents. I had to implement an OCR solution in order to extract the text to make my clients documentation searchable. It was ugly and a real pain in the ass.
Certainly, PDF can be beautiful, but it is often not implemented that way. Personally, I'm a big fan of PDF. If not implemented properly, I try to avoid it.
"Lame" - Galaxar
That is not really a format issue though, in any format that supports images I can insert an image containing text.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
You're missing the point. PDFs do not store text. Text is a stream of characters. PDFs store glyphs and their locations. It is more or less possible to convert glyphs into characters, although things like ligatures and the fact that spaces are not really represented make this difficult. In the metadata, some PDFs also store the text of the document, allowing it to be extracted. Given that the PDF is created automatically from the text in most cases, the text is more useful. You can create the PDF from the text easily, but creating the text from the PDF is much harder.
I am TheRaven on Soylent News
I am OK with PDF. I would RATHER see documents in plain HTML, but there are times when formatting is important. In those cases, if it is to be read/print-only, PDF is the way to go. Otherwise, the gov should use ODF.
But Flash? Are you kidding? The last thing on earth we need is more Flash.
* Does not work on all devices
* Slow and/or consumes tons of CPU
* Consumes tons of RAM
* Consumes more bandwidth
* Makes it difficult or impossible to cut and paste
* Impossible to "search/find"
* Violates the native UI look and feel
* Fonts and font sizes are uncontrollable by the end user
* Can't scroll correctly much of the time
* Almost completely proprietary
* Rarely adjusts to screen size
* Often introduces extremely irritating animation.
* Doesn't allow text to be "seen" by the browser (or OS), making other plugins (like a screen reader) 100% useless
At least that SilverDark stuff isn't even on the radar- thank God for little favors.
What are you talking about? The PDF specification has been available as a free download from Adobe with no royalties payable by implementors since PDF was first created. More recently, the PDF/X family of specifications was approved by ISO. These define subsets of the PDF 1.4 specification for different uses (see ISO 15930). There are at least three open source PDF readers that I know of as well as several commercial viewers (Adobe Reader, FoxIt, Apple's Preview, and so on) and numerous tools can generate PDFs.
I am TheRaven on Soylent News
This sort of authoring is easily handled in vi - or emacs - your choice.
If libertarians are so opposed to effective government, why don't they all move to Somalia?
Printing documents created in other language versions of Acrobat. In particular, the Adobe Acrobat for German created documents that were not only unviewable in a normal Acrobat viewer, but when used to "print PDF" for MS Word documents, created documents that actually crashed Windows computers. The Acrobat for Hebrew didn't crash Windows with the printed documents, but was filled with layout errors when rendered even by Acrobat Reader, errors that didn't show up in the Adobe Acrobat tool. Much of this may have been fixed with the latest release, but I'm not spending nor suggesting that my peers overseas spend all the money needed to upgrade.
Getting our colleagues to stop using Acrobat and use _anything else_ to generate their documents, and use PDFCreator to print them as PDF, stabilized the situation enough for us to generate the documents we needed. It didn't provide PDF forms for people to fill out, which was its only flaw.
Unlike micros~1 word documents, there are freely available specifications and a reasonable number of quite reasonable third party implementations that can either display or generate PDF, or even both. That is to say, you can very well ``do PDF'' without ever using adobe software. Part of its success is that it's a dumbed-down version of PostScript, also open and arguably the right way to talk to printers. That's a whole sight better than micros~1's ooxml abomination, that once standardized turned out to have not even one conformant working implementation. Agree on the flash, but there's more.
PDF is pretty good on storing bound-for-paper documents (and when doing that, use metric paper, dammit) though for scans you're probably better off with DJVU. Flash is basically pure concentrated dancing rodents, and has very little to offer beyond gimmicks. Unless it opens, and opens soon, it will have no staying power and flash data will be rendered useless in a decade or two. That's bad for archiving.
The core goal should be content: Content, interop, accessability for the disabled, accessability for non-wintendo machines regardless of marketshare, archiving, being able to re-use, and still being able to access centuries down the road. PDF may qualify, flash certainly does not.
The summary does not do a good job of reflecting the original blog post's point. The point was that the government should make data available in a machine-parseable and generic format. PDF is a great format for storing typeset pages, but it is a terrible format for publishing data. It's easy to generate beautiful PDFs from well-structured data but it's much harder to go the other way. Would you rather have budget figures (for example) as a CSV file in a well-defined format or as a PDF of tables and graphs? If the data is available in the former format, it's easy for you or a third party to produce the latter format. If it's only available in the PDF form then it's much harder to create the CSV.
If the goal is to make the data available, then even CSV would be a better option than PDF. PDF, while pretty, is a terminal format and is the digital equivalent of a mayfly. It's paper that hasn't happened yet and when it does it will exist for a few short hours before finding its way to the circular file.
Much of the government data consists of tables and tables of data. gzipped csv would be readable by anyone, so would ODF. Adobe appears to be looking for a handout at the expense of creating a useful and open data system.
Put it in context: open government requires data formats that are independent of campaign donors.
Beta is broken and the link to classic doesn't work. Stop wasting our time or there won't be anybody left here.
Yeah, and you can hex edit an SWF file too. But change a letter, refresh, change a letter, refresh, is not the kind of editing that graphic designers prefer to do. If that's what SVG has to offer, the market will choose SWF. I can only hope your comment was sarcasm.
PDF/A is already open. However, that doesn't mean that anyone knows how to produce it, especially some R.O.A.D. staffer or random hourly GS1.
Open or not, PDF/A is a display format and, in most cases, useless for information retrieval or automated data processing. PDF/A is a useful alternative to paper. However, the open government initiative is not talking about paper. It's about 'born digital', machine readable data.
Beta is broken and the link to classic doesn't work. Stop wasting our time or there won't be anybody left here.
Just recently I had to look at, and print a few pages from, a PDF document. Knowing where it came from, a corporation that is only very slowly dipping a toe in the water of software other than the big names, I'm sure it was done with Adobe.
Now I don't even have the Adobe Acrobat reader on my system, when I try to install it, the install crashes. But Fedora comes with several other PDF readers, and the default is set to "Evince" which works fine MOST of the time.
But I got this PDF, and one page was a picture of a tax form, and when I tried to print it, the tax form came out as a big black blob - man, does that waste ink! Obviously I killed the print job to try something else. (Just VIEWING this tax form was fine, only printing messed up.)
I remembered using "Xpdf" a while ago, so I tried that, and voila, the tax form printed perfectly. Since I knew there were more tax forms in there, I used Xpdf for the rest of the job.
So here is a case where two different PDF viewers reacted differently to the same PDF file. I think what we need is is an OPEN DEFINITION for PDF files, probably a subset of Adobe's definition, that any OSS viewer can follow and get the proper results - and ask the user what to do with files that don't follow it.
And tell Adobe they can either follow the open definition, or stuff it where the sun don't shine!
Teen Angel - a Ghost Story
Adobe ships Flash/PDF readers/plugins to: Windows, OS X, Symbian (in some form), Linux, *BSD and various, uncountable tiny platforms. iPhone/iPod does not count because of obvious reasons.
Lets see what MS Silverlight ships to: Windows/Intel Mac. Damn thing is so tied to Windows that they couldn't even convert/ship the V2 for PPC Macs or they simply abandoned them. (like we cared!)
MS XPS format and viewer is the answer to PDF which, some people who didn't use Windows have never, ever heard of. It is that Windows centric. Despise all rude attempts by MS (adding XPS printer without etc), it has never, ever took off.
What we need is, something combines ODF and PDF. You can add binary file to PDF document like some layer. ROM LogicWare, less known Office (Papyrus) developer does it right now. The files are both PDF and their own edit format, transparent to PDF readers and NOT a hack.
Of course, people will spend time "omg flash, pdf, Adobe is slow" flaming rather than finding a solution to a real problem. Asking government to use Flash is really absurd but the real one to blame here is MS and open source based large companies. If they have no alternative, Adobe will suggest PDF of course. What else they should use? MS XPS?
I work with PDFs a lot, especially on OS X. I am telling you from an OS which you can have 60 KB 1080p screenshots in PDF in some circumstances: Whoever did that "text as image" trick, he is a complete moron.
One of the reasons that PDF took off is exactly embedding fonts used in a document so it will appear as pixel perfect on client machines.
As last resort (and a good practice), you can embed unformatted pure text of the entire PDF in your PDF file. PDF, like Quicktime Mov is one of the formats where people doesn't use the features and bitch about the size of client etc.
Because Flash is now a crucial part of the internet. Until HTML 5 comes out with video standards and the like, Flash is about the only way you can embed videos in sites without ruining the layout of the site with a third-party media player and without your users searching for codecs.
If Adobe would simply release the source to the Flash player, they could -save- money, have full platform compatibility and perhaps make more money with the Flash creation products. Think of it this way, if there was a fast language (most apps in Flash seem to load, run and interact faster than Java) that you could truly write once and run anywhere, it would be a hit. Flash could be this language if Adobe just opens up the player. Until they open it up, I expect them to do a good job and port it to every single OS or platform where it is allowed because it is good for business for them and helps that platform (which in all honesty Adobe should want to kill Windows as quickly as possible and move the world to OS X and Linux).
Taxation is legalized theft, no more, no less.
There is such; Adobe publishes it and makes it freely available on its web site. It's possible your file didn't follow it, but it's more likely your reader wasn't 100% compliant; it's a very complicated specification.
So there is a partial option for MS-Windows only. Great. Not exactly platform agnostic and open. I suppose it is better than nothing, though.
CSV is kinda evil (see my post above), but it's better for tabular data than JSON or XML. Again, a tabular serialization format such as Avro, Thrift, or Protocol Buffers might well be far better than CSV for tabular data. JSON has quite a bit of format bloat, and would need some standardized way to explain the data's schema for further analysis. XML is the king of format bloat, but at least has standard schema representations. XML is far better for semi-structured or unstructured data than tables.
PostScript is also a free specification, but NeXT was using the Display PostScript implementation licensed from Adobe. They switched to something closer to PDF because, it turned out, no one actually cared about the nicer features in PS. With DPS, you could write view objects entirely in PostScript and have them run on the display server. This was quite slow and had all sorts of problems in that the PS programs could (potentially) run forever. Most people just used the drawing subset of PS, which is also available in PDF, and none of the flow control stuff.
I am TheRaven on Soylent News
Yes, and then they SUED Microsoft for putting PDF support in Office. It's only "open" as long as you're not big enough to compete with Acrobat. If you even get within a mile of stepping on Adobe's business, you're sued up the wazzoo.
"Free and open" my ass.
Comment of the year
Bullshit.
It's either an open standard, meaning anybody can use it-- ANY BODY-- or it's not. There's no such classification as "it's an open standard, except we don't let companies we don't like use it because they have a big marketshare, but other than that it's an open standard believe me!"
By your argument, Microsoft should also be prevented from parsing HTML files in IE because they're a monopoly. Does that make sense? No. Does your argument make sense? No.
Comment of the year
A PDF file produced by the LiveCycle suite is actually an XML document with a thin PDF wrapper. The XML conforms to the XFA standard which is owned by Adobe but is a published standard (http://partners.adobe.com/public/developer/en/xml/xfa_spec_2_4.pdf).
All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe