Aussie Government Gives PDF the Thumbs Down
littlekorea writes "The central IT office of the Australian Government has advised its agencies to offer alternatives to Adobe's Portable Document Format to ensure folks with impaired vision are able to consume information on the Web. A Government-funded study found that PDFs can present themselves as image-only files to screen readers, rendering the information contained within them unreadable for the vision impaired."
ISO already has created the standardized PDF/X subsets used widely in the publishing industry. They lack support for extra features like scripting and other extensions.
The main problem with PDF for document archives is that it is a presentation format and doesn't adequately preserve text structure since everything is broken down into lines of text or individually placed glyphs. Analysis of a page layout can only bring back so much. There are better ways to store data that offer more versatility.
I am becoming gerund, destroyer of verbs.
Not necessarily. PDF does not preserve text flow. It breaks up paragraphs into lines (or less if kerning has been altered), and places them accurately on the page. If you have a multi-column layout, then a pdf-to-text algorithm (first step in screen reading) is likely to put column-2-line-1 between column-1-lines-{1 and 2}. Best of luck sorting that out.
Prediction for end of Universe #42: Fencepost error in Quantum_bogosort.cpp
Not necessarily. PDF does not preserve text flow. It breaks up paragraphs into lines (or less if kerning has been altered), and places them accurately on the page.
This is not true. PDF is capable of preserving text flow if the document contains such information. See this as an example: if you open it in acrobat reader and move the text cursor using the down arrow, you'll see it travel correctly among columns and paragraphs.
No page description format will help if the page has been generated in a broken way: for instance, try extracting text from the tables of an html page generated by javascript.
If you have a multi-column layout, then a pdf-to-text algorithm (first step in screen reading) is likely to put column-2-line-1 between column-1-lines-{1 and 2}. Best of luck sorting that out.
In this case it is the pdf-to-text algorithm to be broken, and should be fixed.