Paper to XML?
Scott Taylor writes "I have a paper manual that I would like to convert to an HTML browsable manual and to a text searchable PDF manual. Most of the pages of the manual use the same table layout (albeit an irregular table). My current thought is to scan in the tables and then somehow using OCR software convert the data in the table to a xml marked up file. From there I can use XSLT and FOP to convert the data to HTML and PDF. The problem is that I don't know how I can make the jump from a scanned in picture of a table to XML. Anyone out there tried this before? Is there any software that lets one mark up OCR text based on the table cell it was found in? I don't mind spending money on commercial software if necessary (as long as it doesn't cost too much). Is there a better to solve the problem?"
You could always... you know... sort the data by hand. I hear they used to do that back before everyone got lazy. Or, you could parse it with perl, but personally, I prefer the former.
--Dan
contact the copyright holder and buy an electronic copy made from the original.
What you need to look for is the right OCR software. The prime consideration is that it's good at figuring out tables. If you find something good, don't rule it out just because of limited/nonexistent XML support. The software will certainly support Comma-Separated-Values, or something similar, and that's easily translated into XML.
As an employee of a company that does OCR (but not the type you're looking for - we do pre-defined form OCR only, like surveys) I can tell you you're going to spend a lot of time verifying that the software identified the letters of the manual properly. Even though we know exactly where the letters will be and of what type they are (alpha, numeric, other) it still can't get them right - yes even with machine print. And we actually have some of the best OCR for our industry; although we limit our processing to only about a second per page, because we're more volume-oriented. (I work on our other products) Basically, don't expect a miracle.
Of course, it might be easier just to find the original source (be it MS Word or whatever) and find software to convert that. Even if you have to get on the phone and track some people down it would be easier than re-proofing and re-tweaking.
Kurdt
I'm not anti-social. Just pro-technology.
- OCR your material. Frankly OCR still sucks and if there is any possible way you can get your hands on a source file for your material do it. Even with a
.01% error-rate (you wish it were that good!) you're still going to have to proof this damn closely and at great effort.
-
Next is the output format. Sure some software will do it's best to figure out a table from a paragraph but it's still tricky stuff as prone to error as it is to getting it right. Again the best procedure is doing it by hand so you know it is done properly the first time then spending days reviewing crappy machine-generated code.
Finally if you just want to quick-n-dirty get the thing into electronic format and don't mind doing a bit of indexing yourself Adobe Acrobat can suck in material including OCR, supports indexes, tables, etc, and has an XML implementation.You'd might do well just hiring a company to retype this in for you. There are any number of such that do this at fairly reasonable prices, locally & off-shore. They'll likely be cheaper then your going the research/purchase/assemble/scan/OCR/review/output cycle on your own, particularly for a one-off.
As to XML vs. anything else - who cares. Get it into any reasonable format and without too much effort you can convert it to another. Unless you're trying for buzzword-bingo just accept any text & table formats & post-process to your favorite flavor.
I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
You've got two problems here:
As for the first, I don't think that you're going to find a completely automated solution. First of all, OCR isn't terribly accurate to begin with. Second, you have to convert OCR'd plain text to marked up XML. You'll probably have to do this by hand unless the manual you're entering is terribly structured.
However, I'd certainly recommend using DocBook as your intermediate XML format. It's a well-designed language targeted at technical manuals. Don't re-invent the wheel with your own XML format and XSLT style sheets.
DocBook supports RTF, PDF, HTML, PS, LaTeX, and other output formats. Do yourself a favor and use it.
--Bruce
There are 10 kinds of people in the world: those who understand binary, and those who don't.
I fail to see why you would go through all those steps when Adobe Acrobat already does what you want pretty much automatically.
you can convert directly from scanned documents to OCR'd PDFs using ABBYY's FineReader, which can be found at FineReader.com