Paper to XML?

← Back to Stories (view on slashdot.org)

Posted by michael on Saturday March 30, 2002 @07:17AM from the save-a-tree dept.

Scott Taylor writes "I have a paper manual that I would like to convert to an HTML browsable manual and to a text searchable PDF manual. Most of the pages of the manual use the same table layout (albeit an irregular table). My current thought is to scan in the tables and then somehow using OCR software convert the data in the table to a xml marked up file. From there I can use XSLT and FOP to convert the data to HTML and PDF. The problem is that I don't know how I can make the jump from a scanned in picture of a table to XML. Anyone out there tried this before? Is there any software that lets one mark up OCR text based on the table cell it was found in? I don't mind spending money on commercial software if necessary (as long as it doesn't cost too much). Is there a better to solve the problem?"

12 comments

Min score:

Reason:

Sort:

One option springs to mind by Sentry21 · 2002-03-30 07:35 · Score: 2

You could always... you know... sort the data by hand. I hear they used to do that back before everyone got lazy. Or, you could parse it with perl, but personally, I prefer the former.

--Dan
yeah by Anonymous Coward · 2002-03-30 08:02 · Score: 0

contact the copyright holder and buy an electronic copy made from the original.
XML, paper: bad keyword by fm6 · 2002-03-30 08:30 · Score: 2

Another badly-focused Ask Slashdot. Though not as bad as most, and the big mistake is a common one: assuming that XML is a central part of any technical problem where XML is involved. It's just a data format!
What you need to look for is the right OCR software. The prime consideration is that it's good at figuring out tables. If you find something good, don't rule it out just because of limited/nonexistent XML support. The software will certainly support Comma-Separated-Values, or something similar, and that's easily translated into XML.
Verification by KurdtX · 2002-03-30 09:09 · Score: 4, Informative

As an employee of a company that does OCR (but not the type you're looking for - we do pre-defined form OCR only, like surveys) I can tell you you're going to spend a lot of time verifying that the software identified the letters of the manual properly. Even though we know exactly where the letters will be and of what type they are (alpha, numeric, other) it still can't get them right - yes even with machine print. And we actually have some of the best OCR for our industry; although we limit our processing to only about a second per page, because we're more volume-oriented. (I work on our other products) Basically, don't expect a miracle.

Of course, it might be easier just to find the original source (be it MS Word or whatever) and find software to convert that. Even if you have to get on the phone and track some people down it would be easier than re-proofing and re-tweaking.

--

Kurdt
I'm not anti-social. Just pro-technology.
1. Re:Verification by 4of12 · 2002-04-01 05:01 · Score: 2
  
  Even if you have to get on the phone and track some people down it would be easier than re-proofing and re-tweaking.
  
  That is the best advice.
  
  Don't resort the heavy duty technical solution to the problem before exhausting the lower tech solution to the problem first.
  
  Half a day's phone calls and tracking down the current location of your document's author is easier than what you're proposing to do.
  
  Even then, before resorting to OCR to XML, go with the other suggestions of getting a good typist to regenerate the document.
  
  You can do a favor to your successor by including some highly specific URL for the new document showing when it was created, what format it is in, and where on some networked device it is living at the time that the paper was produced.
  
  It's also a nice gesture to publish the new document in several formats to increase the liklihood of future translatability.
  
  No matter how "universal" a particular format happens to be these days, by presenting your document in more than one format you increase the effective lifetime of the document.
  
  --
  "Provided by the management for your protection."
Forest for the trees by maggard · 2002-03-30 10:43 · Score: 5, Informative
You've got two questions conflated here:
1. OCR your material. Frankly OCR still sucks and if there is any possible way you can get your hands on a source file for your material do it. Even with a .01% error-rate (you wish it were that good!) you're still going to have to proof this damn closely and at great effort.
  You'd might do well just hiring a company to retype this in for you. There are any number of such that do this at fairly reasonable prices, locally & off-shore. They'll likely be cheaper then your going the research/purchase/assemble/scan/OCR/review/output cycle on your own, particularly for a one-off.
2. Next is the output format. Sure some software will do it's best to figure out a table from a paragraph but it's still tricky stuff as prone to error as it is to getting it right. Again the best procedure is doing it by hand so you know it is done properly the first time then spending days reviewing crappy machine-generated code.
  As to XML vs. anything else - who cares. Get it into any reasonable format and without too much effort you can convert it to another. Unless you're trying for buzzword-bingo just accept any text & table formats & post-process to your favorite flavor.
Finally if you just want to quick-n-dirty get the thing into electronic format and don't mind doing a bit of indexing yourself Adobe Acrobat can suck in material including OCR, supports indexes, tables, etc, and has an XML implementation.
--
I don't read ACs: If a post isn't worth so much as a nom de plume to its author then I wont bother either.
1. Re:Forest for the trees by 56ker · 2002-03-31 12:12 · Score: 1
  
  Just a thought - you could scan the tables in as graphics for the HTML version (and the pdf version too if you want) - that way it'd get around the OCR problems - however it would mean any data in the tables couldn't form part of a text search. On the plus side though it's probably the quickest method.
  
  --
  
  Video Game cheats, hints a
DocBook XML by bruckie · 2002-03-30 15:29 · Score: 1
You've got two problems here:
1. Convert paper to XML
2. Convert XML to HTML and PDF
As for the first, I don't think that you're going to find a completely automated solution. First of all, OCR isn't terribly accurate to begin with. Second, you have to convert OCR'd plain text to marked up XML. You'll probably have to do this by hand unless the manual you're entering is terribly structured.

However, I'd certainly recommend using DocBook as your intermediate XML format. It's a well-designed language targeted at technical manuals. Don't re-invent the wheel with your own XML format and XSLT style sheets.

DocBook supports RTF, PDF, HTML, PS, LaTeX, and other output formats. Do yourself a favor and use it.

--Bruce
--
There are 10 kinds of people in the world: those who understand binary, and those who don't.
Acrobat by Anonymous Coward · 2002-03-31 14:38 · Score: 1, Interesting

I fail to see why you would go through all those steps when Adobe Acrobat already does what you want pretty much automatically.
FineReader by RazorDaze · 2002-04-01 19:11 · Score: 1

you can convert directly from scanned documents to OCR'd PDFs using ABBYY's FineReader, which can be found at FineReader.com