From Paper To PDF?

← Back to Stories (view on slashdot.org)

Posted by Cliff on Friday June 16, 2000 @05:51AM from the digital-to-analog-wasn't-nearly-this-difficult dept.

Spoing dropped this bit of informative info into the bin: "Last week, a friend of mine griped that he didn't know of an easy way -- short of getting Adobe Capture and paying per-use licence fees -- of creating searchable PDFs. I scoffed, and told him I've done it many times, and it was free -- as in beer and speech. Dumbfounded, he pushed me to show him how, and I did; print to a Postscript file, and run ps2pdf on it...done! Since every document could be output as Postscript, his problem was solved. If he wanted to batch process the documents, he could set up a few scripts to simplify the task. While he was impressed, he ended up asking what seemed like an easy question; 'Can you do the same with a scanned image?'" And therein lies the question...

"After a week of on/off searching, I did find some good references as well as nearly all the parts necessary for the job, including open source OCR engines, PDF and Postscript tools, search engines, and the like.

Unfortunately, I came up with only two solutions -- neither of them Open Source, and most quite costly (premium beer); Adobe Capture or dedicated "PDF scanners" like this one.

My question to the Slashdot crowd is this:

Is there a cost-effective way of moving existing dead-tree documents into either HTML, PDF, or other searchable mixed text and graphics format?

We all deal with a mix of electronic and printed documents -- and you're like me you've paid for some of them in both formats.

If you're like me, you buy new documents in electronic, searchable, format when you can. How many of us have O'Reilly's Networking Bookshelf, or some other CD texts ready to search on our notebooks and networks?

Yet, I have a four foot wide stack of technical documents and books that just isn't going to come with me on each plane trip. I'm not going to get rid of them -- they are still valuable -- but I can't figure out how to make them useful more often.

The available tools for capturing paper and converting it into searchable PDFs is costly, and is geared toward corporations that can justify the costs by the number of users. To me, a per-use licence of Adobe's Capture --

Adobe Capture - Prices
Adobe Capture - Features

-- is just not cost effective.

If the document is already a text document -- even if it's in some word processor I don't use -- generating PDF files is easy and cheap;

Print a document to a Postscript file, or create one. For example a simple text document is trivial;

enscript file.txt -p file.ps

Convert the resulting Postscript file to PDF;

ps2pdf file.ps file.pdf

Converting a paper document to PDF is also easy. Just scan the image and use tiff2ps or jpeg2ps to create the Post script file. The only problem is that the resulting PDF is a bitmap image and isn't searchable.

Interestingly enough, TIFF -- a format used extensively for scanned documents -- does support TIFF+Text, but usually as an extention to TIFF and isn't really an optimal format; The Unofficial TIFF Home Page.

So, if you want to search the documents and keep the formatting and diagrams, you're back to paying Adobe for Capture or some other nearly as expensive method. "

59 of 188 comments (clear)

Min score:

Reason:

Sort:

Maybe Vividata by anewsome · 2000-06-16 01:03 · Score: 2

I considered doing the same thing years ago with scanned images. I scan hundreds of images per month and I thought the free form text search of the scanned images was in order.

At the time the only OCR software that I could find on Linux was from a company called Vividata. At that time they were just adding Linux support and it didn't seem to work for shit, but the support was pretty new.

I use shell scripts to drive SANE programs to do the scanning and conversion to PDF using convert (Image Magick) and then ps2pdf (ghostscript). If the Vividata product actually works now, it might be nice to scan, then OCR, then convert to PDF. A quick index by ht://Dig will then make a nice searchablke archive of scanned docs.

The Vividata products however are not free, if this is a consideration.

--Aaron Newsome
Bad Link by Gleef · 2000-06-16 01:27 · Score: 2

Neither http://www.linuxdoc.org/docs/OCR/OCR-HOWTO-0.1 (what you wrote) nor http://www.microsoft.com (what your link pointed to) gives OCR information. There is a little info in the Access-HOWTO, and a little in the unofficial AI/Alife mini-HOWTO. I couldn't find any OCR-HOWTO, and would love a real link to it if you have one.

----

--

----
Open mind, insert foot.
1. Re:Bad Link by Spoing · 2000-06-16 03:46 · Score: 2
  
  I'd tell you I was sorry for the mistake...but I checked them before I submitted the Ask /. a few weeks ago. Back then, they were valid and worked for me!
  
  --
  A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
Easy! by nstrug · 2000-06-16 03:16 · Score: 2

Next time you're in Hong Kong buy 'Adobe Special Edition' for about $10. Every Adobe application and plug-in there is including Capture!
For some reason it comes on a CD-R with a xeroxed insert. I can't imagine why Adobe would let their packaging standards slip so badly...
Nick

--
-- "It's a sad day for American capitalism when a man can't fly a midget on a kite over Central Park" - Jim Moran
Re:PDF, Ugh. by YogSothoth · 2000-06-16 07:22 · Score: 2

That's funny, I generated some pristine pdf documents using php *this* *week* and the pdf library used by php is right here and works wonderfully and comes with source. Things have apparently improved since last you looked, the relevant php documentation is here

--
there are two kinds of people in this world - those who divide people into two groups and those who don't
Re:Interns by pen · 2000-06-16 06:04 · Score: 2

In all honesty, you're not going to get away going from dead tree to digital paper without proofreading at least once. There is no OCR package that perfect. Same goes for your data entry folks.
I've been thinking about this for a while... can't you just scan and OCR it once, nudge the paper on the scanner, scan and OCR it again, and then use a script to compare the two files? You may use more than two scannings if accuracy is that important.
Something that's been common in the "warez" ebook scene is that people will often correct mistakes in the book as they're reading it, and then spread the corrected version. After a period of time, the book becomes more and more solid.

--
Re:Adobe Acrobat 4.0 by Soong · 2000-06-16 02:10 · Score: 2

Correct. I've use this on my Mac and it works pretty well. The OCR probably misses 10-20 words per page, but is quite good about flagging them as unsure. It has a good interface for going back to do touch on those. It also has a fair interface for running a scanner, getting the data directly into itself, and doing this for successive pages. If you need this, go spend the $250 and support your non-free developers out there in the world.

--
Start Running Better Polls
That's why you need the verify stage by A+nonymous+Coward · 2000-06-16 01:32 · Score: 2

Old keypunch standard practice was to keypunch the holes in the cards, then someone else repunched in verify mode -- it compared and notched the card if it didn't match. For some reason, that practice seems to have disappeared. Do data entry shops still verify the entered data?

So hire two sets of interns or high school kids. Compare the two. Pretty easy. Twice as expensive to get the data in, but it would be more accurate.

Doesn't solve the problem of unreadable original documents which are misread both times, but that's a different story.

--

--
Infuriate left and right
1. Re:That's why you need the verify stage by georgeha · 2000-06-16 01:44 · Score: 2
  
  So hire two sets of interns or high school kids. Compare the two. Pretty easy. Twice as expensive to get the data in, but it would be more accurate.
  
  If you had the money, you could hire enough sets of high school kids to get a high=-school-kid-RAID going, that way, you could hot swap the sick ones one and not lose any productivity.
  
  George
Re:PDF, Ugh. by Juggle · 2000-06-16 08:55 · Score: 2

That library is great but if you read the license agreement it is not free (Beer) for commercial use. And since we were being paid to develop what is most deffinatly a commercial site unless we got the client to cough up the cost of the lib it wasn't going to be an option.

Not to mention I tend to prefer free (Beer,speech) software for anything I do and anything I pass along to clients.

Luckily a bit of work with google and I found some guy in england who had written his own PDF libraries (not nearly as nice as PDFlib linked above) which were GPL'd and had enough functionality to do what I needed.

--
--- Juggle juggle@hitesman.com
Adobe Acrobat "Paper Capture" can do this by specht · 2000-06-16 01:19 · Score: 2

If you don't have to process huge amounts of pages then Adobe Acrobat can do what you want: It's basically a cheap version of Adobe Capture that is probably not as fast and not as easy to use. The "Paper Capture" option is located under the "Tools" menu. I don't think that Adobe will bring out a Unix version of Acrobat 4.0, therefore this is a MS/Mac only solution. But it's more cost effective than Capture.
this must be *UNIX problem I guess. by josepha48 · 2000-06-16 06:33 · Score: 2

I am not sure but I think that this may be just a UNIX problem. I bought a scanner a while back and it came with windows software to do the conversion for me. I have not tried mixed images and text yet, as I have not had a need. TextBridge is the name of the software. I found some info about it here http://www.digitalriver.com/dr/v2/ec_MAIN.Entry10? SP=10023&PN=1&V1=160950&xid=19198 It is not open source and it is fairly inexpensive IMHO. If you buy a scanner I think that they come with this software. It says it can retain color and images. Maybe this and wine? OR maybe enough people will ask them to port to Linux. I think that right now it outputs to word and wordperfect.
Does this help??

send flames > /dev/null

--
Only 'flamers' flame!
Re:where can one find ps2pdf ? by FPhlyer · 2000-06-16 01:50 · Score: 2

If you have ghostscript installed on your computer, you probably already have this (most, if not all), linux distributions have this (okay, maybe not the "micro-distributions") by default to allow postscript files to be filtered to your printer port for output. Try typing "ps2pdf" at the command line and see if you get anything. Also, you can try www.ps2pdf.com, an online engine that lets you upload the ps file and then download the pdf file. Ghostscript is also available for Windows, and you will have to search the installed subdirectories to find the "ps2pdf.bat" batch file that will do this same thing.

--
Brought to you by Frobozz Magic Penguin Fodder.
Re:Violating copyright by Sloppy · 2000-06-16 02:43 · Score: 2

No part of this work covered by by the copyright hereon may be reproduced or used in any form or by any means - graphics, electronic, or mechanical, including photocopying, recording, taping, or information storaeg and retrieval systems - without the written permission of the publisher.

Yes, but their statement about you not being able to do that, is just plain wrong. Just because they say you can't, that doesn't mean you really can't. You didn't actually put your own signature under those words, did you?

If you didn't sign that page of the book, and you didn't get the book directly from the publisher under the terms of some weirdo contract (as opposed to buying it from a bookstore), then the only real restrictions are the ones stated under copyright law. Moving the book into a computer sounds pretty Fair Use -ish to me. Just don't violate the copyright.

---

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
Mod this up! by FascDot+Killed+My+Pr · 2000-06-16 01:27 · Score: 2

He's a troll, but he's funny and subtle. "hot breakfast foods", indeed!
--
Compaq dropping MAILWorks?

--
Linux MAPI Server!
http://www.openone.com/software/MailOne/
(Exchange Migration HOWTO coming soon)
Re:Is it legal to convert PostScript to PDF? by Dr.+Sp0ng · 2000-06-16 01:55 · Score: 2

Unless I miss my guess, Adobe has patented the PDF format and only Adobe Acrobat (and other related products) can legally generate the PDF material.

File formats can't be patented, they can only be trade secrets (I believe.) Otherwise don't you think Microsoft would have patended .doc format? That would be an extremely easy way to kill off Wordperfect, StarOffice, AbiWord, etc., dot dot dot.
--
Re:Missing a step? by GregWebb · 2000-06-16 01:44 · Score: 2

That's a pity. I used a Mac version 4-5 years ago and it was fantastic. Zero intervention produced _very_ accurate text. give it the extra few minutes and it was superb. Sorry to hear it's gone downhill. Wonder why?

--
Greg
(Inside a nuclear plant)
Aaaarrrggh! Run! The canary has mutated!
Re:I see what I missed by SEWilco · 2000-06-16 01:17 · Score: 2

Whether you lose the formatting or not will depend upon the OCR software. The OCR software is looking at the scanned image and can be aware of where on the page it is looking, then use that to create a page which looks similar (with whatever formatting commands the OCR program uses...).
The original article didn't mention which nice public OCR programs he found, so we don't know the capabilities of what he already found.
What he needs is an OCR program which can separate text from images and format the text and images in a similar way on a PS or PDF page. At that point PS or PDF to text programs can be used for indexing.
OT: Opensource OCR by LetterRip · 2000-06-16 02:34 · Score: 2

What opensource OCR have you found? And how "intelligent" is it?

What I'd like to do is enhance the intelligence of OCR, for things like forms. The three things that would be useful is thus...

The ability to define rectangles and lines before OCR happens, so that it will interprete them as graphics as opposed to part of the text.

The ability to Define columns and groups better, and what type of information the column has. For instance Phone numbers, addresses, etc. (and thus quit translating 6 to b ...).

A list of frequent mistranslations pairs - OCR tends to make consistant mistakes - if the spell checker were to substitute for the mistranslation with the alternative character pair, I would recieve a lot fewer misspells.

I figure that those three options would increase the accuracy of the OCR software that I've been using by 95% easily. (The other five percent is from "Fax noise", photocopy fade, and handwritten notes...)

LetterRip
OCR system by jpowers · 2000-06-16 02:31 · Score: 2

We're about to set one up here: Teleform takes data right from the scanner, OCRs (reads) it, passes the text and the image (tiff or pdf) to an image database (alchemy or imagexx), which has search tools and links to various webserver software. The whole thing will be stored in a DVD jukebox. It wasn't my call, but even though we have huge SPARCs and stuff at our disposal, this will all be under NT (imagexx runs either).

Total cost: more than I'm worth.
Value of having 8 million documents in a 2x2 cube: your guess is as good as anyone's.

Errata:
-Number of alternate solutions we looked at: 0.
-Number of comparisons between this and alternate solutions I could find: 0.
-Number of replies I got to a request for comparisons on IWETHEY: 0
-Number of seconds my .org considered my request to look at alternate solutions: 0.
-Rank, among the reasons I'm looking for a new job: 2, right behind "Hey let's get Citrix Metaframe so our lame-ass accounting software can track 100 PCs at your location!"

Anyone need linux support in boston?

-jpowers

--

-jpowers
A possible solution? by cr0sh · 2000-06-16 02:03 · Score: 2

Here is a possible solution (from scanned document to html pages), that could work as long as there weren't any funky symbols, etc. embedded in the text (heck, may even work with that if you are deft with a sharpie - as explained in step 1)...

Steps for conversion:

1. For pages with images, draw a colored border around each image on each page. Make the color something that will sharply stand out (like bright green).

2. Tricky part - process each tiff image (in a looped script) doing the following:

a. Scan each page to color tiff, with sequential filenames (001.tiff, 002.tiff).

b. Using a custom written utility, build two new tiff images - a tiff of the page without the color-bordered images, and a tiff of the color-bordered image(s) on the page. Number the page images like (p001.tiff, p002.tiff), and the images for each page (p001i001.tiff, p001i002.tiff), so that it is known which images go with what page.

c. Convert each page image to postscript, then to html (unless there is a tiff2html tool out there?) - preserve the filenames (p001.html, p002.html),
modifying only the extension.

d. Convert each image for each page to a (gif, jpeg, png), preserving the filenames (p001i001.png, p001i002.png), with a new extension.

e. Add IMG tags for the images to the end (or beginning) of the html pages, for each page.

3. After batch conversion, go back and proofread/reformat pages (to position images where they should go, etc).

Everything to do this should exist in some form already - except for maybe step 2b - that might be a completely custom tool that needs to be written, but it shouldn't be very hard to code (loop through bytes of image, looking for the sharp color changes - kinda like edge detection code - saving/masking the areas in the outlines)...

--
Reason is the Path to God - Anon
1. Re:A possible solution? by cr0sh · 2000-06-16 04:40 · Score: 2
  
  Ah, heck - that's where it breaks down - the tiff to postscript utils only make a non-searchable bitmap (I read that, and still wrote my method - I must be stupid today - my bad).
  
  Of course, if such a program existed - tiff -> OCR'd postscript (searchable text), then my solution would work (I am not advocating the manual cutting and pasting of images - a piece of code would have to be written to that) to convert the stuff to html.
  
  Of course, if one went ahead and built an OCR engine (converting tiff to PS), then they could go all the way and add the extra image stuff in and save all the steps I added...
  
  And here I was thinking I was being smart...
  
  --
  Reason is the Path to God - Anon
You haven't got the right Xerox printer by georgeha · 2000-06-16 20:12 · Score: 2

The Xerox printers I use and support, DocuSP 6180, DocuTech 65 and Sprite Network server are all PS Level 3 compliant, which means they understand PDF's also.

George
Best option: TextBridge Pro 8.0 by 1010011010 · 2000-06-16 07:38 · Score: 2

It scans to "Image + Text" PDFs. This represents each page as an image, but includes the OCRed text for searching purposes. It's the best for legal and archival documents, because it's a true reproduction. Completely OCRed text is often inaccurate in terms of both content and presentation.

I was going to use Acrobat Capture, until Adobe ("The Microsoft of the Graphics World") started charging a penny and a half per page. Suddenly, the job went from costing $800 (old Capture pricing) to $25000 (new capture pricing). I even called the Product Manager at Adobe for Capture and asked her why they made such a bold, stupid move. She said that Capture was now a "server product", which justified the price increase. I asked her if she expected anyone to use capture rather than the $80 Textbridge Pro which did the same thing, and she said yes. "You're on the wrong drugs," I said.

To make TextBridge even sweeter, it turned out to be scriptable. I can hand textbridge specialized configuration files for each job. This allowed me to use Perl to automate the conversion of several tens of thousands of TIFF images into multipage, searchable PDFs. Yay, Textbridge!

Apparently, though, Adobe had some words with Xerox (ScanSoft), because Version 9 does not include PDF support. Wankers.

If you can find a copy of Textbridge Pro 8.0 (I think it's the "'97" release), it'll do the trick!

--
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
Re: Proof reading by underwhelm · 2000-06-16 03:51 · Score: 2

I am not certain about this, but I would presume that OCR software designed to recognize form elements will retain picture elements that do not OCR to text.

Software like Omni Form will let you designate areas on the page to ignore. This should retain picture elements and will put OCRd text in a layout that resembles the original. This, of course, most likely requires user input, at least for each different page layout.

--
I don't need large brains to have a good time.
Re:Adobe Acrobat 4.0 by cetan · 2000-06-16 07:16 · Score: 2

That is wrong. Adobe Acrobat 4.0 captures pages using "Capture" under the Tools menu.

--
In Soviet Russia...michael would be rotting in Siberia!
Re:Adobe Acrobat 4.0 by cetan · 2000-06-16 19:12 · Score: 2

That's WRONG! I create pdf's and capture them with Acrobat and they are FULLY SEARCHABLE. There is an OCR layer created in the file. It's searchable in Acrobat and completely indexable!

--
In Soviet Russia...michael would be rotting in Siberia!
Re:Missing a step? by technos · 2000-06-16 01:18 · Score: 2

Textbridge is, ehrm, messy. It also requires a huge amount of user intervention, and a rather large amount of training..

--
.sig: Now legally binding!
Re:Missing a step? by technos · 2000-06-16 02:26 · Score: 2

I've only used the past few revisions, so I can't really speak for the decline.. It will still produce accurate text with little intervention if you're feeding it plain, crisp ASCII text. Feed it a memo on letterhead with paragraphs, font changes and italics, and it prompts you continually. Not to mention it generically interprets formatting; Any one of a dozen detectable ways of formatting a paragraph (one tab, two tabs, three space indent, doublespaced, etc) are rendered only one way in the result. One tab, single spaced, no indent.

--
.sig: Now legally binding!
Primitive searchables.. by technos · 2000-06-16 01:13 · Score: 2

If the text formatting is primitive, and all you want is ASCII text, there are a couple OCR packages available for Linux. They are rather primitive, and at best about twice as error-prone as an entry level commercial product, but they will handle clean text very well. Graphics, snap exception formatting, etc, are not handled by any of them, but they are scriptable.

Entry level commercial products (read: $200, Windows) will export to a .doc or similar wordprocessor file with the gross formatting intact. A few will actually 'guess' what needs to remain an image, and will include it in the finished product. They always skew the formatting some, graphics are not always detected properly, and I have yet to see one that is scriptable. They are also not free in any sense, and tie you to the Windows platform.

OT: Kind of, but..
Something I would like to see is a OCR search on demand application; In most document management systems you use only image files, and the information is only searchable by meta data.

--
.sig: Now legally binding!
Re:the OCR situation is not good by passion · 2000-06-16 01:48 · Score: 2

Textbridge (on the Mac) has a "verify" function that allows for interactivity. As it is OCR'ing, it seems to run each word through a dictionary, and if it's not found, then it asks you to verify what it should be. This process makes it only a little bit faster than raw typing.

--
- passion
Microsoft would have it otherwise... by Greyfox · 2000-06-16 02:25 · Score: 2

According to this Microsoft believes you can patent a file format, if not quite the .doc one. I'm gonna patent me raw ASCII...

--
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
Violating copyright by onelove · 2000-06-16 01:56 · Score: 2

You might want to look at the front of your four foot wide stack of reference works before you even consider OCR-ing it.
Most books have something along the following lines printed at the front:
All rights reserved. No part of this work covered by by the copyright hereon may be reproduced or used in any form or by any means - graphics, electronic, or mechanical, including photocopying, recording, taping, or information storaeg and retrieval systems - without the written permission of the publisher.
Oops. I hoped that didn't apply to the copyright notice I just pirated from my copy of SNMP versions 1&2, Theory and Practice ! - antoine
The OCR situation is better than you think by Codex+The+Sloth · 2000-06-16 04:02 · Score: 2

Caveat: I used to work on OCR Engines for Caere / Scansoft The available OSS engines are what you might call 'research quality'. They have some good ideas but with OCR "the devil is in the details" and there are alot of details. This is why you will probably not see any good OSS engines in the foreseable future -- there is a very iterative process between the algorithm development and testing and the cost of doing this is significant. The software that comes with scanners is cut back (big suprise) to get you to buy the real version. 100% accuracy on clean documents is not uncommon. Usually the document formatting (which is a much harder problem) is where things break down. Just one guys opinion...

--
I am not a number! I am a man! And don't you ... oh wait, I'm #93427. Ha ha! In your face #93428!
Interns by Tom7 · 2000-06-16 01:04 · Score: 2

When we needed to do something like this, we hired high school kids to retype the text for us. It's much cheaper than an auto-feeding scanner and OCR software. =)
People are doing it... by DeepDarkSky · 2000-06-16 01:48 · Score: 2

In an effort to associate everything with Gnutella/Napster (much like the Beowulf Cluster trend), I'd like to point out that I've seen tons of PDFs on Gnutella of books that are currently on the bookshelves, like all the Teach Yourself xx in xx days books, etc. All copyrighted material, all in either PDS, HTML, or txt format. So obviously, people are able to scan books and convert them into PDFs that are completely searchable and with the graphics intact. Adobe's Acrobat does all of that, including OCR, and if it cannot confidently recognize words, it would retain the bitmap of the text in question, just so you can see and possibly edit.
After you solve the paper to PDF problem... by istartedi · 2000-06-16 04:13 · Score: 2

...could you solve my PDF to HTML problem? I haven't seen any cheap converters for that either. I wouldn't hate PDF so much if I could convert it. I understand that dead tree documents have their place, but that shouldn't come at the expense of on-line documents. Until someone comes up with a free PDF to HTML converter, I will continue to complain to companies and government agencies that post documentation in PDF.

The regular .sig season will resume in the fall. Here are some re-runs:

--
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
Review. by istartedi · 2000-06-16 05:58 · Score: 2

Well, as advertised, it *does* convert PDF to HTML in a way that would work very well for text-to-speach software.

It strips *all* formatting, including many br tags. It's really not much better than a plain text converter.

So, if you're visually impared and need to read a PDF, this is fine, but it falls far short of what I want: A true free PDF to HTML converter that does its best to preserve the look of the original document.

The regular .sig season will resume in the fall. Here are some re-runs:

--
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
Re:OK MODERATORS by pjl5602 · 2000-06-16 01:28 · Score: 2

The person to whom's post this child posted a solution to allow you to OCR from gimp, which would then allow you to post script, and then quite easily create a pdf. This is a far cry from offtopic, but someone felt the need to mark it offtopic.
At least check the link before you flame others about marking something as offtopic (*HINT* it points to http://www.microsoft.com and NO SUCH HOWTO exists.)&nbsp Duh. :-)
Re: Proof reading by Mr.+Barky · 2000-06-16 01:49 · Score: 2

Perfect OCR isn't necessary for searching documents. As long as the OCR is pretty good, you can get pretty good searches. Since the question stated that they want to look at the diagrams, the original image obviously needs to be saved.

One could make the text hidden as suggested by post #27.
Re:Adobe Acrobat 4.0 by drinkypoo · 2000-06-18 00:07 · Score: 2

Looks like it:

From: <Saved by Microsoft Internet Explorer 5> Subject: Ask Slashdot: From Paper To PDF? Date: Sun, 18 Jun 2000 10:02:56 -0700 MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_0000_01BFD90C .601E59E0"; type="text/html" X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4029.2901 This is a multi-part message in MIME format. ------=_NextPart_000_0000_01BFD90C.601E59E0 Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Content-Location: http://slashdot.org/comments.pl?sid=00/06/05/23532 19&cid=171

Et cetera. It's even saving it as if it were an mbox entry... Don't get much more open than that... MIME, HTML, BASE64.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:The age old question by shepd · 2000-06-16 05:25 · Score: 2

>Yeah, just like, say, Linux and Windows.

Bingo. If you have only a one man staff for over 100 people that is... In that case:

Windows 9x: $250 gets you an OS in a box. Hope you like it. Supporting it costs very little because you can do very little with it. Like "A meal in a can" it's server capabilities are laudable only as an example not to follow -- don't pack so much crap into something that is already bursting at the seams.

Windows NT/2k: $$$$$ gets you an OS in a paper sleeve. It doesn't matter wether you like it or not because once the managers see it you are stuck with it. Supporting it costs very much because you can't do anything with it properly. Takes about 1 server for like 10 clients. Sorta like duct tape when it is used on anything but ducts.

Linux: No money gets you an OS on an FTP site. For one man, supporting that many users is going to cost extreme $$$$$$. But you can do it all on one machine. Just like a big swiss army knife.

Of course, a smart company (too bad these don't exist) would hire 5 people (one per 20), run Linux, and buy X-Terms. This is cheaper than ANY of the Windows solutions I have ever seen...

Just my $0.02

--
If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
Re:Effective Solution by mtphoto · 2000-06-16 01:28 · Score: 2

An interesting thing to look into is a research project called TOM at Carnegie Mellon University. It's goal is to convert all sorts of file formats from one to the other. I can't check it out to give more information because my firewall at work doesn't let unusual ports (it's served on 8001).
OCR can retain formatting by Alien54 · 2000-06-16 01:27 · Score: 2

It has to save the document into a file format that has complex formatting features. Usually this is something like Word Perfect, etc.
Omni Page has excellent capabilities for OCR that will scan and retain most, if not all formatting. It also supports this with WordPerfect, not just the Redmond brand X software that that you see around.
Unfortunately, it still requires a win9+ machine, but otherwise it falls into the category of Really Good Stuff(tm)
They were separate from TextBridge a while back, but the companies merged during the past couple of years.
The other option is to see if the compnies have copies of the books available on CDs, etc. this depends on the company, of course.

--
"It is a greater offense to steal men's labor, than their clothes"
The Holy Grail by Alien54 · 2000-06-16 01:37 · Score: 2

Just as a Note, this is a Holy Grail for many companies. I have a number of potential clients who would love this as they have a whole wall of file cabinets filled with paper docs that they want to convert to electronic docs, but cannot because of time, cost, etc. never mind legal issues (original records for legal disputes, etc)
One in particular that comes to mind is an auto insurance place. all of those customers who have to process stuff yearly, etc. nevermind the usual database issues...
if you figure it out, you have the makings of a great business plan.

--
"It is a greater offense to steal men's labor, than their clothes"
Re:Is it legal to convert PostScript to PDF? by Tei'ehm+Teuw · 2000-06-16 02:39 · Score: 2

GIF??
PDF, Ugh. by Juggle · 2000-06-16 01:30 · Score: 3

I learned my lesson about researching and testing what I offer before selling it to clients thanks to PDF. I knew that PHP was capable of generating PDF's so I went ahead and accepted a job to create a website which would automagically generate PDF resumes for the visitors. What I then found out was that PHP could only generate PDF's if you bought one of two pricy libraries which actually do the PDF work.

I ended up searching for three days (and submitting an ask /. which was discarded) before I found a set of OS (free as in beer and speach) perl libraries for generating PDF's. But oh what a pain. I ended up designing a sample resume in QuarkXpress then using a pica ruler on the printout to convert it to something I could generate. But after about two weeks of hacking I had a resume generator which spits out very clean professional looking resumes in HTML and PDF for anyone who's willing to register on the site and fill out a few simple forms. Client was happy and I tucked another language into my cap. (Since the libraries I found pretty much required you to know PostScript).

Moral of story: test the technology before selling to a client. And trying to generate PDF's on the cheap is only for those who have way more time than money!

--
--- Juggle juggle@hitesman.com
Re:Is it legal to convert PostScript to PDF? by Azog · 2000-06-16 05:40 · Score: 3

The patent on gif is not the gif file format per se, but the compression algorithm.

Torrey Hoffman (Azog)

--
Torrey Hoffman (Azog)
"HTML needs a rant tag" - Alan Cox
PDF XML by 1010011010 · 2000-06-16 07:54 · Score: 3

We've about finished a tool that will do PDF to XML conversions, and back again. It also sports a native API to allow t he creation of documents from scratch. It allows embedding of truetype fonts. It runs on Linux and Windows NT.

It'll be out in the next week or so; check Freshmeat.

The idea behind it is, create a nice layou template in the tool of your choice -- Illustrator, for example. Save as PDF. Convert to XML. Add your markup to it -- extra text, etc., convert back to PDF. Done!

Release 1.5 will include a "template" feature, whereby you can use pages from existing PDFs as templates directly; something along these lines (pseudocode):

p = new pdf(); t = new pdftemplate("foo.pdf"); p.newpage("8.5","11"); p.include_from_template(t.page(1)); p.drawstring("Hi!"); p.write("bar.pdf");

Does this type of tool sound interesting to anyone?

On a related note, we plan to offer it as both open source and a commercial product. For instance, the ActiveX interface would be commercial. You could negotiate a commercial license. And you can use it under something like the Alladin license (a la ghostscript, pdflib, etc). Any advice on open source + commercial? I have to justify my department's budget.

--
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
Other ways... by antdude · 2000-06-16 01:16 · Score: 3

I asked a friend about this and he said, "no, but the answer is yes, there are other ways....use other OCR engines, like Omnipage Pro or TextBridge Pro. Adobe Capture 3.0 is really really really nice, but is expensive. The searchability factor is the only reason OCRing is needed in most instances."

Some useful sites:
PDF Research
Planet PDF
AcroBuddies
Codecuts
PDF Zone
Adobe
Deja.com

--
Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
I did that in two hours... by Greyfox · 2000-06-16 02:18 · Score: 3

Easy solution:
1) Write LaTeX resume style class. Mine's pretty primative because it only has to deal with my resume.
2) Create resume using resume style.
3) pdflatex resume.tex.
Or...
3) latex2html resume.tex (Though latex2html doesn't really generate it to look the way I need it, but it is just a simple perl program so you could always hack it.
Nice thing about LaTeX is you can also go to XML or DVI or RTF or a number of other fairly widely used formats. Or you could just ship the raw LaTeX if the company you're dealing with is that clueful.

--
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
the OCR situation is not good by Jamie+Zawinski · 2000-06-16 01:31 · Score: 4

Last year, I tried several Linux-based OCR packages, and they basically didn't work at all.
I ended up using the Windows software that came with my scanner to OCR the documents, and at first glance it appeared to do a good job -- it didn't mess up too often. But then I went in and actually proofread and spell-checked its output to find all the typos it had made, and it turns out that this process was so time-consuming that it was faster for me to just type it all in by hand. Even though the OCR software only made a mistake every few lines, finding those mistakes took enough concentration that typing the whole thing took less time.
Your mileage may vary, according to how fast you can type.
embed TIFF images in the PDF by jetson123 · 2000-06-16 01:12 · Score: 4

Many Adobe-converted scanned pages seem to be just a sequence of TIFF images with the OCR'ed text also contained in the PDF file. The OCR'ed text is never displayed, but can be used for searching (in my experience, Adobe's OCR is not very good).
So, a simple conversion would consist of just putting the scanned TIFF images in sequence into a PDF file.
Re:why bother with PDF? by turg · 2000-06-16 03:39 · Score: 4

I don't know about elsewhere but PDF is essential for dead-tree publishing. The advantage it has over all other formats is not that it displays the same on every screen but that it prints the same on every printer (assuming that the author remembered to embed the @#$! fonts, but that's another story :-)
With PDF, you can design and lay out your ad and transmit it electronically (or on disk) to the newspaper, knowing that it will print exactly how it it did for you. Or you can lay out your brochure and send it off to the printers knowing the same thing. With any other format, the publisher/printer's machine is going to have at least one (oh, if only it were ever just one!) setting different than yours, which will change the layout.
PDF is the way that print ads are submitted electronically today. It's either PDF or old-fashioned cut-and-paste (no, even more old-fashioned than you're thinking, I mean with actual scissors and glue). The Associated Press runs a "wire service" called AdSend for ad agencies to transmit PDF ads electronically to newspapers and magazines -- and they are transmitting millions of PDF's a year.
The same thing basically goes for sending anything you want printed to a print shop. In any case, free PDF-making software enables dead-tree publising the same way that the web enables electronic publishing (though we haven't got any print shops that'll work for free, yet :-)

========

--
<sig>Guvf vf abg n frperg zrffntr
Missing a step? by sugarman · 2000-06-16 01:05 · Score: 4

You mentioned OCR software, but didn't go much further with it. Wouldn't this be the solution you need?
Scan to OCR to PS to PDF
there are apprarently a couple tools to do this for you. check out a brief list here
Seeing as you've looked into Adobe Capture, windows may be an option. If so, then the other question would be whether you've looked into Textbridge? This looks like it would do exactly what you're asking. No muss, little fuss.

--
--sugarman--
The age old question by underwhelm · 2000-06-16 01:12 · Score: 4

I am asked to do this all the time as a computer services employee of Kinkos.

The short answer is using OCR to create a text file, proof reading the text file, and then printing to a postscript file.

The long answer is, you need to find quality OCR software that does not choke on things like forms. You also *MUST* proof read every OCRd document. No OCR is perfect, and drawn elements will almost certainly trip the software into embedding odd characters or pipes into your text. Different fot sizes will cause the software to choke. Thin fonts will cause the software to choke.

If you are OCRing forms, I recommend Omni Form (it's the only software I know of that recognizes forms, but I have never used it personally).

Batch processing of OCR pages is likely easy to set up with professional OCR software (Omni Page does it), but it does not excuse you from proofreading the results. After that, the PDF part is a snap, and can be accomplished with any OCR software you choose to use.

If you are asking which OCR software is, I can't help you directly. OCR software is a niche software market, and you either get free, dissapointing software with your scanner, or you pay big money for something that does a decent job. Just like everything else in life. Have you read any OCR software reviews?

--
I don't need large brains to have a good time.
A former intern... by heliocentric · 2000-06-16 02:20 · Score: 4

Speaking as a former intern under a guy who wanted all these meeting minutes from the early 80s on put on the web I know what you are asking for. I knew HTML and simple coding then, and was only being asked to translate them to HTML. What I did, was OCR a ton of the text, only to reduce the keystrokes (it's much easier to drink coffee while swapping pages in a scanner every few seconds then it is typing all day) then I spell checked them as an initial step, formatted them by hand. Then when I moved onto the next ton, and they were in the scanner bed I would check the grammar of those which I did in the first batch.

So, I ended up being the cheap labor to get the stuff together, but I incorporated the error checked suggested by the other replies, and I utilized OCR to minimize carpel tunnel damage.

Yeah, it took a while, and yes I got paid little in comparison to the other people at the location, but I got paid, they got their silly meeting minutes online, and they didn't have to hire 1,000 monkeys with 1,000 type-writers and have redundancy of people or invest in vast warehouses of paper feeders.

The scale of my work: I worked on a series of bound volumes that took up 3+ feet on a bookshelf and I completed the work on my own in less than 2 weeks (while also feilding tech support questions from the group). If you have 1,000,000 pages to be put online yesterday, maybe you could use a larger staff - but always remember:

If it takes a farmer 3 days to plow a field, and 3 farms only a day to plow the same field, and it takes one woman 9 months to have a baby, how many months does it take 9 women to have one baby?

Often putting more people on a project doesn't equate to faster solutions or better ones and usually not cheaper ones.

--
Wheeeee
Adobe Acrobat 4.0 by cetan · 2000-06-16 01:16 · Score: 5

You don't need to spend all that money for Adobe Capture 3.0 when you can buy Adobe Acrobat 4.0. This is NOT the adobe reader, but the full version of Adobe Acrobat with all the bells and whistles. A url is: http://www.adobe.com/store/product s /acrobat.html.

In addition, you can also buy the Adobe Acrobat Business Tools, which is a slightly broken but still functional version of Acrobat 4.0. That is available here: http://www.adobe.com/store/pro ducts/acrbustools.html.

--
In Soviet Russia...michael would be rotting in Siberia!
Save money on OCR by sacrificing quality by AnonymousHero · 2000-06-16 02:08 · Score: 5
Ahh... mass-OCR cost-effectiveness... it takes me back...
I just used an off-the-shelf OCR engine and hacked the text together with the images programmatically myself. We would get TIFF images, which most engines could understand.
On really, really big OCR jobs, though, the real problem is the tradeoff between human intervention and quality. See, OCR engines just guess at stuff. The only reason they work at all is that they guess well. But they guess wrong anywhere from 0.1% to 10% of the time, depending on the quality of the input.
Each mistake must be correct by a human being. But humans are expensive. If you have lots of documents to OCR, the technology integration costs and the cost of the OCR engines themselves are amortized. They end up dwarfed by the paychecks of the humans.
The cost of massive amounts of OCR, therefore, is directly related to the amount of human correction of OCR mistakes.
Thus, you can save tons of money by selectively sacrificing OCR quality. Getting every page perfectly formatted requires around 60 seconds a page for a skilled OCR operator. It's all about reducing that time. How? Simple. Don't expect everything to be perfect. There are various levels of quality you can get out of OCR engines-human systems:
- no correction: just let 'er run. You can get it fully automated this way, but the quality is crap.
- zoning only: The OCR engines just suck at text with multiple columns, inserts, and tables. You can get people to correct the engine's zoning at a clip of around 5 seconds a page, 10 seconds if you require them to put in tokens representing the excised images.
- spelling correction: Typically, most people object to the spelling mistakes OCR introduces. With good quality text an operator can correct them at around 20-30 seconds a page.
- formatting correction: OCR engines can really mess up indentation and text flow. Unfortunately this is the most time consuming problem to fix, anywhere from 30 seconds to a couple of minutes per-page.
Oh, and it really helps if you get the workflow of the OCR down. Allow the operator to move on to the next document automatically, save them the trouble of remembering the name of the document they're working on, etc. etc. This may require a bit of hacking of the OCR engine you're using, but it's worth it.
So when doing something like this, ask yourself: how perfect does it have to be, really? You can save tons of money if you can cut any quality corners.