Slashdot Mirror


From Paper To PDF?

Spoing dropped this bit of informative info into the bin: "Last week, a friend of mine griped that he didn't know of an easy way -- short of getting Adobe Capture and paying per-use licence fees -- of creating searchable PDFs. I scoffed, and told him I've done it many times, and it was free -- as in beer and speech. Dumbfounded, he pushed me to show him how, and I did; print to a Postscript file, and run ps2pdf on it...done! Since every document could be output as Postscript, his problem was solved. If he wanted to batch process the documents, he could set up a few scripts to simplify the task. While he was impressed, he ended up asking what seemed like an easy question; 'Can you do the same with a scanned image?'" And therein lies the question...

"After a week of on/off searching, I did find some good references as well as nearly all the parts necessary for the job, including open source OCR engines, PDF and Postscript tools, search engines, and the like.

Unfortunately, I came up with only two solutions -- neither of them Open Source, and most quite costly (premium beer); Adobe Capture or dedicated "PDF scanners" like this one.

My question to the Slashdot crowd is this:

  1. Is there a cost-effective way of moving existing dead-tree documents into either HTML, PDF, or other searchable mixed text and graphics format?

We all deal with a mix of electronic and printed documents -- and you're like me you've paid for some of them in both formats.

If you're like me, you buy new documents in electronic, searchable, format when you can. How many of us have O'Reilly's Networking Bookshelf, or some other CD texts ready to search on our notebooks and networks?

Yet, I have a four foot wide stack of technical documents and books that just isn't going to come with me on each plane trip. I'm not going to get rid of them -- they are still valuable -- but I can't figure out how to make them useful more often.

The available tools for capturing paper and converting it into searchable PDFs is costly, and is geared toward corporations that can justify the costs by the number of users. To me, a per-use licence of Adobe's Capture --

  1. Adobe Capture - Prices

    Adobe Capture - Features

-- is just not cost effective.

If the document is already a text document -- even if it's in some word processor I don't use -- generating PDF files is easy and cheap;

Print a document to a Postscript file, or create one. For example a simple text document is trivial;

  1. enscript file.txt -p file.ps

Convert the resulting Postscript file to PDF;

  1. ps2pdf file.ps file.pdf

Converting a paper document to PDF is also easy. Just scan the image and use tiff2ps or jpeg2ps to create the Post script file. The only problem is that the resulting PDF is a bitmap image and isn't searchable.

Interestingly enough, TIFF -- a format used extensively for scanned documents -- does support TIFF+Text, but usually as an extention to TIFF and isn't really an optimal format; The Unofficial TIFF Home Page.

So, if you want to search the documents and keep the formatting and diagrams, you're back to paying Adobe for Capture or some other nearly as expensive method. "

188 comments

  1. use DjVu instead of PDF by Anonymous Coward · · Score: 1
    DjVu is a much better solution than PDF for paper->web conversion.

    The files are 4 to 8 times smaller than PDF for B&W documents. the DjVu plug-in is 10 times smaller and a whole lot faster than Acrobat Reader. It runs on Linux/Unix, Windoze and Mac.

    The DjVu compressor is free for non-commercial uses, and the decoder source code is available.

    Expervision's OCR software can read DjVu files. They even have an OCR toolkit for Linux.

    Although DjVu supports embedded searchable text, Expervision's engine cannot embed text into DjVu files, only produce a text file (or a number or other formats). For web-based search, you can use a simple CGI script to return the DjVu files that correspond to the text files that contain a match to the search string.

  2. Is it legal to convert PostScript to PDF? by Anonymous Coward · · Score: 1

    I know that there's software out there that does it, but that's really not the point. Just because there exists a piece of software to accomplish something doesn't mean that something is legal; see DeCSS (though of course in that case, the law is complete nonsense and it should be legal!)

    Unless I miss my guess, Adobe has patented the PDF format and only Adobe Acrobat (and other related products) can legally generate the PDF material. We had an Adobe rep out here once, and we asked him about third-party PDF authoring software, and he told us the same thing: all roads to PDF generation lie through Adobe. It's kind of a raw deal, if you ask me, but at least there are viewers for Linux. Proprietary formats are never a good thing, but they could be far worse (see MS Word!)

    Anyway I don't intend to tell you how to do business. If you're happy with the way your setup is working for you and you're not worried about the possible legal implications, then by all means, go ahead. Just remember that ignorance is no excuse in the eyes of the law (a lesson that I've learned the hard way a couple of times!)

    --
    Dale Sieven

    1. Re:Is it legal to convert PostScript to PDF? by Simm75 · · Score: 1

      Try using the PDF format in the printing industry. Ugh. Well, QuarkXPress 4.x can use PDFs as images, but only the first page of a doc. You still need to have Acrobat or Acrobat Reader around. A lot of folks also don't realize that when they create Acrobat documents (using the kosher software ;^) that they MUST make sure they're doing everything right. If they're using funky (read: non-Adobe :^) fonts, they MUST include these fonts. If the doc is to ever go to press (not just a consumer printer) they MUST create their photos correctly. No RGB JPEGs here; just nice, bloated CMYK EPSs or TIFFs. And before you say it, yes, it really DOES make a difference. I deal with these sort of problems 40+ hours a week.

      It would be much better if documents were stored in a format similar to..oh, I can't think of it, but AT&T has a format that is essentially a two-layered format that's well-suited to storing scanned docs. Then, include text and images in a componentized format.

      What I'm actually thinking about is something similar to what one can do manually in a product such as Illustrator: convert text to outlines, rather than trying to embed fonts. This is the typesetter's worst nightmare, receiving a file that has, say, embedded TrueType fonts :^) and tries to use the PDF on a Mac-based imagesetter. Ugh. I say, use PostScript and simply convert all text to outlines, rather than doing font embedding, then build this PostScript file in a way that can be pulled apart by a capable program.

    2. Re:Is it legal to convert PostScript to PDF? by Kahrul · · Score: 1

      Adobe has not patented the PDF file format. It is an open standard. You can get your copy of the reference manual from any computer bookstore or from Adobe's website. Create all the PDF you want!

    3. Re:Is it legal to convert PostScript to PDF? by Dr.+Sp0ng · · Score: 2

      Unless I miss my guess, Adobe has patented the PDF format and only Adobe Acrobat (and other related products) can legally generate the PDF material.

      File formats can't be patented, they can only be trade secrets (I believe.) Otherwise don't you think Microsoft would have patended .doc format? That would be an extremely easy way to kill off Wordperfect, StarOffice, AbiWord, etc., dot dot dot.
      --

    4. Re:Is it legal to convert PostScript to PDF? by Tei'ehm+Teuw · · Score: 2

      GIF??

    5. Re:Is it legal to convert PostScript to PDF? by Azog · · Score: 3

      The patent on gif is not the gif file format per se, but the compression algorithm.


      Torrey Hoffman (Azog)

      --
      Torrey Hoffman (Azog)
      "HTML needs a rant tag" - Alan Cox
  3. Making PDFs by Anonymous Coward · · Score: 1

    I friend had asked me the same thing recently...by accident I found that Adode now offers this service to anyone on their website for free...(You upload your file (MS Office/graphic file) they convert it and mail it back to you....Free (as in Beer) Limits are 50 Mb max, and or 15 minutes processing time. http://cpdf1.adobe.com/index.pl?BP=NS - The Hutch-Meister

  4. Re:where can one find ps2pdf ? by Anonymous Coward · · Score: 1

    On my system, ps2pdf is a shell script that checks some arguments, then execs gs. It is included in the ghostscript distribution.

  5. Does Anyone know of a Open Source OCR that works? by Anonymous Coward · · Score: 1

    I have a bunch of jpeg files of some documents that are out of copyright I'd love to convert to text, but I haven't seen anything that actually seems to work. There _is_ the SOCR project but they don't seem very far along.

  6. My solution: by Anonymous Coward · · Score: 1

    At work (Department of Agriculture! woohoo!), I had a Beowulf cluster made up of old (and I mean OLD) AT&T Np-17s, running Linux. Now these are 386-based (hence the need for a Beowulf cluster to get any power out of them), so we were kind of limited in the type of scanner we could use, but someone dug up an old parallel-port greyscale, so we were fine.

    Using SANE, we scanned the documents (all text based), and uses enscript to convert them to PS (we didn't use PDF, but you can just filter them with ps2pdf as you already know). There were some problems with a few of the papers, which dealt with the nutrional value of a few hot breakfast foods, but it turned out that the FDA logo at the top was giving the old scanner trouble. It turned out it was set up for text scanning only, so don't forget to check your scanner settings.

    All in all, the higer-ups were happy, and so was I.

  7. Re:But wait, there's more - by shogun · · Score: 1

    I think you might want to consider investing in a 40 metre tall video wall for each side of your building then. Or maybe just some well made telescopes to distribute among your readership.

  8. Re:Postscript more widely used in print houses by Enahs · · Score: 1

    Gee, the paper I work for uses PostScript for negative-making (the negatives are used to make printing plates) yet we get AP AdSend ads all the time, as well as PDFs from other sources.

    How do we use them? Well, we can do a number of things:
    1.) Use Adobe Acrobat to export eps's.
    2.) Use QuarxXPress 4.x+ and import the PDF as an image file.

    The paper I work for also does job printing; we print a number of papers for a few small towns. We get the pages sent to us on a 250MB Zip disk as PDF's. Most of the time we don't even have to convert them. How? Well, our negative maker (or imagesetter) is really a combination of a dedicated "printer" unit and a Power Mac. The Mac handles some of the conversions necessary; our OPI server can actually make 4-color separations of PDFs automatically.

    Yes, PostScript is the standard. But, hey, when something doesn't come in as a PostScript file, you convert. Either that, or you don't get paid to do the job, and your competition does so for you.

    --
    Stating on Slashdot that I like cheese since 1997.
  9. Re:Adobe Acrobat 4.0 by zander · · Score: 1

    We used this some time back to scan whole books (including graphics). Just scan the pages in b/w and the pictures in color or line-art and use acrobat to OCR the pages.

    The texts were dutch and we ware amazed about the quality. The texts ware about 99% acurate from the paper version!

    Definitely a winner in our book!

  10. Cosource has projects working on open source ocr by jhebert · · Score: 1
    There is a group of people working via Cosource.com that aim to produce an open source OCR solution:


    http://www.cosource.com/cgi-bi n/cos.pl/wish/info/337


    [Disclaimer: I work for Cosource.]

  11. Re:source-code only solution by Trashman · · Score: 1

    Follow the link and you'll see why....

    --
    Do not read this .sig
  12. Re:ps2pdf produces small files by Spirilis · · Score: 1

    I think part of the size reduction from going to PDF is that PDF does compression (zlib? something like it?) on the contents as part of the file format. (or am I wrong?)

    --
    the real at&t mix
  13. Re:Other ways... by Ares · · Score: 1

    Quoth the poster's friend:

    "The searchability factor is the only reason OCRing is needed in most instances."

    That and the need to keep the total file size down to a manageable level.

  14. Generating PDF files by EricTheRed · · Score: 1

    It doesn't matter what format the source is, you can generate PDF files without problem.

    One of my projects is a Java PDF generator library, that allows anything that can be printed in Java to be sent to a PDF file.

    To do this, I used the PDF specification that Adobe publish deep on their web site (Can't remember the link, but a search will find it). The first few pages actually encourage third party developers to write their own generators.

    The one thing that they do restrict, is that no one can "Change the PDF format", but that is reasonable - why (unless you are a certain unmentionable corporation) would you want to do that?

    --
    Java gaming nut - http://www.retep.org/ or for the rail http://uktra.in/
  15. Re:Adobe Acrobat 4.0 by pen · · Score: 1
    IE5 also has a nice webpage saver that saves all the files needed by the page in a separate dir, and redoes the links on the page to point to them. However, it pukes if there is one image that didn't load, and it inserts some random MS crap into the file it saves, including a comment that says the URL where the page was grabbed from (useful), and a few MSHTML thingies (not very useful).

    --

  16. Re:why bother with PDF? by pen · · Score: 1
    (Off-topic)

    Ok, how many of us just opened another window, went to Freshmeat, and spent 5 minutes searching for such a utility, hoping to score some Karma for a link to it? :)

    --

  17. Never search for "Coke" this way by leonbrooks · · Score: 1

    ...because you'll get hits on photos of racing cars, Tibetan general stores, liquor shops, yachts, delivery trucks...

    --
    Got time? Spend some of it coding or testing
  18. Hint: don't use netscape? by leonbrooks · · Score: 1

    Try KFM, Mozilla, Amaya, almost anything else.

    --
    Got time? Spend some of it coding or testing
  19. Or "law" of averages by leonbrooks · · Score: 1

    If you're scanning these using an autofeeder, run each stack of documents through three times and write a little parser (patch could almost do it) to sync up the text from all three scans, and where there is not unanimity, have a "vote" (and maybe spell check the relevant word(s) to see if one of the votes matches a known word, or a suggested spelling alternative matches one or a majority of the votes) to decide which word fits here.

    This would sometimes fail where the source document is mis-spelled. A side-effect might be electronic copies better than the original.

    --
    Got time? Spend some of it coding or testing
  20. Open Source (and Free as in Beer) solution by leonbrooks · · Score: 1
    Visit FooLabs and get a copy of xpdf, if your distro hasn't got one already (I'm using Mandrake 7.1 but I recall xpdf in every version of Mandrake from 6.0 on). Type:
    pdftotext filename

    Rememver to add -acsii7 if MeatheadSystems' Index Server doesn't like Latin1 character sets.
    --
    Got time? Spend some of it coding or testing
  21. How about searchable images? by sandler · · Score: 1
    I wonder if there's a way to search an image for text. Let's say you want to leave the document in image form, say, to preserve the original look, perhaps for historical documents. If it's all in the same font, I wonder if it's possible to do a "text search" by searching the image for the appropriate patterns in the image. This may be a reproduction of how OCR works, although I think it's a separate functionality that could be quite useful in some circumstances.

    Any thoughts from image gurus on the viability of this?

  22. Re:source-code only solution by stimpy · · Score: 1

    didn't check the link, did you?

  23. Monks by Skankmofo · · Score: 1

    The most logical and economical solution to your problem is to start a cult and attract 10-15 monks who will spend their every waking hour recreating your precious technical documents in digital form.

    --
    "A great deal of intelligence can be invested in ignorance when the need for illusion is deep." --Saul Belloe
    1. Re:Monks by DanBari · · Score: 1

      Better yet, start a monastary and have them translate the technical documents into Latin, and then Fourier Transform them, and then put them into digital form. Talk about security plus usage...

      --
      Fruit flies like bananas... Time flies like the wind...
  24. For small runs Adobe Acrobat will work just fine by Baboshka · · Score: 1

    There is an option with Acrobat to do OCR on an image. I worked with a Professor who is also a managing editor of a journal to add a searchable set of PDF's of all issues of the journal. He hired a student to feed the scanner. They then loaded the images into Adobe Acrobat and ran a capture text on it and bampf, searchable. They kept the original image with the text behind so they didn't have to correct the mistakes of the OCR. I don't have a copy of it in front of me, so I can't be more specific.

  25. GNU OCR by thingie · · Score: 1

    A nice GNU OCR package:

    http://www.socr.org/

    Not currently being developed at a notiable rate

  26. Re:If you have a Mac by tweek · · Score: 1

    The pdf writer plugin does the same thing. When we deliver documents to our clients like finished test plan results, all the people here do a print to pdf (they think word is actually GOOD for technical documentation).

    --
    "Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
  27. But wait, there's more - by wirefarm · · Score: 1

    RANTPersonally, I *hate* PDF. It's a format for people who print everything before reading. Ugh. Buy a bigger monitor and read on screen!/RANT
    Anyway, I think the way to do it would be to do the following:
    Acquire the images either by scan or by fax. (Or other docs by email or FTP... Why not make this more comprehensive?)
    Store them in a database.
    OCR them as best you can with the tools available at the time.
    Store this OCR'd text in the same row as the image.
    Create a field of keywords derived from the OCR'd text and use this for searches.
    Now you have a simple database of everything you need. The original image, (or document, or whatever,) and the 'Best Guess' as to the contents of the image.
    If a user wants a PDF, let it be created at runtime - Pages 1,2,3...x are the images. The last page is your searchable index of keywords.
    If a better way presents itself later, do that.
    If a user wants it in HTML, great. You can even embed the images.
    The benefits to using a database are this:
    You can always go back and re-OCR the image when better Open-source tools are available.
    You can search you whole company's documents, not just one at a time.
    You are not limiting your users to using one format.

    Don't think of this as a process that has to require a lot of user intervention and only gives you a dead-end format!
    With this method, you are not limiting the output.

    Cheers,
    Jim In Tokyo

    --
    -- My Weblog.
    1. Re:But wait, there's more - by Simm75 · · Score: 1

      I work for a small newspaper, and much of the national advertising we get comes in PDF format. While the monitors we have are quite large, they're not quite large enough for our readership to get much out of them, especially since some of our readership lives a few miles away. :^)

  28. [Off topic] An absolutely usable solution by CAB · · Score: 1

    If you need to create PDF's on-the-fly based on, say, database queries or the like, go grab Zope and the ZpdfDocument plugin.

    Go to:
    http://www.zope.org/

    ... and grab Zope:
    http://www.zope.org/Products/Zope/2.1.6/

    ... and the latest version of ZpdfDocument:
    http://www.zope.org/Members/gaaros/ZpdfDocument/

    We use this where I work (IT dept.) in production.
    Except from the fact that it only handles different kinds of text so far, it does run perfectly.

    Just click the "report link" and there Acrobat opens. Neat.


    Best regards,
    Steen Suder

    --
    Best regards,
    Steen Suder
    -- for email: send to .net
  29. What about Trapeze? by Internet+Ninja · · Score: 1

    While it's a commercial product, Trapeze from Onstream Systems may be a goob idea.. Basically it uses funky document handling gizmos to scan and process paper based content. A system has recently been devised to turn a very marge chunk or Irelands Marriage and death records (and I think birth records) into a searchable, electronic dodcument system.
    As far as I know, they also do OCR as well. All together, it's pretty darn cool. And no, I don't work for them :)

    Cheers,
    Graeme

  30. Re:A possible solution? by adolf · · Score: 1

    This can't work. Existing utilities to convert PS to HTML are text-only. That is, ASCII text, not graphical representations of text. And even if it did work, it would be a horrendous bother, with manual cutting and pasting of images. But I digress...

    Feeding such a program a Postscript file which contains nothing other than an image will not produce the desired output (if any output at all).

    It doesn't matter how many times you convert from one overlapping format to another; OCR systems don't just materialize out of the ether, someone has to write them. And so far, those who have done so don't see the need to give them away.

  31. Commercial support by gruntvald · · Score: 1

    Contact the support desk for your commercial software to answer this. That's why you pay for proprietary software right? Right?

  32. Re:And now for the obligatory... by SEWilco · · Score: 1

    A Beowulf cluster of high school students?
    You're going to give them swords?
    Take a look at "Romeo and Juliet"... No, wait, Romeo and his peers were of junior high school age...

  33. Printing Linux HOWTOs by SEWilco · · Score: 1
    The Linux HOWTOs only print out like that if you're looking at the multipage HTML version.

    They are created with tools which create documents in several formats. If you want to print the entire document, you should use one of the formats which contains the entire document.

  34. PDF to text, then index. by SEWilco · · Score: 1
    If you can get it to PDF, xpdf's "pdftotext" can get you text for categorization. Indexing to actual pages is a little more complex, but a script can do it because that command can select the page to convert.

    Or there's the related PDFTOHTML if you prefer that for your access method.

    1. Re:PDF to text, then index. by Dr.+� · · Score: 1

      I've been using the Xpdf converter psd2txt for a while now, and it's a blast. I use it as an add-on to a search engine, that makes searchable indexes for a website.

      The only thing, that I think is træls (danish for "its a bugger") is, that I have to use different converters in dependence of the version of the original PDF document AND the lockmode of the file.

      If the file is PDF version 1.3 and is copy protected, then pdf2txt version 0.90 is suitable.

      Else, the pdf2txt version 0.80 is suitable.

      This works for me - and it is implemented in a freetext search tool, that we sell.

      Dr. Ø

      --
      Eih bennek, eih blavek
    2. Re:PDF to text, then index. by robbkidd · · Score: 1
      If you can get it to PDF...

      That's the hitch. Scanning it currently stores the scanned document as a bitmap which doesn't know "text" from "pr0n". Converting the scan to PDF makes a big PDF bitmap, also not recognizing any of the content as text to process. That's missing link the article is talking about ...

  35. Re:why bother with PDF? by lhand · · Score: 1

    In our environment, we have lots of users of different systems, few of which talk to each other (the systems, that is). Many of the systems we create can only address one small part of the overall need but do solve all of that one part. In a small app we're doing now, we have a need to print out a form for the user to sign after he enters a bunch of data. We also keep that data in a database so we can use it when we automate the rest of the system. That will require working with other divisions and departments and is not a small task. There is no reason not to provide a useful product even though it is not perfect.

    Creating a PDF document from the web app is the best way to make sure the form can be used, since the users may need to have HTML fonts, colors, etc. overriden for their use, but the form must be properly formatted with specified fonts, etc.

  36. DocuLex Alternative to Adobe Acrobat Capture by BeIshmael · · Score: 1
    DocuLex has a program that is certified by Adobe and an alternative to Acrobat Capture. It is actually used by the Ricoh scanner you linked to. It appears to be cheaper.


    I haven't had much luck finding anything cheaper. Ideally, I would like something to hook up to our digital copier and convert the scans to .pdf files. I've talked to every photocopier company and no one has a product. They seem to be missing a huge market, but ohwell.

    1. Re:DocuLex Alternative to Adobe Acrobat Capture by rawburt · · Score: 1

      Check out http://www.axis.com/products/document_servers/ I have one of these hooked up to a digital photocopier (Ricoh Aficio 650), works like a charm. You can scan about 15-25 pages per minute and save in tiff or pdf.

      --
      --- oops
  37. Quick And Dirty by miracle69 · · Score: 1

    Since the main problem with OCR is, of course, proof reading, there's a quick and dirty way to do this without having the dreaded proofreading step - at least not up front.

    Set up your script to link the OCR page with the original scan. That way, your search engine will most likely be able to get you to the correct page, but if the OCR hoses some important words, you can always just click "original page here" to see what it said. This would allow near immediate functionality of your new database and would allow you to proofread "on the fly" so to speak and correct errors when you find them.

    This should be a good solution (even though it is a bit of a hack job) especially if the searcher is familiar with the particular documents and can devise several searches - in case a keyword or two is munged by OCR.

    --
    Linux - Because Mommy taught me to Share.
  38. HTLM first maybe? by drfalken · · Score: 1

    I think there may be tools for converting scanned documents to HTML for the web since HTML is an open standard and the web is everywhere. Loads of vendors work with the web and there may be more tools than for PS or PDF alone.

    Printing from Netscape to PS and then using ps2pdf gives nice and searchable results.

  39. Re:tiff2ps by hattig · · Score: 1
    He mentions this at the bottom of the article, ye who cannot be bothered to read the article.

    The guy wants to know how to take an image containing text, and create a pdf containing an image, with that text as real text, not a bitmap.

    I.e., some software that will OCR the image, grab the text from the image, create a pdf file with that text in, preferably in the same layout as before. If the original image had images with the text, then the images should be preserved in the new document.

    Why the poster didn't say so in such a clear manner is beyond me though!

    Oh, and moderators, this is "Redundant", not "Informative".

  40. Re:tiff2ps by Sperk · · Score: 1

    That is going to create a bitmap type eps file where each pixel is a point. That will not extract the text from the original document. You will need to use some of the recognition software (warning entering area where I know very little) I remember a product called "Text Bridge" that was supposed to take a scanned image and try to recognize the text within the image.

  41. Re:The age old question by underwhelm · · Score: 1

    Shut the hell up.
    You make copies.


    I'm not insulted because I only work to make money. As long as I am paid well, treated with respect and left alone in my private life to enjoy myself as I will, I don't have any compuntions about making copies.

    Meanwhile, I can work from the inside of a large corporation to fight for the right of consumers to make copies of things.

    Maybe you don't know, but Kinko's ability to make copies for people was hampered by a lawsuit from textbook makers. Kinko's can't make copies of copyrighted things, and are expected to make every effort to prevent customers from doing the same. In spite of the fact that what they want to do might be fair use.

    Because we are not legally permitted to make the distinction, we are not allowed to do anything that could possibly infringe.

    That, and they give me plenty of vacation, holiday and sick time; schedule around my education; and pay for me to go to school.

    Not so bad for just making copies. :)

    --

    I don't need large brains to have a good time.

  42. Re:HP Digital Sender by danford · · Score: 1
    Despite the fact your mail system storage suffers... the digtal sender from HP is a very very handy way of distributing paper docs.

    Its as easy as faxing something...

    It even does LDAP lookups in our corporate directory for finding email addresses.

    -Eric

  43. Re:OCR can retain formatting by Cy+Guy · · Score: 1


    It has to save the document into a file format that has complex formatting features. Usually this is something like Word Perfect,


    WordPerfect also has their own export to PDF option built-in (they licensed the format from Adobe but wrote their engine). I don't know if it is in every version though.

  44. Re:why bother with PDF? by DrMaurer · · Score: 1

    PDF represents a reasonably good approximation of the "printed" version of a document. I used to HATE PDFs, but I've grown appreciative after working with them cross-platform while creating something like an on-line resume (for those who prefer printing the things out), and a journal with, uhh, shall I say, unique formatting needs.

    However, I needed a zip disk to carry them all around. Floppys wouldn't hold them. Even the MS word transcript of the text and the jpegs (high quality black and quite) fit on a floppy.

    Later

    --
    Dan
  45. Try CPEN by narsiman · · Score: 1

    Have you ever tried a product called CPEN. It is a different kind of solution. You can get the text extracted if that is all you are looking for, and store it in any format that you need to, Of course you would miss images but I guess you were looking at pure text extraction solution. And since this device works at your pace, you can make modifications as you scan. - Pretty nifty.

  46. Re:Effective Solution by jmccay · · Score: 1

    I checked it out. One small problem, it is only a server that you use on the net.

    --
    At the next eco-hypocrisy-meeting, count the private jets used to get to the meeting. Should be interesting to see that
  47. Re:why bother with PDF? by jmccay · · Score: 1

    If I remember correctly, pdf is smaller than html. Doesn't it also keep the images too? Can anyone verify this? I know they are extremely small documents.

    --
    At the next eco-hypocrisy-meeting, count the private jets used to get to the meeting. Should be interesting to see that
  48. Re:If you have a Mac by heliocentric · · Score: 1

    Ok, that takes care of step 2 (the easy step - there are simple and cheap programs for other OS platforms that do the same thing), it's step 1 that's the problem - getting the paper copy into the computer - including text, pictures, and the proper formatting* - then doing step 2 to output the silly thing into whatever format the ultimate recipient (management, web, german speaking clients, etc...) wants.

    * - getting formatting out of OCR is the problem that many in this discussion are talking about - you can get reliable OCR but that's lacking pictures and formatting - you can get good formatting but lose accurate charater recognition - and you can grap the pictures, but lose text and formatting. Since you are brining up the MAC side maybe you (or another MAC using person) can suggest a good product/technique/prayer to help with step 1.

    --
    Wheeeee
  49. Relational algebra, primary keys and the like... by heliocentric · · Score: 1

    This is off the main topic, but responds to the post above...

    There currently exists no good way of searching images. I have been working for sometime to come up with a way to index pictures (no, not by file names) so they could be put into a database (using primary keys that the system generates) so that querries could be run. My idea consists of (for example) having picutres of people sitting, standing, walking, jumping, laying, etc... and a user could draw into a java aplet a stick figure of the pose they are looking for (a use of this would be an art college where students need to do portraits, etc...) and the database has keys of how the general shapes are alligned. Current work is crude (you make big circles for portraits, and horizontal lines for people laying), but I have had some "functional" results - about as good as typing microsoft into google and getting linux.org - they are both OSes, but not much else would lead you to expect the results.

    Now, the point that is ontopic. If I/someone can devise a good keying algorithm for pictures (in keeping with the above example) of people and we incorpoarte in a "depth" amount for how broadly your input is searched then I guess you could get the precision down to the point of supplying say the AaskSlashdot-logo (located at the top of this page) as the argument to a querry on scans of paper and return all those with text (treated as image bits, not ASCII) resembling that of AskSlashdot.

    The technology isn't here, and may never be. I've looked into a program the army was using to get a system that could spot tanks hidden in images (it was more of an AI issue). And the system worked great on test images that were taken before development of scenes with hidden tanks, and scenes with no tanks at all - but it failed miserably on images taken after development - it was learned later that images with tanks hidden in them were taken on a clear day before development, and those with no tanks were shot on a cloudy day. The system was wonderful at telling if it was cloudy out - not at finding tanks however.

    --
    Wheeeee
  50. Re:People are doing it... or not by k_hokanson · · Score: 1

    I got one of those cheapie 'learn programming fast' kits from best buy last year, and it came with 3 or 4 different 'Teach yourself' books in PDF form.

  51. Re:The age old question by mlogan · · Score: 1

    perhaps the proof reading issue could be minimized by first running the results through ispell, and perhaps writing a perl script that asks for confirmation on words which are spelled correctly, but are very similar to other words.

    Or, if the only purpose of this is to be able to have technical docs on a CDR that you can carry with you on trips, it probably doesn't matter if there are a few typos.

    -Mark

  52. images to pdf by userunknown · · Score: 1

    I work for the Marshall Space Flight Documentation Repository - turning documents into pdfs is what we do.

    There is a licence available for Acrobat Capture that does not have a per page cost, but it is expensive. Also there is a product called Alchemy that we use to convert practically any file into a pdf or any other type of file for that matter.

    I know of the open source types too but I can't seem to sell those very well at work..

    -Mark

  53. searching for Morpheus by passion · · Score: 1

    It was my impression - or perhaps maybe just my inspiration from when I rewatched the Matrix for the n th time, that I paid attention to what Neo was doing when the night clubbers came to his door.

    It seemed as though he had written a search utility that searched through internet documents - including newspaper microfiches for info on Morpheus. I imagined him sticking some OCR code into his searching utilitty. This could easily be run through a file like /usr/dict/words - or hell, concatenate foreign dictionaries, and combine it with a Babel-fish-like translator to give the ultimate search results.

    Who's up for starting another OSS project? :)

    --
    - passion
  54. Re:OK MODERATORS by donutello · · Score: 1

    I guess "Troll" would have been a better characterisation. The URL points to www.microsoft.com and the description of the link points to a non-existant URL.

    --
    Mmmm.. Donuts
  55. Re:Missing a step? by inburito · · Score: 1
    I actually have this fully functional on my computer.

    I recently bought a scanner to be used in Linux(tried hp scanjet 5300c first, doesn't do scl, so switched to 6300c - very nice, if you don't mind $399).

    Windows software that came with the scanner can scan a page of text, recognizes the areas to be ocr'd and ones that are pictures, scans it, automatically does ocr and imports it into a word document with the pictures(looks just about the same as original).

    Ocr is very good and normally only requires minor corrections. Installation also seems to have contained a program called pdfwriter(adobe or not, don't know.. but it's not acrobat) that happily writes that word document into a pdf-file thus completing the task.

    Yup, it doesn't work in linux. Yup, it's not open source. Nope, I don't mind, it WORKS!

  56. Re:People are doing it... by mcrandello · · Score: 1

    I'm fairly positive that those are either cracked e-books or that the files were originally purchased elsewhere. Iremember the SAMS teach yourself Linux, it came with 5.2 on a bonus CD, etc. etc...

  57. Re:Adobe Acrobat 4.0 by mcrandello · · Score: 1

    noone really want's to *read* a document in pdf format, but think about this. The linux HOWTOs. You want to print it out so you hit the print button on the index page, and that's all you get. Then you must load each page, print that ad nauseum. The plaintext versions are harder to read than staring at the screen, and all you want is an offline copy. A pdf solves this problem, and also insures that you are able to print that document *exactly* the same from any source, the unix command line, dos, Macinsmak, it makes no difference at all.

    It always prints out just the way you want it (except four up on our lexmarks here for some reason:/ )

  58. Re:That's why you need the verify stage by smurfi · · Score: 1
    You could also use an OCR engine as one of the inputs.

    There are a bunch of mistakes that are fairly common for all OCR enginess -- messing up 'm' as 'rn' being the most common example. There are also a bunch of mistakes fairly common among inexperienced typists.

    The good part is that the two sets don't overlap much (if at all), thus you catch most errors.

  59. Re:why bother with PDF? by Arker · · Score: 1

    Good question. PDF has one advantage - it comes much closer to being an *exact* duplicate of the paper document, whereas HTML naturally involves some abstraction so far as layout and so forth. However I've never seen a case where that was relevant. Despite being forced to use .pdf documents over and over in business settings. I guess someone somewhere thought it would be a horrible thing if a line broke a different place in the electronic copy than the original or something.

    At any rate, if preserving the original layout precisely is of critical importance, then I guess .pdf would make sense - otherwise I would definately go with the smaller and more portable html. Call me a geek, but 99% of the time when I want to reference a paper document, it's the information I care about, not the lovely and artistic layout. A utility to mount a tarball as a directory would be quite handy, that's a great idea, and should be almost trivial to do, no? Any hackers want to get on it? ;^)

    --
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Friends don't let friends enable ecmascript.
  60. Re:Interns by Zibby · · Score: 1

    Isn't that what interns are for? Even better, teach them some basic html and have them create html documents. I found going from html to pdf more useful because somehow the links survive into the pdf, and show up in acrobat reader as bookmarks. Very cool.

    In all honesty, you're not going to get away going from dead tree to digital paper without proofreading at least once. There is no OCR package that perfect. Same goes for your data entry folks.

    --
    "Only two things are infinite, the universe and human stupidity, and I'm not sure about the former." - Albert Einstein
  61. ps2pdf produces small files by SIGFPE · · Score: 1

    I recently put together a paper using TeX and included some tiff images. I was worried about how I was going to submit it to the journal as it ended up being several hundred megs of postscript. And then I tried ps2pdf on it and the file size was reduced by about 90%! I couldn't believe how good ps2pdf was! But free OCR software doesn't seem to exist if you want to create pdf from scanned documents. If it did you'd have a free alternative to the pdf's IBM offers on its patent server (which I believe are searchable - but I may be mistaken).
    --

    --
    -- SIGFPE
  62. adobe capture is not an option by kuma · · Score: 1

    never even heard of adobe capture... why would you even consider this if you are not a corporation? seems stupid.

    raise your hand if you think this software is somehow magic?

    will adobe capture give you flawless electronic documents? if so, how. and if it does not really improve on mainstream pro ocr applications, what are you paying for?

    while my solution would not be free, it would be easy on a personal scale (and this is mac-think, just to counter your cli mumbo-jumbo):
    -- get a good ocr (thinking omnipage pro)
    -- get adobe acrobat (or page-layout app that will convert the ocr docs)

    (pay once for licenses--paying by the page with your own money is insane)

    hell, you can even script the whole thing up using applescript or perl, but if you want documents which are really searchable, you will have to proof results (otherwise you could use searching software which can handle misspellings, not such a good idea).

    of course, i could be wrong, adobe acrobat could really be magical, but do you trust magic? (the reality last time i checked was that you could not get from paper to text on screen without mistakes, and frankly, you will never get error-free translation, humans cannot even read what is literally on the page without introducing errors: identifying every word is error prone as our attention span is weak, but standard reading using context allows all manner of optical illusion and cultural training to stain transcription).

  63. Re:PDF, Ugh. by humphreybogus · · Score: 1
    For an intranet site that generates signed correspondence and formatted fax cover sheets, I have very successfully used a free HTML to PDF converter from Easy SW, called HTMLDOC.

    Use PHP to write out formatted HTML (that's the tweak-heavy part) to disk, then use a shell command to run HTMLDOC and convert it to PDF, and display it in the browser. We've generated literally thousands of documents this way, and it works great. All free as in all free.

    Much easier than using the PDFlib library that comes with PHP, as you can avoid having to learn anything about PostScript. You are at the mercy of HTMLDOC's formatting, which can be quirky. But it's vastly improved in the short time we've been using it, and new versions are out almost biweekly.

  64. HP Digital Sender by itswhatsinside · · Score: 1
    I briefly looked into a network scanner from HP - the Digital Sender 9100c.

    The device plugs into the network, and after entering some network info, is ready to go. I'm not sure what platforms the admin software runs on, but it includes a copy of Circulate with it. It will scan up to 15 ppm (if I remember correctly) and save the result as a pdf. You can then 'Circulate' it to make is searchable. You can also hook it into certain internet fax providers, as well as tie it into Domino DBs, and other potentially useful things.

    I'm not sure if this is ideal for your situation, but I thought I would throw it out there. You can get one on loan for a couple of months, so it might be worth a shot.

  65. Re:OK MODERATORS - check the link, its a troll. by davebooth · · Score: 1

    when the destination of the link is www.microsoft.com instead of where it claimed to be then yes it should indeed be moderated down. There isnt an OCR HOWTO, at least not on linuxdocs.org there isnt. Had I any points to fling it would have been marked "troll" not "offtopic" though.

    Its noise, not signal.

    # human firmware exploit
    # Word will insert into your optic buffer
    # without bounds checking

    --
    I had a .sig once. It got boring.
  66. Re:PDF, Ugh. by Tom7 · · Score: 1


    I'm not sure why libraries would be expensive for doing this with PHP, but it's disappointing. I downloaded the PDF spec a while ago and wrote my own PDF compiler, which took less than a week (and involved learning PDF-style postscript). The spec is VERY GOOD, and Adobe should be applauded for publishing such a complete and open spec.

    Why aren't there free replacements for the necessary PHP libraries? I'd do it but I don't like PHP. ;)

  67. annotation of PDFs by chrihart · · Score: 1

    I have a pretty large collection of PDF files. What I really need to
    be able to do is annotate them with notes while reading. Does anyone
    know of an open source solution to this.

  68. ps2pdf and ghostscript and html pages by codegen · · Score: 1

    I have not had a great deal of success with this. It has only worked for simple documents. For the rest, ghostscript generates what appears to be a pdf file, but ghostscript itself does not seem to be able to display it, and the adobe reader gives an error message "expression to complex" The source ps was postscript generated by printing an html page to a file from netscape. It happens consistently for anything other than the simplest text page. Anyone have any ideas?

    --
    Atlas stands on the earth and carries the celestial sphere on his shoulders.
  69. Re:The age old question by ralmeida · · Score: 1

    OCR software is a niche software market, and you either get free, dissapointing software with your scanner, or you pay big money for something that does a decent job. Just like everything else in life.

    Yeah, just like, say, Linux and Windows.

    --

    --
    This space left intentionally blank.
  70. Postscript more widely used in print houses by dms0 · · Score: 1
    Postscript is the choice of most print vendors and print houses rather than PDF

    why? because xerox havent got their act together and produced a PDF compliant printer controller

    (and if youve used the PS ones.. there like someones uni project gone HORRIBLY wrong.. awful interface)

    *shrug*

    dms0

    --
    You should feel guilty if your just watching - ATR
  71. From Paper To PDF? by ronabramson · · Score: 1

    The Adobe literature makes it sound as if you need to license a $6,000 package with per-copy charges in order to do this. However, I've found (to my surprise), that the ordinary $400 package allows you to do quite a bit of the "paperless office" thing. You can "import" from paper to PDF -- in fact it integrates with the scanner driver. Moreover, the package has an incredibly accurate OCR capability built in. We have scanned (but not OCR'ed) quite a few documents with this package, and put them on an intranet. I haven't (yet) hit any particular limits. In addition, there is a complete 500 page spec. available with the PDF file layout, if you feel like writing your own stuff. I found a Perl module on CPAN that could extract some simple information from a PDF, and in conjunction with that, put together a search engine that allows you to index and search on "keywords" etc. manually coded into the file. It works for me. Writing a full text search program (for OCR'd or "distilled" documents) shouldn't be too hard. As soon as I find some code to deal with the internal text compression, I may do it.

  72. If you have a Mac by MrMac · · Score: 1

    If you have a Mac you can use PrintToPDF... it acts like a print driver and if you can print it... you can turn it into a PDF. $20 shareware. And it does images too.

    --
    *** I Know Everything, But Can't Remember It All At Once ***
  73. AutoTrace by Scott+Johnston · · Score: 1

    http://homepages.go.com/~martweb/AutoTrace.htm

  74. Re:why bother with PDF? by Munky_v2 · · Score: 1

    Thank you for volunteering.


    Munky_v2
    "Warning: You are logged into reality as root..."

    --
    Jay
  75. Re:Save money on OCR by sacrificing quality by cryosis · · Score: 1

    Each mistake must be correct by a human being. But humans are expensive.

    Have you tried Dans' Discount Human Labor? I've pretty good luck with them so far. 10 'laborers' ( I hate calling them slaves. It's so degrading.) for $5 a month.

  76. PDF and Index Server? by haus · · Score: 1

    I have gotten myself into the unfortunate position of attempting to making Microsoft Index Server create a useable index of a poll of data, which is made up of primarily .pdf documents. And to the best of my knowledge Index Server does not support indexing of .pdf's So it appears that I will need to convert these files to something more useable [preferably HTML]. Any suggestion?

    all persons, living and dead, are purely coincidental. - Kurt Vonnegut

    1. Re:PDF and Index Server? by zootie · · Score: 1

      Check your Acrobat Exchange CD. Theres is a MS Index Server filter to add PDF support to IIS.

  77. Hardware by #include · · Score: 1

    There's a new scanner out from Hewlitt-Packard callel the DigiScanner 9100C. Not only will it scan documents to either PDF or TIFF, it'll email 'em too. And, no, I don't work for HP.

    --

    A genius writes code an idiot can understand, while an idiot writes code the compiler can't understand.
  78. I Just Answered My Own Question... by istartedi · · Score: 1

    ...at this URL: http://www.adobe.com/support/downloads/5efe.htm

    This is a plugin for the acrobat reader. They say it's for visually impaired people--you have to convert the PDF to some kind of ascii text so that text-to-speach programs will work. Hmmm... with Magellan's converter selling for $200, you have to wonder why Adobe burried this under "access for the disabled". Anyhow, this could be the solution that I've been looking for. Watch this space; if I'm in a good mood and we don't lose power tonight, I may post a review.


    The regular .sig season will resume in the fall. Here are some re-runs:
    --
    For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
  79. Oh by Dungeon+Dweller · · Score: 1

    See Topic

    --
    Eh...
  80. Well by Dungeon+Dweller · · Score: 1

    I joined when it was first started, most of the people in it are friends of mine. Most of the founders have stayed around, which makes the cap age about 23 at the moment, we have spoken of changing the name/forming a different group, but could never agree on a name/guidlines.

    --
    Eh...
  81. Re:tiff2ps by hidden · · Score: 1

    I fail to see how this would create a searchable document... please elabourate

  82. Re:why bother with PDF? by buckrogers · · Score: 1

    I am in the process of converting some pdf's into html and no, PDF's are much bigger than the same amount of html and .jpg files.

    I can fit an entire magazine with pictures on a floppy disk, but the same PDF file is 5 times that size.

    Plus, once you get the content onto a web site you can do searches across a wide variety of sources, not just a single pdf file or a single CDROM.

    --
    -- Never make a general statement.
  83. Re:why bother with PDF? by bellings · · Score: 1

    PDF may include internal compression, so they can be smaller than the equivilant HTML or plain text.

    PDF also may include the document's fonts, along with any images, along with (optionally) the location of each and every letter on the page, down to a few hundreths of an inch.

    So, a PDF may be suprisingly small or suprisingly large, depending entirely on the program that generates the document. A well crafted program should be able to take HTML plus images and produce PDFs of roughly equivilant filesize (actually, probably smaller). Just printing from your browswer, through the default printer driver, to Adobe Acrobat is probably going to result in a huge file.

    --
    Slashdot is jumping the shark. I'm just driving the boat.
  84. Re:why bother with PDF? by bellings · · Score: 1

    Unfortunately, HTML completely ignores printed output. If you're at all interested in paper, then PDF is the way to go.

    PDF might also have some value if you want to preserve the formatting of an existing paper document.

    I agree, though, that there are many superior formats for reading documents online. Arguably, HTML might even be one.

    --
    Slashdot is jumping the shark. I'm just driving the boat.
  85. Pagis Pro automatically OCR and indexes by brian_cragun · · Score: 1
    Pagis Pro (was Xerox now ScanSoft) has a nice feature that lets you automatically OCR and index scanned documents.

    I thought I was going to have to scan, OCR, then check each document. But this feature makes it virtually painless. It uses TextBridge to scan automatically as the index is built. It lets you include any text documents on your disk in the index as well.

    There is no way to see what the actual results of the OCR step are, but so far it has found every document I wanted to find, even those with names and other words not in the dictionary. Obviously, there must be some errors in the OCR, but I haven't run into them, yet.

    The search capabilities are very nice. Boolean logic, including nearness and proximity of words, along with a confidence factor.

    It has been very effective for me.

  86. 80 CD/R's an hour by pjbrewer · · Score: 1
    I saw a really nice machine in Sham Shui Po's Golden Centre:
    • High full-tower case
    • 1 CD Reader
    • 10 CD Writers
    • Buttons marked "power" and "go"
    • No need for a keyboard or monitor
    • Claimed software will automatically recognize all cd formats -- but it is probably just bitwise copy

    Does this count as fair use?
  87. The solution (not) by avandesande · · Score: 1

    It seems as though OCR of any time would be inappropriate- you need something that goes straight to eps or postscript. There was some vectorizing software available a few years ago- Does anyone know what the state of this type of software is?

    --
    love is just extroverted narcissism
  88. Re:Adobe Acrobat 4.0 by homer_ca · · Score: 1

    I know Acrobat 4 has a capture command. It scans it and stores it as a fucking bitmap. You gotta pay up the butt for OCR.

  89. Re:Adobe Acrobat 4.0 by homer_ca · · Score: 1

    Acrobat 4.0 doesn't do OCR. It scans the page and stores it as a bitmap.

    If you can still find it, the best software for this would be Acrobat 3.0 which was available with Capture 1.0. Back then the Capture plugin was unlimited use. None of this crap about buying a license for every 1000 pages or spending $7000 for unlimited pages. The data bank at my old office used this setup. It was a shocker when we saw the pricing for Acrobat 4.0 and Capture 3.0, but fortunately we had no need to upgrade.

  90. Re:Adobe Acrobat 4.0 by homer_ca · · Score: 1

    OK my bad. I checked again, and Acrobat 4.0 does have a "lite" version of capture which is missing some features for running batch jobs. There's no page limit that I could find, but without support for sheetfeeders, it's unlikely you'd have the patience to scan even 1000 pages one at a time.

    Acrobat 3.0 still came with the full Capture plugin with sheetfeeder support, for the equivalent functionality in 4.0 Adobe jacked up the price to $7000.

  91. Effective Solution by aaronhaley · · Score: 1

    I've done that before and although not the most beutiful method it does make it cross platform. Just embed the PDF into an HTML doc and then place keywords into the HTML doc and then set a search engine on top of it. It works. -j

    --
    --And sektor spoke and said unto the people. Hey, buttwipe hand me the cheezeos.
    1. Re:Effective Solution by mtphoto · · Score: 2

      An interesting thing to look into is a research project called TOM at Carnegie Mellon University. It's goal is to convert all sorts of file formats from one to the other. I can't check it out to give more information because my firewall at work doesn't let unusual ports (it's served on 8001).

  92. Re:Adobe Acrobat 4.0 by thornist-on-dialup · · Score: 1
    Or, for the microsoft crowd, if you use Aieeee! (IE, that is) you can save a webpage (Just one) as a .mht file, which I assume is some sort of zip or cab that contains all the files and is readable by IE5.

    Actually I think that is mhtml there - a standard no less; MIME HTML. It's a mime multipart format, where each part has a content id (cid) and the href's point to the parts via the cid: URI scheme.

    Sean

  93. OCR actually works? by logistix · · Score: 1

    Has anyone had success with OCR working well? It always seems to work in a half-assed way for me. It always seems to cause more problems than it solves. You really need to PDF the documents from an electronic form to get decent results. People still think they'll be able to scan in a spreadsheet and use OCR to create a useable Excel file(formulas and all :) ).

    Of course if you have thousands of sheets of paper, there's not really any choice. A happy middle-ground is to scan in the images and manually add old-school keywords to the documents.

    --
    - My password is slashdot
  94. Re:The age old question by Mr.+Barky · · Score: 1

    Any decent OCR will already have run the results through a spelling checker (and hopefully flag the ones that aren't in the dictionary). Of course, you'd then have to go and fix errors you care about, which takes quite a lot of time.

  95. Shameless plug by Mr.+Barky · · Score: 1

    If you have a lot of paper to convert, check out

    www.actionpoint.com

    Our product InputAccel is geared towards exactly this kind of problem. It cost $$$, but it's worth it if you have lots of paper to convert.
  96. Re:embed TIFF images in the PDF by drinkypoo · · Score: 1

    However, the idea here is to have them available for text searching.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  97. Re:Adobe Acrobat 4.0 by drinkypoo · · Score: 1

    Or, for the microsoft crowd, if you use Aieeee! (IE, that is) you can save a webpage (Just one) as a .mht file, which I assume is some sort of zip or cab that contains all the files and is readable by IE5.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  98. Re:couldn't you just ... by robbkidd · · Score: 1
    Although I have no idea how evolved ocr on linux is

    Therein lies the catch.

  99. yes you can make PDF files for FREE by rjzak · · Score: 1

    i dont know what OS you're using, but if you're a mac user like me there are 2 chooser extensions, one called print2pdf, that will generate a pdf file from any app. as far as i know print2pdf is freeware or shareware. you can propably find it on a carracho or hotline server.

    --
    Professional Genius
  100. Also TXT 2 PDF by worldom · · Score: 1

    Not only PS to PDF - this seems a little bit complicated. There is also the chance to convert TXT-Files to PDF. There is a utility, which I found at download.de (TXT2PDF). -wc

  101. Re:Effective Solution - TXT2PDF by worldom · · Score: 1

    check out txt2pdf at download.de. - wc

  102. What Capture Does Well by child_of_mercy · · Score: 1
    a) CAPTURE manages Duplexing so that double sided docs all end up with the pages in the right order YES other software does this as well but its a good start

    b) CAPTURE REVIEWER highlights all the words it is uncertain of and isn't sure about the spelling. you can go through and approve or change depending on what it has done. Words left as suspect will be placed as image into the PDF. while REVIEWER isn't perfect i've tried demos of every other OCR package out there and they aren't in the same ballpark

    c) because the OCR'd Reviewed output is PDF is will keep all the messy bits of image tied up with the words in the right place. Thats why large scale doc scanning needs a PDF output.

    for our business where we scan lots of docs on important letterhead for further distribution its still the only way to go

    we have licences for Capture 1.0 which doesn't charge per page, having said that we find that the saved staff time using the superior algorithms in Capture 3 make it worth the 5c per page.

    so thats why Capture is an option.

    not a good one but until linux gets a good OCR that outputs to ps/pdf with a credible review stage its something we are stuck with.

    --
    'There is a Light that never goes out.'
  103. Why OCR to PDF? by EricEldred · · Score: 1

    I'm not sure anybody answered the original question. Instead, we got plenty of other ideas that pose other problems.

    Apparently there is no good Free Software for OCR. Scanning with SANE to pbm files and then OCR with socr actually works. Since the project is Open Source software we ought to improve it. Then the output can go to HTML or text or PS or PDF or whatever--it can be part of a pipeline.

    But free as in free beer software to OCR does work fairly well--there is a package called gocr that will scan to RTF files, on Windows 95-NT. Then you can print to Postscript and then to PDF.

    A low cost and pretty effective package that works on Windows and Macs is TextBridge 98. Adobe actually uses TextBridge software as part of Capture. TP98 can be configured to scan directly to PDF files, with OCR and graphics.

    It depends on what you want to accomplish, though, and this is where the multitude of responses becomes confusing. Using TB98 to PDF you can't correct misrecognized words. Instead, if the program can't recognize the word it places a tilde in it. When the user selects the word, an image of the word and context is displayed.

    That style of archiving would seem quite appropriate for something like a legal archive. But in most cases going to HTML is simpler.

    Note that (Xerox) ScanSoft's TextBridge Pro 98 is no longer the latest version of TextBridge--the latest version doesn't seem to scan to PDF, and its user interface seems to me to be more awkward than 98, so I actually bought the new version but after using it a while went back to 98. Since 98 is not the latest, you can find it very cheap on the remainder or used market. It's quite good at OCR, but more importantly it is very fast to make corrections without having to look at the paper proof.

    Since Xerox sold off ScanSoft and ScanSoft bought up OmniPage, which had earlier bought Recognita, we can expect within about two years to have one replacement. Don't hold your breath--I don't believe they think there is a market for Linux desktops, and any improvements are not likely to be revolutionary. OCR software works pretty good in my estimate--much better than typing--you make mistakes doing either, but correcting them is easier after OCR.

    In the end, we need more help with socr and related projects. OCR is really an interesting problem and programming it can be fun. Who wants to help--either by joining the programming project, or by donating resources to it?

  104. Re:OCR can retain formatting by Alien54 · · Score: 1

    I forgot: Omnipage is available for the Mac as well. so there may be a back door for running this on a *nix box.

    --
    "It is a greater offense to steal men's labor, than their clothes"
  105. software bundled with scanners by camadas · · Score: 1

    I've seen at least 2 HP scanners that came with a PDF driver, just print as you would do to a normal printer. And it wasn't an option, it was a default feature.

  106. Well, if you can wait a while... by gughunter · · Score: 1

    ...maybe in a couple years the W3C's Scalable Vector Graphics proposal will bear fruit.

    http://www.w3.org/TR/SVG/

  107. I'm sure I'm missing something here, but by GrayMouser_the_MCSE · · Score: 1

    Can't you just scan the pages in and use some OCR software to convert to plain text, then use your method of outputting to postscript and go from there?

    --
    Of course I use Microsoft. Setting up a stable unix network is no challenge ;p
  108. I see what I missed by GrayMouser_the_MCSE · · Score: 1

    Ahh.. but the formatting would get screwed up. I see the dilemma now. Either lose the formatting or spend much time trying to recreate it.

    --
    Of course I use Microsoft. Setting up a stable unix network is no challenge ;p
    1. Re:I see what I missed by SEWilco · · Score: 2
      Whether you lose the formatting or not will depend upon the OCR software. The OCR software is looking at the scanned image and can be aware of where on the page it is looking, then use that to create a page which looks similar (with whatever formatting commands the OCR program uses...).

      The original article didn't mention which nice public OCR programs he found, so we don't know the capabilities of what he already found.

      What he needs is an OCR program which can separate text from images and format the text and images in a similar way on a PS or PDF page. At that point PS or PDF to text programs can be used for indexing.

  109. PDF Capture by #define · · Score: 1

    There's relatively inexpensive package called PDF Capture by Doculux. It works, but the interface is a little crummy. (Windoze only, though)

    I'd love to hear from people who have successfully used Vividata's software for Linux though, as all servers at my employer are running Linux.

  110. PDF Printer by MrEnigma · · Score: 1

    When you install Acrobat 4.0 it adds a "fake" printer...you can take and print any document to it, and it converts it to pdf...

    --
    GeekWares - Buy and Download Today!
  111. Simple using Distiller by nileshch · · Score: 1
    I do converting to pdf in a simple way...
    1. Setup a default Postscript printer
    2. Print anything (an image or a document) from a Windows application to a PRN file.
    3. Save it to a folder marked as a "Watched Folder" in Adobe Distiller 3.02(Distiller I find is available freely from Adobe. I downloaded the same directly from Adobe.com)
    4. Distiller directly detects the PRN file and automatically converts it into a pdf file.


    This works fine for 'PDF'ing anything I want.
  112. Do you really want to do it? by botsie · · Score: 1
    We had a similar situation for one of our clients. Before undertaking the effort we insisted that they evaluate which documents needed to be digitised. They finally came to the conclusion that close to 70% of their paperwork didn't really need to be digitised. The remaining 30% was small enough that it was cheaper to do it manually.

    Of course, this is India where everything is cheaper manually! :-)

    The point is: Make sure you're solving the correct problem

    Bots

    --
    "Rowe's Rule: The odds are five to six that the light at the end of the tunnel is the headlight of an oncoming train."
  113. where can one find ps2pdf ? by ugandy · · Score: 1

    thanks

    1. Re:where can one find ps2pdf ? by FPhlyer · · Score: 2

      If you have ghostscript installed on your computer, you probably already have this (most, if not all), linux distributions have this (okay, maybe not the "micro-distributions") by default to allow postscript files to be filtered to your printer port for output. Try typing "ps2pdf" at the command line and see if you get anything. Also, you can try www.ps2pdf.com, an online engine that lets you upload the ps file and then download the pdf file. Ghostscript is also available for Windows, and you will have to search the installed subdirectories to find the "ps2pdf.bat" batch file that will do this same thing.

      --
      Brought to you by Frobozz Magic Penguin Fodder.
  114. Another way (maybe) by b0bby · · Score: 1

    Depending on what you're trying to do with the PDFs you might look into PaperMaster (Windoze only, sorry). It's a document manager; you scan in your paper docs to "file cabinets" & can run a batch OCR to enable you to search all files for text. I've found it works pretty well. Seems like the images are just stored as TIFFs. If you just were looking to scan a book & be able to find key words I think it'd do the job; for my limited uses I've been very happy with it. Of course, if you want to distribute PDFs then this will be of no help at all...

  115. Re:The Holy Grail by Griff1973 · · Score: 1

    Wiznet (http://www.wiznet.com) technology is unique in the marketplace. Through the combination of innovative software and proprietary source code, they accelerate the extraction and conversion of raw data into searchable information that can be offered to end-users via the Internet. They also automatically link key elements of unstructured data (such as part numbers) with dynamic information in related databases. By combining electronic data mining and warehousing with self-organizing data indexing, this provides an intuitive natural language search capability unconstrained by traditional, keyword indexing. They handle paper catalogs that are scanned in, as well as any electronic format such as pdf, jpeg, This market space is wide open because current players have not been able to provide catalog content that meets the needs of business buyers and suppliers. WIZnet's ability to provide buyers with unabridged catalog content and to allow suppliers to differentiate their products satisfies a critical business need. However, WIZNET deals with large third parties, such as b2b net market makers (e.g. VerticalNet)on a volume (thousands) catalog basis, as opposed to individual buyer/suppliers.

  116. Re:embed TIFF images in the PDF by karlheg · · Score: 1

    DEC SRC has a Virtual Paper web site. They've got a thing called "Lecturn" that takes a scan and turns it into a document consisting of a series of compressed bitmaps, and, if your arch has the supported OCR library available, the text annotated with bounding box information so that a search can highlight the right spot on the bitmap. The source is available as a Modula-3 program. Lecturn documents are HUGE. If the similar thing could be done with PDF, and the document files are much smaller (better compression) it would be a good thing to have. I also recall seeing a (binary only) scheme system that was up for free download that had OCR stuff built into it. It was kind of like a `festival' for OCR, I guess.

  117. Re:Easy! by Kahrul · · Score: 1

    Yeah, easy but illegal. There is no "Adobe Special Edition" except on the high seas of piracy. Hence the poor packaging standards.

  118. Why Capture is usefull by feenberg · · Score: 1

    The advantage Capture has in creating PDFs from OCR is that when Capture is not sure of an interpretation it just puts a little picture of the word into the PDF file, instead of taking a wild guess. The resulting file is searchable using ordinary PDF tools, on any words that were successfully OCRed, but of course the little pictures aren't indexed. We have found that Capture is a bit too optimistic and thinks it has the words down when it doesn't, and this limits our use of Capture, but in principle using a combination of OCR and pictures is the way to go.

  119. SCANNED IMAGES by jaiway · · Score: 1

    Q: Can you do the same with a scanned image? RE: Sure save it as a .eps file once it is scanned, then it is the same as .ps only with page seperators (plus a few other differences that ACROBAT will not care about). I use single image .eps files and DISTILL then to PDF files and it works fine. RE2: You can also use any "page layout" type application to bind the photo into it's bounding box and make a .ps file to ps2pdf.

  120. OCR, a Primer by yargo · · Score: 1
    I've worked on OCR software for a number of companies. From a Unix based desktop OCR application at Vividata to a high end form processing system at Oyster Software. Doing what you want to do is far from an uncommon wish. Doing what you want in an easy, systematic, scalable and open source way is just not a reality at this point.

    To start with, you need a good OCR engine. There are several out there that I've used that are very good (from Caere, Nestor, Mitek or CGK). These companies all offer libraries for putting together your own document processing engine. They return the text, often return font/pointsize information and even let you know the confidence of the return value. You could use a fullblown app and try and wrap it, but OCR Shop from Vividata is the only app with a command line interface, which you'd need to handle any reasonable volume.

    From this, you can generate all sorts of output with the correct formatting. OCR Shop, which I worked on for Vividata, allowed output to many different formats, including HTML, Word and Framemaker. Depending on the complexity of your document, you can do a fairly good job of outputing what you see on the page. Outputing to PDF wouldn't be all that hard. We set that up as an output format for our scanning software at Vividata. Granted, it was a CCIT G4 bitmap wrapped in a PDF shell, but Acrobat is as close to a cross platform image viewer that most people will have installed that you can find.

    So how to make it searchable. You can go the route of saving the bitmap image and do searching on the accompanying OCR text output. This way, you get the formatting of the document right, but you end up using up a lot more space. Or you can try and do the formatting correctly on the text document, fix up any typos (or not) and use that. Both have advantages.

    I've gotten the itch on several occasions to put together an open source OCR program, with both command line and GUI interfaces. A lot of the pieces are already there. The best, free OCR engine I know of is the NIST OCR Engine. It's a bit old, the code needs some polishing and one would need to put train some memories for standard fonts, but it would make for a pretty nice little app. Then it's just a matter of creating some internal representation of the formatting and write some output functions for the different types of output (HTML, PDF, RTF...). But my copious free time has not yet given me that opportunity.

  121. Fine OCR - FineReader by Abbyy by MATPOC · · Score: 1
    Good news: I'd use FineReader OCR by Abbyy and was impessed of it accuracy recognition. It know 53! languages (also English :)

    Bad news: it costs money (so you can use 30-day trial version) and it made for Windows.

  122. Maybe Vividata by anewsome · · Score: 2
    I considered doing the same thing years ago with scanned images. I scan hundreds of images per month and I thought the free form text search of the scanned images was in order.


    At the time the only OCR software that I could find on Linux was from a company called Vividata. At that time they were just adding Linux support and it didn't seem to work for shit, but the support was pretty new.


    I use shell scripts to drive SANE programs to do the scanning and conversion to PDF using convert (Image Magick) and then ps2pdf (ghostscript). If the Vividata product actually works now, it might be nice to scan, then OCR, then convert to PDF. A quick index by ht://Dig will then make a nice searchablke archive of scanned docs.


    The Vividata products however are not free, if this is a consideration.


    --Aaron Newsome

  123. Bad Link by Gleef · · Score: 2

    Neither http://www.linuxdoc.org/docs/OCR/OCR-HOWTO-0.1 (what you wrote) nor http://www.microsoft.com (what your link pointed to) gives OCR information. There is a little info in the Access-HOWTO, and a little in the unofficial AI/Alife mini-HOWTO. I couldn't find any OCR-HOWTO, and would love a real link to it if you have one.

    ----

    --

    ----
    Open mind, insert foot.
    1. Re:Bad Link by Spoing · · Score: 2

      I'd tell you I was sorry for the mistake...but I checked them before I submitted the Ask /. a few weeks ago. Back then, they were valid and worked for me!

      --
      A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
  124. Easy! by nstrug · · Score: 2
    Next time you're in Hong Kong buy 'Adobe Special Edition' for about $10. Every Adobe application and plug-in there is including Capture!

    For some reason it comes on a CD-R with a xeroxed insert. I can't imagine why Adobe would let their packaging standards slip so badly...

    Nick

    --
    -- "It's a sad day for American capitalism when a man can't fly a midget on a kite over Central Park" - Jim Moran
  125. Re:PDF, Ugh. by YogSothoth · · Score: 2

    That's funny, I generated some pristine pdf documents using php *this* *week* and the pdf library used by php is right here and works wonderfully and comes with source. Things have apparently improved since last you looked, the relevant php documentation is here

    --
    there are two kinds of people in this world - those who divide people into two groups and those who don't
  126. Re:Interns by pen · · Score: 2
    In all honesty, you're not going to get away going from dead tree to digital paper without proofreading at least once. There is no OCR package that perfect. Same goes for your data entry folks.

    I've been thinking about this for a while... can't you just scan and OCR it once, nudge the paper on the scanner, scan and OCR it again, and then use a script to compare the two files? You may use more than two scannings if accuracy is that important.

    Something that's been common in the "warez" ebook scene is that people will often correct mistakes in the book as they're reading it, and then spread the corrected version. After a period of time, the book becomes more and more solid.

    --

  127. Re:Adobe Acrobat 4.0 by Soong · · Score: 2

    Correct. I've use this on my Mac and it works pretty well. The OCR probably misses 10-20 words per page, but is quite good about flagging them as unsure. It has a good interface for going back to do touch on those. It also has a fair interface for running a scanner, getting the data directly into itself, and doing this for successive pages. If you need this, go spend the $250 and support your non-free developers out there in the world.

    --
    Start Running Better Polls
  128. That's why you need the verify stage by A+nonymous+Coward · · Score: 2

    Old keypunch standard practice was to keypunch the holes in the cards, then someone else repunched in verify mode -- it compared and notched the card if it didn't match. For some reason, that practice seems to have disappeared. Do data entry shops still verify the entered data?

    So hire two sets of interns or high school kids. Compare the two. Pretty easy. Twice as expensive to get the data in, but it would be more accurate.

    Doesn't solve the problem of unreadable original documents which are misread both times, but that's a different story.

    --

    1. Re:That's why you need the verify stage by georgeha · · Score: 2

      So hire two sets of interns or high school kids. Compare the two. Pretty easy. Twice as expensive to get the data in, but it would be more accurate.

      If you had the money, you could hire enough sets of high school kids to get a high=-school-kid-RAID going, that way, you could hot swap the sick ones one and not lose any productivity.

      George

  129. Re:PDF, Ugh. by Juggle · · Score: 2

    That library is great but if you read the license agreement it is not free (Beer) for commercial use. And since we were being paid to develop what is most deffinatly a commercial site unless we got the client to cough up the cost of the lib it wasn't going to be an option.

    Not to mention I tend to prefer free (Beer,speech) software for anything I do and anything I pass along to clients.

    Luckily a bit of work with google and I found some guy in england who had written his own PDF libraries (not nearly as nice as PDFlib linked above) which were GPL'd and had enough functionality to do what I needed.

    --
    --- Juggle juggle@hitesman.com
  130. Adobe Acrobat "Paper Capture" can do this by specht · · Score: 2

    If you don't have to process huge amounts of pages then Adobe Acrobat can do what you want: It's basically a cheap version of Adobe Capture that is probably not as fast and not as easy to use. The "Paper Capture" option is located under the "Tools" menu. I don't think that Adobe will bring out a Unix version of Acrobat 4.0, therefore this is a MS/Mac only solution. But it's more cost effective than Capture.

  131. this must be *UNIX problem I guess. by josepha48 · · Score: 2
    I am not sure but I think that this may be just a UNIX problem. I bought a scanner a while back and it came with windows software to do the conversion for me. I have not tried mixed images and text yet, as I have not had a need. TextBridge is the name of the software. I found some info about it here http://www.digitalriver.com/dr/v2/ec_MAIN.Entry10? SP=10023&PN=1&V1=160950&xid=19198 It is not open source and it is fairly inexpensive IMHO. If you buy a scanner I think that they come with this software. It says it can retain color and images. Maybe this and wine? OR maybe enough people will ask them to port to Linux. I think that right now it outputs to word and wordperfect.

    Does this help??

    send flames > /dev/null

    --

    Only 'flamers' flame!

  132. Re:Violating copyright by Sloppy · · Score: 2

    No part of this work covered by by the copyright hereon may be reproduced or used in any form or by any means - graphics, electronic, or mechanical, including photocopying, recording, taping, or information storaeg and retrieval systems - without the written permission of the publisher.

    Yes, but their statement about you not being able to do that, is just plain wrong. Just because they say you can't, that doesn't mean you really can't. You didn't actually put your own signature under those words, did you?

    If you didn't sign that page of the book, and you didn't get the book directly from the publisher under the terms of some weirdo contract (as opposed to buying it from a bookstore), then the only real restrictions are the ones stated under copyright law. Moving the book into a computer sounds pretty Fair Use -ish to me. Just don't violate the copyright.


    ---
    --
    As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
  133. Mod this up! by FascDot+Killed+My+Pr · · Score: 2

    He's a troll, but he's funny and subtle. "hot breakfast foods", indeed!
    --
    Compaq dropping MAILWorks?

    --
    Linux MAPI Server!
    http://www.openone.com/software/MailOne/
    (Exchange Migration HOWTO coming soon)
  134. Re:Missing a step? by GregWebb · · Score: 2

    That's a pity. I used a Mac version 4-5 years ago and it was fantastic. Zero intervention produced _very_ accurate text. give it the extra few minutes and it was superb. Sorry to hear it's gone downhill. Wonder why?

    --

    Greg

    (Inside a nuclear plant)
    Aaaarrrggh! Run! The canary has mutated!

  135. OT: Opensource OCR by LetterRip · · Score: 2

    What opensource OCR have you found? And how "intelligent" is it?

    What I'd like to do is enhance the intelligence of OCR, for things like forms. The three things that would be useful is thus...

    The ability to define rectangles and lines before OCR happens, so that it will interprete them as graphics as opposed to part of the text.

    The ability to Define columns and groups better, and what type of information the column has. For instance Phone numbers, addresses, etc. (and thus quit translating 6 to b ...).

    A list of frequent mistranslations pairs - OCR tends to make consistant mistakes - if the spell checker were to substitute for the mistranslation with the alternative character pair, I would recieve a lot fewer misspells.

    I figure that those three options would increase the accuracy of the OCR software that I've been using by 95% easily. (The other five percent is from "Fax noise", photocopy fade, and handwritten notes...)

    LetterRip

  136. OCR system by jpowers · · Score: 2

    We're about to set one up here: Teleform takes data right from the scanner, OCRs (reads) it, passes the text and the image (tiff or pdf) to an image database (alchemy or imagexx), which has search tools and links to various webserver software. The whole thing will be stored in a DVD jukebox. It wasn't my call, but even though we have huge SPARCs and stuff at our disposal, this will all be under NT (imagexx runs either).

    Total cost: more than I'm worth.
    Value of having 8 million documents in a 2x2 cube: your guess is as good as anyone's.

    Errata:
    -Number of alternate solutions we looked at: 0.
    -Number of comparisons between this and alternate solutions I could find: 0.
    -Number of replies I got to a request for comparisons on IWETHEY: 0
    -Number of seconds my .org considered my request to look at alternate solutions: 0.
    -Rank, among the reasons I'm looking for a new job: 2, right behind "Hey let's get Citrix Metaframe so our lame-ass accounting software can track 100 PCs at your location!"

    Anyone need linux support in boston?

    -jpowers

    --

    -jpowers
  137. A possible solution? by cr0sh · · Score: 2

    Here is a possible solution (from scanned document to html pages), that could work as long as there weren't any funky symbols, etc. embedded in the text (heck, may even work with that if you are deft with a sharpie - as explained in step 1)...

    Steps for conversion:

    1. For pages with images, draw a colored border around each image on each page. Make the color something that will sharply stand out (like bright green).

    2. Tricky part - process each tiff image (in a looped script) doing the following:

    a. Scan each page to color tiff, with sequential filenames (001.tiff, 002.tiff).

    b. Using a custom written utility, build two new tiff images - a tiff of the page without the color-bordered images, and a tiff of the color-bordered image(s) on the page. Number the page images like (p001.tiff, p002.tiff), and the images for each page (p001i001.tiff, p001i002.tiff), so that it is known which images go with what page.

    c. Convert each page image to postscript, then to html (unless there is a tiff2html tool out there?) - preserve the filenames (p001.html, p002.html),
    modifying only the extension.

    d. Convert each image for each page to a (gif, jpeg, png), preserving the filenames (p001i001.png, p001i002.png), with a new extension.

    e. Add IMG tags for the images to the end (or beginning) of the html pages, for each page.

    3. After batch conversion, go back and proofread/reformat pages (to position images where they should go, etc).

    Everything to do this should exist in some form already - except for maybe step 2b - that might be a completely custom tool that needs to be written, but it shouldn't be very hard to code (loop through bytes of image, looking for the sharp color changes - kinda like edge detection code - saving/masking the areas in the outlines)...

    --
    Reason is the Path to God - Anon
    1. Re:A possible solution? by cr0sh · · Score: 2

      Ah, heck - that's where it breaks down - the tiff to postscript utils only make a non-searchable bitmap (I read that, and still wrote my method - I must be stupid today - my bad).

      Of course, if such a program existed - tiff -> OCR'd postscript (searchable text), then my solution would work (I am not advocating the manual cutting and pasting of images - a piece of code would have to be written to that) to convert the stuff to html.

      Of course, if one went ahead and built an OCR engine (converting tiff to PS), then they could go all the way and add the extra image stuff in and save all the steps I added...

      And here I was thinking I was being smart...

      --
      Reason is the Path to God - Anon
  138. You haven't got the right Xerox printer by georgeha · · Score: 2

    The Xerox printers I use and support, DocuSP 6180, DocuTech 65 and Sprite Network server are all PS Level 3 compliant, which means they understand PDF's also.

    George

  139. Best option: TextBridge Pro 8.0 by 1010011010 · · Score: 2

    It scans to "Image + Text" PDFs. This represents each page as an image, but includes the OCRed text for searching purposes. It's the best for legal and archival documents, because it's a true reproduction. Completely OCRed text is often inaccurate in terms of both content and presentation.

    I was going to use Acrobat Capture, until Adobe ("The Microsoft of the Graphics World") started charging a penny and a half per page. Suddenly, the job went from costing $800 (old Capture pricing) to $25000 (new capture pricing). I even called the Product Manager at Adobe for Capture and asked her why they made such a bold, stupid move. She said that Capture was now a "server product", which justified the price increase. I asked her if she expected anyone to use capture rather than the $80 Textbridge Pro which did the same thing, and she said yes. "You're on the wrong drugs," I said.

    To make TextBridge even sweeter, it turned out to be scriptable. I can hand textbridge specialized configuration files for each job. This allowed me to use Perl to automate the conversion of several tens of thousands of TIFF images into multipage, searchable PDFs. Yay, Textbridge!

    Apparently, though, Adobe had some words with Xerox (ScanSoft), because Version 9 does not include PDF support. Wankers.

    If you can find a copy of Textbridge Pro 8.0 (I think it's the "'97" release), it'll do the trick!

    --
    Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
  140. Re: Proof reading by underwhelm · · Score: 2

    I am not certain about this, but I would presume that OCR software designed to recognize form elements will retain picture elements that do not OCR to text.

    Software like Omni Form will let you designate areas on the page to ignore. This should retain picture elements and will put OCRd text in a layout that resembles the original. This, of course, most likely requires user input, at least for each different page layout.

    --

    I don't need large brains to have a good time.

  141. Re:Adobe Acrobat 4.0 by cetan · · Score: 2

    That is wrong. Adobe Acrobat 4.0 captures pages using "Capture" under the Tools menu.

    --
    In Soviet Russia...michael would be rotting in Siberia!
  142. Re:Adobe Acrobat 4.0 by cetan · · Score: 2

    That's WRONG! I create pdf's and capture them with Acrobat and they are FULLY SEARCHABLE. There is an OCR layer created in the file. It's searchable in Acrobat and completely indexable!

    --
    In Soviet Russia...michael would be rotting in Siberia!
  143. Re:Missing a step? by technos · · Score: 2

    Textbridge is, ehrm, messy. It also requires a huge amount of user intervention, and a rather large amount of training..

    --
    .sig: Now legally binding!
  144. Re:Missing a step? by technos · · Score: 2

    I've only used the past few revisions, so I can't really speak for the decline.. It will still produce accurate text with little intervention if you're feeding it plain, crisp ASCII text. Feed it a memo on letterhead with paragraphs, font changes and italics, and it prompts you continually. Not to mention it generically interprets formatting; Any one of a dozen detectable ways of formatting a paragraph (one tab, two tabs, three space indent, doublespaced, etc) are rendered only one way in the result. One tab, single spaced, no indent.

    --
    .sig: Now legally binding!
  145. Primitive searchables.. by technos · · Score: 2

    If the text formatting is primitive, and all you want is ASCII text, there are a couple OCR packages available for Linux. They are rather primitive, and at best about twice as error-prone as an entry level commercial product, but they will handle clean text very well. Graphics, snap exception formatting, etc, are not handled by any of them, but they are scriptable.

    Entry level commercial products (read: $200, Windows) will export to a .doc or similar wordprocessor file with the gross formatting intact. A few will actually 'guess' what needs to remain an image, and will include it in the finished product. They always skew the formatting some, graphics are not always detected properly, and I have yet to see one that is scriptable. They are also not free in any sense, and tie you to the Windows platform.

    OT: Kind of, but..
    Something I would like to see is a OCR search on demand application; In most document management systems you use only image files, and the information is only searchable by meta data.

    --
    .sig: Now legally binding!
  146. Re:the OCR situation is not good by passion · · Score: 2

    Textbridge (on the Mac) has a "verify" function that allows for interactivity. As it is OCR'ing, it seems to run each word through a dictionary, and if it's not found, then it asks you to verify what it should be. This process makes it only a little bit faster than raw typing.

    --
    - passion
  147. Microsoft would have it otherwise... by Greyfox · · Score: 2

    According to this Microsoft believes you can patent a file format, if not quite the .doc one. I'm gonna patent me raw ASCII...

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  148. Violating copyright by onelove · · Score: 2
    You might want to look at the front of your four foot wide stack of reference works before you even consider OCR-ing it.

    Most books have something along the following lines printed at the front:

    All rights reserved. No part of this work covered by by the copyright hereon may be reproduced or used in any form or by any means - graphics, electronic, or mechanical, including photocopying, recording, taping, or information storaeg and retrieval systems - without the written permission of the publisher.

    Oops. I hoped that didn't apply to the copyright notice I just pirated from my copy of SNMP versions 1&2, Theory and Practice ! - antoine

  149. The OCR situation is better than you think by Codex+The+Sloth · · Score: 2

    Caveat: I used to work on OCR Engines for Caere / Scansoft The available OSS engines are what you might call 'research quality'. They have some good ideas but with OCR "the devil is in the details" and there are alot of details. This is why you will probably not see any good OSS engines in the foreseable future -- there is a very iterative process between the algorithm development and testing and the cost of doing this is significant. The software that comes with scanners is cut back (big suprise) to get you to buy the real version. 100% accuracy on clean documents is not uncommon. Usually the document formatting (which is a much harder problem) is where things break down. Just one guys opinion...

    --
    I am not a number! I am a man! And don't you ... oh wait, I'm #93427. Ha ha! In your face #93428!
  150. Interns by Tom7 · · Score: 2

    When we needed to do something like this, we hired high school kids to retype the text for us. It's much cheaper than an auto-feeding scanner and OCR software. =)

  151. People are doing it... by DeepDarkSky · · Score: 2

    In an effort to associate everything with Gnutella/Napster (much like the Beowulf Cluster trend), I'd like to point out that I've seen tons of PDFs on Gnutella of books that are currently on the bookshelves, like all the Teach Yourself xx in xx days books, etc. All copyrighted material, all in either PDS, HTML, or txt format. So obviously, people are able to scan books and convert them into PDFs that are completely searchable and with the graphics intact. Adobe's Acrobat does all of that, including OCR, and if it cannot confidently recognize words, it would retain the bitmap of the text in question, just so you can see and possibly edit.

  152. After you solve the paper to PDF problem... by istartedi · · Score: 2

    ...could you solve my PDF to HTML problem? I haven't seen any cheap converters for that either. I wouldn't hate PDF so much if I could convert it. I understand that dead tree documents have their place, but that shouldn't come at the expense of on-line documents. Until someone comes up with a free PDF to HTML converter, I will continue to complain to companies and government agencies that post documentation in PDF.


    The regular .sig season will resume in the fall. Here are some re-runs:
    --
    For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
  153. Review. by istartedi · · Score: 2

    Well, as advertised, it *does* convert PDF to HTML in a way that would work very well for text-to-speach software.

    It strips *all* formatting, including many br tags. It's really not much better than a plain text converter.

    So, if you're visually impared and need to read a PDF, this is fine, but it falls far short of what I want: A true free PDF to HTML converter that does its best to preserve the look of the original document.


    The regular .sig season will resume in the fall. Here are some re-runs:
    --
    For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
  154. Re:OK MODERATORS by pjl5602 · · Score: 2
    The person to whom's post this child posted a solution to allow you to OCR from gimp, which would then allow you to post script, and then quite easily create a pdf. This is a far cry from offtopic, but someone felt the need to mark it offtopic.

    At least check the link before you flame others about marking something as offtopic (*HINT* it points to http://www.microsoft.com and NO SUCH HOWTO exists.)&nbsp Duh. :-)

  155. Re: Proof reading by Mr.+Barky · · Score: 2

    Perfect OCR isn't necessary for searching documents. As long as the OCR is pretty good, you can get pretty good searches. Since the question stated that they want to look at the diagrams, the original image obviously needs to be saved.

    One could make the text hidden as suggested by post #27.

  156. Re:Adobe Acrobat 4.0 by drinkypoo · · Score: 2

    Looks like it:

    From: <Saved by Microsoft Internet Explorer 5>
    Subject: Ask Slashdot: From Paper To PDF?
    Date: Sun, 18 Jun 2000 10:02:56 -0700
    MIME-Version: 1.0
    Content-Type: multipart/related;
    boundary="----=_NextPart_000_0000_01BFD90C .601E59E0";
    type="text/html"
    X-MimeOLE: Produced By Microsoft MimeOLE
    V5.50.4029.2901
    This is a multi-part message in MIME format.
    ------=_NextPart_000_0000_01BFD90C.601E59E0
    Content-Type: text/html;
    charset="Windows-1252"
    Content-Transfer-Encoding: quoted-printable
    Content-Location: http://slashdot.org/comments.pl?sid=00/06/05/23532 19&cid=171

    Et cetera. It's even saving it as if it were an mbox entry... Don't get much more open than that... MIME, HTML, BASE64.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  157. Re:The age old question by shepd · · Score: 2

    >Yeah, just like, say, Linux and Windows.

    Bingo. If you have only a one man staff for over 100 people that is... In that case:

    Windows 9x: $250 gets you an OS in a box. Hope you like it. Supporting it costs very little because you can do very little with it. Like "A meal in a can" it's server capabilities are laudable only as an example not to follow -- don't pack so much crap into something that is already bursting at the seams.

    Windows NT/2k: $$$$$ gets you an OS in a paper sleeve. It doesn't matter wether you like it or not because once the managers see it you are stuck with it. Supporting it costs very much because you can't do anything with it properly. Takes about 1 server for like 10 clients. Sorta like duct tape when it is used on anything but ducts.

    Linux: No money gets you an OS on an FTP site. For one man, supporting that many users is going to cost extreme $$$$$$. But you can do it all on one machine. Just like a big swiss army knife.

    Of course, a smart company (too bad these don't exist) would hire 5 people (one per 20), run Linux, and buy X-Terms. This is cheaper than ANY of the Windows solutions I have ever seen...

    Just my $0.02

    --
    If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
  158. OCR can retain formatting by Alien54 · · Score: 2
    It has to save the document into a file format that has complex formatting features. Usually this is something like Word Perfect, etc.

    Omni Page has excellent capabilities for OCR that will scan and retain most, if not all formatting. It also supports this with WordPerfect, not just the Redmond brand X software that that you see around.

    Unfortunately, it still requires a win9+ machine, but otherwise it falls into the category of Really Good Stuff(tm)

    They were separate from TextBridge a while back, but the companies merged during the past couple of years.

    The other option is to see if the compnies have copies of the books available on CDs, etc. this depends on the company, of course.

    --
    "It is a greater offense to steal men's labor, than their clothes"
  159. The Holy Grail by Alien54 · · Score: 2
    Just as a Note, this is a Holy Grail for many companies. I have a number of potential clients who would love this as they have a whole wall of file cabinets filled with paper docs that they want to convert to electronic docs, but cannot because of time, cost, etc. never mind legal issues (original records for legal disputes, etc)

    One in particular that comes to mind is an auto insurance place. all of those customers who have to process stuff yearly, etc. nevermind the usual database issues...

    if you figure it out, you have the makings of a great business plan.

    --
    "It is a greater offense to steal men's labor, than their clothes"
  160. PDF, Ugh. by Juggle · · Score: 3

    I learned my lesson about researching and testing what I offer before selling it to clients thanks to PDF. I knew that PHP was capable of generating PDF's so I went ahead and accepted a job to create a website which would automagically generate PDF resumes for the visitors. What I then found out was that PHP could only generate PDF's if you bought one of two pricy libraries which actually do the PDF work.

    I ended up searching for three days (and submitting an ask /. which was discarded) before I found a set of OS (free as in beer and speach) perl libraries for generating PDF's. But oh what a pain. I ended up designing a sample resume in QuarkXpress then using a pica ruler on the printout to convert it to something I could generate. But after about two weeks of hacking I had a resume generator which spits out very clean professional looking resumes in HTML and PDF for anyone who's willing to register on the site and fill out a few simple forms. Client was happy and I tucked another language into my cap. (Since the libraries I found pretty much required you to know PostScript).

    Moral of story: test the technology before selling to a client. And trying to generate PDF's on the cheap is only for those who have way more time than money!

    --
    --- Juggle juggle@hitesman.com
  161. PDF XML by 1010011010 · · Score: 3

    We've about finished a tool that will do PDF to XML conversions, and back again. It also sports a native API to allow t he creation of documents from scratch. It allows embedding of truetype fonts. It runs on Linux and Windows NT.

    It'll be out in the next week or so; check Freshmeat.

    The idea behind it is, create a nice layou template in the tool of your choice -- Illustrator, for example. Save as PDF. Convert to XML. Add your markup to it -- extra text, etc., convert back to PDF. Done!

    Release 1.5 will include a "template" feature, whereby you can use pages from existing PDFs as templates directly; something along these lines (pseudocode):


    p = new pdf();
    t = new pdftemplate("foo.pdf");

    p.newpage("8.5","11");
    p.include_from_template(t.page(1));
    p.drawstring("Hi!");

    p.write("bar.pdf");


    Does this type of tool sound interesting to anyone?

    On a related note, we plan to offer it as both open source and a commercial product. For instance, the ActiveX interface would be commercial. You could negotiate a commercial license. And you can use it under something like the Alladin license (a la ghostscript, pdflib, etc). Any advice on open source + commercial? I have to justify my department's budget.


    --
    Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
  162. Other ways... by antdude · · Score: 3

    I asked a friend about this and he said, "no, but the answer is yes, there are other ways....use other OCR engines, like Omnipage Pro or TextBridge Pro. Adobe Capture 3.0 is really really really nice, but is expensive. The searchability factor is the only reason OCRing is needed in most instances."

    Some useful sites:
    PDF Research
    Planet PDF
    AcroBuddies
    Codecuts
    PDF Zone
    Adobe
    Deja.com

    --
    Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
  163. I did that in two hours... by Greyfox · · Score: 3
    Easy solution:

    1) Write LaTeX resume style class. Mine's pretty primative because it only has to deal with my resume.

    2) Create resume using resume style.

    3) pdflatex resume.tex.

    Or...

    3) latex2html resume.tex (Though latex2html doesn't really generate it to look the way I need it, but it is just a simple perl program so you could always hack it.

    Nice thing about LaTeX is you can also go to XML or DVI or RTF or a number of other fairly widely used formats. Or you could just ship the raw LaTeX if the company you're dealing with is that clueful.

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  164. the OCR situation is not good by Jamie+Zawinski · · Score: 4

    Last year, I tried several Linux-based OCR packages, and they basically didn't work at all.

    I ended up using the Windows software that came with my scanner to OCR the documents, and at first glance it appeared to do a good job -- it didn't mess up too often. But then I went in and actually proofread and spell-checked its output to find all the typos it had made, and it turns out that this process was so time-consuming that it was faster for me to just type it all in by hand. Even though the OCR software only made a mistake every few lines, finding those mistakes took enough concentration that typing the whole thing took less time.

    Your mileage may vary, according to how fast you can type.

  165. embed TIFF images in the PDF by jetson123 · · Score: 4
    Many Adobe-converted scanned pages seem to be just a sequence of TIFF images with the OCR'ed text also contained in the PDF file. The OCR'ed text is never displayed, but can be used for searching (in my experience, Adobe's OCR is not very good).

    So, a simple conversion would consist of just putting the scanned TIFF images in sequence into a PDF file.

  166. Re:why bother with PDF? by turg · · Score: 4
    I don't know about elsewhere but PDF is essential for dead-tree publishing. The advantage it has over all other formats is not that it displays the same on every screen but that it prints the same on every printer (assuming that the author remembered to embed the @#$! fonts, but that's another story :-)

    With PDF, you can design and lay out your ad and transmit it electronically (or on disk) to the newspaper, knowing that it will print exactly how it it did for you. Or you can lay out your brochure and send it off to the printers knowing the same thing. With any other format, the publisher/printer's machine is going to have at least one (oh, if only it were ever just one!) setting different than yours, which will change the layout.

    PDF is the way that print ads are submitted electronically today. It's either PDF or old-fashioned cut-and-paste (no, even more old-fashioned than you're thinking, I mean with actual scissors and glue). The Associated Press runs a "wire service" called AdSend for ad agencies to transmit PDF ads electronically to newspapers and magazines -- and they are transmitting millions of PDF's a year.

    The same thing basically goes for sending anything you want printed to a print shop. In any case, free PDF-making software enables dead-tree publising the same way that the web enables electronic publishing (though we haven't got any print shops that'll work for free, yet :-)

    ========

    --
    <sig>Guvf vf abg n frperg zrffntr
  167. Missing a step? by sugarman · · Score: 4
    You mentioned OCR software, but didn't go much further with it. Wouldn't this be the solution you need?

    Scan to OCR to PS to PDF

    there are apprarently a couple tools to do this for you. check out a brief list here

    Seeing as you've looked into Adobe Capture, windows may be an option. If so, then the other question would be whether you've looked into Textbridge? This looks like it would do exactly what you're asking. No muss, little fuss.

    --
    --sugarman--
  168. The age old question by underwhelm · · Score: 4

    I am asked to do this all the time as a computer services employee of Kinkos.

    The short answer is using OCR to create a text file, proof reading the text file, and then printing to a postscript file.

    The long answer is, you need to find quality OCR software that does not choke on things like forms. You also *MUST* proof read every OCRd document. No OCR is perfect, and drawn elements will almost certainly trip the software into embedding odd characters or pipes into your text. Different fot sizes will cause the software to choke. Thin fonts will cause the software to choke.

    If you are OCRing forms, I recommend Omni Form (it's the only software I know of that recognizes forms, but I have never used it personally).

    Batch processing of OCR pages is likely easy to set up with professional OCR software (Omni Page does it), but it does not excuse you from proofreading the results. After that, the PDF part is a snap, and can be accomplished with any OCR software you choose to use.

    If you are asking which OCR software is, I can't help you directly. OCR software is a niche software market, and you either get free, dissapointing software with your scanner, or you pay big money for something that does a decent job. Just like everything else in life. Have you read any OCR software reviews?

    --

    I don't need large brains to have a good time.

  169. A former intern... by heliocentric · · Score: 4

    Speaking as a former intern under a guy who wanted all these meeting minutes from the early 80s on put on the web I know what you are asking for. I knew HTML and simple coding then, and was only being asked to translate them to HTML. What I did, was OCR a ton of the text, only to reduce the keystrokes (it's much easier to drink coffee while swapping pages in a scanner every few seconds then it is typing all day) then I spell checked them as an initial step, formatted them by hand. Then when I moved onto the next ton, and they were in the scanner bed I would check the grammar of those which I did in the first batch.

    So, I ended up being the cheap labor to get the stuff together, but I incorporated the error checked suggested by the other replies, and I utilized OCR to minimize carpel tunnel damage.

    Yeah, it took a while, and yes I got paid little in comparison to the other people at the location, but I got paid, they got their silly meeting minutes online, and they didn't have to hire 1,000 monkeys with 1,000 type-writers and have redundancy of people or invest in vast warehouses of paper feeders.

    The scale of my work: I worked on a series of bound volumes that took up 3+ feet on a bookshelf and I completed the work on my own in less than 2 weeks (while also feilding tech support questions from the group). If you have 1,000,000 pages to be put online yesterday, maybe you could use a larger staff - but always remember:

    If it takes a farmer 3 days to plow a field, and 3 farms only a day to plow the same field, and it takes one woman 9 months to have a baby, how many months does it take 9 women to have one baby?

    Often putting more people on a project doesn't equate to faster solutions or better ones and usually not cheaper ones.

    --
    Wheeeee
  170. Adobe Acrobat 4.0 by cetan · · Score: 5

    You don't need to spend all that money for Adobe Capture 3.0 when you can buy Adobe Acrobat 4.0. This is NOT the adobe reader, but the full version of Adobe Acrobat with all the bells and whistles. A url is: http://www.adobe.com/store/product s /acrobat.html.

    In addition, you can also buy the Adobe Acrobat Business Tools, which is a slightly broken but still functional version of Acrobat 4.0. That is available here: http://www.adobe.com/store/pro ducts/acrbustools.html.

    --
    In Soviet Russia...michael would be rotting in Siberia!
  171. Save money on OCR by sacrificing quality by AnonymousHero · · Score: 5
    Ahh... mass-OCR cost-effectiveness... it takes me back...

    I just used an off-the-shelf OCR engine and hacked the text together with the images programmatically myself. We would get TIFF images, which most engines could understand.

    On really, really big OCR jobs, though, the real problem is the tradeoff between human intervention and quality. See, OCR engines just guess at stuff. The only reason they work at all is that they guess well. But they guess wrong anywhere from 0.1% to 10% of the time, depending on the quality of the input.

    Each mistake must be correct by a human being. But humans are expensive. If you have lots of documents to OCR, the technology integration costs and the cost of the OCR engines themselves are amortized. They end up dwarfed by the paychecks of the humans.

    The cost of massive amounts of OCR, therefore, is directly related to the amount of human correction of OCR mistakes.

    Thus, you can save tons of money by selectively sacrificing OCR quality. Getting every page perfectly formatted requires around 60 seconds a page for a skilled OCR operator. It's all about reducing that time. How? Simple. Don't expect everything to be perfect. There are various levels of quality you can get out of OCR engines-human systems:

    • no correction: just let 'er run. You can get it fully automated this way, but the quality is crap.
    • zoning only: The OCR engines just suck at text with multiple columns, inserts, and tables. You can get people to correct the engine's zoning at a clip of around 5 seconds a page, 10 seconds if you require them to put in tokens representing the excised images.
    • spelling correction: Typically, most people object to the spelling mistakes OCR introduces. With good quality text an operator can correct them at around 20-30 seconds a page.
    • formatting correction: OCR engines can really mess up indentation and text flow. Unfortunately this is the most time consuming problem to fix, anywhere from 30 seconds to a couple of minutes per-page.

    Oh, and it really helps if you get the workflow of the OCR down. Allow the operator to move on to the next document automatically, save them the trouble of remembering the name of the document they're working on, etc. etc. This may require a bit of hacking of the OCR engine you're using, but it's worth it.

    So when doing something like this, ask yourself: how perfect does it have to be, really? You can save tons of money if you can cut any quality corners.