Slashdot Mirror


Digitizing Your Dead Trees?

smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"

"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"

347 comments

  1. look online before you scan by cheesyfru · · Score: 5, Informative

    You can find a wealth of PDF/PS/HTML/etc copies of computer texts online. Kazaa is a good place to start. Obviously, only download the books you have physical copies of. :-)

    1. Re:look online before you scan by MisterBlister · · Score: 2, Informative
      Most of the stuff you find online is training stuff, like Learn Photoshop or Learn HTML in 21 days or whatever.

      There's a dearth of available electronic copies of programming-type texts, except for those where the author/publish creates their own version (like all of Bruce Eckel's books).

    2. Re:look online before you scan by cheesyfru · · Score: 2, Insightful

      I've got about 30+ O'Reilly books, Design Patterns, Stroustrap C++, etc. They're out there if you look long enough. LimeWire has also been a big help in it as well.

    3. Re:look online before you scan by Anonymous Coward · · Score: 0

      You can also go to irc.nullus.net and join #bw, there's a _massive_ quantity of books available for download.

      Posting as AC so they don't bust my ass. :)

    4. Re:look online before you scan by MisterBlister · · Score: 1
      I'll take a look on gnutella, thanks for the tip.

    5. Re:look online before you scan by phungus · · Score: 1

      I've got over 1.5gb of technical books I've been collecting over the last year.

      Try the alt.binaries.ebook[s] groups.

    6. Re:look online before you scan by vladkrupin · · Score: 2

      and consult your lawyer

      'cause Elcomsoft thought they could do the same (minus the scanning part) and they were wrong. I don't think you need to copy an electronic version to be a pirate. You can scan a paper copy and become one.

      But then again, IANAL...

      --

      Jobs? Which jobs?
    7. Re:look online before you scan by rbeattie · · Score: 2


      "A wealth" of ebooks? Yeah right. If you're a total freakin' nerd. There's 1) Programming boooks 2) Sci Fi and Fiction (only from the most popular/oldest authors including Harry Potter) and 3) How to get laid for Dummmies (No joke). And there's absolutely nothing in Spanish (which is a thing of mine since I live here in Spain and want stuff to practice on).

      I've thought of doing EXACTLY what this guy is doing. I hope there's some good advice... I can't wait until ebooks are as popular on Gnutella as MP3s.

      -Russ

      --
      Me
    8. Re:look online before you scan by jonbrewer · · Score: 3, Informative

      O'Reilly actually sells electronic editions of their books, so please buy them! You can also subscribe and read many of their books online. Also a good idea.

      (I personally like my dead tree O'Reilly books, and will stick with them until I have a really hi-res lcd to read electronic versions with.)

    9. Re:look online before you scan by digitalsushi · · Score: 2
      3) How to get laid for Dummmies (No joke).


      anyone got the isbn?

      --
      slashdot: where everyone yells sarcastic metaphors to themselves to understand the issue
    10. Re:look online before you scan by Anonymous Coward · · Score: 0

      no, no - you need "how to get laid for total fucking stank retards"

    11. Re:look online before you scan by limited · · Score: 1

      Hmmmmm, so Is printing a file on paper considered a copy-protection device? Should we outlaw Xerox copiers now?

    12. Re:look online before you scan by SoupIsGoodFood_42 · · Score: 1

      I've got PHP and MySQL Web Development by Sams, It comes with the full PDF version (exact copy of the book) on the CD. I find it very handy. I use the dead tree version at home, but keep the PDF on my laptop, so I can pull it up anywhere. I wish more companies would do this. It's nice to know that there are still some people don't treat their customers like software pirates.

    13. Re:look online before you scan by Anonymous Coward · · Score: 0

      #bookwarez on DALnet.

    14. Re:look online before you scan by Anonymous Coward · · Score: 0

      You certainly aren't a lawyer. You're more like a troll. It's called fair use and if the courts don't rule liberally on fair use, they're going to find themselves replaced by the people of the United States.
      This whole conservative pro-IP court system was created in the ugly 80s (called the CAFC since you're not a lawyer) and as soon as the economy totally wipes out from this foolishness that would have been so familiar to Americans during the turn of the previous century, we'll get back to a real democracy where fair use is defined in the most liberal terms and patent and copyright are held to extreme scrutiny. This is what happened after the last depression.
      Slightly off-topic, so it's an AC.
      AhFoo

    15. Re:look online before you scan by Anonymous Coward · · Score: 0

      ISBN: 076455302X

    16. Re:look online before you scan by Mysticalfruit · · Score: 1

      As long as the person doesn't give them to anybody they shoudl be all right.

      It's only pirating, if you intend to share the information...

      --
      Yes Francis, the world has gone crazy.
    17. Re:look online before you scan by vladkrupin · · Score: 1

      yeah, I guess that's a bit like a troll. But just a bit.

      Fair use rights? Yes, I've heard of them, they used to exist a few years back. But did you check the state of your fair use rights lately? They are hanging by a thin thread anyway, and are being eroded daily. Check out the attempts to implement copy-protection in all hardware! (your harddrive thinking - "hmmm... that looks like an MP3 file to me... let's delete it, I don't think RIAA allows that...")

      You say that if the courts don't rule a certain way they are gonna be replaced by the "people of the United States". You were kidding me, right? People of the USA are in a major state of apathy! Nothing matters to them. Privacy - down the drain! Fair use - oh, who needs that. And if you use keyword 'combatting terrorism' anywhere when you are taking more rights away, it just goes so much smoother now.

      Ok, that's a bit off-topic, but you do have to admit that there is hardly another nation that is more apathetic than Americans, simply because they've got the most to lose, and thus would rather sit tight than cause trouble. Aside from EPIC and EFF there isn't even a single group that tries to protect the interests of a citizen. And they have to go against corporations *and* the government. Apparently, no-one cares.

      BTW: when was the last time that the courts in this country got replaced by the people? You would probably need to refresh my memory in a major way here because I have trouble remembering history that far back.

      yes, i am not a lawyer, yada, yada, rub it in... real democracy... yes, sounds good... fair use... sounds great too... You talk about all those great things, I like them too, and would gladly see them happen. To bad they aren't going to happen unless the public will wake up. And that's not going to happen. Not with the Americans I see today. At least not until they are hurt (financially or otherwise) badly enough to wake up and do something as a people. But then it might be too late.

      Because right now we get slapped right and left, and offer no response. It's like prompting someone to slap us again. M$ audits the heck out of everyone, costing fortune regardless whether you are compliant or not - and no-one except the victims care. Government listens to your ocnversations/reads your email - everyone applauds (heck, we are preventing terrorism, right?). Sony sells you a CD that kills your imac - everyone is fine with that - we need to fight pirates, whatever the cost. I can patent moving my mouse in a circular motion, and make you license that if you ever need to move it like that - your suffer from unfairness, but everyone agrees - we need to protect IP.

      well, that's really off-topic now. I just feel very sad about how grim things look now in the US, and your post just prompted me to say just that. By the way, it was very much on-topic, I think.

      --

      Jobs? Which jobs?
  2. An easier solution. by SystemFork · · Score: 4, Funny

    Lots of college students at $5/hour.

    --
    Slogan-free since April! We pass the savings on to you!
    1. Re:An easier solution. by Anonymous Coward · · Score: 0

      I'll volunteer for one

      scummy_student@frathouse.com

    2. Re:An easier solution. by psycht · · Score: 1

      Insightful? i belive the author's intent was humor, cause most students would really do it for beer.

    3. Re:An easier solution. by Tekgno · · Score: 1

      No money, I need it to pay off the fscking speeding fine I got last week. I got fined $125 and get $170 a fortnight, no eating for a few days. Not happy jan.

    4. Re:An easier solution. by Lord_Pryo · · Score: 1

      He's right, I would do it for beer. provided its good beer. (ie Canadian beer- not that American piss water) :)

    5. Re:An easier solution. by Anonymous Coward · · Score: 0

      You must be a student. From the type of beer you're describing, you must be strapped for cash. Once you make enough money to buy any kind of beer you want, you'll find that American beer is quite good.

    6. Re:An easier solution. by Mika_Lindman · · Score: 1

      At the end of the month, I'd do it for food! (But that's because I've spended all my money in beer)

  3. Go To Kinko's!!!! by thedbp · · Score: 4, Informative

    Kinko's offers high-volume scan-to-PDF solutions ... at low volume, it is usually a 10 - 25 per page and the cost of the media to copy it to, but in large volume, sometimes the cost can go down to 1 per page.

    Call Kinko's. Ask for the Territory Representative. They'll help you out!!!

    1. Re:Go To Kinko's!!!! by blindbat · · Score: 0

      Will they let you bulk copy copyrighted books?

    2. Re:Go To Kinko's!!!! by Microsift · · Score: 4, Interesting

      I seriously doubt Kinko's would do this. They are ultra-paranoid about violating copyright. I imagine if you could do it at Kinko's, you'd have to all the work yourself in the Self-Service area. I doubt they have machines like that in self-service.

      --
      My other sig is extremely clever...
    3. Re:Go To Kinko's!!!! by Anonymous Coward · · Score: 1, Informative

      They won't. I'm working at a K's right now and company policy won't let us copy anything that's copyrighted without proper permission and to hand place that many pages on a scanner bed would be horrendously time consuming.

    4. Re:Go To Kinko's!!!! by Anonymous Coward · · Score: 0

      Simple...go between the hours of 2AM-4AM. The rave-scene trance drop-outs they have working at that out don't care what the fuck you do.

      Also, they have one of those machines that can cut through a big ass stack of paper and have the edges all come out neat an parallel. So you can chop the binding off a book and then feed it through their scanner.

    5. Re:Go To Kinko's!!!! by Catbeller · · Score: 2

      If you do it in self-service, and they catch you, you will be tossed into the street.

    6. Re:Go To Kinko's!!!! by Hadlock · · Score: 1

      nah. i've copied plenty of legal stuff there. mostly calvin and hobbess books. works great for me. they help me right till the point where it comes to pushing the green button. and then i pay them, and leave. just make sure you're a "student". this is in upscale/lawyer dallas, so it's not like this is a clueless country kinkos (if they exist)

      --
      moox. for a new generation.
    7. Re:Go To Kinko's!!!! by Anonymous Coward · · Score: 0
      I seriously doubt Kinko's would do this. They are ultra-paranoid about violating copyright.
      Hmmm...good thing they didn't notice me photocopying a $125 library book cover to cover... :-p
    8. Re:Go To Kinko's!!!! by Dyolf+Knip · · Score: 2

      Ahhh, but what if the person who wants to copy books and the Kinko's employee are the same person?

      --
      Dyolf Knip
    9. Re:Go To Kinko's!!!! by 3Suns · · Score: 2, Informative

      Don't bother.

      If Kinko's does it like all the copy shops I've seen, the pdf's aren't real digitized texts, they're just the scans, in image format, on a pdf. Not exactly the best way to store a book of info.

      --

      -3Suns

      ~~~~
      The Revolution will be Slashdotted
    10. Re:Go To Kinko's!!!! by muleboy · · Score: 1
      Kinko's offers high-volume scan-to-PDF solutions ... at low volume, it is usually a 10 - 25 per page and the cost of the media to copy it to, but in large volume, sometimes the cost can go down to 1 per page.

      They must not have told the blockhead managers here at the Boulder, Colorado Kinkos, cause they quoted me $1 per page, even at a volume of 5,000 pages.

      You can probably guess my response.

  4. monkeys by blugecko · · Score: 4, Funny

    hire an infinite amount of monkeys on typewriters and... oh wait, that is for shakespeare

    --
    Lysergic Acid Diethylamide, not just chemistry, reality!
    1. Re:monkeys by HalAshton · · Score: 2, Funny

      What about the Google pigeons? I heard they were out of work.

  5. Safari is your friend by Dredd13 · · Score: 5, Informative
    If you're like me, a good chunk of your collection is ORA books... in which case, you should check out O'Reilly's Safari, which is their online book offering. It also includes non-ORA books as well, actually.

    Quite useful and handy.

    D

    1. Re:Safari is your friend by Skidge · · Score: 2

      But unfortunately, owning the OReilly books doesn't entitle you to be able to access them online. You'd have to pay a subscription to access them.

      That being said, the $9.99/month (or so) would probably be worth it, considering all the work tearing apart and OCRing all the books would take, just to get somewhat inaccurate digital versions.

    2. Re:Safari is your friend by Anonymous Coward · · Score: 0

      That's nice, but why would he want to pay a monthly fee to rent books he already owns?

    3. Re:Safari is your friend by SystemFork · · Score: 2, Insightful

      Perhaps the original poster should subscribe to the O'Reilly books they've purchased (for a month) and then save each chapter locally. Even at Safari's upper subscription levels of $100/mo you get access to 200 books. There's no way you could get a quality scanner with a feeder and OCR software for less than $100. Re-inventing the wheel is instructive, but silly. ------

      --
      Slogan-free since April! We pass the savings on to you!
    4. Re:Safari is your friend by Wanker · · Score: 5, Informative
      I'll second this-- the O'Reilley Safari site is wonderful for anyone with a hoard of tech books.

      I bet about half of your books are already online.

      Also, for your compression you should NOT use JPEG. JPEG is optimized for smooth tones and will badly blur hard edges like text. On the other hand, JPEG performs relatively poorly at compressing large areas of the same color (i.e. white backgrounds.) [Note for the nit-pickers, both of these JPEG issues will be reduced/eliminated in JPEG2000.]

      I scan documents to either compressed TIFF (tend to be large), PNG, or (*shudder*) GIF.

      From the Project Gutenberg "Making Etexts from Paper Originals" paper": (You can bet these guys know how to scan...)

      A general rule is to store scanned images to JPEG and store computer-generated pictures (like diagrams etc.) to GIF. The exception is if you scan in grayscale, then use GIF. Never scan pictures as lineart. If acceptable from a file size perspective use the highest possible quality setting for JPEG.
      I suggest never using JPEG. The quality loss for printed words is just terrible relative to the compression you get. Also, just substitute PNG for GIF and the above works.

    5. Re:Safari is your friend by Dredd13 · · Score: 4, Insightful
      That's nice, but why would he want to pay a monthly fee to rent books he already owns?

      Because there's something very nice to having access to your 30-odd book collection from home, office, conference, at a job-site, etc. etc., without dragging along 40 pounds of books with you everywhere you go.

      It's a convenience you pay for. Considering how many ORA books many people pay for (and keep current as new editions come out), the annualized cost of simply subscribing and NOT buying the dead-tree version at all is very appealing to some folks, especially if their lifestyle has them wanting ready access to the material "from lots of different places".

    6. Re:Safari is your friend by interiot · · Score: 2

      If you already own the book, then something like this should be a legal and free way to accomplish the same problem, right?

    7. Re:Safari is your friend by PunchMonkey · · Score: 1

      in which case, you should check out O'Reilly's Safari [oreilly.com]

      You should check out O'Reilly's safari, they even offer 30 day trial accounts with complete access to their entire library.

      However, I've found the site slow enough to be annoying and the search interface not the best. I think I feel this way because I'm comparing safari to the O'reilly CD bookshelves, which I copy onto my webserver (in a password protected folder of course) and can quickly browse and search through it.

      --
      I'll have something intelligent to add one of these days...
    8. Re:Safari is your friend by itsdave · · Score: 2, Interesting

      I subscribed to the safari club shortly after they announced it and I was not pleased.

      for starters, I could only have access to three books at any givin time, I decided to just choose 3 books right when i signed up and later decided i wanted to trade one of the books in for another which they allowed me to do just fine. However, I then decided I wanted to check out another book and it said, sorry, you can only switch a selection once per month.. oh, isnt that handy, so .. do you really have access to all the books no matter where you are? no, you only get access to a few. then I thought, it would be nice if I could save a local copy and then put it in a nice searchable databse. no way, they stopped me in my tracks for turning the pages too fast because they detected that I was a spider.

      thanks oreilly, I love your books but you can keep your safari club.

    9. Re:Safari is your friend by Pituritus+Ani · · Score: 1

      I might be willing to pay a little extra to be able to read the book I already paid for from ORA's servers (without the "swapping" restrictions, etc.), but I'm not going to pay a fee that assumes I don't own a paper copy. And if I wanted to borrow a book, I'd just go to a library and do it for free (ILLing it if my library doesn't have it). Guess it's just a different mentality, but I'd much rather lug the paper versions than pay twice.

      --

      Another proud carrier of the $rtbl flag

    10. Re:Safari is your friend by spectecjr · · Score: 2, Insightful

      Yeah, it really sucks having to pay for convenience, doesn't it? Everything should be free (beer) and handy and no company should ever prevent you from misusing a service they offer just because they have a right to.

      Personally, I subscribe to Safari, and I think it's great. I recognize that the 5 (maybe when you subscribed it was only 3, but now the bottom subscription level is 5) book limit and the "you can only change books once a month" provision and the anti-spidering technology was all to protect O'Reilly's considerable investment in their books and yet still allow me the convenience of reading and searching a selection of their books online.

      But yeah, it really sucks when a company tries hard to both cater to internet geeks *and* protect their investments. They should just post all their books online for free and allow me to write everything to my hard drive so I don't have to pay anymore.


      You're not paying for convenience.

      Since when did you fill your bookshelf with books that expired after a month. Or that you had to pay for continuously?

      Just sell me the E-Book version. ONCE. That's all I ask. Embed my name and address in there if you want; just let me buy the book as a file.

      Preferably, for the same price as the physical book, minus cost of printing / distribution / retailer markup.

      Simon

      --
      Coming soon - pyrogyra
    11. Re:Safari is your friend by realdpk · · Score: 2

      "But unfortunately, owning the OReilly books doesn't entitle you to be able to access them online."

      That is, access them on their web site. You can put them on your own private webspace, on a CD, etc. It's no different than mixing your own music CDs from CDs you legally own.

      But yes, O'Reilly's fees are much less than what you'll pay to scan it all yourself.

    12. Re:Safari is your friend by Anonymous Coward · · Score: 0
      Also, for your compression you should NOT use JPEG. JPEG is optimized for smooth tones and will badly blur hard edges like text. On the other hand, JPEG performs relatively poorly at compressing large areas of the same color (i.e. white backgrounds.) [Note for the nit-pickers, both of these JPEG issues will be reduced/eliminated in JPEG2000.] I scan documents to either compressed TIFF (tend to be large), PNG, or (*shudder* [unisys.com]) GIF.

      Scanned images tend to be grayscale images. The white areas tend not to be white or black. I find that jpeg handles it a lot better than PNG.

      A general rule is to store scanned images to JPEG and store computer-generated pictures (like diagrams etc.) to GIF. The exception is if you scan in grayscale, then use GIF. Never scan pictures as lineart. If acceptable from a file size perspective use the highest possible quality setting for JPEG.

      Sounds like they like jpeg for stuff as well. They don't like jpeg and greyscale. I don't know why I've had such good luck with it.

    13. Re:Safari is your friend by AJWM · · Score: 2

      I'll second what you just said about formats.

      And the tragedy is, the National Geographic Magazine collection on CD-ROM consists entirely of JPEG pictures of the pages (well, plus some (Win/Mac) indexing software). Okay, the photos are probably what attracts most people to National G, but the articles are damn hard to read.

      The folks (Tinker's Guild) that did the complete collection of The Amateur Scientist columns from Scientific American (admittedly a less ambitious undertaking than National Geo.) converted all the articles to HTML (illustrations in GIF). And the indexing software is in Java. Kudos to them.

      --
      -- Alastair
    14. Re:Safari is your friend by Anonymous Coward · · Score: 0

      Since when did you fill your bookshelf with books that expired after a month. Or that you had to pay forcontinuously?

      Since when did you buy a shelf-full of books for $9.95? If I want it forever, I'll buy a hardcopy. But if I want to search and access it and then outgrow it, I use Safari.

    15. Re:Safari is your friend by Anonymous Coward · · Score: 1, Informative

      "Perhaps the original poster should subscribe to the O'Reilly books they've purchased (for a month) and then save each chapter locally."

      I tried doing this... of course only to read while traveling and while I'm subscribing to that particular book. O'Reilly's 'spidering detection', although well intended, locked out my account multiple times... it took me weeks to get ahold of a rep via email. By the time I did, I was so fed up that I quit the service.

      Don't get me wrong, the format of the books on Safari is great. Hyperlinked TOC and indeces.... No search engine AFAIK though. Still, much better than you're going to get by OCR'ing them.

    16. Re:Safari is your friend by Anonynnous+Coward · · Score: 2
      Since when did you buy a shelf-full of books for $9.95?

      You must live in a darn cramped space if your bookshelf can only hold five books, which is what the $9.95 per month gets you access to. Using your logic, I can fill my shelf with ORA books for free. Using a library. At least until the publishers find a way to outlaw them.

    17. Re:Safari is your friend by SoupIsGoodFood_42 · · Score: 1
      Just thought I'd add something to that.

      TIFF, GIF, PNG are much better than JPG for large areas of blank space, they are also better for preserving quality (except GIF and 8-bit PNGs since they will loose colour).
      But this doesn't mean that the pages you scan will be small if you use TIFF etc. A straight scan from the scanner may look white, but there will lots of noise in their, and that won't compress well.

      What I do, is open it up in Photoshop (or app of your choice), and use the levels to 'blow the highlights' (usally a bad thing when editing photos). There will still be a bit of noise left unless you went overkill, but you could use the wand, or use the colour selection, and fill it with white.
      Now you have something that will compress alot better.

      You can also automate it if you have Photoshop. Let me know if you want details.

    18. Re:Safari is your friend by Ed+Avis · · Score: 2

      The best tool for compressing scanned documents is tic98, written as part of someone's PhD thesis. It is GPLed, but unfortunately the website has disappeared except from Google's cache. Does anyone have a copy of the source tarball?

      --
      -- Ed Avis ed@membled.com
    19. Re:Safari is your friend by Ed+Avis · · Score: 2

      Update: tic98, the tightest (lossless) compressor for scanned documents, is back online.

      --
      -- Ed Avis ed@membled.com
  6. Its been a while.... by PepsiProgrammer · · Score: 1

    I havent used OCR in about 2 years, but the last time I tried it out, it sucked horribly, its acceptable for small documents, that arent that hard to proofreed/correct But for huge documents, like books, etc... Dont expect a huge ammount of accuracy

    --
    "The United States has no right, no desire, and no intention to impose our form of government on anyone else." - Bush 05
    1. Re:Its been a while.... by ryanwright · · Score: 1

      I'm with you. OCR, at least the inexpensive (under $1000) software, is worthless. I found it to be faster to retype the whole stupid document by hand than it was to correct the OCR errors.

      --
      -Ryan, with the unoriginal sig
  7. As Krow always says... by bdesham · · Score: 5, Funny

    You can't grep a dead tree.

    --
    Alcohol and Calculus don't mix. Don't drink and derive.
    1. Re:As Krow always says... by technoid_ · · Score: 0, Troll

      But its easier to read while on the toilet.

      --
      Two wrongs don't make a right, but 3 lefts do - Lew of GO magazine
    2. Re:As Krow always says... by Wanker · · Score: 2

      What, you never heard of "igrep"? ;-)

    3. Re:As Krow always says... by MaxVlast · · Score: 1

      How the hell is that a troll? I read books on the can all the time and enjoy doing it. I found his comment both interesting and insightful. Is it a troll because he said 'toilet' or because you disagree with him? Either reason is a poor one.

      --
      There should be a moratorium on the use of the apostrophe.
      Max V.
      NeXTMail/MIME Mail welcome
    4. Re:As Krow always says... by boyprogrammer · · Score: 1

      I use a old handspring visor delux w/ a compact flash adapter and 64M CF card- I can read it on the toilet just fine. I can also comfortably hold it close enough to my face (~10") that I can see it without glasses.

    5. Re:As Krow always says... by WNight · · Score: 2

      It's a troll because it's the same irrelevant statement that luddites trot out every time someone on Slashdot discusses electronic books.

      You like reading on the toilet? Good for you. But to imply that ebooks aren't good because they can't be read while shitting is wrong. Moreover, even if it wasn't wrong, it's still be a stupid reason to dislike it. Most people can't watch TV while on the can and they still like TV. Nobody mocks TiVo because it doesn't come with a free bathroom TV.

      If you really like reading on the can that much, take your ebook/palm with you, or save a few books/magazines and read them.

      But more importantly, don't tell us about it, especially not in a thread dedicated to exactly the opposite. It's as annoying a Star Wars haters who go into AotC threads to mock fans, or SW fans who going into LotR threads to mock those fans, or Jon Katz haters who feel the need to go into every story he posts and tell everyone how much they hate him.

      Get over yourselves. Your luddite toilet habits are irrelevant to the rest of the world and if you persist in babbling on about them you will be modded down.

    6. Re:As Krow always says... by MaxVlast · · Score: 1

      Hehe. This is why Slashdot will never be taken seriously, and it's why I'm glad I don't take it seriously.

      In other words: Worst followup ever.

      --
      There should be a moratorium on the use of the apostrophe.
      Max V.
      NeXTMail/MIME Mail welcome
  8. Great by Quill_28 · · Score: 2, Insightful

    Now the bookseller's will join with the entertainment industry. Nexty we will be seeing books that can't be scanned easily.

    Remeber those passkeys for computer games in the 80's that were black on maroon paper? Or some dial thingy.

    1. Re:Great by Anonymous Coward · · Score: 0

      There's no apostrophe in "booksellers," you stupid twat.

    2. Re:Great by yintercept · · Score: 3, Funny

      Cool idea. You could sell special 3D glasses with an encrypted pattern that you would have to purchase to read a book. With the print on demand technologies, book seller might create a system where people have to get a special printing of the book that fits only their encrypted readers. That way you can guarantee that only one person reads the book. You could also create a pretty good database of what people read. This would give you a good idea on who are the subversive elements in society.

    3. Re:Great by CaseyB · · Score: 2
      Remeber those passkeys for computer games in the 80's that were black on maroon paper?

      Even back then, every photocopier I ever tried it on could adjust the contrast so that they could be copied legibly.

      I also remember trading copied templates of the dials that you could cut out and assemble.

    4. Re:Great by maniac11 · · Score: 2

      Sounds like AD&D Tomes of Enchanted whatevers... spend three months studying it and it disappears... Now THAT's something the copyright nazis would love.

      --
      Guvegrra?
    5. Re:Great by Quill_28 · · Score: 1

      And next isn't spelled nexty, seem's you missed that. :-0

    6. Re:Great by Alan+Partridge · · Score: 1

      Well YOU missed the "remember" carnage. Who's laughing now?

      --
      That was classic intercourse!
    7. Re:Great by Quill_28 · · Score: 1

      I admit it, I'm lost. Well in the sam hill are you talking about?

    8. Re:Great by Reziac · · Score: 2

      All they need to do is use paper with a very porous and slightly greyed surface, and scans will come out looking like the text was dumped into a gravel pit. Or use the same printing trick as with "security lines" on checks, where it's not actually a line, it's a series of small dots, which don't photocopy worth shit. Generally what won't photocopy well won't scan well either. Of course both schemes can probably be defeated with hires digital photography, but somehow this starts to sound like a very expensive ebook.

      But I'm still cringing from the thought of someone slicing up innocent books. When I moved last year, I brought along two full pickup loads of books (even tho I've also scrounged a lot of 'em as ebooks, and even tho some might have only been consulted once in their lives). Yeah, ebooks weigh a lot less and it's easier to search 'em for specific terms, but an ebook does you no good when the computer won't boot. And it's damned hard to flip around thru an ebook looking for *related* material (like when you have only a vague idea what you're looking for but think it might have something to do with some similar topic).

      I find that when I want to do general reading on a topic, I'm likely to use the handy ebook. But if I really need to locate a solution, especially for an ill-defined problem, the dead-tree version is ultimately more efficient.

      --
      ~REZ~ #43301. Who'd fake being me anyway?
  9. Re:How about... by Anonymous Coward · · Score: 0

    he doesn't want google matches retard, he wants help from people who may have already spent time experimenting with different scanner software. The gift of experience. The whole point of open source, and the "ask slashdot" section.

  10. 100 pounds? by NineNine · · Score: 5, Funny

    That's it? Jesus, what are you, a 12 year old girl? That's 2 armloads. Sounds like you need the exercise, fatass.

    1. Re:100 pounds? by Anonymous Coward · · Score: 0

      use a cart. lol

    2. Re:100 pounds? by zulux · · Score: 5, Funny

      That's it? Jesus, what are you, a 12 year old girl?

      Girl? On Slashdot?

      Woah!

      --

      Moneyed corporations, non-working 'poor' and criminal prisoners are turning productive citizens into tax-slaves.

    3. Re:100 pounds? by Anonymous Coward · · Score: 0

      yeah, really... i mean, face it, women have better sense than to come to a shithole like this, full of morons who have no understanding of civility...

    4. Re:100 pounds? by Anonymous Coward · · Score: 1, Informative

      You haven't seen fat until you see this.

      Its https for some reason. Like someone is going to steal the fat recipies or something...

    5. Re:100 pounds? by mikeage · · Score: 5, Funny

      Jesus, what are you, a 12 year old girl

      To the best of my knowledge, Jesus was not a 12 year old girl.

      --
      -- Is "Sig" copyrighted by www.sig.com?
    6. Re:100 pounds? by daeley · · Score: 2

      To the best of my knowledge, the original poster was not Jesus, 12-year-old girl or not. ;)

      --
      I watched C-beams glitter in the dark near the Tannhauser gate.
    7. Re:100 pounds? by Anonymous Coward · · Score: 0

      Hey damnit, I'm a girl, and I love slashdot!!

    8. Re:100 pounds? by Anonymous Coward · · Score: 0

      Hey damnit, I'm a girl, and I love slashdot!!

      OMG what's your phone number? (raised to the 3rd power and modulo 31337^3, for some sort of privacy)

    9. Re:100 pounds? by NineNine · · Score: 1

      The Net used to be nothing but cool links like this. Too bad it's so dilluted now with shit. If I had mod points, I'd mod this up just for finding something as cool as this.

    10. Re:100 pounds? by Dr.+Awktagon · · Score: 2

      Actually, it turns out Jesus was a naked black woman.

    11. Re:100 pounds? by wedg · · Score: 2

      He could get a pair of tweleve year old girls to carry the books for him. Hell, they'd only take up one seat in the car. If you give 'em icecream, they'll probably pack and unpack the books for you.

      --
      Jake
      Dating: while( 1 ){ call_girl(); get_rejected(); drink_40(); } return 0;
    12. Re:100 pounds? by Anonymous Coward · · Score: 0

      Fuck you. Is that civil enough? Last I checked women like jokes as much as men...and just maybe a female who would come to this site would have some kind of understanding of this kind of humorous bashing? Stranger things have happened.

      Oh...and if you are trying to get laid with Slashdot...good luck! Heh.

    13. Re:100 pounds? by Anonymous Coward · · Score: 0

      Why is it that a person (assuming) who pushes an MS porno site would be in such a hurry to slur others?

  11. You are the Bizarro me by Microsift · · Score: 1

    I'm getting tired of buying books only to find out that a LOT of the chapters are on CD in pdf Form.
    What's even more annoying is when the PDF doesn't let you print!

    --
    My other sig is extremely clever...
    1. Re:You are the Bizarro me by Anonymous Coward · · Score: 1

      Get Elcomsoft's Advanced PDF Password Recovery :-) And support Skylarov (or something like that)

    2. Re:You are the Bizarro me by Anonymous Coward · · Score: 0

      If you have full Acrobat (I don't think Reader will let you do this), all you have to do is re-save the pdf under a different name making sure that you set the security options to 'none'. Security?! Tee-hee...

  12. You're mad, surely? by fractalus · · Score: 2, Insightful

    Most of my technical books contain vast quantities of useful information in charts, diagrams, and illustrations... which are far more of a challenge to OCR than mere printed text.

    I suspect that even were this sort of thing really possible, it's a major time investment. I have several dozen technical books I'd like to scan, each with four hundred or so pages... and I'm not sure I want to spend a week's vacation time doing it.

    And even were it done... there is just something comforting about having a nice printed book that I can set on the desk next to the computer and consult, without having to read it on the screen. Print still looks way better than monitors.

    --
    People are never as simple as their stereotypes. This applies equally to Christians, Muslims, and Emacs-lovers.
    1. Re:You're mad, surely? by jgerman · · Score: 2

      It's a convenience issue. I'd love to have all my books on CD's so I can either 1) leave them at work and use the dead tree's at home, or 2) carry them back and forth each day. There have been plenty of times that I need a resource that I know I have at home ( "I think something out of the Dragon book would help here"), but no way to access it.

      --
      I'm the big fish in the big pond bitch.
    2. Re:You're mad, surely? by Anonymous Coward · · Score: 0
      Most of my technical books contain vast quantities of useful information in charts, diagrams, and illustrations... which are far more of a challenge to OCR than mere printed text.

      Yeah, and always remember that you must either OCR every single thing in the book, or nothing at all. Perhaps some day someone will discover a way to encode a scanned bitmap as an "image file" that you can display on a computer screen!

    3. Re:You're mad, surely? by Anonymous Coward · · Score: 0


      Could you at least bother to read the question
      all the way through before answering it?

    4. Re:You're mad, surely? by TheDarAve · · Score: 1

      -Snip-
      And even were it done... there is just something comforting about having a nice printed book that I can set on the desk next to the computer and consult, without having to read it on the screen. Print still looks way better than monitors.
      -Snip-

      Especially when the reason you have your book out is because you can't get anything to display on the monitor. :)

  13. Do you really need them? by alt.sex.fetish.jesus · · Score: 4, Insightful

    I suppose this will be marked off-topic, since the poster is asking about digitization hardware. But whenever I see coworkers with tons of books on their desk shelf, I wonder to myself why they really need them. Do they actually have time to read them? Or are they more for show?

    Personally, I have about 3 books I consider _essential_, and I've read them cover to cover (mostly while in the crapper ;-) ). The rest of the time, I get what I need off the web or USENET.

    As far as I'm concerned, the most important quality in an engineer is not what you know but what search engine you use to look stuff up.

    1. Re:Do you really need them? by ComputerSlicer23 · · Score: 2, Interesting
      All depends. I have probably 8 C++ books that have lots of different useful information in them. Really, I probably only need 3 of them, the ISO standard (yes I own a copy), Strousup's C++ Language and Jossutis's book (big black book, can't remember the title).

      I own probably 500 computer books that completely cover an 6ft by 6ft section on my wall. No I haven't read all of them, but I have read 80% of them cover to cover, and I know the table of contents on the rest of the books. It's generally very useful to keep lots of reference material "grey matter indexed". That is, I know which book to find it in and roughly where it is in the book. I have found on-line documentation to be of very low quality personally, and I like to peruse it when I don't have a computer handy

      The other consideration is it is nice to know the documentation isn't going to change, or move, or do anything weird. Of course it isn't going to get updated either so, cuts both ways.

    2. Re:Do you really need them? by AyeRoxor! · · Score: 1

      As a programmer, I do need them. I program in C, VB, and for web-based applications, on top of VB for ASPs, i use HTML, VBSCript, and JScript, and every now and then I would like another programmer's idea/perspective on how to tackle a particular task, or simply for a refresher of particular commands for the more complex tasks, ie database access or whatever.

    3. Re:Do you really need them? by sphealey · · Score: 4, Insightful
      I suppose this will be marked off-topic, since the poster is asking about digitization hardware. But whenever I see coworkers with tons of books on their desk shelf, I wonder to myself why they really need them. Do they actually have time to read them? Or are they more for show?
      Because once you have developed the skill of processing technical books/documentation, you can scan through them and pick up critical information rapidly - far faster than you could click through them as hypertext.

      Case in point: I recently took a position where I had to do some work with Oracle, which I had not used previously. After some skimming at B&N, I purchased 5 good texts. A lot of pages, but when you need to figure something out you can open 2 or 3 of them, mark multiple pages, and get the outline of what you need very quickly.

      sPh

    4. Re:Do you really need them? by elmegil · · Score: 2
      I have a couple dozen bookmarks to stuff on the net about html, cgi, php, etc. I also have a half dozen of the O'Reilly books on similar topics, as well as most of their Perl collection. Which one do I find a quicker way to get at what I know is there? The books, hands down. Web pages tend to be broken up into individual "pages" to "simulate" being books, but don't have good indexing. Google doesn't count as good indexing, except insofar as I can find information that I've never seen before, because if I *have* seen it before typically it's still tough to find the right magic words to get exactly what I saw.

      So the answer is yes, I really need them. And I bet the original poster does too. And see, that's the hard part. He can scan and download and so forth all he likes, but finding a good index replacement is not going to be so easy.

      --
      7 November 2006: The day Americans realized corruption and incompetence weren't addressing 11 September 2001
    5. Re:Do you really need them? by BovineSpirit · · Score: 1
      I like good computer books, they tell you stuff you didn't realise you need to know. 'Running Linux' told me how to write letters using Latex, and 'MySQL and mSQL' showed me just enough PHP to get me interested and encouraged me to start playing.

      If you do a Google search you can usually find what to do, but without the explanations that let you know why you're doing it.

    6. Re:Do you really need them? by Darth_Burrito · · Score: 1

      I work in a small company where everyone is required to do and know a little of everything. So far that includes some C/C++, VB, COM, ASP, IIS, Oracle, NT, PL/SQL, and lately even a little *nix. Knowing a little of everythin usually leaves you knowing a lot of of nothing. It's nice to have a book on the shelf you can go to for any situation, even if you only open each up once every few months. They are more for quick reference than for reading cover to cover.

      Now the people who have entire bookshelves filled with ancient Oracle Tomes, I was never sure whether I should be laughing at them or crying with them.

      As far as I'm concerned, the most important quality in an engineer is not what you know but what search engine you use to look stuff up.


      I agree completely. A lot of times someone comes over and asks me a question, and I just type it in verbatim to deja and then read them one of the ten answer posts that comes up. Still though there is something very useful about being able to search through digital copies of books. Books are a good fall back for problems that are too stupid or too complicated to have been asked about or otherwise resolved in a public forum. Anyways, sometimes the problem is as much about staring at a computer screen too long as it is about whatever is actually causing the problem, so books help there too.

    7. Re:Do you really need them? by Anonymous Coward · · Score: 0

      It depends entirely on the type of job you do. If I programmed in C++ all day, I could get by with three books. Instead, I program in RPG, CL, VB, VBA, VBScript, C++. I use SQL, MFC, HTML, XML, DB2, etc. Supporting 15 departments, each with their own departmental software, it is difficult to keep up without referring to a reference once in a while. I haven't read them all, cover-to-cover, but specific sections I needed to accomplish a specific task.

    8. Re:Do you really need them? by gilroy · · Score: 2
      Blockquoth the poster:

      Because once you have developed the skill of processing technical books/documentation, you can scan through them and pick up critical information rapidly - far faster than you could click through them as hypertext.

      .... at least, until you develope a comparable skill with hypertext. The manner of reading is different but not necessarily inferior. Why does everyone assume that what we've used simply due to technical limits will actually prove to be superior in a new context? You can't grep books -- that already limits them.
    9. Re:Do you really need them? by Waffle+Iron · · Score: 5, Insightful
      Do they actually have time to read them? Or are they more for show?

      Back before the Web when I was a hardware designer, books were a kind of currency that engineering salespeople used to entice you to meet with them. Each chip manufacturer printed stacks and stacks of data books covering their various product lines. They'd give these to the sales reps who would cart them in on dollies to hand out to the engineers who showed up to hear their latest pitch.

      In a way, huge bookshelves with hundreds of books was a status symbol, showing that you'd been around a while and a lot of people thought it was worthwile to give you books. It was useful to have all of that info available, but few people actually used more than 1% the data that was on their shelves.

      The instant the chip companies put their chip data on the web, all of those books became totally useless. Now I'm doing software, everything is online, and I can go for weeks on end without picking up a technical book.

      I do sometimes miss the office atmosphere you get from row after row of data books neatly segregated by the corporate logos and color schemes on their spines. It had an important look to it.

    10. Re:Do you really need them? by Anonymous Coward · · Score: 0

      "As a programmer, I do need them." ASP, HTML, script? Are you kidding me? Does the table look better centered or left justified? Should I use a pink or green font? Since when is database access a complex task? Read your little books and go on pretending...

    11. Re:Do you really need them? by Anonymous Coward · · Score: 0

      Books page numbers dont change depending on how you view them. Make your browser a little wider, and the needed info is in a completely different place.

      Also, some ISPs have notorious downtime. Just because your service provider doesn't have their act together, should you miss a deadline?

    12. Re:Do you really need them? by alizard · · Score: 2

      Most data sheets and application notes are downloadable pdfs at the vendor site. If you know what chip vendors you need, who needs a search engine?

    13. Re:Do you really need them? by Max+Webster · · Score: 1

      If someone needs to know about the Oracle database, I'd point them to tahiti.oracle.com. (And not just because I wrote the code! :-) It's the first system I've used that's been able to entirely take the place of printed docs for a library of any size.

    14. Re:Do you really need them? by BingoBoingo · · Score: 1
      Back before the Web when I was a hardware designer, books were a kind of currency that engineering salespeople used to entice you to meet with them. Each chip manufacturer printed stacks and stacks of data books covering their various product lines. They'd give these to the sales reps who would cart them in on dollies to hand out to the engineers who showed up to hear their latest pitch.

      When I was an engineering student and had to make a microprocessor based design, the only company that showed up with data books for our class was Motorola. As you would expect, most of our class used a Motorola 68000 in our designs. Only one design used an Intel chip.

      I guess Motorola needed the business and Intel figured they didn't.

      ---
      When everything is coming your way, you're in the wrong lane.

    15. Re:Do you really need them? by Rande · · Score: 1
      .... at least, until you develope a comparable skill with hypertext. The manner of reading is different but not necessarily inferior.


      Nope, if I hold the spine of a book in the left hand, and the top right corner in the right, I can bend over the coner, and then move the thumb back, making the entire contents of the book flash past my eyes in 2 seconds, allowing me to very quickly know whether the information I require is in that book. Try that with 200 pages of HTML.

    16. Re:Do you really need them? by gilroy · · Score: 2
      Blockquoth the poster:

      allowing me to very quickly know whether the information I require is in that book. Try that with 200 pages of HTML.


      Or with digital text, I can type in exactly what I'm looking for, press "Search", and find in 0.002 seconds whether it has what I am looking for. Right now, image searches are harder, admittedly, but that's because we haven't developed the important skills.
    17. Re:Do you really need them? by WNight · · Score: 2

      Actually, I'm a programmer, as in that's my job title and I get paid for it. I use C and Perl, with whatever other languages are needed for the task at hand, shell script, VB, PHP, whatever. I also have to code up HTML for any of the web-based apps I make.

      Know what I spend more time looking up? HTML. By far.

      With C I know probably 98% of the language blindfolded, the other 2% is really obscure stuff and thus I rarely need to look for it. Perl is weirder, but I've made a quick-ref card with the forms of various data structures for the really bizarre stuff, that's 90% of my Perl-related research. But HTML... Anything more complex than an anchor, simple table, or bold tag likely has some weird syntax, and also likely works slightly differently on various browsers.

      Must look kind of funny when a programmer they pay to write network simulations and other stuff has a book on basic HTML open while working, but it saves me from memorizing trivial crap.

    18. Re:Do you really need them? by Anonymous Coward · · Score: 0

      Read the reply by WNight. Obviously it's YOU who knows nothing about programming for a living. No wonder you're anonymous. You knew you'd screw something up :P I'm anon cuz fucking moron mods will probably mod this offtopic, even though it's in direct reply to an on-topic statement :P

    19. Re:Do you really need them? by Anonymous Coward · · Score: 0

      Keep pretending? What the fuck does that mean? Anyways, go look up how many possible commands and flags for commands there are in HTML. If it's more than a couple hundred, shut the fuck up, you stupid annoying twit.

  14. Unprintable PDFs by Anonymous Coward · · Score: 0

    That's what Ghostscript is for. :-)

  15. check sane by walt-sjc · · Score: 4, Informative

    Check the hardware list for sane and then pick one of the fastest scanners you can afford. The DB on Sane's web site is your best bet. You will find that to get good scanning speed you will need scsi as USB is just too slow.

    jpeg also sucks for this. Jpeg is best for full color images like photographs. Better off using tiff or png. Most OCR software will require tiff. Don't know of any OCR software for linux although you might get some windows app to work under WINE. Textbridge from Xerox isn't bad for the money.

    1. Re:check sane by josepha48 · · Score: 3, Informative
      There is gocr or jocr -> http://jocr.sourceforge.net/

      Also there are a few commercial ones. However scanned to text conversion needs at least 600dpi and is only goind to have about a 97% accuracy.

      --

      Only 'flamers' flame!

    2. Re:check sane by *xpenguin* · · Score: 1

      Have you ever actually tried gocr? It segfaults on half of the images.

    3. Re:check sane by josepha48 · · Score: 2
      Yes I have tried gocr, and it did not seg fault for me. I actually scanned the image at I think 300 or 600 dpi and get it to convert the image to text. It however was incredibley inaccurate as every other word was wrong. It probably would have taken me longer to type the document all over as opposed to scanning it and using gocr, but my typing sucks and I have about as many typos as gocr does ;-).

      I would recommend that for book to text conversion like this person wants -> send it out to a professional service.

      --

      Only 'flamers' flame!

  16. We do this all the time at the office...... by diorio · · Score: 4, Informative

    .....we use a xerox DC265ST. This digital photocopier scans pages at 65 per minute and posts them to an FTP server inhouse. It can scan at 300 or 600 DPI and you can apply OCR after the scans are done. The DC265 is a workhorse and there are about a million of them out there. The scan back feature is a additional price on the device so not everyone spent the money on that feature....but about 1000 Kinko's have these in house and a Kinko's with a good DTP department might actually even know how to use the feature. (Good Luck!)
    .

    --
    Ignored Since 1973
  17. when will it end? by bcnarc · · Score: 1

    This has to be one of the dumbest questions I've seen in a long time. If you're ambitious enough to attempt to scan '100lbs of dead trees' you'd think you'd manage to do some research on your own.

    1. Re:when will it end? by ddillman · · Score: 1

      This has to be one of the dumbest questions I've seen in a long time. If you're ambitious enough to attempt to scan '100lbs of dead trees' you'd think you'd manage to do some research on your own.

      And this has got to be one of the dumbest answers I've seen. One of the ways you research is to ask people who may have encountered your situation previously, that you might learn from their experiences. Gee, that's exactly what this guy is doing... I'd be willing to bet this person has also done plenty of Google time as well.

      --
      Little girls, like butterflies, need no excuse. -- L. Long
  18. ooh.. searchable index... by josquint · · Score: 2

    I dont know HOW many times i've looked at a tech manual(or other paper book for that matter)trying to find something I read a while ago and thought " i wish i could just do a text search to find the 3 or so words i remember seeing..." Sure theindex and table of contents gets you part of the way there, but if the author mentions something off-hand in an 'unrelated' section of the book...

    1. Re:ooh.. searchable index... by lukew · · Score: 0

      I don't know how many times I've misplaced something in my disgusting brothel of a bedroom and wished I could do a grep for it.

  19. Try one of these... by matthew.thompson · · Score: 3, Interesting
    Canon DR-5020

    Canon's 90ppm high speed scanner - only problem with high speed scanning is that they need loose leaves. Any decent books you have and want to copy will need a Stanley knife taking to the spine.

    Please remember to make decent backups on a long lasting madium with a high chance of recoverability. Failing that place the loose leaf versions with a document recovery firm and take their insurance for the full purchase value of the originals.

    --
    Matt Thompson - Actuality - Insert product here.
    1. Re:Try one of these... by JLester · · Score: 2

      We have a couple of these, they work really well .. almost scary fast!

      Jason

      --
      "FORMAT C:" - Kills bugs dead!
  20. Re:How about... by Anonymous Coward · · Score: 0

    You mean the "do my homework" section?

  21. searchable text versus scanned images by pomakis · · Score: 2, Redundant
    The first question you'll want to ask yourself is whether you want the result in searchable text form or scanned image form. Searchable text is achievable with OCR (optical character recognition) software, but has at least two issues:

    • OCR software isn't perfect, and so errors will occur that'll you'll either have to live with or correct manually. Good OCR software does some validating against a dictionary, but this doesn't help when the source is highly mathematical, etc.
    • You'll lose figures, diagrams and pictures.

    Scanned images solve these problems, but have two problems of their own:

    • They're not searchable.
    • They're bulky (perhaps 100x).

    Perhaps a hybrid solution exists, but I suspect such a solution will require a lot of manual intervention and tweaking, something you'll want to avoid if your goal is to digitize several books.

    1. Re:searchable text versus scanned images by synx · · Score: 2

      i seem to recall a product that adobe has which makes hybrid pdf files using ocr. Text where possible, graphics elsewhere. You get the benefits of both. Of course the software is expensive.

    2. Re:searchable text versus scanned images by turbosaab · · Score: 1

      AFAIK, you can create Adobe PDF files where the image is visible and the OCRed text is "underneath" for search capability.

    3. Re:searchable text versus scanned images by br0ck · · Score: 1

      Google seems to have found a way to search for words within images of catalog pages. Look for the cool little yellow boxes.

    4. Re:searchable text versus scanned images by kalidasa · · Score: 2, Informative

      Acrobat can do this. Just scan it in with Acrobat, then "capture text." Works well with good, clear fonts, and a straight scan (not crooked) from a good scanner, though there's like a 0.05% fail rate per character. Yes, I know that sucks, it's one error a page, but it's survivable.

    5. Re:searchable text versus scanned images by Anonymous Coward · · Score: 0

      Even its not all that accurate.
      Look at all the variations of this particular word.

  22. I like my dead trees by SirWhoopass · · Score: 2, Insightful
    Electronic manuals are great, particularly because of the ability to search them. I certainly use plenty of them.

    Personally, however, I still like printed manuals. Using an online manual means either reducing some windows or switching desktops. With a paper manual I can keep the screen exactly as it is. Higher resolution screens, or the use of multiple screens, are making online manuals much more useful (anyone remember what a pain in the ass it was to try and figure out something with only an online manual on a 640x480 screen?). Occasionally I still manage to fill two 1600x1200 screens with a bunch of stuff I want to keep visible while still reading the manual.

    1. Re:I like my dead trees by agent0range_ · · Score: 1

      Two words: Dual Monitors

      The price of a PCI video card and a 15/17" monitor is far outweighed by the usefulness of having a second display available for reference material.

      Still, I like my printed manuals. You can read them anywhere, as long as you remember to bring them with you. Sure, will find ways of accomplishing this with electronic gadges, but it's not the same. Not by a long shot.

      As an added bonus, printed books look good on your shelf long after you have outgrown them.

  23. Electronic format is nice for storage, but... by delphin42 · · Score: 2, Informative

    if you are anything like the computer guys I know (myself included), you'd end up printing out
    portions of the text whenever you wanted to read them anyway!!!

    --
    -- Adam
  24. I have the same goals - and problems by nurb432 · · Score: 1

    Looking at over 2000 books ( and magazines ) in my garage in boxes im faced wiht the same issues..

    Like what sort of scanner, software, etc to do such a massive collection.

    And how to rationally complete the project... am i looking at having to cut(!) the books for a sheet feeder, or squish them on a flat bed.. Never had much luck with 'page scanners'..

    Am i looking at *having* to buy something like acrobat to make the scanned pages useful??

    --
    ---- Booth was a patriot ----
    1. Re:I have the same goals - and problems by Anonymous Coward · · Score: 0

      Magazines eh.. Getting a bit sticky so you decide to scan them. You fucking pervert.

    2. Re:I have the same goals - and problems by nurb432 · · Score: 1

      Didnt know back issues of Popular Electronics, BYTE or Omni qualified as pervert material..

      Get a life..

      --
      ---- Booth was a patriot ----
  25. already scanned by Anonymous Coward · · Score: 2, Informative

    Yup. There is quite a lot already scanned. The best places to look are usenet (at alt.binaries.e-book, alt.binaries.e-book.technical, alt.binaries.e-books) and IRC at #bookwarez and #bookz on undernet, dalnet, and irc.nullus.net (and most likely other irc nets as well.)

    You could try making a request in abeb, but the biggest selection in one place is irc. So as long as you are not scared by the interface, that is where I would look first.

  26. Tech books shouldn't be dead tree only. by Anonymous Coward · · Score: 1, Interesting

    Think about it.

    People love books in dead tree format for the most part. You don't really want to curl up with a cup of coffee and a nice monitor. No, you want some good old dead tree.

    But when you're coding, you don't want to curl up with a cup of coffee. You want to sit in a chair and hammer out code while quaffing coffee as if it were, well, coffee.

    Most of the time when I look through books for reference, it's annoying. I'd rather be able to just grep for info.

    Thankfully, at least O'Reilly's catching on to this. :)

    1. Re:Tech books shouldn't be dead tree only. by Jonathan · · Score: 2

      People love books in dead tree format for the most part. You don't really want to curl up with a cup of coffee and a nice monitor.

      Why the hell not? Isn't that what we all do while working?

  27. Re:monkeys ** -- MOD PARENT UP!! -- ** by Anonymous Coward · · Score: 0

    that is all.

  28. JPEG? by Catskul · · Score: 1

    Dont use jpeg, its not good for text. Jpegs are good for photographs because photographs have predictable gradients. Use PNG/GIF for images with sharp/nongradual edges, you will get better compression/quailty that way.

    I like the / character. : )

    --

    Im not here now... Im out KILLING pepperoni
  29. Don't use JPEG. by Bistronaut · · Score: 1

    Use PNG! It's lossless and gets compression ratios that are just as good (unless you are using ultra-lossy compression with your JPEGs - in which case they will be a pain to read anyway). Why do people even use JPEG and GIF anymore? JPEG is only good if you need ultra-high compression and don't care about quality, and GIF only has the animation thing on PNGs.

    Sorry about the rant, but there are so many cool computer technologies that people just overlook. It makes me sad.

    1. Re:Don't use JPEG. by lightray · · Score: 2

      You're right that he should not use JPEG for this, but for the wrong reasons. JPEG is simply the wrong format for images that are not like photographs. Specifically, JPEG is not appropriate for images with high spatial frequencies (ie, distinct lines and shapes, and a small number of colors). Raster-based formats (GIF, PNG, TIFF, etc) are the appropriate format for scanned text, diagrams, etc. PNG is not a replacement for JPEG.

      Furthermore, if you want animations, you are overlooking the new, cool computer technology called MNG.

    2. Re:Don't use JPEG. by Anonymous Coward · · Score: 0

      You can actually get animated PNGs as well today. They are called MNGs. Read more here: http://www.libpng.org/pub/mng/

    3. Re:Don't use JPEG. by hal9k · · Score: 1

      Don't use an image-only format at all. How often do you flip through books wishing that you can grep them? Use PDF or some other searchable file format.

    4. Re:Don't use JPEG. by artg · · Score: 1

      You don't really want lossless compression - the information density of text on a page is so low that PNG-style encoding comes out pretty poor.

      Acrobat does quite good compression of scanned pages using a Fax-style (run length encoded) method. This works well unless you have greyscale (rather than line art) images.

      What you need is a compression method that is tuned to the characteristics of text : not as complex as irregular patterns (use PNG) and without the smooth transitions common in photographs (use JPEG).

      The upcoming standard for text compression is JBIG. This somewhere between a raster compression and OCR. It generates a table of encoded images that correspond to the font in use, then encodes the differences between those library templates and the actual scanned image in order to get spacings and other variations correct. Note that it doesn't actually OCR : it just recognises that the page is made up of a large number of regularly-spaced cells containing a relatively small number of basic shapes. There's some open-source work on JBIG : check google for JBIG-KIT. JBIG also recognises that a page may contain regions where a different compression method is more appropriate (pictures in the text) and switches as necessary.

  30. Copyright Infringement? by stickytar · · Score: 1
    It seems that all the "essential" books I have rarely get touched except at those special key moments when they are needed. I can't imagine spending more than 15 minutes trying to adapt these "old" knowledge bases into electronic form unless it was SO HUGE and needed to be accessed by alot of people (i.e., the company library?) so then, what? where is the fair use policy on this? do I buy oreilly's "java and xml" book and then copy and cut it up to my hearts content? What do the publishers think about all this?

    Fire trucks!! Start your engines!!

    --
    believing the big bang requires a certain amount of supernatural faith
  31. I want both by peterdaly · · Score: 2, Informative

    O'Rielly (sp?) has many of their java books available on CD-ROM, although I only own the dead tree versions of the ones I have in that series.

    On a regular basis, I haul 2188 pages worth, I just added them up, of QUE's Using Java2 Standard Edition, and Enterprise edition, between home an the office. (Speaking of which, go to the link in my .sig and buy some of my favorite books!) That a lot of weight for two books, and I usually haul around a couple smaller ones as well, O'Riely's perl book, and their EJB 3rd edition.

    Not only are all of these books heavy, but I have also yet to find an easy way to card them around, they don't all fit right in any of my bags.

    I want all of these books on CD-ROM, but not just CD-ROM. Half the books I have INCLUDED a cd-rom, it just doesn't contain the texxt of the book. With O-Riely, I'd buy the CD-ROM version, but I want to dead tree version too. I want to use the dead tree version, unless I am working from home, I want to haul home the CD's. I don't think I should have to pay any more for it either, I bought the IP (in the property sense), and I am already paying the price for the wood slices, which includes a silver disk.

    PUBLISHERS, GIVE ME THE BOOK ON THE CD TOO! I spend $100/month or so on tech books.

    -Pete

    1. Re:I want both by Reziac · · Score: 2

      As to toting around those hefty books -- if it gets to where it's really too much to carry, you might check out those small two-wheeled shopping carts -- sortof like a lightweight dolly with a basket. They're about the right size for a stack of large technical books (use a cardboard box, a heavy plastic trash bag, or a towel as a basket liner, to protect the books from getting caught in the wire and from getting dirty while being hauled down the street). They're easy to drag along behind you (little old ladies use them), and the better ones are collaspable and fit easily into a small car's trunk. And they're not as bulky as wheeled luggage.

      For lesser stacks, I found a fishing tackle box that's about the right size for 3 or so of those fullsize tech books; cost less than $10 at WalMart.

      --
      ~REZ~ #43301. Who'd fake being me anyway?
  32. Let me get this straight... by deacon · · Score: 5, Insightful
    You are going to cut up thousands of dollars worth of your "essential" books?

    And put them into an inferior visual format you cannot read without the computer being working and on?

    And you are going to spend about 100 hours to do this.. and the original books are going to be ruined.

    All this just so you don't have to make 3 trips to move your books?

    Mmmkayyy.. (backs away slowly)

    Have you ever heard of a dolly?

    1. Re:Let me get this straight... by Anonymous Coward · · Score: 0
      you are going to spend about 100 hours to do this..

      100 hours? try 1000 hours, or 10000 hours. With skill and experience he might be able to do one book start to finish -- everything finished -- in two days, assuming he is trying to do the job right. What he is suggesting is highly labor intensive -- much more labor than moving some books to a new location. If he ever starts, I bet he quits after 1 or 2 books.

    2. Re:Let me get this straight... by Anonymous Coward · · Score: 0

      I have a dolly! It pees and says "Momma"!

    3. Re:Let me get this straight... by Joe+Tie. · · Score: 1

      Having a choice betwean lugging a dolly of books around, or having all my reference books stored on a PDA in my pocket, I'd sure choose the latter.

      --
      Everything will be taken away from you.
    4. Re:Let me get this straight... by Anonymous Coward · · Score: 0

      "Ruined" is overstating things a bit. Did you ever hear of the "hole-punch" and "binder"? Not that that makes this a good idea.

    5. Re:Let me get this straight... by Suppafly · · Score: 2

      You are necessarily ruining the books, you could easily have them rebound (like old hardcover books at the library often are) or spiral bound. A lot of people cut the binding off of tech books and have them spiral bound so they lay flat without closing.

    6. Re:Let me get this straight... by Anonymous Coward · · Score: 0

      apparently the guy could save thousand years of his life if follows one of the arguments in here. you are all insane.

    7. Re:Let me get this straight... by Hoi+Polloi · · Score: 1

      Are you implying that you have to read every book you own at all times? I also doubt your PDA can hold high-quality (NOT ascii text) scans of hundreds of books.

      Even if you could hold all that info and needed access to all of it you'd still be destroying your primary data source. Sort of like a Xerox machine that turns the original to ashes when you are done.

      --
      It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
    8. Re:Let me get this straight... by lommer · · Score: 1

      And put them into an inferior visual format you cannot read without the computer being working and on?

      This is an oft-overlooked aspect. What will you do when your brand-new dead-tree-replacing box goes berserk and suddenly you can't access the 1000 page "how to fix anything to do with computers" book that you need to fix it? I guess its time to dive into that dumpster full of loose pages that you just trashed... :-)

    9. Re:Let me get this straight... by jfmiller · · Score: 1

      It seems someone always beats me to the punch. With the exception of average sized ORA books, books don't naturally lay flat. I have intentionally had several books that I use often spiral bound because they last longer (paperback do at least) and they stay open.

      JFM

      --
      Strive to make your client happy, not necessarly give them what they ask for
  33. contact your local school for the blind by veggiespam · · Score: 2, Interesting

    Schools for the blind have been doing this for years, especially with technical books. Many of my V.I. friends would remove the binding and feed them through a high-speed sheet feeder to a scanner. Then, the books are proofed by seeing people for OCR perfection. Contact your local school and ask if they already have some of your works in pdf/jpeg/tiff/WordPerfect (yes, lots of Word Perfect). They may be willing to give you some legal copies of your books in exchange for you converting some of the books you have that they don't into blind readable format (which means, you'd have to proof your own book for accuracy - but you're doing that anyway). Basically, you're donating your time for a good cause and bennifiting yourself.

  34. In a Word: by bpfinn · · Score: 1
  35. are you sure you want to do this? by binaryDigit · · Score: 4, Insightful

    I think you may be underestimating the sheer enormity of your task. Getting sheets to all feed right (a little skew and you're skrewed) and in order (feeder issues, what happens when one page mis-scans/feeds, can you go back and insert it into it's proper location), handling front to back issues (though I would assume that decent scanning software would take care of this for you). Also, your plan to use jpg might be problematic. OCR is finicky enough as it is, back when we were scanning documents we always used 300dpi tiff (using group3 or group4 lossless compression) to get the maximum accuracy rates from the ocr package we were using. And speaking of accuracy, keep in mind that OCR software that has a 97% accuracy rate means that it will flub 3 out of every 100 words, in a book that might contain tens/hundreds of thousands or words, that is a whole lot of errors. Now it's been a few years (6-8) since I've done this kind of stuff, so who knows, maybe things are much better now?

    I've been wanting to do something similar for years, but with technical magazines, not books. But the sheer amount of manual labor involved has turned me off considerably (not to mention the thought of destroying the original source).

    Keep in mind that this is such a common need, that if it were pretty straight forward, much of it would be done already (perhaps someone out there has the time/hardware/software to have done some of this already?) Not to mention the issue that with the web, that much of the information contained in those books are now available online, makes you wonder if it's really worth the time and effort, esp. considering that a great many of the technical books are obsolete two weeks before they hit the shelves.

    1. Re:are you sure you want to do this? by hgh · · Score: 2, Informative
      I've been wanting to do something similar for years, but with technical magazines, not books. But the sheer amount of manual labor involved has turned me off considerably (not to mention the thought of destroying the original source).

      Dr Dobbs (and I'm sure others) offers CDs full of all their articles from the past couple years for a pretty good price (less than $100, I believe). They also offer collections of books on CD for about the cost of one original.

      Just a thought,

      hgh
    2. Re:are you sure you want to do this? by Hallow · · Score: 4, Informative

      What he's probably looking for is something like PDF. You can leave the image on the front (i.e., it's what shows up in acrobat reader), and adobe's ocr ocr's the document and and indexes it for searches. The problem with this is, you wind up with big pdf's with poor quality.

      Where I work we tried to turn a book into PDF that we no longer had an electronic copy of. Keeping the images up front with ocr text behind, about 300 pages alltogether. Even with max compression, and the lowest acceptable DPI (300 I think), the PDF came out to 95MB. It didn't help that we scanned the book page by page and generated the PDF by hand, on a slow hp general consumer model scanner, either. (the initial pdf took over 120hrs to produce, with rescans and ocr'ing and everything).

      We wound up taking the acrobat ocr'd text (it was better than the off the shelf ocr package we had at the time) via the adobe accessibility website, and fixing it up. It was a pretty big project.

      We recently hired a document imaging company to PDF a lot of smaller historical documents for us, and that has worked out well. It's kind of pricey, but we also paid them to proof the ocr behind the images, and to hand adjust the images for appearance. It's worked out rather well.

    3. Re:are you sure you want to do this? by binaryDigit · · Score: 2

      Yes, I was aware of DDJ online. It's cool that they are offering this (and have been for a several years now, pre-internet, well, pre-widespread-internet anyway). Two problems though, first is that this only represents a small portion of the stuff that I happen to have. Old Byte, Micro Cornucopia, PC Tech, Compute, etc, etc are probably not going to make it any time soon (though I guess Byte might, they already have some).

      The second and the one that many people don't really think of (and to be honest, care about) are the ad's. Both as a reference (for many old products, the ad can be the only source of information) and for entertainment value (hey, look at the 20MB MFM Seagate for $1200, not including controller). The ads always get lost when companies put their content online, sigh.

    4. Re:are you sure you want to do this? by Anonymous Coward · · Score: 0
      Unless OCR software is dumber than I think, the accuracy should drop for technical material, which will have a fair number of unfamiliar words. I'm assuming OCR can deal with questionable characters in context, like "algo?ithm" becomes "algorithm" even though it couldn't read the "r".

      Just for fun, someone should try OCR on Perl code, preferably with lots of regexes, and see if they get 97%.

    5. Re:are you sure you want to do this? by qrys · · Score: 1

      Um. You probably don't want to do this using scanners/OCR. I used to work for a court reporting company, and we wanted to easily scan in an old deposition (for some reason) and it was a super pain in the ass. And I had a a nice HP scanner with sheetfeed and some software some company gave us to try out.

      It was a waste of time. We ended up having someone type it all back in (well, it was a court reporter, so they did their court reporter thing).

      This wasn't something that came up often (since most of the depositions in the last 7 years were in some electronic format anyway) so we didn't have any special 'super expensive' systems for it - which I'm sure would have worked better.

    6. Re:are you sure you want to do this? by sam+the+lurker · · Score: 1

      I had a problem similair to the original question.

      However, I did not bother the with OCR step. (I can't electronically search my paper books right now, and I still find them awfully usefull :-)

      I scanned an out of print technical book (no pictures) two pages at a time on a flatbed scanner, 300 pages total, 300 DPI, black and white (NOT grayscale). I did 15-20 scans at a time and spread my time over couple of days. Yes this is somewhat labor intensive but you will reap the savings next time you have to move a single Zip disk instead of "over 100 pounds of 'essential' technical books."

      I saved this as a multi-page TIFF document with Group 4 compression. Total disk space used 8 MB. Yes, that right eight megabytes.

      It look just like a nice photocopy. What more do you want? The OCR requirement seems to add a significant amount of risk to the project, both in terms of time and loss of data.

      Summary: Black and white scanned images saved in the same format that fax machines use will get you what you want. The OCR is secondary, if you can get anything out it, great, but you don't need it for the project to be valuable.

  36. While you're scanning my books... by DarkHelmet · · Score: 2
    Oh yeah, I have these 100 dollar bills I'd like you to scan and put in a PDF file... I'm not going to reprint them, honest!

    I just wanna be able to look at the dollar bills on my computer instead of having to carry them with me. Is that so bad?

    --
    /^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i
    1. Re:While you're scanning my books... by uberdave · · Score: 1

      $100 bills have a little foil square on them that doesn't copy. (At least in this country)

    2. Re:While you're scanning my books... by toocoolforsocks · · Score: 3, Informative

      Actually if sign this little buls**t form they have under the counter, they can copy whatever you want. I should know, I work there.

    3. Re:While you're scanning my books... by DNS-and-BIND · · Score: 2
      Yuh, like a $9/hr Kinko's employee is going to be the myrmidon of copyright law, tirelessly vetting every customer's content before allowing access to the duplication technology.

      Get a grip, they could care less what you do as long as you don't cause them any extra work.

      --
      Shutting down free speech with violence isn't fighting fascism. It IS fascism!
    4. Re:While you're scanning my books... by Martin+Blank · · Score: 2

      Most color printers and copiers are designed such that they will not properly reproduce certain colors needed on American currency. Many of them also imprint invisible watermarks unique to each printer, providing a potential method to track a counterfeiting attempt to an address, or failing that, to a suspect's printer should an investigation pin one down.

      --
      You can never go home again... but I guess you can shop there.
  37. Re:OCR has improved by ovit · · Score: 0

    I dont believe he was reccomended plain text as the digital format...

    What I read was to use JPG, IE images of each page.. OCR is just to provide indexing...

  38. Call the cops! by Anonymous Coward · · Score: 0

    Isn't what you are planning on doing technically a copyright violation?

    1. Re:Call the cops! by davidmccabe · · Score: 1

      No, they can't keep you from format-shifting that which you have bought from them.
      It's completely fair use, <jest>thou brainwashed</jest>.

  39. PDF and OCR by 4/3PI*R^3 · · Score: 2, Interesting

    If you really want to go through all this effort use both PDF and OCR.
    OCR sucks royally for large documents, documents with images or diagrams, handwritten comments, etc. However scanning the pages to an image and then creating a PDF of the images does not care about any of that.
    So, scan all of your books as images that your OCR software can process. Use the OCR output to create an index of pages. If a specific word on a specific page doesn't OCR well who cares. With typed and professionally printed books your OCR software should be about 90% accurate. Take the images and create PDF files.
    Now you have your nice clean images but you still have a searchable index. BTW, when you get this done post your procedures, problems, and solutions to a web site somewhere so that you can share your experiences with the rest of the world.

    1. Re:PDF and OCR by Bighund · · Score: 1

      FWIW, this is the way a lot of us lawyer-types do it, by creating an OCR/PDF combo file. This is done all the time in litigation discovery. A stack of documents come in, they're scanned to PDF and OCR'd at the same time to create a searchable index while preserving the original appearance. Sometimes we auto-bate stampp them too via an Acrobat plug-in, to give a numerical index to a stack of otherwise non-indexed documents. But you'll have the original page number per book to work with, so this step won't be necessary. This gives you the dual benefits of having an identical page image (PDF) plus a searchable index (OCR). It doesn't matter whether the OCR is 100% accurate or not, as you'll likely catch most words and that'll be enough to lead you to the general area of the text where'll you'll find your answer. Plus, the PDF will allow you to still use the original publisher's page index and table of contents when you need it. Because you'll be working from nice clean page texts, your OCR accuracy rates should be pretty high, too.

  40. Start with google. by bluGill · · Score: 2

    Start with google. There is a lot of technical information online, and google will find it. Not as good as those dead trees, but if you can find it and it is accurate, google is often easier than searching indexes. Best of all, dead trees are limited to the ones you own, while google is limited to whatever someone found useful to put online.

    Note the last line of the above: google is limited to what someone else finds useful to put online. So if you can't find it on google, take some time to put it online for the rest of us. If/when you find yourself going back to the same few sites often, link to them from your homepage so google knows you find them useful. In other words, google is interactive, make it work for you and it will work for everyone. The internet is not a one way street.

    Finially, some things are just plan eaiser to look up in dead tree format. I would strongly recomend you keep your books intact. Put the information you need on the web (what you can do legally), and keep the books for the rest. If you find you are not using a book anymore because all the information is on the web (including you put it there), then throw it out. My monitor is only 19 inches, not nearly enough to hold all the information I have scattered about my desk.

  41. Re:How about... by Anonymous Coward · · Score: 0

    No, he means the "can anyone please change my damn dirty diapers?" section, of course.

    What? You haven't noticed this section yet?!?!

    Blimey.

  42. typewriters use paper by IIRCAFAIKIANAL · · Score: 1

    So along with his current books, he would have an infinite number of pages that contain some of the works of shakespeare?

    What you need is an infinite number of monkeys with an infinite number of computers

    ...Silly old bear

    --
    Robots are everywhere, and they eat old people's medicine for fuel.
  43. Blackmask.com by KelsoLundeen · · Score: 2
    Blackmask.com

    Tons and tons of e-texts. In multiple formats: text, pdf, lit, HTML.

    Excellent resource!

    1. Re:Blackmask.com by Jack+William+Bell · · Score: 2

      First I thought about modding your post up. Then I went there and looked and afterwards I considered modding you down instead. (I have mod points right now.)

      Why? Because the Blackmask site you refer to has few or no books of the type referred to by the original post. There does seem to be a lot of cool content there, but most of it is stuff you can find just as easily on the Project Guttenburg site or elsewhere.

      So basically your post is somewhat off-topic, almost cool, but not really cool enough to merit a mod up despite the off-topicness of it. If I would have wasted a down-mod point on you someone else would have meta-modded it badly because they probably wouldn't know why I modded as I did. And, as I said, I just don't think the link is worth the mod up, despite the fact such a mod would probably survive a meta-mod.

      All this points out one interesting fact about meta-modding -- it may work better than its critics give it credit for! At the very least it makes a subset of the moderators (a subset with at least one member, me) think twice before bestowing mod points either way. Note that I often lose mod points when the time runs out because I just don't find anything truly worthy of moderating.

      Jack William Bell, who fully expects someone will mod this down as 'Off-topic'...

      --
      - -
      Are you an SF Fan? Are you a Tru-Fan?
    2. Re:Blackmask.com by Anonymous Coward · · Score: 0

      Who the fuck cares? You dumbass, pompous motherfucker.

  44. FAQ: Making Etexts from Paper Originals by ancarett · · Score: 2, Informative

    Anders Borg wrote this FAQ from Project Gutenberg. Lots of field-tested advice there, such as a suggestion to scan at 300dpi or better.

    --
    ancarett, historian and zombie gamer
  45. Goatses evil twin by Anonymous Coward · · Score: 0

    Heres someone else doing the same thing As goatse.cx!

  46. YES, BUT... by Anonymous Coward · · Score: 0

    you'd also need an infinite amount of bananas...

    what a mess, all the flies... oh, the humanity...

    um, wait...

    nevermind.

  47. Somewhat on topic... Historical Papers by Embedded+Geek · · Score: 3, Interesting
    My father passed on Sunday and we were going through all the family papers. We have lots of original documents from my family during the Civil War and earlier. My sister and I were thinking of donating them to a museum, so there would be no risk of their loss should my house get damaged (there's way too many documents to fit in my fire safe).

    Before doing this, though, we were thinking of scanning/copying all the documents to keep copies for ourselves. In doing so, though, we could use some advice:

    What special steps must we take in scanning 150+ year old documents, some very yellowed and fragile?

    What is the best format in which to store them (assuming we want them easilly readble in 20+ years for our kids)?

    What is the best media upon which to store the data (again, hoping for readability in 20+ years)? (I'm thinking online storage to allow easy conversion to the media of the moment, but I still want something to stash in the safe deposit box)

    Does anyone have experience with digital preservation/resoration of archival documents? Should I just try cleaning it up in photoshop or should I find a pro to help out? Maybe I can make it a term of the donation to the museum/library, for that matter.

    Thanks in andvance for your advice.

    --

    "Prepare for the worst - hope for the best."

    1. Re:Somewhat on topic... Historical Papers by catfoo · · Score: 1

      1. contact your local state college/university library and ask if they have a "special collections" department. they can tell you all kinds of stuff. 2. tiff and png with searchable copies in PDF (for text) 3. cd + tape + hardrive (one backup is never enough) 4. google it... "Document Conservation" http://www.nedcc.org/

      --
      no sig today, come back tomorrow
    2. Re:Somewhat on topic... Historical Papers by ancarett · · Score: 2, Informative

      I highly suggest you consult an archivist or a librarian trained in archival management. Nineteenth century paper products are notorious fragile (a result of the switch from rag pulp to acidic, unstable wood pulp). If you don't have the facilities to store these properly, donating them to a local museum or archive is a wonderful idea.

      The National Archives and Records Administration has a FAQ. Their advice on preserving family papers? --

      Paper preservation requires proper storage and safe handling practices. Your family documents will last longer if they are stored in a stable environment, similar to that which we find comfortable for ourselves: 60-70 degrees F; 40-50% relative humidity (RH); with clean air and good circulation. High heat and moisture accelerate the chemical processes that result in embrittlement and discoloration to the paper. Damp environments may also result in mold growth and/or be conducive to pests that might use the documents for food or nesting material. Therefore, the central part of your home provides a safer storage environment than a hot attic or damp basement.

      Light is also damaging to paper, especially that which contains high proportions of ultra violet, i.e., fluorescent and natural day light. The effects of light exposure are cumulative and irreversible; they promote chemical degradation in the paper and fade inks. It is not recommended to permanently display valuable documents for this reason. Color photocopies or photographs work well as surrogates.

      --
      ancarett, historian and zombie gamer
    3. Re:Somewhat on topic... Historical Papers by Anonymous Coward · · Score: 0

      Make paper copies. Paper lasts longer than any other media format. CDs, DVDs, et al will be obsolete in 20 years, even though they'll last 100. Tape drives have always been around, but they aren't as stable. Your original documents, however, have obviously lasted over 150 years .... so copying them over to a long-lasting, high-quality paper is the best choice. Unless something REALLY dramatic happens in 20 years, your kids should be able to read paper documents without any special equipment.

    4. Re:Somewhat on topic... Historical Papers by Seanasy · · Score: 3, Informative

      If you really want to do it right, do it on film. Either pay someone or beg/borrow/steal a medium format camera and try to do it yourself. Film and archive quality prints will probably last longer than CDs and you can get good scans from the negatives if you want digital, too.

      I beleive libraries use uncompressed TIFF files for digital archives.

      You might find some discussions of this on photo.net

    5. Re:Somewhat on topic... Historical Papers by Grail · · Score: 1

      Be careful - the high intensity light from the scanner is damaging to old paper. Even a strobe/flash from a film camera can hurt the paper.

      The best media for long term storage is mylar tape - you know, the stuff with holes punched in it.

      And yes, I'd leave it to the experts. This might even be the kind of project that a University student would want as part of their studies relating to the preservation of cultural materials.

      FWIW: I found a bibliography claiming to deal with Archives and Digital Longevity at http://scholar.lib.vt.edu/theses/archivebib.html

    6. Re:Somewhat on topic... Historical Papers by toast0 · · Score: 2

      The best format to store it on is paper.

      As you have expirenced, paper lasts 150+ years. 8" floppies from only a few tens of years ago are essentially unreadable now. In 50 years, who knows if we can read CDs. That being said, theres nothing wrong w/ storing on a computer as well as paper.

      As for formatting for computer storage... I'd guess any format with readily available documentation would work. Be sure to include that documentation as a plain text file on the media as well. Just because PNG or JPEG is a big standard now, doesn't mean that it'll be in use in 20 years, but plain text never dies, and having the specification for the graphics format could facilitate the writing of a viewer at a later date. If you're using a CD, you might as well make it a bootable cd w/ a small viewer program set to auto run... Judging by recent moves by AMD and Intel, x86 code will never die, but I wouldn't rely on that alone. :)

    7. Re:Somewhat on topic... Historical Papers by Grail · · Score: 1

      And I also found an article about archival storage of the original work.

  48. Abuse of hardware resources! by Anonymous Coward · · Score: 0

    Listen, chump! Scanning books and storing them online is a waste of hard drive space! Space which would be better utilized for illegal MPAA and RIAA copyrighted material, porn, games, warez and other materials.

  49. paper is superior by Anonymous Coward · · Score: 0

    1. You can keep paper docs next to your computer while you work without having to juggle applications on your PC.
    2. It's much easier to read paper docs while you take a dump or lay in bed. I know that you can use a laptop while shitting or laying in bed, but a book is easier. Save the toilet/bed laptop sessions for important IRC chats.

  50. EEEWWW HE SAID THE 'K' word by chewedtoothpick · · Score: 0

    oh my god... people actually use that word any more? We are talking a word that in the computer world is worse than all the old and slang curse words multiplied by their cumulative power... Are we going to let him get away with it?

    --
    Erutangis ym si siht.
  51. I agree by IIRCAFAIKIANAL · · Score: 1

    Donate them to your local library if they are still relevant.

    I just donated a bunch of books myself.BR>
    Another strategy may be to only scan the stuff you need out of the books.

    I just wish I could get rid of all of the leftover records/reports/legacy app documentation in my office.

    --
    Robots are everywhere, and they eat old people's medicine for fuel.
  52. Hardware by egad_man · · Score: 1

    Its very simple, just take 1000% of the recomended dose of ginsing, add in about 1 gram of caffine and then sit down to read the books, Just download it all into your brain, the caffine to go quickly and the ginsing to remeber it. If thant doesn't work then just hit yourself in the head about 100 times with each book, through the process of osmosis/diffusion you will absorb al the information that comes out. Both worked for me with Moby Dick and Great Expectations in high school.

    --
    Hmmm, I have 5 mod pts, its time to metamod, and on top of that I have to meta-metamod? When do I get to read slashdot?
    1. Re:Hardware by Anonymous Coward · · Score: 0

      when you wanna do a brain dump to free up some memory, jest toke a fat British Columbian jizzoint and play video games for a week. worked for me, now my melon's perty empty an' there's a whole lotta free space in there to fill with good stuff like pr0n and trivia. and gambling odds.

  53. Hauling Trees around by DarkHelmet · · Score: 2, Funny
    I'm tired of lugging around dead trees

    Call Paul Bunyan. Cause he's a lumberjack and he's okay!

    --
    /^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$/i
    1. Re:Hauling Trees around by electric_penguin · · Score: 1

      Call Paul Bunyan. Cause he's a lumberjack and he's okay!

      I believe Babe his big blue ox did the hauling.

  54. Re:fp by Anonymous Coward · · Score: 0

    Apparently, it's vapourware...

  55. Electronic versions from the publishers by truthsearch · · Score: 2

    Have you tried contacting the publishers directly? Or maybe the companies that created any of your software documentation? I know that some companies have PDFs of their manuals and other books, but don't make it well known. They don't usually offer them for free download, but if you prove you have a hard copy some companies will tell you how to get a PDF version. This works especially well for lost instruction manuals, which you can always get for free.

    One good, but old, example is Oracle. Back in the day my company had megs of PDFs of all of Oracle's documentation. There was a main index PDF with links to basically every other possible document. I don't recall Oracle leaving them open for download on the internet. We got them on CD. But it was easy to get since they new we were a customer.

  56. Aargh! Flashbacks! The pain, the pain... by mccalli · · Score: 2
    The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

    Right then. In 1993/4, this is what I did for a living. The company I worked for did quite a lot of this, and one contract in particular sticks in my mind - the digitising of all books in the French National Library.

    No doubt the equipment we used has moved on in the intervening decade however. We used Bell & Howell scanners fitted with automatic document shredders. Err...feeders. Yes, automatic document feeders. Not shredders at all. No. Honest.

    You see, these were high-speed scanners, and some of the books we received were qute old. Me and the other coder on the project got really quite good at doing "pit stops", or changing the rubber wheels that drove the ADF. What I'm saying is no disrespect to the scanner company - it was the quality of the paper we had to put through it that caused the hassle. Some books, like the 18th century Academie Francais records, were so thin we had to photograph them and scan the photos.

    We then scaled, OCR'd, deskewed and indexed the results on decent machines - 25Mhz 486SX, 4Mb RAM and Kofax graphics cards. Everything was then tarred up to DAT.

    Hardware moves on, but I'll bet the amount of work remains the same. Do not underestimate the preparation required, and also the ammount of QA.

    Oh, and don't use JPEG. Lossy compressionon text? Use TIFF - the image processing industry standard.

    Cheers,
    Ian

  57. re:isnt that illegal by Anonymous Coward · · Score: 0

    to make acopy such as this, which wopuld be easily transmittable,seems to me you will run up against copyright laws. simply put its illegal, no grey issues there.

  58. From the alt.binaries.e-book FAQ by robert0122 · · Score: 1
    The abeb FAQ lists several options. Here are two links:

    http://www.slack.net/~hermit/ebook/
    http://www.slack.net/~hermit/ebook/documents/page- 4-1.html

    I recommend first looking for them online. I use irc.bookwarez.net #bw and have found many computer books (of course, I only download the ones I own. ;-) )

  59. Free the monkeys! by sydb · · Score: 4, Funny

    Isn't an infinte number of computers enough?

    cat /dev/random > ebooks

    --
    Yours Sincerely, Michael.
    1. Re:Free the monkeys! by carlos_benj · · Score: 1

      Isn't an infinte number of computers enough?

      I'd like to see a beowulf cluster of those.... Oh, wait.....

      --

      --

      As a matter of fact, I am a lawyer. But I play an actor on TV.

    2. Re:Free the monkeys! by Anonymous Coward · · Score: 0

      USE urandom! SAVE ENTROPY!

    3. Re:Free the monkeys! by astroboscope · · Score: 0

      You should save entropy by using gunzip, or bunzip2 if you prefer. The output of a good compressor is random (anything nonrandom is a pattern, all patterns should be compressed...), so feeding random bits to a good uncompressor should give you lots and lots of interesting stuff.

      --
      If we were ants living on a Rubik's cube, differential geometry would be a little more confusing.
    4. Re:Free the monkeys! by sydb · · Score: 2

      So why does this not work?

      bash-2.05a$ gunzip /dev/urandom
      gunzip: /dev/urandom is not a directory or a regular file - ignored
      bash-2.05a$ dd if=/dev/urandom of=test.gz count=1024 bs=1024
      1024+0 records in
      1024+0 records out
      bash-2.05a$ file test.gz
      test.gz: data
      bash-2.05a$ gunzip test.gz

      gunzip: test.gz: not in gzip format


      (I'm just joking...)

      --
      Yours Sincerely, Michael.
  60. I'm tired also by Anonymous Coward · · Score: 0

    "I'm tired of lugging around dead trees."
    I'm tired of that stupid term. "dead trees". I challenge you to come up with one good reason not to call them books, that could be more important then, easily recognized, shorter typing, and not sounding like a jack-ass.

  61. Re:OCR has improved by DEBEDb · · Score: 0, Offtopic

    Everything does double every 18 months, you know.

    Stock prices especially...

    --

    Considered harmful.
  62. DjVu format is Wavlet Based with OCR by elucidus · · Score: 1

    DjVu format is Wavlet Based with OCR
    I have tried the Solo version, file sizes were incredibly small. A nice feature is that the Tiff is compressed to a wavlet compression technology, and if you buy the full version you can add an OCR layer which means you can text search. The sample content demonstrates the systems capabilities.

    Many libraries are converting rare books and manuscripts because it preserves the original image so well.

    --
    This sig is self referential.
    1. Re:DjVu format is Wavlet Based with OCR by juhtolv · · Score: 1

      Their software is also available as free software: DjVuLibre

      --
      Juhapekka "naula" Tolvanen - http://iki.fi/juhtolv
  63. a girl by Anonymous Coward · · Score: 0

    I know girl, does that count... Sorry I lied, but I did know a girl, she was on the bus and she asked me the time and I said 2:30. The good old days.

  64. printing electronic docs is for amateurs by shaldannon · · Score: 2

    Real men use the command shell and man() or google ;)

    Seriously, most of the hard-core computer folks I know either open their copy of the ORA book on the subject, steal their neighbors copy and flip it open, or use some form of online docs w/o printing said docs off. The only reason I've ever known anyone to print anything resembling a doc is when someone I knew had assembled binder full of pages on tech specs for a project.

    It's just a lot easier to sit at the screen arrowing up and down on the doc than it is to print it, reach over to the printer, pull it out, shuffle through it....and then eventually have to take it out with the trash. I've seen comments about paperless offices vis a vis paperless restrooms, but the fact is that for reference there really isn't a reason to print the online doc.

    --


    What is your Slash Rating?
    1. Re:printing electronic docs is for amateurs by carlfish · · Score: 2

      I tend to print important documentation. Printed documents are:

      1. Easier to read, at least until monitor resolution increases a great deal. Also, I find back-lit screens much harder on the eyes. You can generally read paper documents at about twice the speed you read screens, I believe.
      2. More ergonomic. You can hold them at any angle that feels comfortable.
      3. More flexible. It's easier to attach notes to them, fold them, tape them up on the wall so anyone can see them as they pass, or write "This is all a load of crap" on them in big red marker pen if you disagree.
      4. More convenient. You can read them on the train home from work without investing in a reader. You can take them into a meeting. You can hand them to the guy at the next desk in a tenth of the time it'd take him to load the document himself.
      5. More insistent. If you leave printed documentation on someone's desk, they're more likely to read it than if you stick it on the fileserver.

      Charles Miller

      --
      The more I learn about the Internet, the more amazed I am that it works at all.
  65. Weight or Money? by andyapple · · Score: 0

    From what i can gather there's something that needs to be cleared up: £100 or 100lb?

    --
    Andy
  66. I'd be happy with.. by geekoid · · Score: 2

    ..an index of the book on my system. just a table with all the words and which page they appear. Pretty useless without the book, since it would be practically impossible to create the book from it, and it would be damn convienant.

    --
    The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
  67. Hell yeah by Anonymous Coward · · Score: 0
    I don't know about you but I read a *lot*. I've got an apartment full of books (technical and otherwise), and more in storage. Once I've read a book I'm not likely to read it cover-to-cover again, but I keep it around because I'll refer to it from time to time. They take a lot of space, and are a hassle to truck around when I move.

    Meanwhile, I've got this nifty little device which is capable of storing more text than I'll read in my entire life....the most current programming books I'll keep in physical format, but I'd dearly love to compress everything else into that little box.

    1. Re:Hell yeah by hyperstation · · Score: 1

      same here, and i'd sure as hell (with current tech) rather be reading a wood pulp format book than a digital book anyday. after you destroy all of those books in order to scan them, what will you have to take to the shitter with you? think.

    2. Re:Hell yeah by Anonymous Coward · · Score: 0
      after you destroy all of those books in order to scan them, what will you have to take to the shitter with you?

      A new book. Like I said, I don't re-read the old ones too much, just refer to them now and then.

    3. Re:Hell yeah by Jason+Earl · · Score: 2

      I have outfitted my $100 Visor Handspring with a Compact Flash springboard module and now I can carry around over 100M of books in my shirt pocket. The darn thing is even backlit so that I can read in the dark. What's more I can search for keywords, and annotate the books to my hearts content.

      What really settled it for me was when I started reading Structure and Interpretation of Computer Programs on my Visor and could do the example programs in LispME.

      Needless to say I prefer my Visor over the dead tree version for any book that is text heavy.

  68. I Like Dead Trees by shanestyle · · Score: 0

    I have a hard time reading content on a screen, I much rather open a book, and make notes in its pages. Thats just my personal taste, I totally see the benefits of a digital book, and I suppose someday that might be all we really have. But until then, I will keep buying books.

  69. Just wait... by chrisatslashdot · · Score: 1

    I figure college students will catch on to this idea soon. A few students could go in together on a book, saw off the spine and distribute the chapters. Each one scans a small section then they reassemble the pieces, burn a stack of CD's and sell/give them to the class. I could have bought a cheap 7 pound laptop several times over for what I spent on hundereds of pounds worth of books.

    --


    Simple people talk of people, better people talk of events, great people talk of ideas.
  70. #bookz... by MrSeb · · Score: 1

    Go check the #bookz channel on Undernet IRC.

    It's like the equivalent of Napster for books - and it's still in it's early stages... everything goes.

    They also have 'scanathons' where you all start scanning a book at the same time, and the person that finishes scanning first gets... er... the kudos for being the fastest scanner...

    1. Re:#bookz... by Graspee_Leemoor · · Score: 2

      Wouldn't it be more productive if they divided the number of pages by the number of entrants to this sad "scanathon" and saw who finished first ? That way no work would be duplicated.

      If you're going to rip off books, at least be efficient!

      graspee

  71. 4DigitalBooks 900 pages/hour - or do it yourself by jukal · · Score: 4, Informative

    I do not have any experience with their products, but the solution offered by this company seems simple and functional. Their system consists of an apparatus that turns pages of your book automatically, scans, turns, scans, turns. The result you can naturally pass to OCR.

    Now, if I was to digitize all my books, I would try to create te the 4DigitalBooks kind of solution myself. The only tricky part is to find a cheap enough way to turn pages automatically, see also Kris Mckenzie's automatic page turner, still the best start is this document which is a proposal and overview on how to create an automatic page turner from pieces, the total cost is $459.

  72. The Wearable Comp. Guy at Georgia Tech does it... by gte910h · · Score: 1

    ...so he can read them on his wearable. Why don't you ask him? (Thad Starner)

    --
    Want to see every step I took to start my company? http://www.rowdylabs.com/blogs/pitchtothegods
  73. start with a high speed Fujitsu scanner by loneoak · · Score: 0

    both Fuji and Bell & Howell make some tremendous scanners - even on eBay the good ones go for a chunk -look for a Fujitsu 3096E 11x17 high Speed Scanner Item # 2022530016. You don't need to OCR everything if you're gonna take in the whole book(s) intact, just readable images, you'll have page numbers that accurately relate to the TOC and index... and if you absolutely must have the stuff OCR'd this equip can do that too, just takes longer...

  74. (requisite RIAA slam) by mlibby · · Score: 1

    just be prepared for the DTIAA (Dead Tree Industry Association of America) to come after you for violation of the DMCA. it's circumvention of the "ink in fibrous media" encryption scheme, don't you know.

    heck, even just talking about how you would do it means everyone posting to this thread is busted now.

  75. Write your own ... ? by Annamite · · Score: 1

    Used to do this for a while. Document managament segment. The software out there are kinda suck. It is better if you can write your own, on the "embrace and extend" direction of available components out there. Even better if you plan to do a lot of this, go all crazy with all the bills, credit cards statements and daily papers and whatnots :-)

    Kofaxsells compression board + APIs to help you with all the deskew, strengthen your images and stuff. The APIs that I used was for Windows only, altho it come with full support for C and VB.

    Bell and Howell scanners work with Kofax product like a charm. Some model can scan like 30-40 pages/min for a moderate amount of money. Better scan as compressed TIFF. It handles multiples pages much better.

    ORC batches can be scheduled to run as soon as you finish scanning the docs.. I forgot what the package is called. But the current technique prolly improves alot more than when I was doing it ('97)

    Add in a databse and some indexing capability, you can actually build a business around this (we did).

    Why all these trouble to make your own? Well if you are a good coder, you can extende and add any functionality that you might want. Cookie-cutter package ain't also good enough or cheap enough.

    Now you might want books only, but what about billings? statements and even integrate other digital forms of documents? And indexing and searching ... By all mean go crazy on it .. it is fun .. :-)

    Annamite

  76. Dual monitors could wean me from dead trees by UsonianAutomatic · · Score: 2

    Reading over these responses I realized what it is that bugs me most about having a reference manual in PDF or some other electronic format versus having a nice book in my lap: I don't have the screen real estate for both a document reader and whatever app it is I'm using the reference for.

    The endless jumping between windows gets old real fast, especially if I need to copy a code snippet out of a document (like a PDF) that won't let me select & copy text.

    But if I had a second monitor right there at eye level, I could just open up the reference doc there. No more switching between windows, and no more neck strain from constantly looking down at a book in my lap and then up at the screen.

  77. How to get around disabled printing by Anonymous Coward · · Score: 0

    There is a trick in windows to get around the PDF no print. You need the full version of acrobat. Open the disabled print pdf and select the acrobat printer, I forget the name. But anyways you will notice the print menu greys it out to disable printing, but if you switch back and forth between the acrobat printer and another printer the print button will be enabled, then quickly disable itself. You can actually use the arrow keys to switch back and forth then with the mouse click on print before it disables it. I did this on a mighty words e-text book and printed a copy of it to acrobat printer which saves it to a new file.

    1. Re:How to get around disabled printing by zeno_2 · · Score: 2

      This sounds a lot like that PDF that was on the NYTimes (i think) where they had a list of names of people, but they were blacked out. Someone with a slow connection or something like that was able to see the names at first, and then the black squares loaded after over the names..

      Pretty strange stuff =P

  78. Funny You should ask. by Fapestniegd · · Score: 3, Informative

    My current setup consisits of:
    4 HP scanners with ADF ~$150 ea. (eBay)
    4 Sparc LXs from a property contol auction $50
    one flatbed scanner for covers and bad scans. $50 (eBay again)
    Barebones System/w scsi from Compgeeks $80

    (NFS server), An Amtren Device(courtesy of the office) and away you go. I've found the best way to cut off the binders is to use a box cutter and to use your previous cuts as a guide. Several shell scripts to scan various types of books. It's amazing the page numbering schemes some publisers use. With this setup I can scan approximately 2-3 college textbooks 1000 pgs.(grayscale) or 1 color in an 10 hour period. (including checking for bad scans, sane ain't perfect, so you better check em) also jpg isn't very good for OCR, I store as png, and convert a second set to jpg for web viewing. OCR under linux isn't quite there yet (unless you want to pay through the nose) So I am Archiving the pngs to CD until it is. This also allows me to regenerate the jpgs if I lose a webserver disk. Add a nifty little IMageMagick web viewer and viola! eBookshelf! Oh and a NSM CD changer is nice too get to the CDs nearline.You can pick these up on ebay for $200-$400

  79. PNG vs JPEG by Anonymous Coward · · Score: 1, Informative

    First, I'd use PNG (lossless) or Photoshop's format(lossless) over JPEG (lossy). PNG/PSD will be crisper and color pictures will not be degraded.

    Second, I'd make them HTML/PDF instead of plain text. Mainly because then you can retain the fonts. (Of course, some of the OCR programs will do this for you if you want to save it as MS-Word file but that's another story. :-/ )

    Fourth, a well scanned book is just as easy to read as the book itself. Honest! ;-) But really the problem I've run into is that the back of the pages tend to show through sometimes. You can help to alleviate this by rescanning the pages by hand. Place a blank piece of paper behind the page. This helps to make the page seem whiter to the scanner. If the paper is too bright then use a darker colored piece of paper (like grey or black). This will help the scanner to tone down the bright white of the paper. Only trial and error can tell you what you will need on a book by book basis. This is because each publisher uses a different brand of paper.

    Last, use an exacto knife to do the cutting and a good ruler with a metal edge. Exacto + wooden ruler means lots of splinters, badly cut pages, and sore thumbs/fingers. :-) Plastic rulers can also lead to problems. Use a metal one! Save time! Save going to the doctor for stitches! Keep those hard to get out red stains from appearing in your books!

    Nuf Said!

  80. Re: Concordence by tigris · · Score: 1

    There's a very handy database program designed for large legal case management (cases with 100,000s of pages of produced documents) that enables one to link OCR text with a scanned image of the document. It comes in very handy when you're doing a keyword search of the OCR'ed text - you can automatically choose to go to a image of the page - circumvents quite nicely the problems that OCR has with images, tables, graphs, etc.

    http://www.lcsweb.com/Software/concord.htm

    But I have to admit that I would just check out Usenet or file-sharing programs for the titles you have - why duplicate work that someone's probably done already?

    Tig

  81. Omni Page by EddydaSquige · · Score: 1
    I've been using omni page lately and it does a fairly good job, but it's still only 99%. 99% sounds good but if you figure 250 to 300 words per page, or 1500 characters, that's 15 mistakes per page. If your doing a whole book that's a lot of mistakes. The way that many large publishing house do it it to get three non-native speakers per document to type it up and then a software package double checks them against each other.

    It sounds like your best bet is the 'hire a bunch of collage students' route.

  82. From ??AA by Anonymous Coward · · Score: 0

    What? You want to make fair use copies of our copyrighted work. We'll be sending the lawyers out immediately.

  83. Here is a interesting tool. by unlocked · · Score: 1

    http://djvu.research.att.com/home.html

    I think it is free for non comercial use.
    Makes nice archives of documents.

  84. You have work buy one copy, and you buy another... by gte910h · · Score: 1

    ...then you can spend you extra time in the Gym becoming capable of carrying 100 pounds of books if you really need them...

    --
    Want to see every step I took to start my company? http://www.rowdylabs.com/blogs/pitchtothegods
  85. IRIS OCR by Anonymous Coward · · Score: 0

    IRIS OCR is very good software for this. This is not a stupid question at all. Books should go this way.

  86. Useful for non Techie books too?!?! by RalphWigum · · Score: 1

    I have also been wrestling this issue for a while, but with shop manuals for old cars.... Try finding a shop repair manual for a '73 roadrunner. Even after exhaustive searches on the internet to find a place to buy these books, I find one shady site and have to pay $$$ for them.

    My thought was that if I could find them, buy them and then digitize them, I could save some poor sap that might not be so search engine savy time to get them a searchable digital copy...

    Plus if I could do a keyword search for "holly carb" rather than thumbing through a three volume repair manual set.... man.. I could get my car up and running in half the time, and not ruin the $$$$ manual set with grease and muck in the process!

    My main question would be, for discontinued or non-traditionally published books (like company distributed manuals) what is the legality involved with re-distributing (or even selling) digitized versions of them them??? Even if it is a chore to do, would it legaly be worth me doing it for the good of others???

  87. Photo of Danese Cooper by Anonymous Coward · · Score: 0
  88. Some Advice on Scanning Textbooks by xerofud · · Score: 1

    Well I couldn't agree more about lugging around dead trees ... I've got over 200 mathematics textbooks myself that I am in the process of digitizing.

    Having done a fair amount of research on the topic, this is what I can share:

    If you are willing to feed your books to the guillotine, then it should be enough to purchase a home/office level scanner that includes a automatic document feeder (ADF). My advice here is to buy an Epson product. Even their lowest end models are lightning fast for their class, and the quality of the scan is consistently high.

    For more serious scanning of texts that you don't want to destroy I settled on the Ricoh IS450SE, because this model has a 11"x17" glass plate, which is ideal for scanning both pages at the same time. It copies two pages in just over one second, which makes quick work of manually scanning a 500 page textbook. It's even quiet enough to operate while watching your favorite TV program. The native resolution is 400dpi and it can handle grayscale, at a slightly slower speed. The scanner was my top choice because of speed and cost. At $3000 street price (including ADF), I could justify it because I'll make that back selling my textbooks used on the net, not to mention avoiding purchasing any new textbooks. Some of the other competing manufacturers in this category include Fujitsu, Panasonic, Kodak, and Xerox. Visioneer has recently introduced a sub $1000 scanner that is supposed to be competitive in this category of machines which might be worth investigating.

    One last comment about hardware. Since my goal is to read my texts online, I purchased a laptop with a 15" 1600x1200 screen from Dell which I highly recommend. Apparently Dell has recently introduced an improved version of this UXGA screen called "UltraSharp" which supposedly fixes some problems with uniformity of contrast from top to bottom of screen, which might be of interest to someone considering purchasing a 133dpi LCD laptop. (IBM, Compaq, Sony and Hitachi I believe also offer models with UXGA screens.) I'm curious if anyone knows the current lightweight champion for laptops with UXGA screens. My Dell is over 9 lbs :)

    When it comes to software, I had to write my own Python script using the freely downloadable TWAIN interface kit for Python since the driver interface to various software packages I tried was not optimized for doing bulk scanning. In particular, I wanted the software to automatically name/number the images as they became available. One of the Ricoh TWAIN drivers I tried was a little buggy but I was able to correct for that in the Python script I wrote by seeking out the TWAIN spec from www.twain.org and reading up on it. I can make my mods available to anyone interested in driving a Ricoh product from Ricoh's supplied TWAIN drivers.

    For archiving/image processing of the raw scanned images there are two formats I've looked carefully at using. The first is DjVu available from djvu.sourceforge.net. This uses the open standard called jbg2 to efficiently compress images of scanned text (by indexing recognized glyphs). It can also automatically segment images according to text and image and compress the latter with wavelet technology. AT&T spun off a commercial venture called LizardTech that is based on the same software provided on sourceforge albeit a little faster and more sophisticated. The open source tools should be more than adequate however for non-commercial applications.

    There is also of course PDF, the latest version of which supports jbg2 encoding. Unfortunately the linux viewers for pdf do not support this feature in the latest PDF spec, and neither the PDF viewers or the DjVu viewers support sub-pixel rendering of images. (Of course Adobe Acrobat 5 does so through its "CoolType" technology, but that is only available on Windows platform as far as I know.) I am currently working to upgrade the opensource DjVu viewer to support "pixel borrowing" on LCD screens since standard grayscale anti-aliasing (supported by the viewer) does not look so great on LCD screens.

    Then there is the open source OCR project called claraocr. My ultimate goal is to add support to this package for recognizing mathematical formulas and producing equivalent TeX code. This is a HARD problem and I don't expect to make significant progress on it anytime soon.

    Anyway, I hope the above the long-winded tour through my adventures in this area can provide some useful insights.

    Good luck, and make sure to make available your handiwork on P2P networks like gnutella and Freenet :)

    1. Re:Some Advice on Scanning Textbooks by markov_chain · · Score: 1

      Speaking of high resolution monitors, check out this new IBM monitor: http://www.provantage.com/YIBML011.HTM. It's a 22" monster with a resolution of 3840 x 2400!!! That's over 200dpi. At $8k, it's a bargain =)

      --
      Tsunami -- You can't bring a good wave down!
    2. Re:Some Advice on Scanning Textbooks by nukebuddy · · Score: 1

      xerofud wrote:
      I'm curious if anyone knows the current lightweight champion for laptops with UXGA screens. My Dell is over 9 lbs :)

      The Dell Inspiron 4100 has an optional UXGA screen. It is only 14" though. Your 8100 has a 15" screen. 14" gives it ~145ppi and is a plus for me since it is smaller and lighter, but it might be a bit too tight for some folks.

      4100 link:
      http://www.dell.com/us/en/dhs/products/model_inspn _1_inspn_4000.htm

      -nukebuddy

  89. Re:The Wearable Comp. Guy at Georgia Tech does it. by uberdave · · Score: 1

    Or better yet visit the father of wearable computers at http://eyetap.org/mann/ (I think he's involved in our local linux group, but I'm not entirely sure.)

  90. I've done this by brad3378 · · Score: 4, Insightful
    To do it, I purchased a used HP scanner with a 50 page Automatic Document Feeder (Search for ADF on Ebay).

    I started with the easiest books. - Books that could be removed from the binding. Scans go smoothly with the ADF, but it is not as easy as you might think. I find that I spend most of my time naming the files because the default naming comvention is *01.jpg , *02.jpg , *03.jpg, etc.

    It is a problem for two reasons:

    most of my books are double sided.
    My HP scanning software for windows does not let me name files with a 2,4,6,8 or 1,3,5,7 format.

    If books contain more pages than the ADF holds, The first page scanned will still be named page 1.

    If I knew a little perl, I'd write a script to rename the files between scan batches.

    For scanning full bound textbooks, there are two main problems:

    Scanning the side of the page along the binding requires carefully holding downward pressure on the book to keep it near the scanner glass.

    You cannot scan the book using ADF, so you should expect to spend A LOT of time scanning.

    Do not even consider manual scanning hundreds of pages with a parallel port scanner. WAY WAY too slow. USB scanners are cheap now, and will usually scan as fast as the scanner mechanism can move (assuming black & White scans).

    Lastly, be realistic.
    Know how much time you'll need to invest.
    Rule of thumb: If you need to scan manually, expect to scan about 200 pages per hour at top speed. Is it worth investing six hours to scan that 1200 page book of yours? If money allows, I'd suggest purchasing a second book that you can afford to destroy. Cut the binding off with something like a jigsaw, then insert the pages into an ADF scanner. Hope this helps somebody.

    --

  91. Another place to look... by zaren · · Score: 2, Informative

    is http://docs.rinet.ru:8080/ - I ran across this site a few years back. It almost looks like an online library for a Russian ISP's technical support staff.

    They've got lots and lots of official books, all HTMLized a chapter or a section at a time. They're all a bit old or out of date, too - I know of one Perl book in particular that they have there was one edition behind what was being sold on the shelf at the time I saw it.

    -----
    Is Darwin an evolutionary OS?

    --
    Come to the University of Mars! Classes starting soon!
  92. You fucking Girl... by kung-fu-hippie · · Score: 0, Troll

    "Oh... I'm a litte school girl who can't carry her her books around and I take it up the butt.."

  93. kinkos: a buncha idiots by Anonymous Coward · · Score: 0

    don't go to kinko's, they suck...but you should be able to find a good quik printer in your area, calll and ask if they have a digipath scanner, have them scan it(digi's have good feeder's) and save it as a pdf. it's only one copy, I think thats fair use...

  94. Software has improved a lot. by Anonymous Coward · · Score: 0

    I've done a few books with Omnipage 11. I'm very happy with it - it does an excellent job.

    After its done OCRing, it brings up a proofreading dialog - presenting you with any words it wasn't too sure of. Depending on the quality of your scans you are usally looking at 0-10 corrections per page.

    I find that OCR/proofing takes about the same ammount of time as scanning. As a final check, I read along with text-2-speech at ~400wpm - you can hear any missed mistakes very easy and fix them as you go.

    The end result is near perfect.

  95. Document Imaging Systems by yum_icecream · · Score: 1
    I work for LaserFiche, a major software vendor that specializes in document imaging and document management. It's typically used in offices that have a huge paper burden. i.e. they have large numbers of documents, they need fast access to the documents, the files take up a lot of expensive real estate, they spend a lot of money photocopying and routing, they want to preserve the fragile originals, keep backups for disaster recovery, etc...

    The idea is simple: All documents (paper and electronic) are stored in a single repository. Retrieval is based on what you know. If you know what folder it was put in, browse around. If you know how it was categorized/indexed, do a db search. If you remember some words that occurred in the document, do a full-text search.

    Documents can then be made available through your PC, over the network, over the web, via CD, emailed, etc.

    Read this overview if you'd like to find a bit more about the basics of document imaging

    Depending on your budget, you could either buy your own system, or hire a service bureau to scan your documents and give you the images/text/index on a CD.

    If you've only got a couple books, use a service bureau. We have reps all over the globe that can offer this service.

    As for the "build your own" option... Well... Let's just say there's a lot of subtleties involved in building a reliable system that might be overlooked at first glance.

    Tom Wayman
    Senior Technologist, LaserFiche
    E-mail: twayman "at" laserfiche.com
    Web: www.laserfiche.com
    Document Imaging for the Real World.

  96. I do this with... by coldmist · · Score: 1

    with a Fujitsu 3092DG (Duplex-SCSI) scanner. I'd recommend either a Fujitsu or a Canon DR* scanner.

    On the rare occasion that a book doesn't want to feed smoothly, I just stand next to the scanner and loosly put each page on the input. Doing this, I can still queue up about 30 pages, leave for a minute, come back and add more, etc. I've done a 500 page book like this once without ever stopping the scanner.

    I chop the spine off at Kinkos, OCR it with TypeReader or FineReader (auto-straighten of skewed images, auto-split of an open-faced 2-page scan on smaller books, despeckle, etc).

    I scan at 400dpi, which seems to give just an error or two less than 300dpi without too much extra disk usage.

    The biggest thing you can do to adjust OCR result quality is to play with the contrast and brightness settings. I've scanned several books that had about 1 OCR error every 3-4 pages. No, I'm not kidding.

    Finereader can output directly to PDF, doc, txt, html, etc.

    --
    Don't steal. The government hates competition.
  97. www.greenstone.org by Anonymous Coward · · Score: 0

    See www.greenstone.org

    Greenstone is a suite of software for building and distributing digital
    library collections. It provides a new way of organizing information and
    publishing it on the Internet or on CD-ROM. Greenstone is produced by
    the New Zealand Digital Library Project at the University of
    Waikato, and developed and distributed in cooperation with UNESCO
    and the Human Info NGO. It is open-source software, issued under the
    terms of the GNU General Public License.

    Includes the document:

    From Paper to Collection (224kb)
    A document describing the entire process of creating a digital library
    collection from paper documents. This includes the scanning and OCR
    process and the use of the "Organizer".

  98. Acrobat perhaps... by Jaycatt · · Score: 1

    Adobe Acrobat I believe allows searches (at least it does allow cutting and pasting, so maybe searching as well?)
    Plus it's a fairly small file size with a multi-platform reader.
    No ideas on what scanning software to use, Omnipage is okay, but I don't think it creates Acrobat files.

    --
    "Shared pain is lessened; shared joy is increased. Thus we refute entropy" - Spider Robinson
  99. Don't be such a wimp! by Anonymous Coward · · Score: 0

    All this over 100 lbs of books? Just put em in 3 boxes and carry them! Even if you changed jobs every 6 months this would be no big deal. The exercise will do you good. Digitizing technical manuals sounds like a big waste of time to me.

  100. You *need* to be aware of OpenDJVu by Effugas · · Score: 5, Interesting

    Run, don't walk, to http://djvu.research.att.com/home.html . DJVu is a image-based competitor to PDF that is a feat of beautiful engineering -- 300DPI scans break down to about 10-30K a page, the viewer is about an order of magnitude faster than PDF, the format cleanly supports separate encoding of page texture/graphics vs. page text, there's significant amounts of open source for it, and more.

    It's truly a brilliant format. Go check it out.

    Yours Truly,

    Dan Kaminsky
    DoxPara Research
    http://www.doxpara.com

    1. Re:You *need* to be aware of OpenDJVu by Anonymous Coward · · Score: 2, Informative
      The open source implementation of DjVu is called DjVuLibre. It includes a viewer and browser plug-in for Unix/X11 (with binaries for Linux, Irix, and Solaris).

      There is a free online conversion server at Any2DjVu.

      Info can be found at DjVuZone.

  101. Re:Do you really need them? (Red Rubber Ball) by Chris+Y+Taylor · · Score: 2

    A mathematician, a physicist, and an engineer are asked to find the volume of a red rubber ball. The mathematician measures the diameter and calculates the ball's volume. The physicist submerges the ball in a full beaker, and measures the amount of water that spills out to get the volume. The engineer turns the ball over until he find's it's serial number, then looks up the volume for that model on his Red Rubber Ball Table.

    Half of the library in my office is catalogs and equipment data sheets for components. A lot of the rest is more generalized data like stress concentration factors for various object geometries and material characteristics; these are things that CANNOT be derived from theory. Only about 4 of my books (which, admittedly, I do use a great deal) are theoretical books. Physics, Advanced Math, Design of Experiments, and a Mech. Eng. Handbook. When you work with real objects, rather than just theory and pure numbers, you tend to need a lot more detailed reference materials. And I'm sure that at least one Engineer in the red rubber ball industry has himself a Red Rubber Ball Table.

  102. Yes. by r_j_prahad · · Score: 2

    Because the yellow highlighter looks like shit on my CRT.

  103. definition of "dearth" by bcrowell · · Score: 3, Informative

    There are hundreds of them here. Very few are the kind of dopey software manuals you're referring to. Is that a "dearth?"

    1. Re:definition of "dearth" by Graspee_Leemoor · · Score: 2

      Tuh- such common everyday stuff as:

      "Introduction to the Theory of Infinite-Dimensional Dissipative Systems"

      ...Where can I find more esoteric stuff ?

      ;)

      graspee

  104. Has everybody missed the obvious? by Anonymous Coward · · Score: 0

    Has everybody missed the obvious? I mean, have you even tried contacting the publisher to see if the books are available in electronic format? So what if they turn you down, it may get them thinking that their customers want these books in electronic format as well as dead tree format.

    They have to have them in electronic format in order to re-release them to the printing presses so maybe they can be reasonable about it. Who's the guy from O'Reilly always posting to Slashdot (Andy O'Reilly?)? He seems quite reasonable about a lot of things, maybe he can get a precedent started in the tech book publishing field just like Eric Flint in sci-fi!

  105. A hacksaw and a page feed scanner by awol · · Score: 1

    Cut off the spines (a hacksaw works well) and drop them in a page feed scanner.

    Interestinglr enough this process works. A few years ago now, when I was in university, a friend of mone was looking for a large repository of english prose from which to draw conclusions about his hypothesis regarding aspects of natural language processing.

    Anyway, there was a certain dictionary company that was willing to provide their five million word dictionary for some exorbitant cost. When I discovered that another friend of mine working at the dictionary published by my university had a twenty million word database of actual prose that he collected by "cutting the spiones of books and dropping them in a page feed scanner". Whilst omnipage was abou the only decent OCR at the time the field now is probably more extensive.

    My friend got to use the 20 million word dictionary for nada, as long as he helped the dictionary guys out with a bit of coding from time to time. Nice huh. Sharing is such a beautiful thing.

    --
    "The first thing to do when you find yourself in a hole is stop digging."
  106. .tiff are fine by Anonymous Coward · · Score: 0

    Well, I am scanning my books on old Epson scanner
    bought for $35 online as monochrome. I scan them into pnm, then use -g4 compression which gives me 300dpi resolution and 60kB size per page.
    I can later printout a single page I need and
    I get much better quality than xerox.

    Kubus

  107. Scanning Text by Wanker · · Score: 2

    Scanned text pages should be black and white.

    Of course it won't scan this way due to shading, bits of wood chips on the pages, etc. Your image processing software can/should convert it to literally two colors-- black text + background (white). As you can imagine, this kind of "lossy" conversion cuts out a great deal of information and the file size reflects this.

    Combined with a lossless compression algorithm which takes these huge areas of the same value and compresses them very tightly and you have a tiny, high-contrast, easy-to-read (or OCR) image.

    Now with JPEG, it "loses" information by smoothing (forgive my oversimplification of a complex mapping process). With text you *want* unsmoothed (hard) edges-- it makes things easy to read. The JPEG smoothing process results in hard to read text, so you can't use as much of it before the image degrades too badly to read.

    The result, the 2-color conversion with lossless compression gives you a smaller image size for the same relative viewing quality as a JPEG. (Or the flip side, for the same image size, the 2-color image is much more readable than the JPEG.)

    Try this-- take a screenshot of some text. (Only text) From the GIMP, convert it to 2 colors and save as PNG. Then save it as a high-quality JPEG and a low-quality JPEG. Check the file sizes versus the clarity of the text.

    1. Re:Scanning Text by Anonymous Coward · · Score: 0

      Taking a screenshot is not the same thing as scanning a printed page. A screenshot of text is already in 2 colors whereas a scan is true color.

    2. Re:Scanning Text by Wanker · · Score: 2
      Taking a screenshot is not the same thing as scanning a printed page. A screenshot of text is already in 2 colors whereas a scan is true color.
      Perhaps you missed step two of my description where the image gets converted to two colors, making it irrelevant whether the original was 2 color or true color.

      Screenshots on both Windows and X-Windows are created at the color depth of your display-- not 2 colors. There may only be 2 of the 16M colors in use, but the raw data is 16M colors. (If you're running your screen at 24bit.)

      If you must, feel free to try it with a "real" scan. (But don't forget to do the two color conversion. Sometimes a noise reduction transform is useful beforehand to get rid of small grey dots/blotches before they get converted to black.)

  108. Then What? by mugnyte · · Score: 1


    (1) But the time you finish, those books will be obsolete. Hint: Don't buy any more.

    (2) What will you read in the crapper? Get a wireless card and laptop?

  109. Privacy anyone? by Anonymous Coward · · Score: 0

    It shouldn't be Kinko's business at all what you are copying. The copyright commies have struck again.

  110. Re:Copyright Infringement? 2800 years by Herr_Nightingale · · Score: 1

    Fair use policy on this dictates that I can do whatever the hell I please with my books for my own personal enjoyment. Why would a publisher have a problem with that? Why should anybody have a say in my enjoyment of purchased material.

    For that matter, the only really essential papers out there are the Bloom County and Outland collections, for which I'd get only $35 from the used book guy - but which are worth incalculably vast sums to me in terms of necessary mirth on demand. I'm putting the whole schmeer into PDF right now because it's totally 100% worth it to me. I don't care if it takes me 15 years to scan and all that, when I need my mirth I need it now. Dammit. And I don't want to search through 20 books to find it.

    Consider all the times you've had to hunt through 400 million pages of generic-looking documentation for one essential spec. Think how many times in the future you'll hunt through those 400 million pages (we'll use 275 million for the sake of argument because your data will tend towards the later-middle pages rather than the front or back - a pseudo-fact proven out, in my experience, time and time again) and assign a rough estimate - say, 1,350,000 more times in your entire existence if you're not far past 30 with a life expectancy of 80 years. Paging through those 275 million pages 1.35 million times leads to 371,250,000 pages perused in pursuit of tidbits. If you're super-duper-speedy man, you'll visually audit about four to five pages a second (unless you've not read the book - but we're dealing with a best-case scenario here so forget about that for the moment) until you derive that special nugget, and finally you'll have to put away all those books (unless you're like my sister, but we'll hope and pretend that you're not so ignore that also) which will take approximately 83 hours (assuming 3 seconds per book for 100,000 books at 4000 pages per book).

    Now, I'm no math major here, but upon adding it all up, it would appear that one might easily spend 92,812,500 seconds directly searching plus 83 hours per instance of garbage collection (we'll use another thin-air variable, in this case with a value of 300,000 because chances are that you won't be so fastidious about cleaning up as I am) which adds up to 24,925,781.25 hours of your life wasted because you couldn't spare a weekend or three scanning those books.

    Think about it.

  111. Network Scanning by wmaheriv · · Score: 1

    I recently instaled a pair of Xerox Document Centre 440s, which have a high-speed network scanning capability. Basically, you set up user templates, point it to an ftp server, place your documents on top of it and hit go. The end result is a .pdf file or multi-page .tiff, dropped onto your network. The .pdf is immediately useful, but the .tiff can be run through an auto-OCR, producing an editable document.

    I just tested this by cutting the binding off of one of my AD&D Player's Handbooks, placing it on the scanner, and creating an OCR'd document out of it! Makes it really easy to extend a resource for personal use, eh?

    Now, the down-side to this beast is that they cost about 45k. *grin* Still, if your company needs a great printer/scanner/copier/fax machine, these things are well worth the price tag.

    --
    ~wmaheriv
    "Shema Yisroel- Adonai Elohenu, Adonai Echad!"
  112. takes time, but is worth it by savetz · · Score: 2, Interesting

    I have scanned several books (in my case, Atari and other classic computing books) for atariarchives.org. The process takes time, but is worth it.

    A scanner with a reliable sheet feeder is essential. This doesn't necessarily mean expensive -- I've seen a lot of reasonable-looking scanners with ADFs on ebay for less than $100.

    I cut the pages off the books using a single-edge razor blade -- non-ragged cuts are essential. Then I scan then into TIFF format at 300 DPI, greyscale. If I want searchable PDFs, I use OmniPage X on a Mac to create image-over-text PDF, it's quick and easy.

    But most of the time, I these books are for Web viewing. So I use a graphics conversion program with batch capability (GraphicConverter on the Mac) to a) increase the contrast dramatically -- near 100%; b) trim the whitespace from the edge of the images; c) scale the pages as necessary. d) scale them more to create thumbnail versions.

    There are no hard-and-fast rules for choosing the final file type. Just got to balance file size and readability, and this varies from book to book. Sometimes I go with JPEG, sometimes 8-bit GIF, and sometimes 4-bit GIF. Sometimes I'll convert every page to GIF and also to JPG, then use a little script to select the smallest one for each page.

  113. Pitfalls of converting to film by ancarett · · Score: 1

    I'd be careful about trying this. Film is not an especially stable medium. Photographs require careful conservation as well.

    --
    ancarett, historian and zombie gamer
  114. This is against the DCMA. by barfy · · Score: 2, Interesting

    The digital representation of the "copyrighted" work as existed in a "page layout" program, using a technological means to prevent digital copying: Imaged to paper using digitally created "Plates".

    By attempting to "recreate" the digital representation by using technological means to defeat the digital copy protection of a bound book, you are criminally liable to the owner of the copyright.

    (Now if you were just copying this to another piece of paper, you may be ok under existing laws. But moving it to digital... Um, hands up scofflaw!)

  115. Future compatability??? by zytheran · · Score: 1

    Some sort of information is timeless, and won't go out of date. Some sort of information is only relevant to todays technology and can be discarded or moved to a museum when no longer used. Currently I can read books/notes that were written hundreds of years ago in my local library. I don't need any special skills or hardware apart from my ability to read and a set of eyes. It can always be accessed by me, my children and all future generations.Do people really think it is wise to put *important* material onto media that will be redundant (and unreadable)in 20 years time??This article is a nerd feeding frenzy, a whole lot of technical solutions which miss the big picture.

  116. National Geographic by Wanker · · Score: 2

    I know exactly what you mean about the National Geographic CD-ROM set. I was very excited about having the complete archives available and was deeply disappointed in the quality of the final product.

    Much of the text is completely unreadable because of over-JPEGging. (Is that a word? It is now.)

    However, it did teach me to be very careful before plunking down $200+ for online books in the future. Now, I insist on a preview before I buy. (And yes, this does mean that many electronic collections don't get purchased simply because I can't find them in any libraries to view...)

  117. I'll wipe my ass with the pages by Anonymous Coward · · Score: 0

    The old destroyed books. I'll use them to wipe my ass.

  118. Our experiences over 5 years... by LucienMP · · Score: 1

    We, that is two of us, have been doing this since 1997. Our site Internet Technical Documentation Archive (ITDA) houses a lot of freely available Field Service Manuals.

    We started with borrowing local scanning resources and manually page flipping. That's one page per every 5-6mins! Then we bought our first LPT scanner and it was a little faster but ate pages....

    ...ack depends on what you want to do. Like most people say do you want to totally destroy your books? How much do you want to spend? Are you ever going to use those physical books again?

    If its just low cost and personal copy with reasonable quality and you have LOTS of time then.... just grab a copy of OmniPage OCR v11, a HP ADF scanner [ hp scanjet 5490cxi (C9863A)], a copy of Adobe Acrobat and get a professional company to despine your books.

    We spent a total of $800 on software/hardware to do this. We spend, on average, about 50 - 200 hours per book to process it - thats scanning, OCR, OCR proofing and format rework and then final PDF output.. Some of the books we're doing I have given to students to work on. They'll do it for next to nothing ;-)

    Its possible to outsource this to companies to do this work for you. For example Crowley do this and they also handle large documents. You have to be aware of how they are going to process your book and the copyright problems. However, as someone said, some don't care about copyright and some do (eg Kinkos). Again this comes down to do you care about the books and how much you wanna pay for a digital copy...

    In our case we don't make money off this site so we can't afford to out-source. So our biggest problem now is how we are going to get the over-size PDP-11 documents into PDF. The Minolta PS7000 looks like the beast we need but its way too expensive for a non-profit. We'll probably be out-sourcing and eating the costs.

    My suggestion is to either go the HP scanner+Omni+Adobe PDF route OR out-source it if you can afford. At least with the out-source option you get to keep your books intact.

    ITDA Team

  119. And then... by Pvt_Waldo · · Score: 2, Insightful
    The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.


    And then 3 weeks after you chuck it, go "Damn, I can't read this page!" when you go to look up something and it says, "It is extremely important that you fark dnf2 gib oefll or else you will damage your hard disk."

    Stick with books. There's a reason why they are popular. They work really well. Besides, the trees are already dead so you're not doing them a favor. And you'll just have to kill more trees to get more books to scan more stuff.
  120. TINLC by benh57 · · Score: 1

    There is no lumber cartel. Move along.

  121. Portability? by Black+Jack+Hyde · · Score: 1
    smart2000 asks: "I'm tired of lugging around dead trees.

    You haven't investigated all the transportation options for your collection. Wouldn't you rather put all that potential scanning time to better use? :-)

    Jack
    We want more, but we're getting Jack instead.

  122. how do i copy this windows xp cd? by Graspee_Leemoor · · Score: 2

    Put Cd in drive. Run Sad Old Easy CD Creator that came free with your cd burner, select "copy cd", select source and destination cd drive, click copy and follow on-screen prompts about changing cds over.

    Just remember to search for a crack on the web too!

    graspee

  123. FirsT1 by Anonymous Coward · · Score: 0

    you slashdott peoplez ar ea bunhca fuckin lame asshole kocksukerz and u''s can up and suck my fat boner

  124. You are insane by labradore · · Score: 4, Insightful
    Ask yourself this question:
    What is the oldest file that I have?
    and ask:
    What is the oldest useful file that I have?
    For most people their papers and books are much older than the data they keep and the paper version is always available and easy to read.

    You are much more likely to lose or corrupt your data if it is on a disk or a tape than if it is in a book. Your electronic version is going to be of much lesser quality than the books you had and you will have a lot of "adventures" getting your ebooks to be as easy to read as your paper books. What happens to your portable ebook when your reader runs out of batteries? Ebooks have failed because ... THEY SUCK. Let us all know how much time you wasted tweaking your ebook setup and worrying about how to make them sustainable. Also, please tell us when you go back to the store and buy new "dead trees" copies of the ones you destroyed.

    1. Re:You are insane by heelrod · · Score: 1

      I like my ebook.

      I've never had a crash or had to reset anything.

      I do agree however that Paper (Technical) books are easier to use than in digital format.

  125. This was a TROLL??? by Anonymous Coward · · Score: 0

    Wow. Mr. Moderator, you're so clueless that when you spend time with your friends they won't let you play Clue for fear of you causing the board to spontaneously combust.

  126. DMCA save us!! by Anonymous Coward · · Score: 0

    Someone notify the authorities! Adobe's software is circumventing copy protection!

  127. Occams Razor? by Sakhmet · · Score: 1

    *click*
    *drag*
    *ctrl-c*
    *alt-tab*
    *ctrl-v*
    *c trl-p*
    *click*
    *wait*
    *read*

    Why, exactly, is 2.6 chars per line considered too few? Is this the infamous lameness filter? The formatting was so much nicer the first time.

    Sakhmet.

    --
    Ban the Nukes! Save the Whales! Screw it. Nuke the Whales!
  128. Use JBIG - not GIF by mangu · · Score: 3, Informative

    For bi-level images, the standard to use is JBIG, comes from an ISO group similar to those that created JPEG and MPEG.

    It generates much smaller files than GIF for printed text, with none of the inconveniences of JPEG. Grey scale pictures come reasonably well, if done at 300 dpi, dithered.

    I don't know exactly why JBIG never caught like those other standards. There doesn't seem to be many JBIG programs around, but, if you are handy with source code, there's jbigkit, a library for reading and writing JBIG files. I wrote my own software with that, and converted a half-ton of old magazines into a 20-pack caselogic of CD's.

  129. dead trees to CDs by Simonetta · · Score: 2, Interesting

    I am also faced with the task of converting thousands of pages from paper to text files. I suggest looking into using a high resolution digital camera in a custom docking station above a flat surface that holds the printed material. (a photo enlarger comes to mind). Then instead of waiting for the scanner carriage to pass downward over the page, you can take a snapshot of the page.
    Send the image directly from the camera to the OCR program. I find that the Xerox TextBridge program can do OCR on a page almost as fast as I could turn the page were I not using a scanner to input the text. TextBridge is quite ackward to use and not very customizable for new types of applications such as this.
    Using a high resolution digital camera to input OCR text is also a good way to get around the question of whether or not to cut off the binding of the book.
    By the way, I assume that you're wishing to scan european language text. Doing OCR on Japanese, Chinese, or Korean I would assume is much slower than recognizing ASCII. Does anyone know of an available program that will do OCR on Chinese?
    With our friends in the middle east obsessed with blowing the shit out of us, it might be time to develop an open-source program that will do OCR on Arabic and Farsi, along with a translation program companion. Would Arabic be much more difficult to OCR because all of the phonetic symbols are joined together? I sometimes wonder about these things when I'm bumming about not having a life.

  130. Re:Do you really need them?-E-publishers. by Anonymous Coward · · Score: 0

    Two questions.
    One how many companies & publishers are putting their material in an electronic format?
    Two are their any E-publishers out there? You know strictly CD,floppy,etc, no dead trees.

  131. could be good by ironfroggy · · Score: 1

    Couple this with one of those new dual-screen laptops and you're set. I'd love to do this myself actually, I probably have around 100 to 150 pounds of technical dead trees. Of course, I'd look and see what digital documents are already available. For example, dont scan that XML Blackbook, just download the official spec.

    1. Re:could be good by ironfroggy · · Score: 1
      Just some follow-up...

      What about things like tables and lists and things like that? How well will the OCR handle that sort of thing? or custom fonts, images in the book, etc. Buy a nice 120 gig harddrive and just keep those jpegs!

  132. One book a weekend by tombabu · · Score: 1

    I scanned one book (500 pages) into Word format in a weekend on a basic flatbed scanner, using the OCR Omini Page Pro 10 by Caere. It is very reliable for text and proofreading is very minimal. Maybe 10% of the pages require any human input. If you go with OCR, this is the best one out there.

  133. Use DjVu by Anonymous Coward · · Score: 1, Informative

    For scanned documents, nothing can beat DjVu. Bitonal documents are 3 to 10 times smaller than with TIFF or PNG. Color documents are 5 to 10 times smaller than JPEG or PDF. There is a free online conversion service at Any2DjVu.

  134. .pdf sucks by Complust · · Score: 1

    It sounds good in theory. However, when I want to get something like a pin-out description on an IC, it is actually easier pick up a vendor supplied tome for the info desired.

    BTW, a lot of the manufacturers and businesses are publishing their own CDROMs. Case in point:

    Mouser sells electronic stuff. They did pretty much just as you recommended, cutting and scanning their paper catalog into .pdf files. Their first catalog on CD was quite pathetic....

    Having learned their lesson, they apparently found a way to clean up the mess and now they have both a decent CDROM catalog and also publish on paper.

    Now, about .pdf....

    I HATE IT! AND I HATE ADOBE!

    Let me explain. Why make a %$#%# CDROM in a format meant for PAPER?!?!? Don't get me wrong, I'm no busom buddy of Billy G, but Microsoft lets you zip thru text easier than anything Adobe ever did.

    They can keep their silly little free reader. I translate .pdf to text whenever possible.

    Yours,
    Complust

  135. Re: Giuliani is a fucking fascist by Anonymous Coward · · Score: 0

    What part of "no text" don't you understand...?

  136. Re:Copyright Infringement? 2800 years by odin53 · · Score: 1

    Fair use policy on this dictates that I can do whatever the hell I please with my books for my own personal enjoyment.

    Except that this isn't a fair use. It's definitely copyright infringement. First of all, fair use doesn't at all say what you think it says. You're thinking of first sale doctrine. First sale doctrine doesn't help because he's making a copy. Neither does the statutory exemption for backups of software. And although this copying (or derivative work, same result either way) isn't for commercial purposes (arguable, of course), and the copied work is factual and informational, the other factors actually considered for fair use weigh against it (i.e., the copying) being fair use.

    But this is /., and even the most obvious cases of copyright infringement aren't considered copyright infringement.

  137. Re:ooh.. searchable pr0n... by Anonymous Coward · · Score: 0

    "Goddammit, where is that June 1993 issue of Hustler's Barely Legal?! All I can find are dozens of Mayfairs and Club International..."

    ;-)

  138. Re:Babe was a pig! by Anonymous Coward · · Score: 0

    >>I believe Babe his big blue ox did the hauling.

    No, Babe was a PIG and he did HERDING.

  139. For all the Perl and Python programmers... by stephanruby · · Score: 1
    Here is an excellent "Ruby Programming" eBook that can be downloaded for free from http://www.pragmaticprogrammer.com/ruby/downloads/ book.html

    Here, you can download the Ruby source or the executables for your platform at http://www.ruby-lang.org/en/index.html

  140. Well, YMMV, but that wasn't my experience. by Lord+Vipor+Scorpion · · Score: 2, Interesting

    I was locked out because of their spidering filter, too. But I called up at like eight o'clock one night & someone unlocked it for me (& set it so that it wouldn't happen again).

    Safari also has a very good search engine, althought it's wierd that they coded it in MS ASP.

    The spidering filter seems intent on inhibiting the casual copier. I thought this was lame, but there's actually a certain logic to it. If you go to all the trouble to download & reassemble the books, then you've put enough work into it not to not just throw the book out there on Gnutella.

    At it's most expensive, Safari books cost $2 per month. So I'm not impeding anyone's education, and I'd like to see this service stick around. In fact, I can save people a bundle if I get them to use it the way it's meant to be used.

    The one lame thing is that OReilly pads their selection with multiple editions of the same book and also with books that are available for free on the openbook site--ok, that's like five books, but still... They're really starting to get a good selection now.

    In college, I used a free (as in stolen beer) html copy of a textbook for a class, and realized at the end of the year that someone had purposefully altered the book so that a lot of information was horribly incorrect. They'd basically cut out the word "not" all through the book, and inserted it after "is" in other places. Most people would not do that, but some a-hole did. Ah, college, what a hellhole.

    1. Re:Well, YMMV, but that wasn't my experience. by Anonymous Coward · · Score: 0

      In college, I used a free (as in stolen beer) html copy of a textbook for a class, and realized at the end of the year that someone had purposefully not altered the book so that a lot of information was not horribly incorrect. Most people would do that, but some a-hole did not.

  141. somebody setup us the bomb by Anonymous Coward · · Score: 0

    I got check irc.nullus.net tonight for my daily dose of bookwarez and I can't connect. I think it got slashdotted! I just hope this never happens to my server. ;)

  142. Here it comes by DataGrok · · Score: 1

    This situation has been approaching for a while now.

    What will the ALA, librarians, and book publishers around the world say, what insanely stupid legislation will they lobby to enact, when people begin to creating legitimate electronic duplicates of books and other printed material under their fair use rights?

    I'll tell you what. The DMCA. All over again. And worse. Applied to literature, rather than music. You heard it here first.

    But one thing still puzzles me. Typically, librarians are super cool people, full of common-sense, against stupid legislation like that of internet censorship. (See also one of my most favorite 'sites on the net.) I wonder what the reaction to electronification of information will be of the level-headed, pro-freedom librarians of the world. Will it be librarians vs. publishers and the ALA, side-by-side with programmers and technophiles vs. the MPAA and RIAA?

    What needs to happen is a complete and total revolution and upheaval in the way we think about intellectual "property" and copyright law. But that will, of course, never happen in our corporate-ruled capitalist soceity.

    I think it's time to pay a visit, and hit the information desk. It's been way too long since I visited the local library.

  143. To OCR, or not OCR... by satanicultwhackjob · · Score: 1

    Depending upon manual type, none of the suggested methods will guarantee more than 90% accuracy when you attempt to turn them into grep-able text. (My assumption is that you're not going to spend a couple grand burning CD / DVD documents that you can't then search.) This effectively renders the output useless, particularly those containing code. Either commit to typing in the important docs in by hand, perhaps scanning the images, or face up to the fact that the technology does not yet exist at the user level.

    Note 1: The above comes after months of attempting to turn The C Programming Language, Rev. 2 into searchable text as a test model for a large government project. (Don't worry boys and girls, your paper documents will not be turned into easily referrenced HTML anytime in the near future.) I did, however, do so by hand to run a diff on various Co's output. The aggregate result was closer to 87% accuracy, and the average error rate showed only 5% duplication, meaning that further refinements in the OCR would probably only bring the errors back down to a level of 10% -- 90% accuracy.
    (Scanning method was left entirely to Co's interested in bidding, with the proviso that this would be used for counties up to 200,000 residents - records, time line of 6 months.)

  144. Laws are begging to be hypertext. by ahfoo · · Score: 2

    Amen on that hypertext comment. The battle has not even begun.
    Most folks aren't lawyers, but generally people have seen some texts of court opinions at one point or another. I was just going over some court documents related to the patent courts --AKA, the CAFC-- and I was struck by how computer code-like the text was. The only reason people think it's hard to read court cases, especially patent court cases, is because they're riddled with links to other cases. Since the system was developed in a book only format in a rather rag-tag fashion, the text becomes very difficult to read because of all the notations they've used to indicate varying types of links.
    In my opinion, requiring the legal system to use electronic hyperlinked texts for court opinions and other legal documents is absoultely essential to any kind of IP reform. Until judges are benefitting from hypertext in an immediate way, they're going to fail to see the urgency of advocating its use or deciding in favor of electronic formats.
    Law and court documents should be readable by anyone with standard high school level English skills. The same is true for patents themselves. The core of a patent isn't the drawings. In fact, the drawings are often intentionally misleading to avoid disclosure of importatnt information valuable to competitors. The important part of a patent is the references to other works, these are natural places for hyperlinks. I bet Bounty Quest would move a lot quicker if patents had hyperlinks.

  145. ClaraOCR by Jeff+Knox · · Score: 2

    http://www.claraocr.org/
    "Clara OCR is a free (GPL) OCR for systems that support the C library and the X windows system (e.g. most flavours of Unix). The development platform of Clara OCR is 32-bit Intel running GNU/Linux.

    Clara OCR is intended for large scale digitalization projects....."

    Havent tried it, but it looks good.

    --
    Jeff Knox
  146. Re:Copyright Infringement? 2800 years by Anonymous Coward · · Score: 0

    So.. it isn't "fair use" to copy your -own- book for your -own- use?

    oh-k

    If that's legally the case, then maybe /. readers doen't consider the 'most obvious cases' because /. readers know that some laws need to be rewritten.

  147. Re:Do you really need them? (Red Rubber Ball) by HeyLaughingBoy · · Score: 1
    The engineer turns the ball over until he find's it's serial number, then looks up the volume for that model on his Red Rubber Ball Table.


    I'm working on a home project involving motor control project using an Atmel MCU. All my Atmel data is on CDROM or their website. I'm just about ready to call a local Atmel rep and have them send paper books to my office cause it is sooo much easier to flip through pages than to page through pdf files. Not to mention all those times you want to quickly flip back and forth between two sections. having multiple browser windows open to do this just takes up valuable resources that makes the entire development system run slower. Electronic documentation is not always the best.
    Now, my 50 issues of Circuit Cellar on CD: that's a whole other deal :-)
  148. My GreatGrandfather!!! by itwerx · · Score: 2

    "A man observed by the celebrated Dutch physician Hermann Boerhaave took his meals at a table that had been cut away in a semicircle to accommodate his circumference"

    No kidding. I never saw him, but my Grandmother has stories about this.
    (But I weigh all of 170# without any flab at all. :)

  149. Linux OCR: Try claraocr.org by madstork2000 · · Score: 1

    Try http://www.claraocr.org/ I tried it a while ago and it worked well, but haven't had the need for ocr in a while. -ms2k

  150. Re:Copyright Infringement? 2800 years by odin53 · · Score: 1

    You have a valid point; why do you hide behind anonymity?

    Think about it this way. You had one book. You scan it in to make a copy. You now have two books, one of which you didn't pay for. This is a very, very simple case of copyright infringement.

    Problem is, /.'ers only have a lot of experience with dealing with software and music, both of which have statutory exemptions. Section 117 lets you make a backup copy of software, and the Audio Home Recordings Act lets you do the same for, well, audio recordings. But there's no such exemption for books.

    Look, I'm not saying there should or shouldn't be a copyright violation. I am saying, though, that it's important to know what the laws say. It's also important to be able to think about why we have exemptions for audio recordings and software. To start off your thinking, think about this: fair use DOES allow you to make partial copies of a copyrighted work (assuming some other factors). This you and everyone else have probably been doing for years. How often do you ever need to copy an entire book? Really, if you did that, given the nature of what you're doing, it should be (rebuttably) presumed that you're infringing. But with music and software, you almost always *have* to copy the entire work, if you're going to have some sort of fair use. (If you only copy a bit of a piece of music, then that, given other factors, is then a pretty simple case of fair use.) Since people who were trying to fairly use software and audio recordings kept running into the copyright infringement problem, Congress passed exemptions for those two cases.

    That would be the backdrop. What do you think? Do the laws "need" to be rewritten? Should books have a statutory ban? (Realize, though, that would pretty much gut our meaning of "fair use" and a lot of copyright jurisprudence. Which may or may not be a bad thing, of course.)

  151. Journal Store by RockDoctor · · Score: 1

    2 points:
    Firstly many people are reading the questioner's comment about "100lbs of TECHNICAL books" to mean "100lbs of COMPUTER books". Just looking over to my bookshelves, sure ther's a good few computer books out there; also about 30 kilos of reference works on palaeontology, some with print runs that made it to 3 figures; also a few tens of kilos of mineralogy references; lots of oilfield structure and stratigraphy analyses and reports ... Very little chance of any of them being on the web anywhere, particularly the "in-house" ones 25 years old.
    Second point: a number of the learned journals are addressing this very issue because of library storage space issues. Go to Jstor and particularly to the process description to see how one industrial-scale program goes about doing this. Note in particular the twin parallel paths they use: OCR to produce searchable indices but delivering fax quality PDF images of the original journal pages to preserve complex images, typograpcy and editorial quality.
    Another interesting source might be http://www.octavo.com , who amongst other things produce high-quality PDF distributions of historical documents, again linked to a searchable back end produced by OCR again.
    The combination of batch, offline OCR and PDF'd images to automatically generate some sort of useable indices to the images seems to have been selected by a number of independant groups.
    If you can't be bothered to go the whole hog to build a database, at least scan in the indices so you can do as much searching in the scanned books as you could in the originals.
    Not JPEG images - TIFF or PNG. That's a no-brainer. Image contrast is the issue here (for the text sections at least), not absolute image size. My reference library stretches to 1.6GB and is growing steadily (and yes, most of it is copyrighted and legal).

    --
    Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
  152. Re:Don't use JPEG... well it depends... by Anonymous Coward · · Score: 0

    Actually I *do* use jpeg for storage. I scan the originals at 150x150 dpi RGB.

    Why? Because mostly I'm doing archival documents. There is more than text on most pages. Many documents have text over images, text over colours, and so forth.

    I give the documents good long descriptive titles in a format

    yyyymmdd-source-title-author-keywds-pageno

    or

    yyyymmdd-docname-pageno-articlename-author-keywd s

    so they all collate nicely and I can usually find the relevant item with locate.

    A very small percentage of documents are pure black text on white paper unless you are doing plain text books, in which case you may well be right.

    It all depends what you are up to. If you were a museum archive (or law firm), you'd probably do tiff at 600x600 or better so you'd have a good image of Einstein's coffee cup ring on his notebook page :-)

    jpegs do quite nicely for me. Over a period of 5 years I've probably done a 100K pages that way, mostly corporate or personal documents or research materials. I have no trouble at all reading them on screen and although open source OCR sucks at present, it will eventually reach a point of sufficiency, at which time I'll just back the images with .txt versions so I can both grep and use the original document

    Dale Amon, amon from the vnl.com domain.