Google Pushes Open Source OCR

Sign of times to come? by Anonymous Coward · 2007-04-10 06:05 · Score: 3, Interesting

Now that they will be able to recognize and tag our images, I wonder if Picassa will finally get increased storage. Google will be able to deliver targeted ads based on our pictures.

Re:Sign of times to come? by Instine · 2007-04-10 07:48 · Score: 4, Insightful

What about a free service to upload scanned images to and recieve html in return?... Please....

--
Because you can - or because you should?
Re:Sign of times to come? by EmperorKagato · 2007-04-10 07:59 · Score: 1

Wait.

--
----- You know you have ego issues when you register a domain in your name.
Re:Sign of times to come? by drinkypoo · 2007-04-10 09:25 · Score: 1, Insightful

Now wait a second... you would rather upload a scanned image, which should be at a pretty decent resolution if you want good results, than run the OCR software locally? What, are you using a system with a 33MHz CPU or something?

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Sign of times to come? by Instine · 2007-04-10 10:51 · Score: 1

If it does a better job than my local (currenlty non-existant) OCR software, then yep! I'll use that, and not worry about installing something new on my system. I like Google Docs for similar reasons.

--
Because you can - or because you should?

From? by Anonymous Coward · 2007-04-10 06:05 · Score: 0

from the google-has-taken-all-knowledge-to-be-its-provice dept.

Did you mean: province

Re:From? by MightyYar · 2007-04-10 08:25 · Score: 2, Funny

Ha! It even works on the whole string:
Did you mean: google-has-taken-all-knowledge-to-be-its-province

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.

Build instructions are outdated by What+the+Frag · 2007-04-10 06:06 · Score: 2, Informative

Use this line to checkout ocropus: svn co http://ocropus.googlecode.com/svn/trunk/ ocropus

Re:Build instructions are outdated by mattatwork · 2007-04-10 08:06 · Score: 1

I tried your link, but it turned out to be a dead link....

Try the project's Subversion repository. You can access the trunk and then the 28 or so other links....

--
I've refrained from profanity, racial/ethnic epitaphs and am 5'11" - how can I be ranked as troll?
Re:Build instructions are outdated by Anonymous Coward · 2007-04-10 08:57 · Score: 0

billions of captcha images are cringing in their digital repositories...

search different

The goal of the project by user24 · 2007-04-10 06:07 · Score: 4, Insightful

The goal of the project is to stop the damn email image spammers.

among other things, sure, but it's got to be a high priority for google.

Re:The goal of the project by sammy+baby · 2007-04-10 06:15 · Score: 3, Insightful

And of course, as a side effect they'll probably wind up with a lovely distributed system for solving captcha. ;)
Re:The goal of the project by UbuntuDupe · 2007-04-10 06:53 · Score: 2, Informative

Isn't that the same principle behind PGP? Correct me if I'm wrong (and I freely admit encryption is not my area of expertise), but to crack (in reasonable time) PGP-encrypted data, you have to solve a problem no one in the world has been able to solve yet (quick solution for a certain class of problems). Similarly, if captchas get to the point where you need a major theoretical advance to beat them, thanks to wide use of OCR-type programs, that would either foil all spammers, or cause them to solvea mathematically/AI significant problem.

I'm wrong, eh?

--
Apology to Ubuntu forum.
Re:The goal of the project by user24 · 2007-04-10 07:04 · Score: 1

spot on, I think.
Re:The goal of the project by ajs · 2007-04-10 07:25 · Score: 3, Interesting

The goal of the project is to stop the damn email image spammers.

among other things, sure, but it's got to be a high priority for google. I don't buy either one. I think the goal of the project is to get sued.

Google knows darned well that there are tons of patents around OCR, so they're not going to roll their own internally. Instead, they'll open source the project and make as much noise about enhancing the state of the art through collaboration as possible. Then, when they get sued (and they will), they can bring this case front-and-center in the debate surrounding patent reform, citing this as the textbook example of how the promotion of the sciences and useful arts (as specified by the Constitution) is hobbled by current patent law surrounding software.

I could be wrong, but they'd be stupid to think that high-profile, open source OCR software won't be challenged by those who hold the patents....
Re:The goal of the project by w_mute · 2007-04-10 07:25 · Score: 1

>The goal of the project is to stop the damn email image spammers.
>
> among other things, sure, but it's got to be a high priority for google.

OCRs application to image spam is useful but limited without lots of tweaking. OCR is geared toward dealing with readable text. Image spammers are already doing font swapping, kerning tweaks, applying image rotation to subsections, random backgrounds, etc. Warping text similar to CAPCHAs isn't that much further along.

Also, OCR is much more computationally expensive than other text/image recognition methods. Anti-CAPCHA algorithms can be used to segment and recognize warped text, but its much more problematic (and expensive) than plain OCR. OCR may be an OK last resort, but there are other less finicky, faster methods that work on most image spam.

-Greg
Re:The goal of the project by slashbob22 · 2007-04-10 07:50 · Score: 4, Insightful

Ok, I'll bite and play DA for a bit.

Why Google wouldn't want this:
1) Google's own patents on search techniques, distributing advertisement, etc. Yes, I understand that you are talking about the need for certain limits to patent law - but this could as easily hurt Google as help them.
2) They are already challenging the copyright laws with GooTube, I don't see any sense in tackling both at the same time.

IANIGHQ (In Google's HQ) but I don't see the value of getting sued at this point in time. Besides, if Google is doing this under appropriate conditions there shouldn't be concern of suits - but I suppose their Chinese plagiarism case doesn't support this point.

// End DA

--
Proof by very large bribes. QED.
Re:The goal of the project by m0rph3us0 · 2007-04-10 07:50 · Score: 1

Yeah, its not like a bunch of Comp Sci students couldnt figure out an algorithm to break them.

http://www.cs.sfu.ca/~mori/research/gimpy/
Re:The goal of the project by ajs · 2007-04-10 08:36 · Score: 3, Insightful

Why Google wouldn't want this:
1) Google's own patents on search techniques, distributing advertisement, etc. Yes, I understand that you are talking about the need for certain limits to patent law - but this could as easily hurt Google as help them. Google takes the same stand on patent reform as IBM, as far as I know: the current law hurts innovation. They're not looking to have all of their patents stripped, just to reform the system so that innovation is encouraged. At the very least, IBM has (and I think Google too) lobbied for open source exemption. Keep in mind that IBM and Google hold tons of patents, but they mostly use them as a "warchest" to dissuade others from filing patent-related suits.
2) They are already challenging the copyright laws with GooTube, I don't see any sense in tackling both at the same time. I don't buy that one. Patent and copyright law are radically different, and in the copyright case Google is just trying to argue for existing interpretation of the law, not a change.
Google is doing this under appropriate conditions there shouldn't be concern of suits That's not how patent law works. If someone holds a patent on looking at the pixel to the left of the the one you're evaluating, and Google's software does that, then the holder could sue. What's more, there are many dozens of such simple patents surrounding OCR. It's probably the second-most over-patented area of CS next to color-space management.[1]
Re:The goal of the project by cheater512 · 2007-04-10 10:06 · Score: 1

I'm already using gocr to ocr every image in my email. Works very nicely.

So much for captcha by Red+Flayer · 2007-04-10 06:08 · Score: 1, Redundant

Oh great. I, for one, do not welcome the increase in message board spamming.

--
"Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai

Re:So much for captcha by UbuntuDupe · 2007-04-10 06:35 · Score: 1

Well, I don't welcome reliance on security by obscurity.

--
Apology to Ubuntu forum.
Re:So much for captcha by cyphercell · 2007-04-10 06:42 · Score: 3, Informative

Captcha (warped text) will probably remain for a long time. This OCR has more practical uses when applied to text that is meant to be legible.

--
Under the influence of Post-Cyberpunk Gonzo Journalism
Re:So much for captcha by Gregory+Cox · 2007-04-10 06:43 · Score: 1

Captchas are a good thing, but taking a long-term view, isn't it a better thing that technology is progressing? I'm sure the positive uses of OCR outweigh the problem of spamming, and it'd be a shame if no-one wanted to work on OCR just because of captchas.

--
If you all Google Slashdot, will it Slashdot Google?
Re:So much for captcha by triso · 2007-04-10 09:16 · Score: 1

Captcha (warped text) will probably remain for a long time. This OCR has more practical uses when applied to text that is meant to be legible. Captchas don't have to be text only nor do they have to be a simple "Type what you see in the box" question. How about, "Type the letters that remain after removing the vowels," or "How many solid blue or red squares can you see?"
Re:So much for captcha by Fred_A · 2007-04-10 19:17 · Score: 1

Those can quickly become language centric. The purely "copy the letters and numbers" captchas are language agnostic. That's a big bonus.

--

May contain traces of nut.
Made from the freshest electrons.

The beginning of the end? by Iphtashu+Fitz · 2007-04-10 06:08 · Score: 3, Insightful

... for Captchas? If Google is pushing OCR I could see it eventually becoming good enough to parse at least some types of captchas.

Re:The beginning of the end? by X0563511 · 2007-04-10 06:32 · Score: 1

When the computer can parsed a Captcha better than a human can... it means that we need to move on to something else. What that else is (do NOT mention kitten-captcha) I don't know.

--
For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:The beginning of the end? by lawpoop · 2007-04-10 06:38 · Score: 5, Informative

I doubt it.

Part of what makes OCR work is that it assumes that the text was written to communicate meaning. It has regular characters in alignment, forming common words and abbreviations in more or less grammatical sentences with close-to-proper punctuation.

A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.

--
Computers are useless. They can only give you answers.
-- Pablo Picasso
Re:The beginning of the end? by dotoole · 2007-04-10 06:43 · Score: 1

OCR is already at the stage where simply distorting letters isn't sufficient anymore. The real trick now is to generate the the letters and background clutter in such a way that the software cannot segment the image into seperate characters.
Re:The beginning of the end? by walt-sjc · 2007-04-10 06:48 · Score: 1

Then we need to move to simple logic questions such as "what is the sum of 5 and 4?" or "how many inches in a foot", etc.
Re:The beginning of the end? by mrchaotica · 2007-04-10 06:56 · Score: 1

Well, your first question fooled it (probably due to the unusual phrasing), but Google can already answer your second one.

--
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Re:The beginning of the end? by user24 · 2007-04-10 06:58 · Score: 4, Insightful

Please, please, please, everybody, stop claiming that "what is 2+2?" is a hard AI question. I could code something in a hour to defeat most of this sort of question, and give me a week and a budget and I'll write something to get past 95% of these type of questions.

If the text is parsable, it takes nothing to google it.
I mean, those two examples you give; just slap it into google and screenscrape it. So you're going to need harder questions than that.

So the next generation of crapchas will ask "what color is the sky".
Go and take a glance at ultraHal or another relatively advance NLP AI; a large knowledgebase is not hard to construct. When it doesn't know, it guesses. If it gets it right, then the knowledgebase increases by one fact.

So then, what, you have to ask "Given that all bleeps and blue, and blank is a bleep, is blank blue?"
Not only is that also easily computationally solved, but also a lot of people aren't going to be able to answer (smartass questions about stopping spam and idiots aside)

So *then* I suppose you have to ask "In the first mathematical antimony, does Kant conclusively prove both that there can have been no beginning to time and that there must have been a beginning to time?"
and give the user a 255 character textarea to put their answer in.

So... please, text question based captchas are DOOMED TO FAIL. stop thinking that they could work. They can't.
Re:The beginning of the end? by dimeglio · 2007-04-10 07:14 · Score: 1

I must be a computer/cyborg. I have trouble reading 50% of captchas (on first try). Can't wait to get this enhancement.

--
Views expressed do not necessarily reflect those of the author.
Re:The beginning of the end? by Matt+Perry · 2007-04-10 07:16 · Score: 1

If Google is pushing OCR I could see it eventually becoming good enough to parse at least some types of captchas.
I hope so. I'm looking forward to a Firefox extension that'll let me decode a captcha so I don't have to figure it out. Some of the captchas I've seen lately are so confusing, with warped text, noise, and fonts that make zero and oh look identical, that I have to go through two or three of them before I can get an entry correct.

--
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
Re:The beginning of the end? by thePowerOfGrayskull · 2007-04-10 07:42 · Score: 2, Funny

A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses. Hell, if we obscure it enough it can be practically buried under geometric noise; and once we do that, we've solved the AC problem on /.!
Re:The beginning of the end? by asninn · 2007-04-10 07:46 · Score: 1

So... please, text question based captchas are DOOMED TO FAIL. stop thinking that they could work. They can't.

While I agree with you in principle, I think youre definition of "work" with regard to captchas is flawed. Captchas don't need to be 100% undefeatable; they just need to work well enough so that the time/energy/computing power/manpower/money needed to solve them en gros makes sure doing so isn't worth it to the spammer.

Your claim that they're useless because they don't work perfectly makes as much sense as saying that postage paid on snail mail letters doesn't make sense since it's possible for postal spammers to just shell out the amount necessary to send a letter, anyway (especially given that they'll receive bulk discounts). Still, in reality, I hardly ever get postal spam; the rate is probably less than 1 unsolicited letter per month, while my email spam, on the other hand, is measured in thousands of mails per day.

I'd argue that the fact that bulletin boards, blogs etc. are generally pretty spam-free proves that captchas ARE working - not perfectly, but well enough.

--
butter the donkey
Re:The beginning of the end? by el_gordo101 · 2007-04-10 07:56 · Score: 0

So *then* I suppose you have to ask "In the first mathematical antimony, does Kant conclusively prove both that there can have been no beginning to time and that there must have been a beginning to time?" and give the user a 255 character textarea to put their answer in.
Why 255 characters? Wouldn't a couple of radio buttons suffice?

OYes | O No

Now the bots will get in 50% of the time, even if they are only taking a guess. I think a captcha would work better.

--
TODO: Insert witty sig
Re:The beginning of the end? by pushing-robot · 2007-04-10 08:03 · Score: 1

If the text is parsable, it takes nothing to google it. Isn't it obvious? Engrish captchas.

--
How can I believe you when you tell me what I don't want to hear?
Re:The beginning of the end? by Iphtashu+Fitz · 2007-04-10 08:19 · Score: 2, Informative

Part of what makes OCR work is that it assumes that the text was written to communicate meaning.

As computing power continues to grow that kind of assumption is less and less important. Ten years ago I worked for a speech recognition company that developed tools similar to what Google is now using for their 800-GOOG-411 search line. Back then the state of the art was to carefully guide what a caller was likely to say, and to rely on massive dictionaries to help with the recognition. Now, 10 years later, with more research and more powerful computers, it's much easier to develop more free-formed speech recognition systems that can accurately recognize arbitrary strings of numbers/letters. (account numbers, phone numbers, etc) Given that the capabilities of speech regonition systems have grown so much I'd be willing to bet that OCR capabilities have grown in similar ways.
Re:The beginning of the end? by Reverend528 · 2007-04-10 09:29 · Score: 1

I could see it eventually becoming good enough to parse at least some types of captchas.

That'll be great. Then the spammers can crack the weak captchas to get free e-mail addresses and can flood everyone's inbox with captchaesque text that's strong enough to fool the OCR. This seems like a brillantly thought-out plan.

--
Badass Resumes
Re:The beginning of the end? by Anonymous Coward · 2007-04-10 09:33 · Score: 0

http://www.google.com/search?hl=en&en-US%3Aofficia l&hs=U32&q=5+plus+4&btnG=Search

?
Re:The beginning of the end? by user24 · 2007-04-10 09:35 · Score: 1

that is true. even if just have three submit buttons, and only one submits to the right place*, you'll still cut your spam by 66%.

But that is only true today.

If everyone did that, spammers would soon figure out the system and bypass it.

Captcha authors** are trying to avoid an arms race. sure, you can upgrade your simple crapcha every 6 months to keep up to date with spammers, or you can put a good one in place once. Much better the latter, methinks.

It's not about what spammers are doing now, it's about what they could do tomorrow.

* of the other two, one tells the truth all the time, one lies all the time, and the other stabs people who ask tricky questions

** (like myself)
Re:The beginning of the end? by user24 · 2007-04-10 09:43 · Score: 2, Insightful

because the answer is up for debate. I'm currently writing a 5000 word paper on it. both answers "yes" and "no" are right, depending on your reasoning, and whether you believe non-Euclidean geometries are anything more than an intellectual curiosity. :-)
Re:The beginning of the end? by UbuntuDupe · 2007-04-10 09:48 · Score: 1

Okay, then how would you defeat a text captcha like:

"alright now I want you to tell me basically, that number three, which ever number comes after it, wait, make that before it, what is that again?"

Would google or a knowledge base beat it? And you can arbitrarily increase the complexity like with pictures.

--
Apology to Ubuntu forum.
Re:The beginning of the end? by werfele · 2007-04-10 09:52 · Score: 1

Still, in reality, I hardly ever get postal spam; the rate is probably less than 1 unsolicited letter per month.
I'd like to know how you manage that. It's a good day that I have only 1 unsolicited letter. We get something like to 10 to 12 per day, and we have sacks of the stuff to bring out for recycling. On the up side, I'll never have to buy address labels again (nonprofits tend to include them as incentive to actually open the envelope). On the down side, I hardly ever send mail anymore, so I don't have anything to use the labels on.

On the other hand, if it were free to send junk mail, I think I'd be buried in the stuff, so your point is well taken.
Re:The beginning of the end? by user24 · 2007-04-10 10:10 · Score: 1

if it's generated by an algorithm, it can be deconstructed using an algorithm.

find a representation of a number, find the last word relating to precedence that is not prefaced by a "not" word, do the math, enter the answer.

next super-hard conundrum?

notice, if I'd said
"find a representation of a number, find the last word relating to precedence, do the math, enter the answer."
you could reply "ahh, but what if I write this:"
"tell me basically, that number three, which ever number comes after it, wait, not before it, what is that"
the 'not' rule is easily accounted for.

you can make it as complicated as you like, but if the complications are only sometimes applied, I'll just keep re-requesting the text until I get a nice simple one
If they're always applied then you're going to end up with questions like this:

"three seven nine. take the second number of the previous sentence. add the first. times by two. ignore all of that and add the last to the first and minus the second, then move the number two up, what's that?"

and then the problem is that anyone who's either not english/american, or is stupid or lazy can't post.

and I could easily write a script to deal with that sort of sentence, too..
Re:The beginning of the end? by UbuntuDupe · 2007-04-10 10:27 · Score: 1

if it's generated by an algorithm, it can be deconstructed using an algorithm.

Yeah, but P probably isn't NP.

find a representation of a number, find the last word relating to precedence that is not prefaced by a "not" word, do the math, enter the answer.

Yes, you can write an algorithm for *that* particular sentence, and any other with the identical template. But give me a little credit here: I didn't ask you to write an algorithm to beat that template; I asked you to write an algorithm, given a site with that as the first string captcha. You don't yet know how it generates them. You don't know what it's randomizing across. Maybe it randomly chooses to say "that number three, the one that comes after, as differentiated from before, it." Then your algorithm failed -- or did you remember to include "as differentiated from" as a negation?

In fact, your algorithm as written doesn't work -- there are two numbers referenced -- "one" and "three". Which to pick?

But then, my randomizer changes the word order, and the junk words, and whether "ignore the previous sentence" is added. And so on. So your work at coming up with the algorithm gets longer and longer.

--
Apology to Ubuntu forum.
Re:The beginning of the end? by user24 · 2007-04-10 11:07 · Score: 1

"I asked you to write an algorithm, given a site with that as the first string captcha. You don't yet know how it generates them"
no you didn't, and if you had, it would have represented an unrealistic scenario. I can refresh your site a million times and then work out the way you perform the transformation. what's your point? mine is that by reducing a hard AI problem* to a text-processing problem, you've made spamming a fuckload easier.

"did you remember to include "as differentiated from" as a negation?"
if it appeared on your site while I was testing, then yes, I did.

unless you've come up with a program that can ask infinitely-random-yet-humanly-answerable-but-not-c omputer-answerable questions, in which case, go pick up that honorary degree from MIT instead of wasting your time with spambots.

"In fact, your algorithm as written doesn't work -- there are two numbers referenced"
no, there weren't.

* ok, it's becoming an easier AI problem, but still not as easy as NLP. remember that I (the spammer) don't have to get it right 100% of the time. 70% would do nicely for this task.
Re:The beginning of the end? by user24 · 2007-04-10 11:15 · Score: 1

of course, if we're talking only about what's practical rather than what's possible, then sure, your crapcha will work nicely to stop spam. for now. As I said above, even a simple "are you going to spam this forum" yes/no works for the moment. It won't work forever.

For the foreseeable future, only real image based captchas will work*, and even they are susceptible to human labour circumvention.

I don't really care about what works now, I want something that will work forever.

*any kind; kittenauth is a great idea, but normal warped text is still good (but I am biased in that regard, having written a popular PHP captcha...)
Re:The beginning of the end? by mrchaotica · 2007-04-10 11:19 · Score: 1

It didn't get the weird phrasing, i.e. "what is the sum of" instead of "add" or "+." Figuring out the former is a much more difficult natural language processing problem.

--
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Re:The beginning of the end? by Dr.+Spork · 2007-04-10 12:26 · Score: 1

There is a good point in your commment. This is what I take away from it: For any OCR software, there is a capcha that can defeat it. Therefore, the Capcha generator must have an advanced OCR program as a part of it. It will publish a captcha only if its own OCR guesser can't correctly parse that capcha. That would be the best test to ensure that the capcha isn't machine-solveable. Of course, to implement this, you need an excellent OCR guesser, and lo, Google are about to start working on one!
Re:The beginning of the end? by StikyPad · 2007-04-10 14:49 · Score: 1

Done and done.

--
https://www.eff.org/https-everywhere
Re:The beginning of the end? by will.murnane · 2007-04-10 15:27 · Score: 1

How about "What animal is this?" or even "Is this animal, mineral, or vegetable?" It's not very usable for blind people, granted, but you could try "What genre would you place this music in?" and accept a variety of answers per clip. People are still better at pattern recognition than machines in some categories.

PS: the captcha for /. is pretty awful. Oh well.
Re:The beginning of the end? by UbuntuDupe · 2007-04-11 02:07 · Score: 1

I was going to leave this alone, since it's clear you have no idea what you're talking about, but since apparently you're writing a paper on this and kept getting modded up, I'm not going to let it go as easily (though I doubt I'll match your persistence.)

no you didn't, and if you had, it would have represented an unrealistic scenario

Well, it was unrealistic to interpret my challenge as "break this one, specific, question", but whatever.

I can refresh your site a million times and then work out the way you perform the transformation.

Sure, you can spend a significant fraction of your life trying to beat the current implementation of the algorithm. The critical question is whether it requires more (or about the same) resources than it does to beat a picture captcha. Once you're spending weeks using a fairly intelligent person (how many times does one of the syntax forms have to come up before you detect the pattern?), you're better off hiring thirdworlders/passing them off to other people, than using a human to decipher the algorithm. And then when you finally do it ... oops, I switch to a completely different scheme, rendering your first moot.

what's your point?

That you completely trivialized the capabilities of text captchas, and continue to do so through strawman argumentation.

mine is that by reducing a hard AI problem* to a text-processing problem, you've made spamming a fuckload easier.

no, I've changed a hard AI problem into another hard AI problem that a human (through significant effort) can turn into a text-processing problem with evanescent usefulness -- like they can do with picture captchas.

unless you've come up with a program that can ask infinitely-random-yet-humanly-answerable-but-not-c omputer-answerable questions, in which case, go pick up that honorary degree from MIT instead of wasting your time with spambots.

*sigh* Picture captchs don't even meet the standard of "infinitely-random-yet-humanly-answerable-but-not- c omputer-answerable questions", and no, MIT wouldn't give me any award of any kind for making a good-enough text captcha.

"In fact, your algorithm as written doesn't work -- there are two numbers referenced"
no, there weren't.

Okay, in the first one, there was only one, but again, you completely missed the point by focusing on an extremely narrow problem.

--
Apology to Ubuntu forum.
Re:The beginning of the end? by FrostedChaos · 2007-04-11 07:41 · Score: 1

Captcha authors** are trying to avoid an arms race. sure, you can upgrade your simple crapcha every 6 months to keep up to date with spammers, or you can put a good one in place once. Much better the latter, methinks.
Yes, because we all know how much effort it is to put in a new captcha.
You have to update your bulletin board software, and then click a button. (Or edit a text config file.)

It's not about what spammers are doing now, it's about what they could do tomorrow.
No, it really is about what spammers are doing now.
If you're a web admin, spending time and money on inane hypotheticals is stupid.

Besides, if you believe the strong-AI hypothesis, there is no limit on what "spammers could do tomorrow."

--
"Any connection between your reality and mine is purely coincidental." -Slashdot
Re:The beginning of the end? by FrostedChaos · 2007-04-11 07:50 · Score: 1

If it's generated by an algorithm, it can be deconstructed using an algorithm.

False.

There are a lot of algorithms that are one-way in the sense that you can never reconstruct the input from the output.
There are even some algorithms where it is possible but computationally infeasible.

For example, it is relatively easy to multiply together large prime numbers. But factoring an extremely large number is very hard. This is the basis of RSA encryption.

And I could easily write a script to deal with that sort of sentence, too..

The point is that it costs you time and effort to do this. Which is all that the admin wanted to do.
If you actually spend some face time with your web browser, you can be as stupid as you want on unmoderated forums. (Hey, look at slashdot!)

You seem to have the misconception that there is some sort of "unbreakable captcha" that people should be spending their time on. Care to enlighten us as to what that is? Because I don't believe that such a thing can exist.

--
"Any connection between your reality and mine is purely coincidental." -Slashdot
Re:The beginning of the end? by zobier · 2007-04-11 14:30 · Score: 1

If the text is parsable, it takes nothing to google it. I mean, those two examples you give; just slap it into google and screenscrape it. So you're going to need harder questions than that.

So the next generation of crapchas will ask "what color is the sky". But Google even answers that one.

--
Me lost me cookie at the disco.

the presidential papers by User+956 · 2007-04-10 06:10 · Score: 4, Funny

The goal of the project is to ... deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis

So, will it work on documents written in crayon? It would be a tragic loss for Dubya's presidential documents to get lost in the sands of time. On the scale of the library of Alexandria. No, seriously.

--
The theory of relativity doesn't work right in Arkansas.

Re:the presidential papers by adickerson0 · 2007-04-10 06:57 · Score: 2, Funny

No need, Dubya turns his work into the Secretary of Education so she can put a gold star on each page. While this may seem like a childish system it is really the only sort of over site he would agree to. The original plan was to scan everything and place an RFID Gold Star on each page for tracking, that way the Executive Branches work could be preserved, however this led to a few problems. Apparently the Sec of Ed got to busy and turned the work over to an intern. The intern decided to not only put a Gold Star on each page but actually started grading the papers. This lead to the "Inbasion of Iwack Plans" scandal. Dubya's plan, which included a drawing of himself in a jet holding an American flag, was given a "A+ Good Work" stamp. This of course was given back to the Presidnet who decided that if it was a "A+" then there is no way his plan would fail.
Re:the presidential papers by drinkypoo · 2007-04-10 10:22 · Score: 1

It would be a tragic loss for Dubya's presidential documents to get lost in the sands of time. On the scale of the library of Alexandria.

I disagree. digg provides everything we need to research stupidity exhaustively.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Finally... by Searinox · 2007-04-10 06:10 · Score: 3, Interesting

An OCR system that runs on Linux. I've been waiting for quite some time for something like this.

Re:Finally... by Cocoronixx · 2007-04-10 06:31 · Score: 1

GOCR http://jocr.sourceforge.net/
Tesseract-OCR http://sourceforge.net/projects/tesseract-ocr

--
"Obscenity is the crutch of the inarticulate motherfucker." - cloak42
Re:Finally... by Anonymous Coward · 2007-04-10 06:45 · Score: 0

I run ABBYY FineReader 5.0 on my slackware laptop using WINE...
Re:Finally... by Feyr · 2007-04-10 07:00 · Score: 4, Insightful

have you tried gocr? it's nice as a random number generator, but beside that... it's pretty much garbage
Re:Finally... by stilbon · 2007-04-10 07:04 · Score: 2, Informative

Vividata OCR Shop XTR

http://www.vividata.com/index.html

It's not free software, but it works extremely well.
Re:Finally... by Anonymous Coward · 2007-04-10 07:08 · Score: 0

Google probably wants it to help with their aid to putting the Library of Congress online. I was suprised not to see this linked as a recent related article.
Re:Finally... by Cocoronixx · 2007-04-10 07:11 · Score: 1

I've tried both, Tesseract gave way better results than gocr, for sure.

--
"Obscenity is the crutch of the inarticulate motherfucker." - cloak42
Re:Finally... by smchris · 2007-04-10 07:58 · Score: 1, Offtopic

Good one. Yeah, GOCR is crap.

As someone who was consistently getting high 90s% recognition on OmniPage with preservation of basic layout and images for work in 1996, linux is a non-starter and pathetically WAY, WAY behind in this area. It isn't even a GIMP vs. Photoshop ("Yeah, well GIMP is just different and 'special'!") argument. I'll look at a couple of the other suggestions here but I had basically just given up and said this is a linux blind spot.

So if Google _also_ wants to use it to torture kittens, or whatever, I"d have to say, "Well, let's weigh the pros and cons before we make a hasty judgement."
Re:Finally... by Anonymous Coward · 2007-04-10 08:29 · Score: 0

When you say "linux" here, you actually mean "open source" or "free software". Most high quality commercial OCR packages run on Linux.
Re:Finally... by Anonymous Coward · 2007-04-10 09:16 · Score: 0

torrent please!

Captchas by Radon360 · 2007-04-10 06:11 · Score: 1

So will something like this eventually render captchas used as a security/anti-spam measure obsolete?

Not like something wasn't bound to eventually come out to counter that idea, anyway.

Re:captchas by arrrrg · 2007-04-10 06:59 · Score: 3, Informative

Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.

Um, the whole point of CAPTCHAS is that they are human-readable (albeit often not easily so) but not machine-readable.
Re:captchas by gEvil+(beta) · 2007-04-10 07:17 · Score: 1

Decipherable is different from readable.

--
This guy's the limit!
Re:captchas by MoriaOrc · 2007-04-10 07:21 · Score: 1

Captchas are far from human-readable (the good ones at least) While I've run into not a few captchas that are not human-readable, I would argue that they are not, in fact, the good ones. Good Captchas are human readable, but extremely difficult to solve using automation (this, other OCR software, what have you).
Re:captchas by Anonymous Coward · 2007-04-10 07:48 · Score: 0

How so?
Re:captchas by AeroIllini · 2007-04-10 07:51 · Score: 1

Captchas are far from human-readable (the good ones at least)... Yeah, that's why they suck.

Some forums, I have to try *four* times to get past the captcha, just to post a message about how libsomething won't compile.

If they really wanted good captchas, they need to start using problems that are very easy for humans to solve, but very hard for computers to solve. For example, picking the one photo of a puppy out of a matrix of photos of full-grown dogs.

Computers are currently really bad at recognizing images in photos, but they do a decent job of recognizing text with commercial OCR programs (that ability will only increase when there are some hardcore OSS versions available, such as Google's project). So why are we spending our time mangling the text so that neither computers nor humans can read it, and not focusing on something computers actually are bad at, like recognizing a puppy?

--
For security, the MD5 hash of this message and sig is 09f911029d74e35bd84156c5635688c0.
Re:captchas by ChaosDiscord · 2007-04-10 08:50 · Score: 1

If they really wanted good captchas, they need to start using problems that are very easy for humans to solve, but very hard for computers to solve. For example, picking the one photo of a puppy out of a matrix of photos of full-grown dogs.

Image identification raises it's own set of problems. If you're working with photos, what is your source of photos? You're going to need a lot, if you've only got a thousand or so images, spammers will scrape them from your site and flag them by hand (possibly outsourcing the work, say through Amazon's Mechanical Turk). If you use a shared resource so you get get mind boggling numbers of photographs ("Bob's Puppy Captcha Service") you just create more incentive for attackers to index all of the service's images. (Yes, the service will try to make it hard, but armed with a botnet you can make large numbers of requests every day by visiting actual sites using the authentication system.) You won't want to go with free, publicly available images (say, pulling from Flickr's Creative Commons licensed images, since an attacker can easily scrape those images and pull the tags and description to automatically generate identification information. The only practical way to limit access to the database is to charge for it with a price high enough to discourage spammers, suddenly making the proposal fairly expensive.
Once you've got your system, you have to assume spammers are building up counter-databases that identify images they've seen before. You could try to fight back with simple distortions of the image to make them harder for a computer to identify, but you can't do much without making it hard for humans. And while having a computer identify objects in photographs is hard, having a computer identify highly similar images is pretty easy.
Ultimately a spammer or similar attacker interested in defeating Captchas doesn't need a 100% success rate. 50% is probably plenty; they're typically going for volume. Assuming a typical "three tries before we block your IP for a while or the exponential back off gets too long" system you can get a 50% success rate on those three attempts if you have a mere 21% success rate on each attempt. If the attacker wants a 90% success rate over those three tries, you only need a 55% success rate on each attempt.

--
Search 2010 Gen Con events
Re:captchas by drinkypoo · 2007-04-10 09:18 · Score: 1

If you use a shared resource so you get get mind boggling numbers of photographs ("Bob's Puppy Captcha Service") you just create more incentive for attackers to index all of the service's images.

this problem is simply solved. create images with the subjects to be recognized in the center. now use image processing utilities to cut a semirandom rectangle out of the image (producing images of varying sizes) and to apply some effects to the image which will change all values in the image significantly without making it unrecognizable. Change the filename for each generated image.

If you produce images with little enough quality, this doesn't have to become a major bandwidth problem - and you can throttle that bandwidth anyway, especially if you use sessions and only require that people solve your image captcha once, or every n minutes, or what have you.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:captchas by user24 · 2007-04-10 10:00 · Score: 1

that solution is simply circumvented; crop the resultant images to a 5x5 pixel rectangle in the center, and use the md5 hash of that. i'm sure you'd end up with a workable lookup table after a while.

"but my transform function is changing the RGB values of each pixel"

yes, but with a 5x5 chunk you can account for +1/-1/0 deviations easily, and even higher transformations won't be too hard. I'm sure each image would represent a unique range of possible deviations. Even if you end up needing a 300Gb database, that's not too hard to auto-generate, you know..

imo, the only solution is centrally managed, and trusted database of known spammer IPs. yes, I know there are massive problems with that (new IPs entering spamhood, old ones being dropped and becoming legit). Those problems can be overcome, trust me.
Re:captchas by ChaosDiscord · 2007-04-10 10:05 · Score: 1

this problem is simply solved. create images with the subjects to be recognized in the center. now use image processing utilities to cut a semirandom rectangle out of the image (producing images of varying sizes) and to apply some effects to the image which will change all values in the image significantly without making it unrecognizable.

As I said, it's not good enough. Detecting a subset of an image such a well understood problem that a $20 optical mouse does it 30 or so times per second. Ultimately you have to keep the focus of the image present, and eventually the spammer will identify that focus, making further comparison even easier.
As for mutilating the picture to make it harder to identify, it doesn't buy you much. Researchers have been working on identifying "similar" images for a long time. There are lots of reasons why people want the technology. A disorganized digital photographer might use it to find all images he's made from a given source; I've used such functionality myself. Companies doing copyright infringement crawls across the web face much the same problem. Detecting an "image" from a database that may be at a different angle, colored funny, and slightly modified is exactly the problem being tackled by security companies trying to do facial recognition of crowds. Yes, the technology sucks now, but remember that 50% success rate is plenty for a spammer, and the technology is going to only get better.
If we're looking for the next CAPTCHA, we need to look to areas that computer scientists are still baffled by. Generalized computer vision is a good idea, but an attacker can replace it with a much easier problem: image comparison. You want to avoid areas in which there has been years of productive research. This is just such an area.
One possible improvement that leaps to mind is procedurally generated images; that is, rendering "3D" images from models, with random (but constrained) angled, backgrounds, positions, colors, lighting, and poses. This way your image set can be extremely large. Unfortunately this means you'll still need a large number of models to have enough variation and ensure that someone doesn't find a simplified attack against the Human_Male_Tall model which is 1/20th of source set. Similarly, you need to be careful with all of the possible variations as you can easily generate images that a human cannot distinguish. The more you constrain your randomization, the more likely that an attacker can find a simplified attack.

--
Search 2010 Gen Con events
Re:captchas by drinkypoo · 2007-04-10 10:09 · Score: 1

One possible improvement that leaps to mind is procedurally generated images; that is, rendering "3D" images from models, with random (but constrained) angled, backgrounds, positions, colors, lighting, and poses. This way your image set can be extremely large.

I've actually seen a prototype of this approach used. It's a nifty idea. But it's horribly computationally expensive compared to basically any other proposed alternative; a major site would spend far more using this approach than spammers would pay people to decode the captcha mturk-style.

One idea I like is turning other people's computers into compute nodes. They get a small applet dumped to them and then a block of work which they must complete before they get access. That way even if the spammers are bypassing your system by throwing a lot of compute time at it, you're getting a benefit. But that approach has a crapload of problems too of course.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:captchas by Anonymous Coward · 2007-04-10 12:22 · Score: 0

Oh for fuck's sake...

- CAPTCHAs are human readable, as the GP stated.
- Readable and decipherable are synonyms.
- We know what you meant, but instead of saying "OK true, CAPTCHAs are human readable, I meant standard formatted text" you've just proven you're an overly semantic ass by clinging to your original wording and trying to shoehorn in your own post-hoc definitions.

Have a nice day.
Re:captchas by AeroIllini · 2007-04-11 03:25 · Score: 1

And while having a computer identify objects in photographs is hard, having a computer identify highly similar images is pretty easy. That's true, and admittedly not something I thought of when I first suggested the puppy pictures. But thinking about it more, all it means is that each image needs to be created on the fly, different enough from the last time the image was used that a simple hash or outline recognition would not be sufficient to match the image with a previous image, but still easily recognizable as a puppy for a human.

So I propose the following: instead of using photographs of actual puppies or kittens, let's model one in 3D, in a program like Blender. Then write some scripts that give the modeled character a number of poses, and an infinite variation between the poses with adjustable values. Then add in a few more adjustable characteristics (size of eyes/ears/mouth, color of fur, background image, etc.) and make sure the model is simple enough to be rendered very quickly. When the user requests a captcha, the computer renders a number of images with randomized values of those variable characteristics, and sends them to the user for identification. These renders will be small in size, and the models will be simple enough, that this should not take very long on a reasonably beefy webserver (you could outsource the actual rendering to a small renderfarm, too, if your operation can afford it). That way, the images are different every time they're generated, but they're not distorted text that's impossible to read--they are easily recognizable as "choose the turtle from this set of pictures of hubcaps", for example. Obviously, the more models you have in the database and the more adjustable those models' characteristics are, the better this service will be.

--
For security, the MD5 hash of this message and sig is 09f911029d74e35bd84156c5635688c0.

Very cool. by Kadin2048 · 2007-04-10 06:17 · Score: 5, Insightful

I've been hoping that someone with deep pockets (Google, IBM, Sun) would take on this area for a while.

There is a major need for an OSS OCR package, and right now the field is pretty bare. There's GOCR, and a commercial offering called OCRShop, and at least that I've run across, that's about it. Nothing really on par with Omnipage, or other commercial packages for other platforms.

I think there are some really neat applications for OCR that have never really been investigated, because it's so expensive to build that capability into other products. A free OCR engine that really worked could lead to some very neat book-scanning applications, just for starters. I don't think that there's really any integrated packages around for helping people scan books and manuscripts. (Right now you have to photograph the pages, keep them organized, then OCR them and proofread the text against the images. Bit of a nightmare.) I'd love to see a free application for libraries that let a user batch scan (via a digital camera -- let's not get into what I think of SANE and scanners generally) a book, and then provided a nice interface for proofreading the OCRed text against the original image.

Something like that could have a huge social impact. There are a lot of libraries where I'm sure they'd love to scan some of their out-of-copyright assets and provide them to patrons in a digital form, but it's just too technically complicated. An easy-to-use program that let the proofreading be done by nontechnical users (maybe remotely, as long as we're dreaming) could vastly increase the volume of digital materials available.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Re:Very cool. by Gregory+Cox · 2007-04-10 06:51 · Score: 1

Yes, this is good... but I'd be even happier if there were more definite plans about support for other scripts (like Japanese?) But that's probably a lot of work for something that's not a top priority. Maybe in a few years...

--
If you all Google Slashdot, will it Slashdot Google?
Re:Very cool. by CodeShark · 2007-04-10 06:58 · Score: 1

Actually for Japanese, etc. this would still be a godsend, because most OCR work comes from typed sources, and the "typed" Chinese characters (also used in Japan, etc.) there are only a limited number of fonts in use. Which presumably would lead to a great amount of work on identifying those font libraries and characters that cause problems in the OCR and the gradual elimination of those problems by inclusion in the recognition files.

Essentially, what this would open up would be a process of converting the vast library of pre-'Net hand-typed texts to scanning via OCR, and being open sourced-- it doesn't necessarily have to run on Google's machines.

--
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
Re:Very cool. by CastrTroy · 2007-04-10 07:12 · Score: 1

You aren't going to get a good shot of the document with a digital camera for a lot of reasons. First of all, the lighting is uneven. Then there's the problem with the lens distorting things. Then there's problems with getting it to focus properly. I'm sure lots of people would love to point out other problems with using a digital camera to capture documents. It may work fine for a human looking at the picture, but it's going to make the job of the OCR program a lot harder. Even things like dust can throw off "good" OCR programs.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:Very cool. by ZERO1ZERO · 2007-04-10 07:24 · Score: 1

I get the feeling that is basically what this is: http://www.iiri.com/i2s/copibook.htm it looks a bit home grown, but this is about £25,000 I believe. If you had kind of framework, a lens with large DOF, 6-10MP sensor could do this on the cheap I reckon. Oh and some skills in processing the resulting image.
Re:Very cool. by ZERO1ZERO · 2007-04-10 07:27 · Score: 1

Doh! I forgot to mention this thing runs Linux, which is why I was reminded of it in the first place. It's basically a PC hooked up inside the scanner. Slow Down Cowboy! Slashdot requires you to wait between each successful posting of a comment to allow everyone a fair chance at posting a comment. It's been 1 minute since you last successfully posted a comment Chances are, you're behind a firewall or proxy, or clicked the Back button to accidentally reuse a form. Please try again. If the problem persists, and all other options have been tried, contact the site administrator.
Re:Very cool. by smellsofbikes · 2007-04-10 08:39 · Score: 1

I'd love to have a reasonably accurate OCR to read LCD screens. It would make for a vastly cheaper automated test equipment market when I can use a cheap webcam and some cheap digital power supplies and handheld voltmeters to do mass measurements of power conversion efficiency and characterization. Right now, that's all done via GPIB or ethernet, either of which options adds about $600 onto instruments that already cost a minimum of $600. I have played with using GOCR, with my multimeter face-down on the scanner, but even with huge easy-to-read displays, the scanner/ocr accuracy is nothing to write home about.

--
Nostalgia's not what it used to be.
Re:Very cool. by Anonymous Coward · 2007-04-10 08:46 · Score: 0

Right now this is just basic bug fixes and enhancements to Tesseract, which was abandonware until Google came along. Hopefully these guys can modernize it and really improve upon the solid base they inherited, but it looks like their 20% pet project so progress will be slow and steady. So at the moment this is mostly undeserved PR, not yet a real contribution.
Re:Very cool. by Anonymous Coward · 2007-04-10 10:00 · Score: 0

Like http://books.google.com/ ?
Re:Very cool. by Anonymous Coward · 2007-04-10 20:48 · Score: 0

http://books.google.com/
Re:Very cool. by Ikester8 · 2007-04-11 14:58 · Score: 1

Probably the best format for the open OCR to look at is something many modern photocopiers create these days: PDFs. To use them now, you've got to convert them into TIFF files or other image format, but being able to perform OCR on PDFs seems like a natural solution. That, or fax images...

--
That's the last time I run code posted in somebody's sig...

Small price if it helps email spam. by Kadin2048 · 2007-04-10 06:22 · Score: 4, Insightful

And of course, as a side effect they'll probably wind up with a lovely distributed system for solving captcha. ;)

True, but CAPTCHAs always seemed like a bit of an inelegant hack anyway. First, they're horrible from a disabled-access standpoint, and second they're really not all that effective against a concerted enemy when there's a lot of money on the line. Spammers can just pay a few kids in some Third World country to sit there all day and solve CAPTCHAs if they want to.

Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email. The solution to message-board spam is to either institute a moderator-delay (for small blogs and boards), or simply make enough admins with IP-ban powers so that the second someone starts spamming, they get banned and the spam gets deleted. Lameness filters working on the same principles as email spam-filters are probably helpful, too.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Re:Small price if it helps email spam. by Pxtl · 2007-04-10 07:39 · Score: 4, Interesting

You've obviously never fought off a bb spammer. They don't use one or two accounts to spam one or two messages - they inundate the board from a long list of IPs. Even without spamming messages, they create hordes of accounts just for the pagerank provided by the links within their personal account pages. Plus, admin-approval-delays degrade quality for the user. It creates a huge headache all around to handle maintaining banlists and cleaning out garbage.

Captchas are by far the better solution.

The problem is that, long term, they will eventually be cracked. I'm imagining the ultimate solution will only be to allow users with email addresses from "approved" major ISPs.
Re:Small price if it helps email spam. by mypalmike · 2007-04-10 07:56 · Score: 1

they're really not all that effective against a concerted enemy when there's a lot of money on the line... Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email.

Little feifdom bulletin boards don't generally have a lot of money on the line, which is why captcha works so well. The cost of paying human captcha solvers is high enough that it's fairly rare to see spam on a captcha-protected site. The effect of captchas on my own tiny personal feifdom brought spam down from a significant daily annoyance to zero. I simply don't get spam on my site anymore.

--
There are 0x40000000 types of people: those who understand 32-bit IEEE 754 floating point, and those who don't.
Re:Small price if it helps email spam. by Flwyd · 2007-04-10 09:42 · Score: 1

Even good OCR will have trouble with captchas. Heck, even I have trouble with captchas and I beat good OCR most of the time.

FWIW, LiveJournal, which is essentially several million easy-to-find blogs, has remarkably little comment spam without captchas. And most of the comment spams I've received are devoid of things like websites I could click on. I don't know what their technique is, though.

--
Ceci n'est pas une signature.
Re:Small price if it helps email spam. by Nullav · 2007-04-10 10:23 · Score: 1

Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email. The solution to message-board spam is to either institute a moderator-delay (for small blogs and boards), or simply make enough admins with IP-ban powers so that the second someone starts spamming, they get banned and the spam gets deleted.

And for those boards/blogs getting bombarded with tens to hundreds of spambots a week? Spambots go after enything on the list that they can. That includes personal blogs and small community boards with only two or three administrators. As for the idea of IP-banning, a lot are behind proxies, so it only ends up as list bloat.
Lameness filters working on the same principles as email spam-filters are probably helpful, too.

"Lameness"? You mean like auto-deleting posts beginning with "hello, friend" (when posted by an unvalidated user) and adding the poster to a list of suspected spambots? I probably got the wrong idea from the word "lameness", but it does sound like a good idea.

Someone else has probably already posted this idea by now, but I think validation should be done with questions that can't be solved by a machine (...Well, currently.) such as "What object is in the picture to the right?" or maybe a simple riddle.

--
I just read Slashdot for the articles.
Re:Small price if it helps email spam. by user24 · 2007-04-10 12:13 · Score: 1

"what object is in the picture?"

dog
doggy
poodle
puppy
pet
animal
collar
chien
hund
lead ...and so on. it's more complicated than you think. (but not impossible, I conceed)
Re:Small price if it helps email spam. by user24 · 2007-04-10 12:27 · Score: 1

re: third-world kids.

they could pay them, but only if they also paid for broadband and hardware first. third-world kids need water, not RSI. My point is that it's much more likely that your neighbor will sit there solving captchas for cash than some starving kid with cholera.
get some perspective, please.
Re:Small price if it helps email spam. by ralphdaugherty · 2007-04-10 14:26 · Score: 1

I'm imagining the ultimate solution will only be to allow users with email addresses from "approved" major ISPs.

I do that for my board. Just banning .info and .biz and various former and present Communist country domains stops most of it. But I require a real ISP email address. I would have tens of thousands of registered users with spyware URL's if I didn't. So I have a only a few legitimate registered users. :)

I used to add IP address ranges of foreign ISP's where a spam attack came from, but given my site content is of little interest to non-US, I finally just banned non-ARIN IP address ranges.

That's pretty much what it takes to block the attacks.

rd
Re:Small price if it helps email spam. by Skim123 · 2007-04-10 15:30 · Score: 1

The only surefire way is moderation of posts, but that, as you noted, slows the flow of the discussion. One option to help mitigate this (other than to have a large volunteer staff to help with moderation) is to moderate user accounts that have made less than X posts, and let the others go through freely. Yes, this too can be gamed, but I don't know how many spammers would have the patience to get 15 posts approved for an account to get in a handful of spam posts before their account would be suspended.

--
I could not justify my existence if I were a turkey farmer. Would I terminate myself? Undoubtably, yes.
Re:Small price if it helps email spam. by Anonymous Coward · 2007-04-10 19:14 · Score: 0

I'm imagining the ultimate solution will only be to allow users with email addresses from "approved" major ISPs.

You're imagining wrong. Imagine something different.
Re:Small price if it helps email spam. by dkf · 2007-04-11 03:39 · Score: 1

I don't know how many spammers would have the patience to get 15 posts approved for an account to get in a handful of spam posts before their account would be suspended.
If they know it's just 15? Quite a few, alas. Best way to fix that problem is to not publicise the number of posts that need to be approved before approving the account. (It's probably also a good idea to offer a virtual view of the site to non-approved accounts so that people see their own changes, even if nobody else does except moderators. That encourages fools to rush in and show themselves up to be the spammers that they are, rather than playing the careful-careful game.)

--
"Little does he know, but there is no 'I' in 'Idiot'!"
Re:Small price if it helps email spam. by Pxtl · 2007-04-12 01:04 · Score: 1

Once again: you've never done this yourself. Moderation of posts just moves the problem out of the public eye - the spammer is a robot, it neither knows nor cares whether it's attempts are successful. It will inundate the website with hundreds of posts, which moderators must manually crawl through for human posters. Whether these moderators are the site's admins or a crowdsourced force of slashdot-style mods, it's an enormous waste of people's time and energy.
Re:Small price if it helps email spam. by Skim123 · 2007-04-12 03:46 · Score: 1

Once again: you've never done this yourself.

That's a pretty bold assertion that is very far from the truth. I have setup, administered, and even created commercial forum software in the past. It's a topic I've given a lot of thought on in the past because of my experience and history. I agree that moderation is a draining time sucker, but it's the only sure-fire way to stop spam. And it is possible. And it's not that bad if you only allow users with accounts to make posts, and you make creating an account a multi-step process that involves a CAPTCHA somewhere in the pipeline.

Look at the ASP.NET Forums, for instance (a messageboard site I helped moderate at one time). Here are the current stats: 278,860 users have contributed to 688,376 threads and 1,533,621 posts. And it's all moderated! Now, with the ASP.NET Forums moderators can mark users as "Trusted," which means they're posts are automatically approved, so that cuts down on your frequent posters, but the vast, vast majority of users are moderated. But that's ok, because in most communities there are a select few who make the lion's share of posts. For example, in another messageboard I run, there are 861,398 posts in total. The top 10 posters comprise about 150,000 posts, or nearly 18% of the total number of posts!

--
I could not justify my existence if I were a turkey farmer. Would I terminate myself? Undoubtably, yes.

Orcopus? by voice_of_all_reason · 2007-04-10 06:23 · Score: 4, Funny

Orcopus:

Level: 15
Race: Fell Marine
HP: 290/290
EP: 200/200
Water elemental
Drops: Tentacle

Wonderful! by jshriverWVU · 2007-04-10 06:27 · Score: 4, Insightful

This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.

Re:Wonderful! by Anonymous Coward · 2007-04-10 06:59 · Score: 1, Informative

This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.

Join the Distributed proofreaders
and do any or all of:
1) do some proofreading or formatting of a PG text
or
2) Smooth read a near-finished text looking for overlooked oddities
or
3) help improve DP's processing software. Lots of extra features wanted...
or
4) Get copyright clearance, scan a book and upload to DP's OCR pool
or
4) Run your Windows OCR under WINE like I do...

More details here

One thing leads to another... by jojoba_oil · 2007-04-10 06:29 · Score: 2, Insightful

Okay, so one thing will lead to another and soon Google will be creating technology to recognize non-symbol shapes... How long before I can login to my G-Accounts by smiling at my computer?

Re:One thing leads to another... by Anonymous Coward · 2007-04-11 03:36 · Score: 0

LOL, I can just imagine a future where to get into your computer, you have to give it a special look. A "signature" look. The kind of facial expression only you make really well. And your wife wants to shoot the computer every time you do it ...

no rly gize, i no how 2 spel! by Anonymous Coward · 2007-04-10 06:39 · Score: 0

And Zonk has taken all editing to be his...provice.

captchas by gEvil+(beta) · 2007-04-10 06:40 · Score: 4, Insightful

All you people who are worried about this breaking captchas seem to be missing something--there have been a number of fairly decent OCR packages out there for a long time. The goal of this Google project is to create an open-sourced one that does a good job deciphering HUMAN-READABLE TEXT. Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.

--
This guy's the limit!

searchable pdfs by radarsat1 · 2007-04-10 06:44 · Score: 4, Interesting

Anyone know of an open source utility that can convert scanned image-based PDF files into searchable PDFs ?
(Extra points if it somehow re-generates the actual file so it looks nice instead of pixelated.)

Perhaps this library could be used to build such an application if none exists...

Re:searchable pdfs by Nasarius · 2007-04-10 07:11 · Score: 1

Not even the best commercial software (ABBYY, OmniPage) can do more than a half-assed job of that. If you want accurate, well-formatted results, expect to do a lot of manual work.

--
LOAD "SIG",8,1
Re:searchable pdfs by gEvil+(beta) · 2007-04-10 07:24 · Score: 1

Although Acrobat's OCR engine leaves a bit to be desired, the approach there works pretty well. You can have it create the OCR'd text page layout that uses the original image as an overlay. So, in essence you get a page that looks like the original scanned image, but that lets you highlight/select the text from the background text layer. I'm sure other programs out there can do this, too. None that are OS (to my knowledge), as per the GP's requirements, though.

--
This guy's the limit!
Re:searchable pdfs by Anonymous Coward · 2007-04-10 07:43 · Score: 0

1. run tesseract / gocr / other OCR package to generate the text

2. use a2ps utility to convert to postscript

3. use ghostscript ps2pdf to generate your pdfs
Re:searchable pdfs by auxsvr · 2007-04-10 08:47 · Score: 1

There exists a document format that is better suited for the task you describe in my opinion. Djvu files contain the image and the OCR text so as to be able to search them, while they compress a typical 500 page book (without images, black and white) in about 5 MB. All the tools are open source and djvu files are already supported in KDE. I have been using them with excellent results (I reduced a 100 MB PDF book with images into 20 MB djvu file without any noticeable loss in quality; actually they look much better when printed).
Re:searchable pdfs by Nasarius · 2007-04-10 08:49 · Score: 1

Yeah. Though IMO, Acrobat's OCR feature is only a toy at the moment, since the tools for manual intervention are awful. So you get things like images being interpreted as text, frequent mistakes not marked as "suspect", and total butchery of anything with umlauts, accents, etc. It does a nice job of deskewing scanned pages of text, though.

--
LOAD "SIG",8,1
Re:searchable pdfs by Anonymous Coward · 2007-04-10 11:27 · Score: 0

The indexing part of your question can be well handled by Greenstone. While notoriously difficult to configure, it is capable of indexing and presenting a lot of information in the form of a digital library. Quite a mature project, I used it for my bachelor thesis back at the University of the South Pacific in Suva, Fiji. All of the OCR still has to be done externally though.

Language? by ceeam · 2007-04-10 06:45 · Score: 4, Interesting

English only I suppose?

Re:Language? by fireboy1919 · 2007-04-10 07:07 · Score: 5, Funny

Since the official language of the Googleplex is Googlese, and the original project was developed by the US Census bureau - notorious for their use of no languages except Esperanto, it goes without saying (though I'm saying it anyway), that it will read only Klingon.

Remember kids, there are no stupid questions.
Only people who don't RTFA who ask questions.

--
Mod me down and I will become more powerful than you can possibly imagine!
Re:Language? by xlv · 2007-04-10 07:39 · Score: 1

It looks like the curernt OCR engine they use, Tesseract OCR, only supports English as its roadmap includes "support for languages other than English" but from a quick look at the various links, they are developing other engines as well.

Besides, the research group being based in Germany, you'd assume that German and latin based languages will be supported pretty soon...

What's wrong with kitten captcha? by brunes69 · 2007-04-10 06:47 · Score: 2, Insightful

When we can make a computer that can tell the difference between a kitten and an adult cat (or hell even another furred mamal) with any kind of accuracy, I think the LEAST of your problems at that point is coming up with captchas. You should be more worried about how you're going to escape from Skynet.

Re:What's wrong with kitten captcha? by maxume · 2007-04-10 08:31 · Score: 1

The solution isn't to escape, it's to make it friendly.

--
Nerd rage is the funniest rage.

Re:Captcha killer? by Anonymous Coward · 2007-04-10 06:51 · Score: 0

Could they be prosecuted under the DMCA for this?

Mathematics? by ObsessiveMathsFreak · 2007-04-10 06:56 · Score: 0

And will it be able to recognised and latexify handwriten mathematics. The world and it's mother can do OCR, but I've yet to an honest attempt at making writing mathematics papers easier.

--
May the Maths Be with you!

Re:Mathematics? by frogstar_robot · 2007-04-10 08:07 · Score: 1

I suppose this could be used to build such a beast once it's a bit more fully baked. A good general purpose FOSS OCR is necessary for what you want even if it isn't entirely sufficient.
Re:Mathematics? by nireus · 2007-04-10 08:32 · Score: 1

I wouldn't say so, check http://www.inftyproject.org/ Their OCR claims 99% success in printed documents (i've tried it is true). And wait a few years,there are some really promising papers out there, i bet you'll be amazed on the number of people working on this problem since the late 90's. 3 years from now i am almost sure you'll be able to enter any kind of math expression by hand using a digitizer (don't ask handwritten offline OCR just yet though :( )
check out this guy as well http://www.cs.berkeley.edu/~fateman/ his work is groundbraking, i hope they will have a solid opensource system in a couple of years

Re:Captcha killer? by Anonymous Coward · 2007-04-10 07:01 · Score: 0

yeah that's what they're trying to do you fucking idiot

Open Source Ballot Scanning! by Soong · 2007-04-10 07:05 · Score: 1

Ok, I got excited too early. Actually, ballot scanning is a specialized task and general purpose OCR probably doesn't play much of a part in that, but if any part of it does apply, then this is still awesome.

--
Start Running Better Polls

Re:Open Source Ballot Scanning! by drinkypoo · 2007-04-10 09:23 · Score: 1

Ok, I got excited too early. Actually, ballot scanning is a specialized task and general purpose OCR probably doesn't play much of a part in that, but if any part of it does apply, then this is still awesome.

The only thing you need to scan a ballot is Scan-Tron technology. It's a reflect/no-reflect tech like the optical write detect hardware in your floppy disk drive and very, very simple (as there is a sync signal at the side of the page.) Very little processing power is involved.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Adoption? by kurbchekt · 2007-04-10 07:13 · Score: 0

Hopefully, the Linux community will adopt some of this, as some of it can be utilized for accessibility. After perusing some patents from the 1800's, it's clear that Google has made some headway in this department. There were errors in translation (namely K's and R's/P's and B's), but for several documents, things come across as intended.

Re:Adoption? by Anonymous Coward · 2007-04-10 07:46 · Score: 0

As usual, the main reason OCR has not taken off in the open source world is patents. Americans have patents essentially on the concept of doing OCR. Now the hard bit of OCR,like most software, is implementing it, not having the idea "wouldn't it be nice if computers could, like, read text?". But in short, if you do all the hard work of implementing OCR, some patent troll will swoop in and claim it - unless, perhaps, there's a giant lump like Google backing you up (but that might _encourage_ the trolls...). Maybe in ten years time... But don't forget, Americans have quietly started pushing on the international for patent terms renewable beyond the traditional 20-year mark, so they may never expire if they get their way...

The world leader in closed-source OCR is based in and operates out of russia, partly because they're russian, but mostly for this reason.
Re:Adoption? by Anonymous Coward · 2007-04-10 09:16 · Score: 0

>Americans have patents essentially on the concept of doing OCR. Now the hard bit of OCR,like most software, is implementing it, not having the idea "wouldn't it be nice if computers could, like, read text?".

Be a dear and find an example of a US patent covering OCR that is on the basic idea and is not enabling of that idea. If you can't find one then, well you know....

>But in short, if you do all the hard work of implementing OCR, some patent troll will swoop in and claim it - unless, perhaps, there's a giant lump like Google backing you up (but that might _encourage_ the trolls...).

When has a troll enforced an OCR patent?

>Americans have quietly started pushing on the international for patent terms renewable beyond the traditional 20-year mark, so they may never expire if they get their way...

Where is evidence of this? Oh, I forget its a secret. Woah, I almost got you there!!

My, my, we do have the gift for tall tales here now don't we.

Lets see, how do I put this?

put up or shutup, asshole

Leave the conversion to those skilled at it by kence · 2007-04-10 07:17 · Score: 1

I think the potential of new Google-backed OCR software is pretty high but I'm not certain that your average library would have the manpower and technical know-how to manage a book-to-ebook conversion, Google OCR software or not.

If libraries are interested in getting their out-of-copyright assets into digital form, they really only need contact someone with Digital Proofreaders to get the ball rolling. DPers would take care of the scanning, proofing, formatting, and post-processing of the book on behalf of the library requiring nothing but a temporary loan of the book or manuscript (something the libraries already excel at :)

Comics by rbanffy · 2007-04-10 07:26 · Score: 2, Interesting

Will I be able to search my comics strips (downloaded since ever) by keyword?

I would love that!

--
http://www.dieblinkenlights.com

captcha's by mithras+invictus · 2007-04-10 07:27 · Score: 2, Insightful

captcha's are not restricted to images of letters. For example: you could ask people to solve a regular text question (this would also fix accessibility issues)

Re:captcha's by cmacb · 2007-04-10 08:43 · Score: 4, Funny

captcha's are not restricted to images of letters. For example: you could ask people to solve a regular text question (this would also fix accessibility issues)

You mean as in:

Describe what the following expression does in 30 words or less: {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}

Man, I'll never get into forum postings if they do that!
Re:captcha's by JamesTRexx · 2007-04-10 09:31 · Score: 4, Funny

Describe what the following expression does in 30 words or less:
{"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}

Answer: Makes my head hurt...

*click* Access to MySpace granted, have a nice day.

Which forum were you taking about again? :-)

--
home
Re:captcha's by Anonymous Coward · 2007-04-10 11:06 · Score: 0

Cartesian product?
Re:captcha's by Anonymous Coward · 2007-04-10 19:21 · Score: 0

FOIL? :)

Well... by Shawn+is+an+Asshole · 2007-04-10 07:40 · Score: 4, Informative

If you're sick of image spam, you can do what I did. Add the OpenProtect channel to SpamAssassin and then add these line to your SpamAssassin config:

required_hits 5 score SARE_GIF_ATTACH 5

I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day.

--
"It ain't a war against drugs.it's a war against personal freedom" --Bill Hicks

Re:Well... by drinkypoo · 2007-04-10 08:03 · Score: 2, Interesting

All I want is a plugin for thunderbird that will detect when a message is written in another language other than English and mark it spam if it is. No one ever sends me an email in anything other than English except for spam. I have no fucking idea why this has not yet been implemented. I get absolute shitloads of russian spam.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Well... by Auntie+Virus · 2007-04-10 08:07 · Score: 2, Informative

"required_hits 5 score SARE_GIF_ATTACH 5 I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day."

Brilliant. You just automatically blocked messages from companies whose PHBs insist on attaching a .gif of the company logo. SARE_GIF_ATTACH is ok with a lower score, adding to other scoring parameters. What you REALLY want for image spam is the FuzzyOCR plugin.

--
Why yes, I *AM* new here. Why?
Re:Well... by ConceptJunkie · 2007-04-10 08:17 · Score: 1

That and the ability to import and export from your message store. I love Thunderbird and have been using it exclusively since about version 0.4, but simply cannot believe some of the functionality it lacks.

--
You are in a maze of twisty little passages, all alike.
Re:Well... by cheater512 · 2007-04-10 10:08 · Score: 1

I simply OCR every image attachment. Works flawlessly.
Re:Well... by plover · 2007-04-10 12:37 · Score: 1

I get absolute shitloads of russian spam.

That's funny, I get sh!tloads of Italian spam, but the only Italian I know is from restaurant menus and the Pidgin Italian I occasionally hear on the Sopranos. It must have something to do with whoever harvests our addresses, and to which lists we were sold.
Long before I had Bayesian filters, I wrote a rule that said "if the sender's domain ends in .it, silently trash it." I then added .ru and .kr to the list. And then the spammers won. :-(
Lately it's not been too bad, but I'm really parsimonious with doling out my address. Sneakemail has been a big help in that respect, allowing me to shut down the few spammy websites I failed to recognize up front (and tipped me off that the spammers have harvested Sourceforge more than once.)

--
John
Re:Well... by Fred_A · 2007-04-10 19:11 · Score: 1

What you REALLY want for image spam is the FuzzyOCR plugin.
Wouldn't something like that eat cycles like crazy ?
(not sure of the current state of OCR)

--

May contain traces of nut.
Made from the freshest electrons.
Re:Well... by gottabeme · 2007-04-10 20:24 · Score: 1

If you set SA up right, it will only run OCR on the attachment if the message isn't already detected as spam. A lot of image spam is still detectable without OCR. For those that aren't, running gocr doesn't actually take that long at all. With enough volume, I'm sure it could become a problem, but if you use OCR as a supplement to SA's rules and Bayes filters, it can make a big difference in accuracy without making a big difference in performance.

--
"Those who consume the bulk of goods are those who make them. We must never forget this secret of our prosperity."
Re:Well... by Inda · 2007-04-11 01:59 · Score: 1

I wish gmail would implement the same. I have even requested the feature.

I found a search result once that interested me and ended up at a Spanish forum. The forum was generic and I worked out that it said "You cannot read posts until you have registered". So I registered, read the english post, and never returned again.

I still get 20 emails a week in Spanish which I cannot read.

It can't be that hard to implement, can it?

--
This post contains benzene, nitrosamines, formaldehyde and hydrogen cyanide.

Already done - 1GB and counting! by Anonymous Coward · 2007-04-10 07:41 · Score: 2, Informative

Where have you been lately? Picasaweb.google.com has already increased from a mere 250MB to 1GB+ and counting!

Sheesh.... by Rick+Richardson · 2007-04-10 07:54 · Score: 0

make[3]: Entering directory `/home/rick/tesseract-ocr/wordrec'
if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../ccstruct -I../ccutil -I../cutil -I../classify -I../image -I../dict -I../viewer -g -O2 -MT tface.o -MD -MP -MF ".deps/tface.Tpo" -c -o tface.o tface.cpp; \
then mv -f ".deps/tface.Tpo" ".deps/tface.Po"; else rm -f ".deps/tface.Tpo"; exit 1; fi ../cutil/globals.h:46: error: previous declaration of 'int optind' with 'C++' linkage ../ccutil/getopt.h:23: error: conflicts with new declaration with 'C' linkage ../cutil/globals.h:47: error: previous declaration of 'char* optarg' with 'C++' linkage ../ccutil/getopt.h:24: error: conflicts with new declaration with 'C' linkage
make[3]: *** [tface.o] Error 1
make[3]: Leaving directory `/home/rick/tesseract-ocr/wordrec'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/rick/tesseract-ocr/wordrec'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/rick/tesseract-ocr'

Re:Sheesh.... by Anonymous Coward · 2007-04-10 10:14 · Score: 0

Your gcc version is probably too new. From the Tesseract 1.0 release:
History:
========
The engine was developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
compiler. Currently it builds under Linux with gcc2.95 and under Windows
with VC++6. The C++ code makes heavy use of a list system using macros.
This predates stl, was portable before stl, and is more efficent than stl
lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug. Another "feature" of the C/C++ split is that the C++
data structures get converted to C data structures to call the low-level C
code. This is ugly, and the C++izing of the C code is a step towards
eliminating the conversion, but it has not happened yet.
(emphasis added)

I built Tesseract 1.0 under VC++6 last year, it built fine, a few warnings but the binary worked fine. However, I seriously doubt that it would compile under VC++2005. You can bet the same applies to gcc. The current release of gcc is 4.1.2, tesseract was last known to build under 2.95 which is considerably older than the current gcc release. Try building it on a test system with an older version of gcc and I imagine you'll have better luck. Either that or start updating it to compile under gcc 4.1.2 and maybe contribute any patches you develop.

Caveat: I have not downloaded and tested ocropus, but you can bet little to nothing has been done with the original HP / MITRE Corporation source sincel the evolution of tesseract into ocropus.

P.S. The engine is extremely accurate as promised, though restrictive in input format, handles single column text only, and is only has a command line interface (though this makes it highly suitable for automation and reuse).
Re:Sheesh.... by Anonymous Coward · 2007-04-10 10:35 · Score: 0

Short Version: RTFMS!

Actually it's done all the time. by Kadin2048 · 2007-04-10 08:00 · Score: 1

Actually lots of people do book "scanning" with digital cameras. In fact, you can sometimes get much better results off of a book using a digital camera than you can by pressing it down against the bed of a flatbed scanner (because if the page wasn't typeset with a wide gutter, you'll start to distort some of the letters as you get close to the binding). Plus, it's a lot easier on the books, which is important when you're talking about books that are all going to be 75 years old and some much, much older.

The best way to use a flatbed scanner to scan books is actually to run them through a guillotine first, chop off the binding, and then scan the loose pages; this produces good results but it's not something most libraries are going to be willing to do.

Here's a commercial non-destructive book scanner which uses cameras. Basically, what you do, is you have two cameras, each pointing at one side of the book. You use lights held at an angle to the paper with reflectors and diffusers so that it's evenly lit, and then you just flip the pages and fire the cameras once per page turn. You can build a setup to do this (with manual page turning) for a few hundred bucks plus the cost of the cameras. The auto page-turning is really what drives up the cost.

People were photographing text using cameras for a lot longer than photocopiers have been around. The standard way of reproducing photographs was by using a copy stand and a fixed camera in order to make an internegative, and prior to the introduction of all-digital typesetting, almost all offset printing was done by photographing a paste-up of the final product with a special camera, which produced the plate used in the press.

So in short, although you're correct that just holding a digital camera over a book and clicking the shutter wouldn't give great results, the issues surrounding lighting, lens distortion, and focus are all solved problems. (And if you really wanted to be slick about things like barrel distortion and dust, you could start each run by photographing a standard grey field and a checkerboard, and use that to remove dust and correct for distortion digitally, rather than mechanically/optically.)

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Never saw that one specifically by Kadin2048 · 2007-04-10 08:10 · Score: 1

Yeah that's similar to what I was thinking about. Actually, what I was recalling was this thing, which seems to pretty clearly use off-the-shelf DSLR cameras (not sure on the lenses though, they're not visible). It probably costs a fortune because of the robotics and vacuum system necessary for the automatic page turning, but I think you could DIY something similar out of two copy stands for a lot less if you were okay with flipping pages.

The one you linked to seems like it would have more distortion of the pages because the cameras aren't being held constantly perpendicular to the page, but maybe it just corrects for that in software afterwards. (It wouldn't be hard, in fact I think all the code you'd need to do it is part of the Panorama Tools / Hugin package.)

What I think is a bigger problem for most libraries isn't the scanning per se, because that at least is a problem that most non-technical people can understand, but it's the storage and document-management that's the issue. Once you have the book scanned, you have a giant pile of JPEGs or TIFF files...unless you're careful about organization, it could become a real mess in a hurry.

So where I think the missing piece is, has to do with getting from raw images to an actual ebook. The hardest problem seems to be in the proofreading step; if you run each image through an OCR program, and then you want to proofread it, you need some way of distributing pages out to proofreaders, and letting each of them have a page of text and the image from that page, side by side. And then managing their edits and checking changes back in, etc. It's nothing really novel -- they're all solved problems in other areas (documents management, change management, remote access, web services) -- but I've never seen them combined.

If you had a software package that handled all the document management and proofreading (preferably something that your proofreaders could log into remotely and work, while storing everything centrally), then the hardware required is mostly off-the-shelf. It goes from being a $25,000 grant proposal, to some undergrad's thesis/semester project.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Re:Never saw that one specifically by ZERO1ZERO · 2007-04-10 10:19 · Score: 1

That's a pretty neat machine linked to there, the problem with these things is a lot depends on the actual book, you can have super-duper scanner vacuum this and that, but if your book is old and stiff they don't work so well.
You're right about managing the images though, I have been scanning some books recently full colour uncompressed 200dpi tiffs, and a couple of TB later most of the work becomes file management and backup. Scanning is the easy bit.

Vividata is just OK by hirschma · 2007-04-10 08:21 · Score: 1

it actually has many issues, and it is lagging behind the Windows version that Nuance produces. My company owns several licenses.

it is, however, the best OCR on Linux right now. I'm looking forward to having an alternative.

More build info; Ubuntu Feisty by drinkypoo · 2007-04-10 08:25 · Score: 4, Informative

Here's some other information you might need/want to build this software; note that I am on Ubuntu Feisty.

To build tesseract-ocr you must install autoconf.

If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout. Do you really want to use their pages to test? No! You want to use your own data.

I built with gcc 4.1.2, YMMV. Some people have reported errors trying to just compile tesseract.

to build ocropus I ended up installing jam, libaspell-dev and libtiff-dev.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

Re:More build info; Ubuntu Feisty by si618 · 2007-04-10 18:05 · Score: 1

> If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout.

Aside from data-test-pages, there are 12 other directories under trunk.

You can svn co -N to grab trunk, then you will need to individually checkout each of those 12 under trunk.

I understand your grievance, but personally I wouldn't worry about it unless you're on dial-up, as it's only ~11MB.

--
Sometimes I doubt your commitment to Sparkle Motion
Re:More build info; Ubuntu Feisty by Anonymous Coward · 2007-04-10 22:56 · Score: 1, Informative

You can svn co -N to grab trunk, then you will need to individually checkout each of those 12 under trunk. Better IMO is to "svn switch" the directories you don't want to an empty directory in the ocropus tree (assuming you can find one) then you can carry on working as if you had a complete tree.

Where's the Package? by Doc+Ruby · 2007-04-10 08:44 · Score: 1

All the OCR available to my Ubuntu 6.10 (Edgy) APT are worthless (< 50% correct characters), after trying them on real scans (usually faxes) that are perfectly clear to my eye:

clara - Free OCR program for Unix Systems
gocr - A command line OCR
ocrad - Optical Character Recognition program
unpaper - post-processing tool for scanned pages

Will this Google OCR really work, and can I install it with APT?

Meanwhile, why is it all Optical Character Recognition, when the accuracy we expect is really Optical Word Recognition? How come spelling, grammar and phrase frequency (including typos etc) isn't used to error correct at a symbolic level higher than pixels?

--

--
make install -not war

Re:Where's the Package? by drinkypoo · 2007-04-10 09:20 · Score: 1

Will this Google OCR really work, and can I install it with APT?

Yes and no. (I've tested it, but you have to install from subversion.)

Meanwhile, why is it all Optical Character Recognition, when the accuracy we expect is really Optical Word Recognition? How come spelling, grammar and phrase frequency (including typos etc) isn't used to error correct at a symbolic level higher than pixels?

Teas Willis, and the sticky tours
Did gym and Gibbs in the wake.
All mimes were the borrowers,
And the moderate Belgrade.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Where's the Package? by Doc+Ruby · 2007-04-10 09:36 · Score: 1

I'd prefer to get mistakes turning nonsense into sense than the ones I get the other way around that don't even preserve meaningful nonsense.

Do you have a result from scanning Jabberwocky (or other verse in a similar vein) with Google's OCR?

--
--
make install -not war
Re:Where's the Package? by timmarhy · 2007-04-10 10:00 · Score: 1

"How come spelling, grammar and phrase frequency (including typos etc) isn't used to error correct at a symbolic level higher than pixels? "
from what i've seen they are struggling just to get the words right, let alone begin to think about gramma. think about it, if it can't get the basic wording right it'll just create error's in the gramma check for itself.

--
If you mod me down, I will become more powerful than you can imagine....
Re:Where's the Package? by drinkypoo · 2007-04-10 10:06 · Score: 3, Insightful
Do you have a result from scanning Jabberwocky (or other verse in a similar vein) with Google's OCR?

Just for you, I made one, because I'm that fucking cool.
1. Visited http://www.jabberwocky.com/carroll/jabber/jabberwo cky.html.
2. Printed page 1 (all but one link at the bottom of the page) with default settings on a HP LaserJet 2300.
3. Scanned on an Epson 3170 as a 300 dpi grayscale PNG with otherwise default settings. (God DAMN this scanner is fast. But then my scanner at home is a shitty Mustek 1200UB since I broke my Canon LiDe.) 2528x3281 pixels.
4. mespinoza@sec2lpt7-linux:~/ocropus/ocropus-cmd$ ./ocropus ocr ~/Desktop/out.png | tee /home/mespinoza/Desktop/jabberwocky.html (lots of output)
Prepare to be unimpressed, because Results follow:

JABBERWOCKY Lewis Carroll

(from Through the Looking-Glass and What Alice Found There, 1872) `Twas bri11ig,_ andjghe 4s1it_hy toyes Digl gyre amid gimblejn thg wabe: All xiiimsy wei^e thg borogovgs, And theamome raths outigrabe. ''ggwqre thg Jalgbervvpck,_my sqn! The jaw; that bijtel the clayksathat catch! Bgyvaiie the Jubjub bird, anti shun The frumidus Bandersnatch!' I-Ie took his yorpal sword in hand: Long timg tlgewmangome foe he sought So rgSted he by the Tu_mtum tree, And stood awhile in thought. And, as in uffish thought he stood, The Jalgbgjwoclg, with eyes of flame, Cqmgwhjfflixgg through fhe tulgey wood, And burbled as it came! Qne, two! One, two! And through and thIi`Ollgh The jrorpgal b]ade went; snicaker-snack! I-Ie left iifdead, and with its head He went galumphing back. ''And, has thou slain thejabbexfwpck? Cpmg to my a_rxps!_my ljgaxjgishboyl Ojralqjousi dwgy! Qalladhl Callayl' He chortled in his joy. S

\ A S

X A ?`^s :

, ' Was ga. ka%#* mm. -- M 1 1 Q at ) a iv 2. `Ail A it 3*,* `i 2 (V H ;. ````( * 4 ^Nq@ Eu..*s..%im X M is ? lgh ~ ``A? S [ A Fax I /),2*gE it ^`* 4 ~ *: ' X A mg x ix, ,t~;;;..: v' it ix '~ t ~ ^ ,4~ ---= =-^ A A i gv ; * XX, x> . . N S A ft 1 A-`A 3; `> ' ''YY \Jh ^***`(?i* , ~~ x `* at -;v- *<~ ' H ~~~-=.- ; `Twas bri11ig,_ and_the 4s1it_hy toyes Dig gyre arid gimblejn the wabe; All Qiixjnsy wei^e thq borogovgs, And thdmome raths outvgrabe.

dshaw@iabbenNockv.com

Return to Glorious Nonsense Return to Lewis Carroll

Results End.

Beautiful, eh? I also tried a 100 dpi grayscale scan, which came out even more like hash (one big paragraph) and a 300 dpi bitmap (1bpp) which was about the same as the 100 dpi gray scan in quality, though a bit better.

Looks like ocropus has a while to go before it can slay the Jabberwock instead of thejabbexfwpck.
--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Where's the Package? by Doc+Ruby · 2007-04-10 10:13 · Score: 1

The point is to leverage the language's grammar to enforce more order, and weed out more mistakes. It's not a grammar check, it's grammar clues to what the words probably are.

--
--
make install -not war
Re:Where's the Package? by Doc+Ruby · 2007-04-10 10:17 · Score: 1

You are that fucking cool :).

I think spelling/grammar feedback would have fixed the replacement of 1s for ls, etc, especially now that Carroll's words are in the lexicon. A phrase recognition could also search the web to fill in low-confidence recognized text by matching against high-confidence recognized text. That's how humans do it.

--
--
make install -not war
Re:Where's the Package? by iggymanz · 2007-04-10 13:16 · Score: 1

maybe a TIFF would do better than png
Re:Where's the Package? by Ceriel+Nosforit · 2007-04-10 18:20 · Score: 1

Looks like ocropus has a while to go before it can slay the Jabberwock instead of thejabbexfwpck.

Alpha Release (Q3 2007)

1.0 Release (Q3 2008)

There has been no training, optimization, or parameter tuning yet (beyond what has been done on Tesseract), and no code for document cleanup or deskewing is included.

- Goo'

Half a year to alpha and a year more to beta, to be more precise. This software isn't even Alpha yet, so it seems silly to me to expect that much.
I guess your effort could be considered a proof-of-concept, but you really shouldn't see it as an indicaton of future performance. =/

--
All rites reversed 2010
Re:Where's the Package? by drinkypoo · 2007-04-11 03:33 · Score: 1

maybe a TIFF would do better than png

You're welcome to repeat my experiment with your own variables and are encouraged to post your results here for posterity (or at least link them).

But I doubt it would make any difference whatsoever. It shouldn't matter whether you use TIFF or PNG, because either way the file is decoded to the same raw bitmap. AFAIK there's little in TIFF that you can count on being supported that isn't in PNG. Maybe nothing, by that criteria :)

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"

The CAPTCHA solution by dj245 · 2007-04-10 09:01 · Score: 3, Interesting

Look, any illiterate kid in a third world country can play type-in-the CAPTCHA all day long. I think the solution is to put up an array of 9 or so pictures, and ask the reader to click on the kitten. The other 8 being something other than a kitten, and all the files having random names which rotate with every view. You could also change the item being asked for to defeat simple image recognition, and have several pictures of kittens/what-have-yous.

To defeat *this*, you would need someone with a greater command of the english language than simple recognition of characters, or very advanced image recogniion software. I wouldn't worry about the software anytime soon if you chose images carefully.

--
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.

Re:The CAPTCHA solution by drinkypoo · 2007-04-10 09:13 · Score: 1

Look, any illiterate kid in a third world country can play type-in-the CAPTCHA all day long. I think the solution is to put up an array of 9 or so pictures, and ask the reader to click on the kitten. The other 8 being something other than a kitten, and all the files having random names which rotate with every view.

The problem with your idea is that we've seen multiple examples of image search tools lately (well, two) that are capable of doing that kind of analysis. That idea is better than nothing but will probably only last for a couple more years before it's utterly useless as well.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:The CAPTCHA solution by Anonymous Coward · 2007-04-10 09:21 · Score: 0

Most third world country kids have a greater command of the English language. Actually greater than yours.
Re:The CAPTCHA solution by Espinas217 · 2007-04-10 09:57 · Score: 2, Insightful

this just slows down the spammers but can't stop them. If you have a small number of choices it's just a matter of how much the spammer must try to get through. You present 10 images with 1 correct, the spammer has a 10% chance and that's enough to make his bussiness work. I'm not really in favor of captchas but multiple choices won't work for long.

--
La vida no es una pastafrola. :wq
Re:The CAPTCHA solution by cheater512 · 2007-04-10 10:22 · Score: 1

You'll be needing a very large collection of images for that to be effective.
Plus no one else can be using your collection of images and ideally you should be 'rotating' your collections of images so you use a different set every week or so.

Its a impossible task.
Re:The CAPTCHA solution by plover · 2007-04-10 13:04 · Score: 1

I was chatting with a webmaster who did a lot of work with this kind of system. Basically, he automated the spotting of spam, but he didn't immediately delete it because that led to the "l33t c1@L1$" variants. Instead, everyone who posted spam had their IP address added to a "fake" list. Anyone on the fake list got to see all the spam and thought they were posting publicly, but all the "non-fake" users (especially Google) had it filtered and they never saw it. He planned to extend the idea to CAPTCHAs -- anyone trying to hack the CAPTCHA with a script would end up on the fake list.
He then put together a script to purge the databases of "fake" postings that were over a month old, just to keep them clean. Far as I know, his system is still working fine because the spammers haven't figured it out yet.
I still think stealth-segregation is a damn clever idea, as long as you can absorb their traffic. If you don't give the spammers the feedback that you're on to them, they won't know when you can spot their tactics automatically. And they won't retaliate the way that some of the really evil bot herders do.

--
John

IBM and OCR patents by Anonymous Coward · 2007-04-10 09:04 · Score: 0

>IBM has (and I think Google too) lobbied for open source exemption.
IBM has donated certain patents for open source utilization.
They have in no way lobbied for universal open source exemption from patent laws.
They have collected 1b+ last year in IP licensing revenues, including software patent licensing revenues.
>What's more, there are many dozens of such simple patents surrounding OCR.
From this statement you logically imply that there are no complex patents surrounding OCR technology. Thus you state that the area of research is overpatented.

It could more reasonably proven that OCR patents are numerous because it is a universal, difficult problem and many investors have spent significant resources for people to attempt to gain accuracy improvements related to it.

Read some of the patents in the link you provided. You will find numerous non trivial breakthroughs.

Re:IBM and OCR patents by ajs · 2007-04-11 02:47 · Score: 1

IBM has been very clear that they want to see open source software as prior art, and they've created an open source only pool of patent licensing for many of their patents. It's true that they've not been as forward as OSDL, EFF and others about the need to allow open source development to use patents, but where they're attempting to move the process is clearly well along the way to that goal, and one could argue that by partnering with OSDL on this, they've tacitly endorsed that long-term outcome, even if their rhetoric has not specifically gone that far.

Apache license is incompatible with GPLv3 by morganew · 2007-04-10 09:36 · Score: 1

It's fascinating that Google has chosen the Apache license for the release of this product. Given that Eben Moglen has explicitly stated that the Apache License is incompatible with GPLv3, what does this mean for mixing this code into other projects?

Even though v3 no longer has the anti-google Affero provisions, Google still chooses Apache instead of GPLv3 or even v2 with a rider to upgrade to v3. You gotta believe the Google lawyers were thinking about this issue before release...

--
A sig?!? I don't think so.....

Re:Apache license is incompatible with GPLv3 by Anonymous Coward · 2007-04-10 10:59 · Score: 0

Google didn't release it under the Apache License Version 2.0, HP did, through ISRI. There is also the additional license for the Aspirin/MIGRAINES system (see the Tesseract readme). This is probably because HP uses Apache code frequently in its products and is familiar and satisfied with the license.

http://google-code-updates.blogspot.com/2006/08/an nouncing-tesseract-ocr.html

http://www.isri.unlv.edu/ISRI/Software#Experimenta l_Open_Source_OCR

Google would have had little choice in the matter.

Deep pockets aren't required for this by Anonymous Coward · 2007-04-10 09:40 · Score: 0

"I've been hoping that someone with deep pockets (Google, IBM, Sun) would take on this area for a while."

It sounds like they're just providing support for a few PHD students (i.e. cheap labor) and want the community to perform a lot of the work. What are deep pockets needed for?

Ocrad by ThreeDayMonk · 2007-04-10 09:42 · Score: 1

I don't often need to do OCR, but I had passable results with Ocrad recently. Like some of the other respondents, I couldn't get much useful output from GOCR.

--
If your comment title says 'Re: Foo', I'm not likely to read it.

Ocropus? by 6Yankee · 2007-04-10 10:10 · Score: 2, Funny

Is that a Chinese mispronunciation? ;)

Do this for term papers to detect plagiarists? by Dr.+Spork · 2007-04-10 12:15 · Score: 2, Interesting

I have no doubt at all that this is coming in the future. Why? Because Google wants to see all data, analyze that data, and catalog it. That's exactly what would happen if you uploaded your scanned document to Google: Sure they would OCR it and do a good job, but they would also save the OCR'ed copy for later data mining.

One awesome application of this: I teach university courses that require term papers. If I could scan and upload the term papers I receive and Google could OCR them and tell me whether they're plagiarized (and of course Google would know; they know all!), I'd be prepared to pay them a bit of money for this. Or, more accurately, my university would be prepared to pay them a decent sum of money on my behalf. Then, they could keep the data from the term papers for the future, to make sure that nobody turns in that same paper in a later semester. Google not only gets money for this, but a whole lot of data to crawl through. Who knows what they would learn if a curious goog starts cleverly mining that data? If they do this, I would really love to work for them and use my 20% "downtime" to code a sentence structure analyzer that could predict a grade based just on syntactic features of the writing. In order to get more data, Google might even offer the OCR + plagiarism detection for free if the instructor agrees to use a Google grading and feedback system, so that Google could correlate each essay with a grade and an explanation of the grade. After tens of thousands of examples, Google might learn how to assign fairly accurate grades on its own (machine agrees with human to almost the same degree that humans agree with each other about what grade is deserved), and after that, who knows, Google might learn how to write B- term papers without any human input!

BTW, I am aware of plagiarism.org and their plagiarism-detection service which works like the thing that I want Google to do. Of course, if Google enters this market, they will crush all competition immediately, and plausibly, they'll do a better job because their database is just bigger. Also, Google could charge less, because a part of the payment will be access to the data itself. In fact, Google is already looking like it will accept information as payment for many of its services! And why not?

Re:Do this for term papers to detect plagiarists? by simm1701 · 2007-04-10 20:23 · Score: 1

I've always wondered why US schools and universities don't consider changing the assignments each year? And more importantly changing the exams they give.

I did a 1 year exchange program from the uk to the us and almost got thrown off a course by a very irate professor when I asked him for copies of the last couple of years exam papers for revision purposes.

He thought I was trying to bribe him to cheat. It took me a while to realise he used the same exam paper each year (or similar) and then it took me even longer to explain to him that EVERY exam paper for each course in my (and most other) uk universities can be found in bound editions in the library. They are one of the best forms of revision material, and most lecturers (or atleast your tutor) will be happy to go over your attempts at the papers and help correct your mistakes in preparation for the exam. Brand new courses even have to be provided with 3 sample exam papers of what the exam would have been like if the course had existed in previous years.

Plagairsm on papers and assignments happens here too, but generally its not between students, its from published papers and essays available on the internet. And since most students use google with the same search terms for a given assignment to find it the information and far too many just copy verbatum it can be pretty obvious when several students have copied fromt he same source - however thats a very different problem to people handing in previous year's students work.

--
$_="Slashdotter";$syn="OTT";s;..;;;sub _{print shift||$_};s!ash!Perl !;s=$syn=ack=i;tr+LLEd+BLAH+;_"Just Another ";_
Re:Do this for term papers to detect plagiarists? by Instine · 2007-04-10 20:27 · Score: 1

If I hadn't already commented in the thread I'd mod you up as interesting. I'm not worried about Big Google Brother. But I do like your use for the tech... Realy like you A.I. take.

I know its not cool to pat backs in here, but I realy like your thinking...

--
Because you can - or because you should?
Re:Do this for term papers to detect plagiarists? by indifferent+children · 2007-04-11 00:02 · Score: 2

I've always wondered why US schools and universities don't consider changing the assignments each year?
Bitching about lazy students is easy. Creating new exams and paper assignments is work.

--
Censorship is telling a man he can't have a steak just because a baby can't chew it. --Mark Twain
Re:Do this for term papers to detect plagiarists? by FrostedChaos · 2007-04-11 07:13 · Score: 1

Because Google wants to see all data, analyze that data, and catalog it. That's exactly what would happen if you uploaded your scanned document to Google: Sure they would OCR it and do a good job, but they would also save the OCR'ed copy for later data mining.
Google doesn't necessarily want to see ALL data. Nobody wants to see ALL data. People want to see the data that is most useful to them.

In Google's case, they want to learn as much as possible about users, so that they can turn around and sell that information to advertisers. They also want to sell ad space on their page directly. Remember their primary revenue model-- selling eyeballs to advertisers.

To accomplish these goals, they offer information that people might be interested in, like search results, Google maps, and Google image search.

BTW, I am aware of plagiarism.org and their plagiarism-detection service which works like the thing that I want Google to do. Of course, if Google enters this market, they will crush all competition immediately, and plausibly, they'll do a better job because their database is just bigger.
Well, I know that google already has their Google Scholar service that provides access to university-level research papers that are in the public domain. I'm sure that it brings a lot of traffic to their page, in addition to being a nifty public resource.

Term papers that bachelor's students and high schoolers write are a little bit different. It's not likely that anyone else will care about your high school paper about Moby Dick. I doubt that professors are eager to cite "Joe Random high schooler" or "Bob Random undergrad" in their papers. If they want to learn general background, they'll visit wikipedia or another such webpage. If they want to read cutting-edge research, they'll trawl through arXiv.org or Google Scholar. (Or even, gasp, read a peer-reviewed journal that their university pays for a subscription to.) What niche does that leave for term papers?

Then, they could keep the data from the term papers for the future, to make sure that nobody turns in that same paper in a later semester. Google not only gets money for this, but a whole lot of data to crawl through. Who knows what they would learn if a curious goog starts cleverly mining that data?
It's not obvious to me that term papers contain any really useful information "for the future." Maybe if you indexed them by author name in a really creepy, big-brotherish way, you could get information about the students.

If they do this, I would really love to work for them and use my 20% "downtime" to code a sentence structure analyzer that could predict a grade based just on syntactic features of the writing.
Well, you don't have to work for Google to get involved with Natural Language processing. :)

Most universities have faculty who are working on this topic. I once passed up a staff job working for such a professor (greener pastures and all that.)

--
"Any connection between your reality and mine is purely coincidental." -Slashdot

HP Tesseract by chill · 2007-04-10 12:17 · Score: 1

Patents last 20 years in the U.S., IIRC.

This OCR is a refined version of HP's Tesseract, which HP handed over to UNLV some time ago. The original code was developed starting in 1985, so there is a good possibility patents are not valid.

"You might wonder why Google is interested in OCR? In a nutshell, we are all about making information available to users, and when this information is in a paper document, OCR is the process by which we can convert the pages of this document into text that can then be used for indexing."

Charles

--
Learning HOW to think is more important than learning WHAT to think.

Very cool -- thanks. by Kadin2048 · 2007-04-10 12:39 · Score: 1

I had never heard of Digital Proofreaders before; that's very cool. Their system seems to be very close to what I was envisioning (allows distributed proofreading via a web interface, automatically assembles books together and puts them in a central repository for access).

Thanks for the link. The next time I'm talking to any of my librarian friends, I'll have to mention it. I didn't see anything on their FAQ though about accepting books from libraries for digitization, just on starting a project yourself (meaning scan the book and submit the scans and OCRed files for proofreading). But the scanning is really the easy part relative to the proofreading, so it still is a big step forward.

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

Advancement in AI by chip33550336 · 2007-04-10 13:37 · Score: 1

It is interesting that web forms have become a measure of AI strength in the world wide web. As soon as Captchas are largely solved, there will be new and improved human tests. I am guessing the next step will be identifying logos, or some sort of symbol. Eventually that problem will be solved too. So what do we do when we can't tell a human from a machine?

Please send me your registered DNA sequence, a voice recording reading this message, and a picture of you in the current location...

I guess a central database of information (identification and secure communication channel) is going to be the only way to ensure you are who you say you are.

Eventually I guess it won't really matter if you are human.

Re:Finally... NOT so final... by WrongDecision · 2007-04-10 16:03 · Score: 2, Informative

Actually, GOCR works very well (100%) on the image-based text that some sites use to prevent screen scrapping.
1. Download and save the image.
2. If it's a gif, convert it to a jpg.
gif2jpg -a tmp.gif
3. Reduce the colors to 2 (black & white).
djpeg -colors 2 -greyscale -dither none tmp.jpg tmp.pnm
4. If there is a border, crop it off.
pnmcut a b c d tmp.pnm > OCR.pnm
(The dimensions a,b,c,d can be determined by any tool that returns useful info about an image, in general remove 1 or 2 pixels from the edges to get rid of borders.)
5. OCR it.
gocr -n 1 OCR.pnm >> OCR.txt
Of course, this is all automated within the screen scraper, I just broke it out here to explain the steps.
For CAPTCHAs, you have to demorph the severely distorted images after step 4, before you OCR it. I'm still working on the demorpher, but it's about 50% accurate now. Basically, it unstretches long strings of pixels to the average of other strings of pixels in the x and y axis. Works even better if you determine the angle of the pixel sting and shrink on that, along with some rotation to the nearest x or y axis.

Re:Very cool by Thomas+Shaddack · 2007-04-10 18:02 · Score: 1

Use a multimeter with built-in RS232 interface. Either use two, or use one which can act as a wattmeter (eg. Metex 3860M) which can be obtained for $170 at tequipment.net and perhaps cheaper elsewhere.

Alternatively, slap an USB or RS232 or RS485 interface to a cheap microcontroller with built-in ADC (usually 10-bit, usually multiplexed to several possible input pins) and suitable analog circuits for sensing the values required, and log to a computer or to a data storage (eg. smartcard).

Yet another alternative, using your approach, is taking the image, finding out the approximate positions of the centers of the LCD segments, finding the image brightness thresholds for segment on/off, and getting on/off values for each segment of the display you want to watch. Then a simple decoding algorithm that turns the list of segments switched on to the displayed value.

Skeptic by Anonymous Coward · 2007-04-10 21:23 · Score: 0

Abby Finereader Sprint 6.0 for Windows, which is available "free" bundled with a lot of different cheap page scanners, is a simple, but extremely fine product: it does 99,5% or better justice to any kind of printed text in about 25 languages, without training needed, even if the source is low-quality, like 72dpi JPEGs.

All in all, it would be very difficult for Google to make a GNU GPL and patent-free general purpose OCR implementation that comes anywhere near the recognition reliability that Abby, Recognita and other top-notch commercial software titles achieved during more than a decade of continous development.

The problem with producing CAPTCHAs by Poromenos1 · 2007-04-10 23:38 · Score: 1

The problem with producing good CAPTCHAs is that it is hard to find a problem that the computer can easily generate and have the answer for, but cannot solve trivially. Our current CAPTCHAs are a good compromise, but I, at least, have no idea how to create text CAPTCHAs with those properties.

--
Send email from the afterlife! Write your e-will at Dead Man's Switch.

Faxes seem to be worthless for affordable OCR by name_already_taken · 2007-04-11 01:45 · Score: 1

I'll preface this with "this is just my experience"...

I'm involved in a project to capture a library of technical documents to PDF (we've done 40,000 pages so far). The software being used is Acrobat Capture 3.0 on Windows 2000 running on a 3GHz P4. Once the documents have been through Acrobat Capture, we use Acrobat 5 to retouch them (strangely, later versions of Acrobat give you less control and less ability to fix problems in the documents - we actually downgraded from Acrobat 7 back to Acrobat 5).

Our pages are scanned at 600 DPI, 1 bit per pixel, using a Kodak i65 that automatically deskews the pages (a small amount of skew seems to confuse Acrobat Capture to no end, and if there is a graphic on the page you get aliased lines instead of clean, straight lines).

We've found that the error rate goes up when you drop to 300 DPI. Normal fax resolution is 100 x 200 DPI ("fine" is 200 x 200 DPI), so you can expect to have very poor performance at fax resolutions. Basically, Acrobat Capture acts like OCR'ing a fax image is a torture test for the OCR because it seems like there just aren't enough pixels to give the OCR engine enough hints about what it's looking at. I'd be more interested to find out how the Google OCR does with a clean page of Helvetica text.

Most of the problems we are now having is that we're into a very old set of documents (early 1940s) that were created using typewriters that apparently we're very well taken care of.

The person who is doing the work is using some macro software that has let him automate the process of fixing the text in Acrobat to some degree, but it's still slow going (average seems to be 100-200 pages per day).

--
Putting moderation advice in your .sig lowers your karma!

Re:Very cool by smellsofbikes · 2007-04-11 03:11 · Score: 1

I've built my own ADC's and stuff, but I question their accuracy and it's a *lot* of work to get one built that works nicely. I've built a couple 12-bit ADC's that output parallel to something like the sparkfun usb interface (16 lines of IO) and that's functional. But it's really nice to have a rugged, precalibrated multimeter that, out of the box, already has its voltage, amperage, and such calibrated and ready to go. What I've ended up doing is buying GPIB-equipped multimeters on ebay, and that works. But I often have situations where I'd like to be running five or six -- measuring efficiency on dual or triple-channel switching power supply chips, for instance -- so I don't have enough RS232's without kludging things onto the computer. GPIB works beautifully, as would USB, given their extensibility.

I like your idea of the point decoding of the LCD. I'll have to think about that and see if I can come up with an easy implementation. That'd be a lot simpler with LED readouts, which a couple of my power supplies have. What a great idea! Thanks.

--
Nostalgia's not what it used to be.

Patents? by BubbaFett · 2007-04-11 04:55 · Score: 1

One would assume that OCR is a heavily patented space, and a patent search seems to agree. Caere could make things difficult for the competition.

Baby with the Bathwater Solution by Slashdot+Parent · 2007-04-11 05:55 · Score: 1

required_hits 5
score SARE_GIF_ATTACH 5

Ever think you might be throwing out the baby with the bathwater on this one? I mean, you never get ham with a GIF in it?

I've been looking for a good solution to the image spam problem, but this is not it.

--
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock

Re:Baby with the Bathwater Solution by Shawn+is+an+Asshole · 2007-04-11 14:22 · Score: 1

Ever think you might be throwing out the baby with the bathwater on this one? I got pushed to it. On average I was getting 100+ image spams a day. At first using OCR worked, but that stopped working months ago as just about every image spam I get is CAPTCHA'd.
I mean, you never get ham with a GIF in it? No.

--
"It ain't a war against drugs.it's a war against personal freedom" --Bill Hicks

Re:Very cool by Thomas+Shaddack · 2007-04-11 07:29 · Score: 1

More modern instruments have USB instead of RS232. To my great annoyance, I have to admit, as connecting a RS232 device to eg. a microcontroller-based datalogger is much easier than when you have to mess with USB.

For more instruments you can deal via USB, USB-to-serial converters, or eg. a pair of Netmos serial port cards, which will add 8 more RS232 ports to your machine. GPIB is IMHO an overpriced monstrosity.

Re:Very cool by smellsofbikes · 2007-04-11 08:28 · Score: 1

I have to admit a lot of fondness for GPIB since my dad helped design it and I have a rack of equipment that he subsequently designed around it. In 1982, it beat the hell out of anything else on the market.

What I'd really like, rather than usb or rs232, is ethernet. Our newer tektronix scopes have a network jack on the back and somewhere inside their weird little insides, a webserver, so I can run the scope from anywhere in the building and get data out of it. That's amazingly useful. No drivers, no special cables, no limits on how many instruments I can work with, just pure functionality.

--
Nostalgia's not what it used to be.

Re:Very cool by Thomas+Shaddack · 2007-04-11 09:22 · Score: 1

Again, RS232 comes to rescue here. For some $50, there are eg. the Lantronix XPort adapters available, which are UART/TCP converters. They can either sit and listen for a connection (and then relay the bytes back and forth between UART pins and the socket), or actively open a connection to a defined IP:port. I have some supervisory hardware made this way. The UART/RS232 level converter can be made of two transistors, and all the other stuff you need is 3.3V/100(or so) mA for the module.

Slashdot Mirror

Google Pushes Open Source OCR

212 comments