Search Engines for Handwritten Documents
An anonymous reader writes "Researchers at the University of Massachusetts have created a tool for automatically searching handwritten historical documents, such as the 140,000 pages that make up George Washington's personal papers in the Library of Congress. The most interesting part is that the papers are scanned versions of the originals and the search tool actually recognizes the handwritten text from these images."
In America, handwriting is only for old people.
The most interesting part is that the papers are scanned versions of the originals and the search tool actually recognizes the handwritten text from these images.
How else would it search handwritten documents? Am I missing something here?
Huh? Well, lets see how well it keeps up with my doctor's handwriting...
Free XBox, PS2
slashdot page down after 0 posts!
Somebody invented a way for computers to recognize handwriting.
Like, so 10 years ago.
paintball
No OCR is performed on the documents. The search tool operates on the image.
Fair is where you take your cow to be judged.
Wow, looking at some of those examples, I was amazed by the fact that I couldn't READ most of the words. It looks completely foreing to me, might as well be trying to read Japanese.
...will this tool be open source, or at least free to use?
How good is the accuracy? The OCR technology of today might not be able to recognize the "flowery" text of most historical documents (look at "We the People" in the Declaration of Independence)
got sig?
RTFA :) It actually looks pretty cool, the software is looking through the actual handwritten pages.
Tech News, Reviews and Tutorials
These documents are old and handwritten. Why waste the processing power decyphering results for each search when you can decypher the text once with a similar algorithm and search an index built that way? It's not like the information is ever going to change. (unless we do rewrite history)
Google already did it! Well, it's not handwritten, but that's just a logical progression.
such as the 140,000 [handwritten] pages that make up George Washington's personal papers in the Library of Congress.
In related news, the family of Tobias Lear, George Washington's personal secretary, who took his own life (arguably due to the horrible pain in his wrists), has filed suit.
I hate reading/producing anything longer than a post-it note that's in handwriting.
The owls are not what they seem
Why not use OCR? It's not like there aren't techniques for dealing with the OCR errors, such as language models for error correction, n-gram document retrieval and relevance feedback.
Also OCR systems are trainable to learn handwriting styles, even on a per-document basis. But I guess it's a cool hack.
... eh eh !gniddik tsuJ. !skoobeton inciV ad eht no esool ti teL.
I took a lot of notes in College. I took a lot more notes in graduate school. I've even taken notes on books I've read for the fun of it. If I could run all of these through my scanner & search them from an application on my desktop, I could be really obnoxious in an argument.
Trying to use sarcasm in text-based forums does not work.
>No OCR is performed on the documents.
Yeah but, um, why?
The article points out that the handwriting reader is a Newton.
With ink pots and nibs and everything. But, like fish-tickling and lice-picking, it's a dying art.
Sometimes seventeen/Syllables aren't enough to/Express a complete
You have to be able to handle a quill pen to use it.
Sometimes seventeen/Syllables aren't enough to/Express a complete
It's an interesting approach that should be extended to other languages than English. Most of the world's history is not about the US and it has certainly not been written down in English. What I would really like to have is a similar tool that can search, say, Greek, or Latin, (or whatever) handwritten text. Imagine being able to query Ovid for an item of interest without having to consult everything he's written. I can imagine that this might encourage people to study the classics (a pet peeve of mine is that many people lack historical sense...) and it would certainly facilitate research in this area.
If you can put the queries in English, with the search engine taking care of translation, it would be even better. Then, extended historical study comes within everyone's reach and the classical studies (or humaniora) might be transformed.
----- One learns to itch where one can scratch.
How pleafant that they've done what waf neceffary to make this happen. How did they train the foftware to recognize the quirky 18th Century handwriting?
And the brethren went away edified.
You are new here. (PS, your post is spam too).
Now, if the Library of Congress starts instead storing its data willy-nilly in random image formats, possibly with unpredictable compression algorithms, we are truly on a slippery slope. We risk losing altogether any meaningful standard for what it really means to have a LOC's worth of information. Is it the ASCII text version? Is it the scanned image file? Is it the sum of both? The numbers vary wildly based on the arbitrary choices about which data we include in the LOC. What's worse, there is no single right or wrong answer to this subjective data classification question, so we will never have agreement on this most fundamental of issues.
Clearly, the risks presented by this new untested technology experiment outweigh any possible benefits for the few people who might be interested in these obsolete documents. Consistency must be preserved. Boycott this search system!
We could use it as a jobs program for monks. Their predecessors wrote the manuscripts, and now they could transcribe them into digital form...
A fine is a tax you pay for doing wrong and a tax is a fine you pay for doing all right.
Their handwriting recognition system doesn't work for shit. It couldn't even correctly retrieve results from words that I know are in its scanned letters. The word "governor" appears as a result from one of their suggested queries (*cough* hard coded results *cough*), but if you do a separate search for governor it returns stuff that doesn't even contain the word.
Any man who afflicts the human race with ideas must be prepared to see them misunderstood. -- H. L. Mencken
So they didn't already do it, then?
Long live inglish!!!!!!!!1
It's "Pixelative Text Cognizance."
It's different. With OCR these rays of light scan the original, translate each scanpoint to discrete RGB values, and do pattern recognition.
With this system, they just read the discrete RGB values directly from pixels of documents scanned in with rays of light, then they do recognition of patterns. See, it's totally different.
They aren't doing OCR
Yes, they are. They are not using an off-the-shelf OCR package. The OCR functionality is embedded into their software, it is highly specialized, but it is OCR. For those who are fixated on the letter 'C', recognizing multiple characters as a single unit is nothing new.
And college students during exam season. (Can't speak for the Koreans.)
Blue-stained hands-up, all those who remember those glorious essay exams from the mandatory humanities courses, where your grade ceases to be based on the merits of your ideas (and/or your ability to parrot your professor's ideas), but is solely a function of how well-developed the muscles in your right hand are, in order to keep scribbling for the entire three hours what would have taken you 90 minutes to type.
Of course, even in the dark days before I discovered Slashdot, my CS education had proven to be more than ample preparation for the worst that any Philosophy, History, or (worst of all) English prof could throw at me. *rimshot*
So, when does henscratch.google.com (searchable handwritten blogs) come out?
Convert the search text into an image to look as written by hand.
Then do an image search on the documents. You will need a powerful image recognition software.
This would be news.
*** Find that COM error at http://www.comerrors.com **
No OCR is performed on the documents. The search tool operates on the image
The search tool is doing the OCR then. OCR is simply taking an image and analyzing it to recognize text.
do you have a newsletter I can subscribe to?
If only Nicholas Cage had this tool at his disposal, it would have made things much, much easier.
Just run the text through an existing handwriting-aware scanner, then run your favorite search tool on it.
Step 1: Combine two existing software technologies.
Step 2: ???
Step 3: Profit!
Wait, that's how software patents work...
Holy shnikes! Optical Character Recognition! Bah.. I'm part of a research team at the Center for Cybermedia Research who are working on new algorithms for OCR with $4 million from Homeland Security. Its to be used on a gi-normous database containing scanned images of documents relating to Yucca Mountain.
On top of that, OCR has been around for years. Yes, it isn't the best, but its functional. Doesn't census bureau use OCR for its census forms?
So, yeah.. where is the news in the article?
What is your penile percentile?
Yes, only give me your email address and you can read all about my aberrant sexual encounters.
Somebody invented a way for computers to recognize handwriting. Like, so 10 years ago.
I worked on an OCR system about 20 years ago. No pre-defined bitmaps of text, you trained the system on the font to be recognized. After a few hours you could turn it loose and it did fairly well. While goofing off we tried handwritten text. With good penmanship it worked to a degree.
No, OCR stands for Optical Character Recognition. This is Digital Character Recognition on an Optically Acquired Digital Image. Don't you see the difference?
Video Production Support
For sure it will cost 5 times and more complicated algorihtm if it were use to search Doctor's handwriting.
Wow, a moderator who has never seen the Simpsons, next they'll have dating Slashot posters and editors who spell check posts.
Danny Dunn and the Homework Machine.
I keep a handwritten log of daily work; when I arrive, when I leave and what I did. Every week these logs are run through an HP Digital Sender and a PDF version is emailed to me. I then take these PDF files and post them on my personal website. If I can add search capabilities, then that's about as ideal setup as I can imagine.
This is really, really, really, really stupid, it would be faster just to hand type the documents into the database, then search it, you could link to pictures of the documents if you really needed it
There is no sig
I've been using this feature in OneNote for a long time now. It searches through my handwriting with amazing accuracy
They already have a full time job. Praying.
great. now people are just going to spoof documents and put pr0n or enlargement spams in the pdfs when i search for anything academic related. i'm glad i dont have that problem yet finding pdf papers via google yet.
my blog
I think, maybe 3rd or 4th grade is the last time you have to use cursive. I do, however highly recommend giving your kids touch-typing classes, so that they too, can keyboard with fluidity (and rapidly lose their writing skills too).
For me, it is a speed issue - I can type MUCH faster than writing, when I have a lot to do, typing on a computer is the way to go (plus, I can't live without speelcheking).
That said, I do agree with others that sometimes, pen and paper is the right way to go - for me that is pen and composition books that I scribble in on a daily basis to keep track of what I was doing, when. (I am a software consultant) - There is nothing faster than flipping thru a comp book with dates on every page to see what I was doing, say August 10 (testing and OSD application). However, the penmanship on those notes is really bad - and anything I learn and jot down I do type into at least a plain text document so that I can search for it later (and have it be legible when I find it)!
Heh, what about shorthand? My Mother used to write in shorthand whenever she wanted to write notes to herself that noone else in the family could read.
Finally - how many of you have even tried to type, on a manual typewriter (if you can find one) lately? I learned on one, and was a speed demon, back in the day. Now, after years of these soft-touch keyboards, I tried punching a few keys on a manual and had a hard time making marks on the paper. Sheesh, you really need to whack those things. Good Riddance!
This issue is a bit more complicated than you think.
No.
That's my point.
Video Production Support
The only real threat is fire, and it is no more dangerous than it is to CDs or hard drives.
Go back and look at some old notebooks - if they used acid-based paper, then they'll be getting rather fragile.
Although it is hard to OCR text and very hard to OCR cursive text written in historical documents, performing searches on those documents does not require a complete comprehension of the textand is therefore much easier to do.
For instance, the software may be unable to distinguish the word bug from dog in one person's handwriting, but can still mark it with probabilities of the word's possible meanings.
If a person later searches for the word bug or dog at a future date along with other terms, a mathematical calculation can be done for the likelyhood of the match and the searcher can make his/her own judgement to the meaning of the text.
---
Conrad Barski
can it make sense of square roots? Matthew Leung
This reminds me of the _many_ times I've wanted to hit Control-F to find something in a book at school or in some long handwritten paper.
Though it may not seem important to most of us who are used to Microsoft Word, the search engine for handwritten documents is important for the following two reasons- 1] It is an innovation in computer science and this technology may have applications elsewhere. 2] There are old documents that are handwritten and it is not practical to create their typed versions. This is an inexpensive method of creatig easy access to those documents on the Internet.
The difference might become more obvious when it is apllied to the field of digital authentication common in porn and free email providers.
You know, the little images of a couple of chars you have to type over to 'prove' you're human
I wonder what will be the next step in anti-bot techniques now that this last hurdle seems to be taken aswell....
According to this search, the famous Patrick Henry was noted to have actually said "Give me Liberty, or eat up martha!"
Cursive is only slightly faster to write. The fact is cursive writing encourages sloppy habits and rushing which degrades quality, as a sample of most folk's cursive will exhibit . Reading cursive takes much longer, not only taking longer but it turns reading into a guessing game. Is that an a or an e?
Of course, you use a ballpoint pen for lab notebooks, not fountain pens or other pens based on water-soluble inks. Of course, this won't help you if you spill vodka. :-)
Anyway, in lab situations you might not have a place nearby to put a laptop and you might be running between different laboratories so a laptop is often not very convenient. I was taught that you should write observations directly in a notebook instead of waiting and writing them down later. Moreover, you are not supposed to change the notes once they are written down, a temptation that might be hard to resist if the notes are in a computer file.
Avantslash: low-bandwidth mobile slashdot.
"Do you think that OCR is actually the wrong way to think about this problem? After all, we don't really care about characters, but rather about what words and ideas have been written. Do you have a strong background in pattern recognition, machine learning, image processing and computer graphics? Google currently "reads" almost every web page in the world. Come help us read all the printed material as well!"
:)
Requires MS/PhD in CS/EE. Position available only in Mountain View.
http://www.google.com/jobs/eng/sw.html#ocre
(Note: I don't work for Google -- just thought someone on this thread would like
Corollary to Moore's Law: The IQ of new computer owners is declining.
For those with strong views (one way or another) about handwriting, please let me know what you think about the information on this web-page:
Somehow, the URL I mentioned didn't come through. So, again ...
http://www.global2000.net/handwritingrepair
Comparing a language to an operating system is quite ridiculous.
The user interface of an operating system is the language through which users interact with a computer. Its ABI is the language through which users teach a computer to do tasks. Its driver model is the language through which users teach a computer to interact with their devices.
To some experienced Windows users, learning to maintain a GNU/Linux system is like learning another language.