Inkwell No Longer From the Newton?

← Back to Stories (view on slashdot.org)

Inkwell No Longer From the Newton?

Posted by pudge on Thursday August 1, 2002 @12:34AM from the and-it-hates-him-too dept.

CrezzyMan writes "From this post on the Newtontalk.net mailing list: Some of you may be interested to know that in the Inkwell section on Apple's website the following original text (straight after the keynote): 'Based on the Newton's 'Print Recognizer'-widely considered to be the world's first genuinely usable handwriting recognition solution-Inkwell's handwriting recognition is highly accurate and extensively tested' has been changed to: 'Built on Apple's Recognition Engine - Inkwell's handwriting recognition is the best in the industry.' Steve must really hate the Newton..." I'd be more likely to consider Inkwell a good technology if I knew it was from the Newton, but I was an actual Newton user. Most people erroneously think the HWR in Newton OS was bad (thanks to The Simpsons!).

5 of 65 comments (clear)

Min score:

Reason:

Sort:

The Simpsons? by Captain+Pedantic · 2002-08-01 00:54 · Score: 3, Informative

Wasn't it Doonsebury which effectively killed off any hope for the newton>

--

None are more hopelessly enslaved than those who falsely believe they are free. Johann Wolfgang von Goethe.
It's from Apple ATG (RIP) by Anonymous Coward · 2002-08-01 00:54 · Score: 5, Informative

Neural networks provide robust character recognition for Newton PDAs

Larry Yaeger, Apple Computer

While on-line handwriting recognition is an area of long-standing and ongoing research, the recent emergence of portable, pen-based computers (personal digital assistants, or PDAs) has focused urgent attention on usable, practical solutions.

Pen-based PDAs depend wholly on fast and accurate handwriting recognition, because the pen serves as the primary means for inputting data to the devices. To meet this need, we have combined an artificial neural network (ANN) character classifier with context-driven search-over character segmentation, word segmentation, and word recognition hypotheses to provide robust recognition of hand-printed English text in new models of Apple Computer's Newton MessagePad.

Earlier attempts at handwriting recognition used strong, limited language models to maximize accuracy. However, this approach failed in real-world applications, generating disturbing and seemingly random word substitutions known colloquially within Apple as "The Doonesbury Effect" (due to Gary Trudeau's biting satire based on first-generation Newton recognition performance). We have taken an alternative approach, using bottom-up classification techniques based on trainable ANNs, in combination with comprehensive but weakly applied language models. By simultaneously providing accurate character-level recognition, via the ANN, with dictionaries exhibiting very wide coverage of the language (as well as special constructs such as date, time, and phone numbers), plus the ability to write entirely outside those dictionaries (at a low probability), we have produced a hand-print recognizer that some have called the first usable handwriting recognition system.

The core of Apple's print recognizer is the ANN character classifier. We chose ANN technology at the outset for a number of key attributes. First, it is inherently data-driven-it learns directly from examples of the kind of data it must ultimately classify. Second, ANNs can carve up the sample space effectively, with nonlinear decision boundaries that yield excellent generalization, given sufficient training data. This results in an ability to accurately classify similar but novel patterns, and avoids certain classic, subtle data dependencies exhibited by hidden Markov models (HMMs), template matching, and other schemes, such as over-sensitivity to hooks on tails, pen skips, and the like. In addition, there is a rich literature demonstrating the applicability of ANNs to producing accurate estimates of a posteriori probabilities for each class, given the inputs.

In some respects, our ANN classifier is quite generic, being trained with standard error backpropagation (BP). Our network's architecture takes advantage of previous work, indicating that combined, multiple recognizers can be much more accurate than any single classifier. However, we combine those parallel classifiers in a unique fashion, tying them together into a single, integrated multiple-representations architecture, with the last hidden layer for each, otherwise independent, classifier connected to a final, shared output layer. We take one classifier that sees primarily stroke features (tangent slope resampled to a fixed number of points), and another classifier that sees primarily an anti-aliased image, and combine them only at the final output layer. This architecture allows standard BP to learn the best way to combine the multiple classifiers, which is both powerful and convenient.

Training an ANN character classifier for use in a maximum-likelihood word recognition system has different constraints than would training such a network for stand-alone character recognition. In particular, we have devised several innovative network training techniques, all of which modestly degrade the accuracy of the network as a pure character classifier, yet dramatically improve the accuracy of the word recognition system as a whole.

The first of these techniques we refer to as NormOutErr, short for "normalized output error." Training an ANN to classify 1-of-N targets with standard BP produces a classifier that does a fine job of estimating p(class|input) for the top-choice class. However, BP's least mean-squared error solution, together with typical classification vectors-that consist of all 0s except for a single 1 corresponding to the target class-results in a classifier that does not estimate second- and third-choice probabilities well. Rather, such classifiers tend to make unambiguous single-choice classifications of patterns that are, in fact, inherently ambiguous. The result is a class of recognition errors involving a single misclassified letter (where the correct interpretation is assigned a zero or near-zero probability) that causes the search to reject the entire, correct word.

We speculated that this effect might be due to the preponderance of 0s relative to 1s in the target vectors, as seen at any given output unit. Lacking any method for accurately reflecting target ambiguity in the training vectors, we tried partially normalizing this "pressure toward 0" relative to the "pressure toward 1." We did this by modulating the error seen at nontarget output units by a scale factor, while leaving the error at the target output unit unmodified. This generally increased the activation levels of the output units, and forced the network to allocate more of its resources to the modeling of low probability samples and classes. Most significantly, it allowed the network to model second- and third-choice probabilities, thus making the ANN classifier a better citizen in the larger recognition system. While this technique reduced top-choice character accuracy on the order of a percent, it dramatically increased word-level accuracy, resulting in approximately a 30% reduction in word-level error rate.

Another of the techniques we apply routinely in our ANN training is what we call frequency balancing. Training data from natural English words and phrases exhibit very nonuniform priors for the various character classes, and ANNs readily model these priors. However, as with NormOutErr, we find that reducing the effect of these priors on the net, in a controlled way, and thus forcing the net to allocate more of its resources to low-frequency, low-probability classes, significantly benefits the overall word recognition process. To this end, we explicitly (partially) balance the frequencies of the classes during training. We do this by probabilistically skipping and repeating patterns, based on a precomputed repetition factor. (Each presentation of a repeated pattern is "warped" uniquely, as discussed later.) This balancing of class frequencies is conceptually related to a common method for converting from ANN estimates of posterior probability p(class|input), to the value needed in an HMM or Viterbi search p(input|class), which is to divide by p(class) priors. However, our approach avoids potentially noisy estimates of low-probability classes resulting from division by small numbers, and eliminates the need for subsequent renormalization. Again, character-level accuracy suffers slightly by the application of this technique, but word-level accuracy improves significantly.

While frequency balancing corrects for under-represented classes, it cannot account for under-represented writing styles. We use a probabilistic skipping of patterns to address this problem as well, but this time for just those patterns that the net correctly classifies in its forward/recognition pass, which results in a form of error emphasis. We define a correct-train probability for use as a biased coin to determine whether a particular pattern, having been correctly classified, will also be used for the backward/training pass. This only applies to correctly segmented, or positive patterns, and misclassified patterns are never skipped. Especially during early stages of training, we set this parameter fairly low, thus concentrating most of the training time and the net's learning capability on patterns that are more difficult to correctly classify. This is the only way we were able to get the net to learn to correctly classify unusual character variants, such as a three-stroke "5" as written by only one training writer.

Other special training techniques include negative training--presenting missegmented collections of strokes as training patterns, along with all-zero target vectors--and stroke warping--deliberate random variations in stroke data, consisting of small changes in skew, rotation, and x and y linear and quadratic scalings. During recognition, the ANN classifier will necessarily encounter both valid and invalid combinations of strokes, and must classify them as characters. Negative training helps by tuning the net to suppress its output activations for invalid combinations, thus reducing the likelihood that those missegmentations will find a place in the optimum search path. Stroke warping effectively extends the data set to similar, but subtly different writing styles, and enforces certain useful invariances.

Two practical considerations in building an ANN-based system for a hand-held device are speed and memory limitations. Especially for the ARM 610 chip that drives the Newton MessagePad 120 and 130 units, 8-bit integer operations are much faster than either longer-integer or floating-point operations, and cache coherency benefits from reduced data sizes. In addition, memory is at a premium in these devices. So, despite previous work that suggests ANN training requires roughly 16-bit weights, we were highly motivated to make 8-bit weights work. We took advantage of the fact that the ANN's forward/recognition pass is significantly less demanding, in terms of precision, than is the backward/learning pass. It turns out that 1-byte (8-bit) weights are sufficient if the weights are properly trained. We limit the dynamic range of floating-point weights during training, and then round to the desired precision after convergence. If the weight limit is enforced during high-precision training, the net's resources will adapt to compensate for the limit. Because bias weights are few in number, however, and very important, we let them use 2 bytes with essentially unlimited range. Performing our forward/recognition pass with low-precision, 1-byte weights (a 3.4 fixed-point representation, ranging from almost -8 to +8 in 1/16 increments), we find no noticeable degradation relative to floating-point, 4- or 2-byte weights using this scheme. We have also developed a net training algorithm based on 8-bit weights, by appending an additional 2 bytes, during the backward/training pass only, that accumulate low-order changes, only occasionally carrying over into the primary 8-bit range, which affects the forward/recognition pass.

So, in summary, we have devised several techniques for using and training an ANN classifier that is to be embedded in a higher-level recognition system. Some, such as limited precision weights, are a direct result of physical limitations of the device. Others derive from the fact that an ANN classifier providing class probability estimates to a search engine necessarily has different constraints than does such a classifier operating alone. Despite the seemingly disparate nature of the various techniques we've described, there does seem to be a unifying theme, which is that reducing the effect of a priori biases in the data on network learning significantly improves the system's overall accuracy. Normalization of output error prevents overrepresented nontarget classes from biasing the net against underrepresented target classes. Frequency balancing prevents over-represented target classes from biasing the net against under-represented target classes. And error emphasis prevents over-represented writing styles from biasing the net against under-represented writing styles.

One could even argue that negative training eliminates an absolute bias toward properly segmented characters, and that stroke warping reduces the bias toward those writing styles found in the training data, although these techniques provide wholly new information to the system as well. The general effect may be related to the technique of dividing out priors, as is sometimes done to convert from p(class|input) to p(input|class). In any event, it is clear that paying attention to such biases and taking steps to modulate them represent a vital component of effectively training a neural network serving as a classifier in a maximum-likelihood recognition system. It is also clear that ANN classifiers in conjunction with optimal search strategies provide a degree of accuracy and robustness that is otherwise difficult to obtain.

This work was performed in collaboration with Richard Lyon (Apple), Brandyn Webb (The Future), Bill Stafford (Apple), and Les Vogel (Angel Island Technologies). We are also indebted to many supportive and contributing colleagues at Apple and in the connectionist community. A more detailed, technical discussion of our recognition system is available through my Web page (Larry Yaeger-pen-based character recognition http://www.atg.apple.com/personal/yaeger.

Larry Yaeger is technical lead at Apple Computer in the development of the neural network-based hand-print recognition system used in second-generation Newton PDAs. At Digital Productions, he used a Cray X-MP supercomputer to generate the computer-graphics special effects for Hollywood films The Last Starfighter, 2010, and Labyrinth. While with Alan Kay's Vivarium Program at Apple, he designed and programmed a computer "voice" for Koko the gorilla, and created the PolyWorld artificial-life computational ecology that evolves neural architectures resulting from the mutation and recombination of genetic codes, via behavior-based, sexual reproduction of artificial organisms. Contact him at larryy@apple.com
Re:Class notes by PD · 2002-08-01 05:59 · Score: 2, Informative

The computer is trained to recognise your writing, over time. You might have crappy writing, but if you're consistent about it the Newton can read it.

--
If tits were wings it'd be flying around.
Re:Oh grow up. by DJSpray · 2002-08-01 10:20 · Score: 5, Informative

All right, to try and keep this from turning into an "it sucked! no, it didn't!" debate, some background.

I used (and programmed) every version of Newton device. There were several generations of Newton recognition software. The first generation was actually licensed from a Russian company called Paragraph. There was some speculation that it would perform better with Cyrillic. You could set it for cursive recognition, and could also tweak the individual character shapes it was looking for.

The algorithms were largely dictionary-based. Hence, it had a tendency to either do really well getting the words completely right, or really badly (substituting a really wacky word choice that was triggered a match). It was also possible to have the settings quite wrong for your handwriting, so that, for example, it did not know when you were breaking words (via letter spacing or pauses in your writing).

People had very mixed results with it. If you did not use cursive, it tended to be even worse. There was some idle speculation that it probably did really well with Cyrillic cursive, due to the software's origins, but I never heard any substantiation of tht rumor.

The original Newton (through OS 1.05) had many other problems, and clearly came out of the oven a bit too early. Battery life was poor. Memory was very limited. Recognition was extremely slow. One of the most noticeable was that the recognizer had a tendency to lock up and stop recognizing text; you had to hit the reset button to get it moving again. Fortunately, the early Newton stored data in flash, so even doing a reset after a severe crash, you were unlikely to lose any data. (You pretty much had to forcibly wipe the flash to do that).

Version 2.0 of the Newton software, dubbed "Newton Intelligence", came about initially with the MessagePad 130 or a ROM update to the 120. Developers were able to get the ROM from Apple and do the replacement themselves. 2.0 featured a new recognition strategy: support the cursive recognizer, and also offer a new character-by-character print recognizer. This one worked much better for me, and a lot of other users thought so too. It would also work for cursive. This is the recognizer that presumably Apple has ported. Allegedly it was based on an ATG project. One of the things that made it better was that its algorithms were more character-based than dictionary-based; it tended not to pick completely incorrect words. Instead, you'd see what you wrote with perhaps one letter wrong. You could use gestures similar to proofreading marks to edit that one letter, or even over-write the single wrong letter.

The Apple employees who described it at the Newton developer conferences spoke of it this way: when it made a mistake, instead of saying "HUH???", the user would say "huh."

When I first used a Palm, I was very disappointed that they were not able to use a recognition engine like the Newton character recognizer. It really did work much better. Yes, I learned Graffiti, but I never liked it, and to this day I don't use a Palm. I would use the Newton character recognizer on a portable Palm-sized device quite happily.

Apple doesn't get much credit for its innovations. Remember that the MessagePad 2100 could do all this, although somewhat awkwardly and perhaps just barely:

- run Pocket Quicken or other checkbook apps
- do shape recognition and editing
- run a spreadsheet
- run a graphing calculator/solver
- store text as raw "ink" (compressed vector graphics) and recognize it at a later time
- provide 2 PCMCIA memory card slots
- do infrared data exchange
- drive a modem card
- run a mini- web browser
- do shape recognition
- do desktop sync
- record and playback voice memos
- do text-to-speech (in a pretty primitive Macintalk-1.0 way).
- support a keyboard
- ran applications written in very cool dynamic, interpreted, byte-coded language optimized for low memory footprint, using ideas from languages like Scheme and Self, but with a simple Pascal-like syntax. (This was pre-Java).

Of course, it was also way too expensive, and Apple was not able to get the price down in time to gain market share. Flash memory was expensive. Static memory was expensive. The screen was expensive. It was expensive to assemble. If you ever took one apart, it was clear that it required a great deal of skilled labor to assemble. The screen itself was an elaborate sandwich of the recognizer, the LCD screen, and a backlight. There were wires running all over the innards. Compare that to a Palm device, which was designed for minimal chip count and minimal cost to manufacture. It was not just Apple's desire to maintain profit margins at work, although that no doubt played a part too.

I personally was very fond of using the Newton to keep notes and balance my checkbook. The character-based recognizer and StrongARM chip made it usable.

Nevertheless, when we developed software for naive new-to-the-device end users, we did everything with popups and radio buttons. Trying to get a novice user to successfully use any computer handwriting recognizer immediately is not yet feasible (and may not be for some time).

I personally am rather annoyed to see Steve's childish behavior in ignoring and marginalizing all the R&D and many smart peoples' hard work that went into the Newton. Sure, it may have been a failure as a product in the long term, but it caught people's imagination and was definitely a technological success in many ways, and its technolgies have not been equalled in another product.

The engineers who designed it don't deserve more snide behavior from Apple's self-appointed "savior." They took enough flak at the time for trying to create something so far ahead of the curve, and getting yanked around by having the Newton group spun in, spun out, and unceremoniously killed.

The marketers and project managers leading certainly deserve a healthy share of the blame for the Newton's failure; in a way, they were letting the engineers design the product and stuff it with features, which gave it a high geek value but not much chance of mass-market success and not much cost-effectiveness, but then nickel-and-diming them on things like memory. It's unclear in retrospect what outcome they really could have expected under those circumstances.

Paul R. Potts
Re:Oh grow up. by Mr.+Protocol · 2002-08-01 10:48 · Score: 3, Informative

I used (and programmed) every version of Newton device. There were several generations of Newton recognition software. The first generation was actually licensed from a Russian company called Paragraph.
I had a chance to talk to the folks at ParaGraph International at a mobile computing conference once. It was a very enlightening conversation. They all used to work at the Soviet Academy of Sciences, and had decided that rather than slowly starve in the post-Soviet era, they'd rather form a Russian equivalent of Bell Labs. Now, it turns out one thing they were really good at was curve-fitting. They could use higher-order polynomials to compress and characterize curves of arbitrary shape. As a demonstration, they'd taking a Picasso pencil sketch (in color) and compressed it down to 17K, then re-expanded it into something indistinguishable from the original.

These folks told me that when Apple first contracted with them,they were held at such arm's length that they didn't even know what kind of device they were writing a recognizer for. They never even saw a Newton until they hit market. Hence,they had no opportunity to tune the recognizer. Those who've used Newtons know that things were difficult at best until Newton OS 2.0 came out for the 120. After that, it got much better (and the Rosetta printed recognizer really helped). That was the release that used the 'tuned' cursive recognizer, and with further tweaking in Newton OS 2.1, it pretty much rocks. No more Egg Freckles.

Inkwell, of course, is based on Rosetta, not the ParaGraph recognizer, but the latter is available as a separate package for other PDAs.