Statistics For Data Entry: The Brave New Step

← Back to Stories (view on slashdot.org)

Statistics For Data Entry: The Brave New Step

Posted by Hemos on Monday October 25, 2004 @12:54AM from the moving-ever-onward dept.

A reader writes:"First there was Dasher, a novel application of statistical theory that lets free texts be written using only a pointing device. Dasher works by predicting the continuations of the text being written, based on what has been written so far; there is a probability associated with each offered continuation and the presentation is designed to make it easier to choose more probable continuations. A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones. Now the same approach has been extended to writing maths. Apropos is a Javascript application (it supports IE6 & Firefox) to create mathematical expressions. It represents the math using MathML, the official XML spec for mathematics. It is definitely clunky when compared to Dasher, but better than MS Equation Editor etc. It is interesting to consider if this approach can be extended to other XML vocabularies (for example, a model for HTML that suggests the markup as you go along - a properly trained one will make it harder to create pages with blinking text, loads of images etc.), or to formal languages other than XML (e.g. programming languages). Stochastic modeling can also be used as a basis for speech recognition, with the recognizer using the model to choose a continuation when the speech signal is ambiguous or indistinct."

24 of 121 comments (clear)

Min score:

Reason:

Sort:

Like t9 by xabi · 2004-10-25 00:57 · Score: 4, Interesting

It seems to be the same concept as t9.

--
Check populicio.us
1. Re:Like t9 by xabi · 2004-10-25 00:59 · Score: 3, Insightful
  
  More info in the same dasher web site here
  
  --
  Check populicio.us
2. Re:Like t9 by a_hofmann · 2004-10-25 01:49 · Score: 3, Interesting
  
  While the concept is the same, the application goes way further than t9. This is where I see such ideas bound to failure.
  
  t9 is a great technology because the vocabulary used writing SMS is pretty narrow. After entering the first few characters of a word, the contextual information in the dictionary is good enough (most of the times) to suggest the wanted word very fast. t9 is even able to dereive this information without the user specifying the exact characters but rather just one of the 3-4 on any mobile key.
  
  As said this is possible because of the small dictionary of probable words and the good contextual information for characters and their position in words.
  
  Extending this tech (or event better methods) to larger domains makes the problem much harder very fast, by my feeling I would even say that increase to be exponential in the mathematical sense.
  
  There may be a small, well defined set of possible mathematical formulaes, if you divide the things into small enough junks. Saying the same about XML documents or even native language text (beyond the character/word level) is imho foolish.
  
  I would like to stress that such technologies are very important and promising to lower the input barriers for disabled people, as they already work in the same sense for anyone on very limited devices (like mobile keyboards). On the other side I don't think such things to change the way most of the people already put information into their computers.
Old technology by Inigo+Soto · 2004-10-25 00:58 · Score: 4, Interesting

That is hardly news. Mobile phone interfaces have been offering this kind of interfaces for years. True, they are useful, but nothing new here
It sounds good .. in theory by Anonymous Coward · 2004-10-25 00:58 · Score: 5, Funny

"You appear to be writing a letter, and here's what you're probably going to say..."
Re:correctness? by Anonymous Coward · 2004-10-25 01:01 · Score: 2, Insightful

Or the poster uses the British form of English where, I believe, this is correct usage. Not everyone is a 'Merican, you know.
I Can Only Hope...... by ObsessiveMathsFreak · 2004-10-25 01:01 · Score: 2, Interesting

...That this guy will GPL this software rather than start up a private company.
Then maybe I'd get in in the next version of fedora.

I'm so sick of *Tex.

*sigh*

--
May the Maths Be with you!
1. Re:I Can Only Hope...... by Carnildo · 2004-10-25 06:33 · Score: 2, Informative
  
  First, it's TeX, not Tex. Secondly, TeX goes through email, and most people who care to read it unrendered very easily, so they don't need to install any dopy software just to read teo little formulas in my e-mail. Plus, TeX math notation is fast to type, and you only need to learn a page or so from the TeX manual in order to be able to use it for math. So, how is this Dasher thing better?
  
  You're comparing apples and aardvarks here. Dasher is an input method that tries to predict what letter you'll input next based on what you've input so far -- think of it as an improvement on those silly on-screen keyboards. TeX is a typesetting notation -- you could theoretically use Dasher to write TeX, if you wanted to do such a thing on your handheld.
  
  --
  "They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
I'm not hopeful by Anonymous Coward · 2004-10-25 01:03 · Score: 2, Insightful

Dasher works because there is a small number of words that are likely to follow on from where you are. The same does not apply to MathML or HTML. The most useful you are likely to get is tab-completion for tag names, attribute names, etc.
1. Re:I'm not hopeful by weierstrass · 2004-10-25 01:30 · Score: 2, Insightful
  
  What we write is only predictable to the extent that it is redundant: ie when i type "tomor" into my mobile phone, if it's obvious to the phone i'm going to write "tomorrow", i could just send a msg saying "C U tomor".
  
  It doesn't seem to me that there's anything like as much redundancy in mathematical formulae as there is in written language. When the professor writes "X=..." on the board, it's very hard to predict the next symbol unless you know what x is in fact equal to.
  
  --
  my password really is 'stinkypants'
2. Re:I'm not hopeful by KevinKnSC · 2004-10-25 02:53 · Score: 2, Informative
  
  You're incorrect when we say that what we write is only as predictable as it is redundant.
  There are over 90,000 words in the English language (based on number of entries in the American Heritage Dictionary), but nobody uses all of them. Good predictive data entry is not just a matter of waiting until you've typed "tomor" and concluding that you're going to write "tomorrow" because no other words begin that way, it's a matter of noticing when you get to "tom" that, based on your past word usage, the most likely word for you to use is "tomorrow".
  You can apply the same concepts to mathematics. When the professor in your example writes "X=..." on the board, you can guess that what's coming next is either another symbol, a literal number, or a mathematical expression. If the professor indicates that it's a mathematical expression, you can then guess that it will probably be the same kind of expression that he uses most often. For example, if you're in a calculus course, you could make a good guess that the expression will be an integral or derivative, and so on.
  Apropos does this all with a point and click interface, which is, as mentioned before, much better than MS Equation Editor or writing out the MathML by hand.
Riiiiiiiiight by $RANDOMLUSER · 2004-10-25 01:05 · Score: 4, Funny

"Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe."

It knew he was going to say that.
More likely, it's going to predict that someone's going to say "Let's circle back and touch base tomorrow".

--
No folly is more costly than the folly of intolerant idealism. - Winston Churchill
Why this isn't the same as T9. by pkhuong · 2004-10-25 01:06 · Score: 4, Informative

Having used both dasher and T9, it seems to me that t9 only takes into account the keystrokes entered for each word. It then correlates them to a dictionary. Dasher, on the other hand, is based on markov chains (yes, like those word/text generators), and thus takes into account the last [n] characters. That makes it much more accurate, and, interestingly enough, should make it particularly well-suited to editing programs in most mainstream languages, since they have a lot of noise words and frequently used sequences.

--
Try Corewar @ www.koth.org - rec.games.corewar
Quick test by potifar · 2004-10-25 01:13 · Score: 4, Insightful

MathML was never really intended for writing by hand, and even if Apropos makes it easier, I can't see myself switching from (La-)TeX anytime soon. I can enter extremely complex mathematical expressions at least 20-30 times faster by typing them in TeX than I ever could do clicking around an interface like Apropos.
MathML is a good idea in theory, but until there are good tools for writing and editing MathML, there will be very few people using it (either for publishing or for archival purposes.)
Failures of inattention by hussar · 2004-10-25 01:17 · Score: 4, Interesting

As other posters have noted, this sounds a lot like T9, which is used in cell phones for predictive text entry. T9 is a great utility, but it has happened that what I am writing is less predictable or the there is a more often used combination of letters that results from the keys I have hit. If I don't pay attention, I get the wrong word.

I can't help but think of someone entering a mathematical equation and concentrating more on his idea than what is being written to the screen. Due to this inattention, the equation doesn't work, he figures he's just wrong, and spends hours/days to find the point at which the computer put in its prediction and not what he thought he entered. Worst case, he could abandon what would have been a great idea.

Or, imagine this applied to writing computer programs. Say for example, you are writing a program to calculate the correct distance the probe should hold above the atmosphere so it doesn't burn up. Your cube mate distracts you briefly, and...

--

Bureaucracy loves company.
GIGO would be proud by 10am-bedtime · 2004-10-25 01:23 · Score: 2, Insightful

data integrity starts w/ data entry. when data entry is reduced to "no" vs "yes-for-now-we-can-fix-it-later", the game is lost; GIGO prevails, then.
Correctness, huh? by The_REAL_DZA · 2004-10-25 01:24 · Score: 2, Funny

A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones.

They obviously didn't include many PHBs' writings in their calculations...
I'm frequently amazed at some of the grammatical... umm... experimentations undertaken by the upper two or three levels of management in their memos -- and the speeling, good grief, the SPEELING!! Is [F7] the last great secret of our civilization?!?!

--

This space intentionally left (almost) blank.
Dasher and stats rock by palad1 · 2004-10-25 01:26 · Score: 4, Funny

I did a quick test run of Dasher instead of RTFA, and as far as I understand, it works by presenting the most statistically-probable letter in the middle of the input area.

So, by dragging a perfectly horizontal line with my mouse cursor, I was able to create the most statistically-probable sentence.

Here goes, for Science:

Kennedy insider&xeathGhed a noviceable. Punt.uetGrance beganic or Central believe t, space ship,' Alice, it is deleasantB.Carzone.That's luJbi

Conspiracy theorists, area51 nuts and cypherpunks are going to be thrilled!
Ahem by kahei · 2004-10-25 01:33 · Score: 3, Insightful

A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones.

In a rigorous, technical environment, being _usually_ correct is not enough and a statistics-based approach to ensuring correctness is not very useful.

In an informal environment, correctness is not nearly as common as you might hope, so again a statistics-based approach may well not be as good as actually enforcing definite correctness.

--
Whence? Hence. Whither? Thither.
Re:correctness? by r_j_howell · 2004-10-25 01:48 · Score: 3, Interesting

from the dasher site http://www.inference.phy.cam.ac.uk/djw30/dasher/ :
With version 3, as with version 1.6, every language requires a text file full of natural writing (about 300K or more); a specification of the alphabet of the language is also required.
It wouldn't be hard at all to make it work for English, as opposed to Americanese, all you have to do is train it on text written with your own preferred idiosyncrasies
This approach favors bloated, redundant encodings by alispguru · 2004-10-25 01:56 · Score: 3, Interesting

The reason predictive interfaces work is that most encodings have some degree of redundancy in them. English text is about 50% redundant information, in an information-theoretic sense, and anything based on XML is going to be more so.

To see this for yourself, pick a nice big hunk of English text and gzip it. You'll get about 50-60% compression. Now, pick a similar-sized hunk of XML and gzip it - you'll probably get 75% compression or more.

Tools like this make using bloated, redundant encodings more tolerable by automating some of the redundancy away. It's not clear to me that this is a good thing.

--

To a Lisp hacker, XML is S-expressions in drag.
Re:Why? by r3m0t · 2004-10-25 01:58 · Score: 2, Informative

'Why should it? What if I want to create such a page? Why should someone (or something) tell me what to say, or how to say it? And who will "train" such a thing? The Government??'

To make the other (more likely) options more easily available, spend a lot of time poking around for tags with smaller targets *or* type it by hand *or* change the settings to lower the effect of prediction *or* replace the training files *or* just use the damn thing since it'lol learn, nobody's telling you to do anything, and The Government (as you call it) wouldn't bother.

Happy now?
Re:Training by Dracolytch · 2004-10-25 03:01 · Score: 2, Informative

Ok, OpenOffice.org proved to be too large for me to really use, so I hopped over to the GIMP instead. I grabbed a copy of their source, and created a text file that appended all of the c files I could find in one directory... About 750k.

I took the "English with lots of punctuation", and copied the .xml file. It turns out that using their little interface for creating a language is a PITA, and just copying an existing file works pretty well. I tweaked it to change the name of the language, and point to the right training document.

It needs a little work, because there's no way to tell the difference between a space and an underscore, but for the most part it works pretty well. As a fairly quick test, I'd call it a great success.

I also did the same thing using PHP. Similar results. I got a chuckle when I was able to visually see the probability of me typing _POST or _REQUEST after any $.

Pretty neat. Slower than typing, but it has some interesting possibilities.

~D

--
This sig has been enciphered with a one-time pad. It could say almost anything.
this idea has been around for a long, long time by TheDemotic · 2004-10-25 06:45 · Score: 2, Interesting

Claude Shannon, the father of information theory, used the idea referenced here in his famous 1950 experiment to calculate the entropy of the English language. See "Shannon Game" at, for example, http://www.math.ucsd.edu/~crypto/java/ENTROPY/ There's also an entire field, often referred to as "Natural Language Processing," which uses empirical observations of large amounts of language data (text or speech) to construct statistical models which do speech recognition, language translation, text summarization, spelling correction (and, yes, people at Microsoft Research have worked on this), etc. Finally, Hemos writes "Stochastic modeling can also be used as a basis for speech recognition, with the recognizer using the model to choose a continuation when the speech signal is ambiguous or indistinct." FYI, speech signal is _always_ ambiguous, from the perspective of a machine trying to transcribe it to text. I very much doubt there's been any successful speech recognition work in the last 15 years on a non-statistical system.