Statistics For Data Entry: The Brave New Step
A reader writes:"First there was Dasher, a novel application of statistical theory that lets free texts be written using only a pointing device. Dasher works by predicting the continuations of the text being written, based on what has been written so far; there is a probability associated with each offered continuation and the presentation is designed to make it easier to choose more probable continuations. A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones. Now the same approach has been extended to writing maths. Apropos is a Javascript application (it supports IE6 & Firefox) to create mathematical expressions. It represents the math using MathML, the official XML spec for mathematics. It is definitely clunky when compared to Dasher, but better than MS Equation Editor etc. It is interesting to consider if this approach can be extended to other XML vocabularies (for example, a model for HTML that suggests the markup as you go along - a properly trained one will make it harder to create pages with blinking text, loads of images etc.), or to formal languages other than XML (e.g. programming languages). Stochastic modeling can also be used as a basis for speech recognition, with the recognizer using the model to choose a continuation when the speech signal is ambiguous or indistinct."
It seems to be the same concept as t9.
Check populicio.us
That is hardly news. Mobile phone interfaces have been offering this kind of interfaces for years. True, they are useful, but nothing new here
"You appear to be writing a letter, and here's what you're probably going to say..."
...my mind about apropos is the *nix program
"NAME
apropos - search the manual page names and descriptions"
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
Or the poster uses the British form of English where, I believe, this is correct usage. Not everyone is a 'Merican, you know.
...That this guy will GPL this software rather than start up a private company.
Then maybe I'd get in in the next version of fedora.
I'm so sick of *Tex.
*sigh*
May the Maths Be with you!
Dasher works because there is a small number of words that are likely to follow on from where you are. The same does not apply to MathML or HTML. The most useful you are likely to get is tab-completion for tag names, attribute names, etc.
I'm hopeful that this will eventually make it into word processors like in the OpenOffice or Microsoft Office suites. Seems like the best standard faire we have is a little paperclip/dog/wizard/other nuisance asking how he can "help" make a cover letter.
( o ) one could say I'm rather baked
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe."
It knew he was going to say that.
More likely, it's going to predict that someone's going to say "Let's circle back and touch base tomorrow".
No folly is more costly than the folly of intolerant idealism. - Winston Churchill
Having used both dasher and T9, it seems to me that t9 only takes into account the keystrokes entered for each word. It then correlates them to a dictionary. Dasher, on the other hand, is based on markov chains (yes, like those word/text generators), and thus takes into account the last [n] characters. That makes it much more accurate, and, interestingly enough, should make it particularly well-suited to editing programs in most mainstream languages, since they have a lot of noise words and frequently used sequences.
Try Corewar @ www.koth.org - rec.games.corewar
I read, "a novel application of statistical theory" as "a novel of statistical theory" - and I was still interested!
MathML is a good idea in theory, but until there are good tools for writing and editing MathML, there will be very few people using it (either for publishing or for archival purposes.)
As other posters have noted, this sounds a lot like T9, which is used in cell phones for predictive text entry. T9 is a great utility, but it has happened that what I am writing is less predictable or the there is a more often used combination of letters that results from the keys I have hit. If I don't pay attention, I get the wrong word.
I can't help but think of someone entering a mathematical equation and concentrating more on his idea than what is being written to the screen. Due to this inattention, the equation doesn't work, he figures he's just wrong, and spends hours/days to find the point at which the computer put in its prediction and not what he thought he entered. Worst case, he could abandon what would have been a great idea.
Or, imagine this applied to writing computer programs. Say for example, you are writing a program to calculate the correct distance the probe should hold above the atmosphere so it doesn't burn up. Your cube mate distracts you briefly, and...
Bureaucracy loves company.
orrect strings are more probable than incorrect ones
apparently they havnt taken my writing into account
Nathan Friedly
I hope so, then I might actually use it! :D
-- Boycott Shell
data integrity starts w/ data entry. when data entry is reduced to "no" vs "yes-for-now-we-can-fix-it-later", the game is lost; GIGO prevails, then.
They obviously didn't include many PHBs' writings in their calculations...
I'm frequently amazed at some of the grammatical... umm... experimentations undertaken by the upper two or three levels of management in their memos -- and the speeling, good grief, the SPEELING!! Is [F7] the last great secret of our civilization?!?!
This space intentionally left (almost) blank.
A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones.
:-)
Though probably college educated the writer of the above sentence has probably NOT BEEN a TA in an English class. Truly correct strings are a rare find
I did a quick test run of Dasher instead of RTFA, and as far as I understand, it works by presenting the most statistically-probable letter in the middle of the input area.
So, by dragging a perfectly horizontal line with my mouse cursor, I was able to create the most statistically-probable sentence.
Here goes, for Science:
Conspiracy theorists, area51 nuts and cypherpunks are going to be thrilled!
before long you'll have to write only half of your program. the other half is predicted by some neat tool.
;)
Or imagine the possibilities for bookwriters. You write half an the rest is predicted based on your previous works. Seems as if some authors already use such a technique
See pictures of tits
A big advantage of statistics-based interfaces is that they automatically enforce correctness, because correct strings are more probable than incorrect ones.
In a rigorous, technical environment, being _usually_ correct is not enough and a statistics-based approach to ensuring correctness is not very useful.
In an informal environment, correctness is not nearly as common as you might hope, so again a statistics-based approach may well not be as good as actually enforcing definite correctness.
Whence? Hence. Whither? Thither.
APOSTROPHE! APOSTROPHE!
The Orthographic Commandos have been notified, and a kill squad is now on its way to your location. For your own comfort and convenience, please choose not to resist. Have a nice day!
Whence? Hence. Whither? Thither.
Has anybody tried to compile (and succeeded) Dasher for my beloved Zaurus?
Bye egghat.
-- "As a human being I claim the right to be widely inconsistent", John Peel
...we find the unpredictable more interesting.
And, there are no predictable new ideas. Who could've guessed that Einstein would follow the equals sign with "mc^2".
Bureaucracy loves company.
Why should it? What if I want to create such a page? Why should someone (or something) tell me what to say, or how to say it? And who will "train" such a thing? The Government??
from the dasher site http://www.inference.phy.cam.ac.uk/djw30/dasher/ :
With version 3, as with version 1.6, every language requires a text file full of natural writing (about 300K or more); a specification of the alphabet of the language is also required.
It wouldn't be hard at all to make it work for English, as opposed to Americanese, all you have to do is train it on text written with your own preferred idiosyncrasies
Have been using this approach for decades.
Overuse of this technology will result in repetitive and boring prose. Yes, well-written prose does have some redundancy/predictability -- it helps the reader stay on track, reinforces key points, reminds the reader, etc. This technology will help some writers create more consistent text. Yet I fear that too many will rely too much on this crutch.
The problem is that the best prose contains unexpected novelty such as a plot twists, new facets of a character, joke punch lines, etc. In a true "page-turner" the reader can't predict what will happen next. This novelty (appropriate for a good "novel") is the opposite of what this technology offers.
Two wrongs don't make a right, but three lefts do.
The reason predictive interfaces work is that most encodings have some degree of redundancy in them. English text is about 50% redundant information, in an information-theoretic sense, and anything based on XML is going to be more so.
To see this for yourself, pick a nice big hunk of English text and gzip it. You'll get about 50-60% compression. Now, pick a similar-sized hunk of XML and gzip it - you'll probably get 75% compression or more.
Tools like this make using bloated, redundant encodings more tolerable by automating some of the redundancy away. It's not clear to me that this is a good thing.
To a Lisp hacker, XML is S-expressions in drag.
That said, I have been feeling that TeX is a bit outdated as a system, but then I discovered TeXmacs. This is a fully wysiwyg editor for TeX, where you type in TeX code and see the formatting instead of the code. I have switched to using it, and would definitely recommend it to others
Looks like you can train this thing by giving it large amounts of text in the language of your choice.
I'm going to pop over to OpenOffice.org, and use their source to create a training document.
Stay tuned for details.
~D
This sig has been enciphered with a one-time pad. It could say almost anything.
"correct strings are more probable than incorrect" doesn't "enforce" correctness at all.
What's the probability that all of the texts written this way will be similar?
The statistical properties of languages are utilized in most (successful) approaches for natural language processing, from part-of-speech tagging, information extraction, syntactic parsing, machine translation to question answering; you could almost say that NLP=S(tatistical)NLP nowadays.
--
Try Nuggets , our mobile search engine. We answer your questions via SMS, across the UK.
This is a test of dasher.
I find it a bitch to get proper punctuation, nevermind capitalization, and the routine stuttery freezes are amazingly annoying. I suppose if I were incapacitated to the point that I could only type by looking around I would appreciate it alot more though.
So I'll just call it a really cool toy that is in fact worth trying out and hope some games incorporate some of this technology at some point in the future.
Someone set us up the bomb, so shine we are!
Using this technology on source code (for instance) would be an extremely bad thing since it would encourage cut-and-paste or copy-and-mutate approaches to coding. The result would be highly regular and poorly factored source. But, I don't think anyone was actually suggesting this for program code... just a thought.
[signature]
Aim such a product at programmers, and you'll learn a few things about programmers.
:-) Programmers as a class are notoriously poor spellers.
Correct spelling is no longer more probable than incorrect spelling.
Some misspellings are intentional. I knew a guy who frequently wanted to use MODE as a variable name in his COBOL programs. But MODE is a COBOL keyword and the compiler would hiss at him. So he now always spells it MOAD.
Likewise some misspellings are due to local culture. Paw through some DEC code and you'll find that "controller" is always spelled "kontroller". It's not an error; it's probably to do with the more intelligent bits of DEC gear being given K-series board/unit designations.
It is definitely clunky when compared to Dasher, but better than MS Equation Editor etc.
I will be first to cheer anybody who invents a worse way of typing math than MS Equation Editor. Being better than that is not an achievement at all. Can't they simply learn TeX for their math?
"Long run is a misleading guide to current affairs. In the long run we are all dead." (John Maynard Keynes)
Statistics has been used for decades in handwriting input, OCR, speech recognition, systems like T9, and other input modalities. Dasher seems pretty cumbersome in comparison to most of those.
And the fact that it only generates "correct" input can be a real problem: names, foreign words, etc. just don't come out right.
"3GL", third generation programming languages, were supposed to do for programming what these stat predictors do for data entry. They were menu interfaces, using syntax and grammar to offer only the valid options for the "next word" in a program. Usually with dropdown/popup menus for mousing in windows, the new computing paradigm back in the 1980s. But human expression turned out to be much less modal, and the UI always got in the way. Wake me when these interfaces have been playtested, and survive the arena.
--
make install -not war
1) Submit "Penthouse Letters" for statistical analysis.
2) ?????
3) Profit!
For math notation, no matter how good this might be, TeX is better. First, it goes through e-mail, and it's easy to read unrendered. I.e., people I send TeX notation to are guaranteed to be able to read it, without having to install software that doesn't necessarily exist for their favorite OS. Unless they are too lazy to learn the two pages of TeX documentation that list the math notation:-). Secondly, it's fast to type, and you don't need to take your hands off the keyboard. I doubt that there is any input system for math notation that's better than TeX.
Feed the entire contents of /usr/dict/words into a markov generator and you get pretty much the same thing. Random words which, whilst not having any meaning, are reasonably syntactically correct.
http://www.fourteenminutes.com/fun/words/.
Avantslash - View Slashdot cleanly on your mobile phone.
I was about to point out in response to "A big advantage of statistics-based interfaces is that they automatically enforce correctness..." that rather than enforce correctness they will more likely introduce common errors.
When designing a language - be that a simple one which can be encapsulated in an XML schema for example, or even a complex natural language there is a trade off between being efficiently terse and introducing sufficient redundancy as to allow communicants to differentiate signal from noise. If you enter data in a format too terse then you are more likely to make errors which can't easily be detected - if you enter in a language too redundant you will find it tedious but errors are more likely detectable at a syntactic level.
For this reason I'd like to see exactly the opposite approach... I'd like to see long-hand ways of entering data where my errors are detected and flagged - which are then parsed and stored or transmitted in a more efficient format.
Dasher indeed looks interesting. The heuristics remind me of the input methods for Japanese keyboards where hiragana or katakana are entered, and depending upon the context, a short list of matching kanji is presented to choose from. Elegant solutions to a complex problem.
However, while Dasher can be compared to the JavaScript application that works with MathML, Dasher and MathML cannot be directly compared. Determining correctness would be from a program reading the DTD or schema of MathML. MathML would just be the serialized form (the data format).
(Not that I'm suggesting it be done but...) It's like saying that a C program would be written with prediction on raw parentheses and curly braces in the C source file. If anything, the predictive algorithm would be supplied with BNF notation. The C code would just be the output format.
I don't see why the technology associated with Dasher could not be applied to parsing DTD or schema files for output to XML syntaxes like MathML.
- I don't need to go outside, my CRT tan'll do me just fine.
One of the things that sucks about MS Office is AutoCorrect. Granted, it helps fix a few typos, especially of the "teh" type. What really is annoying is not turning off how it converts URLs, UNC paths, etc., into "hyperlinks".
T9 input, imho, kind of sucks also, unless you're IM'ing and just doing a lot of simple "dood, where U B", adding new entries into the phone book on the phone, etc.
If the lexical hierarchy has too many words that have very similar SOUNDEX values, same set of initial characters, etc., it's not going to save much time or effort. It takes much longer to pick a word out of a list.
My bias: I'm a touch-typist, so for me it's usually just easier to keep plowing through the typing, rather than stopping to select the "right" choice presented to me. Nor do I IM. It's hard to IM when you're driving.
Plus, again, it all comes down to the quality of the statistical set (or dictionary). Is your writing target for writing "business speak"? Is it for writing medical or legal docs? Or whatever. A set for "general" writing will just be about as bad as Word's grammar checker or spelling dictionary...
Claude Shannon, the father of information theory, used the idea referenced here in his famous 1950 experiment to calculate the entropy of the English language. See "Shannon Game" at, for example, http://www.math.ucsd.edu/~crypto/java/ENTROPY/ There's also an entire field, often referred to as "Natural Language Processing," which uses empirical observations of large amounts of language data (text or speech) to construct statistical models which do speech recognition, language translation, text summarization, spelling correction (and, yes, people at Microsoft Research have worked on this), etc. Finally, Hemos writes "Stochastic modeling can also be used as a basis for speech recognition, with the recognizer using the model to choose a continuation when the speech signal is ambiguous or indistinct." FYI, speech signal is _always_ ambiguous, from the perspective of a machine trying to transcribe it to text. I very much doubt there's been any successful speech recognition work in the last 15 years on a non-statistical system.
The linked article neglects to mention Unicode compatibility in its list, but a good read nonetheless.
- I don't need to go outside, my CRT tan'll do me just fine.
"I knew a guy who frequently wanted to use MODE as a variable name in his COBOL programs. But MODE is a COBOL keyword and the compiler would hiss at him. So he now always spells it MOAD."
Similarly for "list" in Lisp or Scheme. I use "lyst" since I learnt the basics off Douglas Hofstadter from "Metamagical Themas".
I did after the first two words, perhaps it's just well-read.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Please read my report for a detailed description of how Apropos works. I have contrasted the system with both T9 & TeX.
Three o'clock is always too late or too early for anything you want to do. - Jean-Paul Sartre