Programmer's Language-Aware Spell Checker?
Jerry Asher writes "Not all of my coworkers are careful about spelling errors. Sometimes this causes real embarrassment as spelling errors creep into software interfaces. Does anyone know of spell checkers for programming languages? I don't want a text spell checker, I want a programming-language-aware spell checker. A spell checker that I can pass all of my code through and will flag spelling errors in function names, variable names, and comments, but will ignore language keywords, language constructs and expressions, and various programming styles (camel code, or underscores, or...). I want a spell checker that knows that void *functionSigniture(char *myRoutine) contains one spelling error. Does anyone have such a thing for Java or C++? Are there any Eclipse plugins that do this?"
The version of Eclipse I run, Eclipse WTP 3.3, does spell checking on comments as standard. Not for variable, function names and the like though. It's a decent first attempt though. In truth, I turned it off within the first few hours. It underlines any mistakes in red which I find really annoying when scanning code as I keep thinking I've seen syntax errors. More often than not my eyes are drawn to a spelling mistake, which in many cases isn't even really a mistake, which distracts me from what I'm actually trying to look at.
Visual Assist for Visual Studio does this.
Next silly question, please.
Some people call using it a "code review". If you are really serious about it, post the code to /. - plenty of people here seem to have time to point out any spelling errors.
.... that if you want your code to read like english, you consider a language like COBOL? Not that it would help you with spell checking, per se... but if one is going to be so pedantic about making sure that their procedure names can be found in an actual english dictionary why not go the whole 9 yards and write the whole program that way?
File under 'M' for 'Manic ranting'
And not too hard to implement - all you need is a lexer and a few functions to classify different naming styles. lexertl even comes ready with a full example for C++, so get to it ;)
How about the Built-in OS X spell checker?
We're talking about programming, friend.
I particularly like the spelling feature in new vim, right-click menu (:set mousemodel=popup) to select a corrected word or remember current word as correct. Perhaps writing a vim plugin as you explain could be possible? I'd be very glad to use it too ;)
#
#\ @ ? Colonize Mars
#
A small script to split up camelCase into seperate words, then feed the result through a normal spell checker. Then after that just whitelist certain words like maybe "m" as found in "mSomeVariable".
We've got code here that refers to 'insurrances', 'insurances', 'insurrences' and 'insurences', I'm not kidding.
People here making fun of his request and saying that this should be set in stone in design documents, or be checked in peer code reviews are obviously not working in a run-of-the-mill software company where there's neither the inclination nor the time to do everything the formal way. Also, I have to see the first design document that correctly enumerates all the requirements for the software, let alone all the names for the variables to be used.
---
"The chances of a demonic possession spreading are remote -- relax."
Okay, so it's only for Managed Assemblies (C#, VB.NET, J#, etc), but it does spell-checking, acronym-checking, and case-checking, which is nifty. Along with the other slew of introspection rules (some of which are a PITA to implement, even if it does increase the quality of the finished product).
The $$$ version of Visual Studio (the Team Suite version) comes with an introspection engine for VC++ though, it's not as flexible as FxCop but does the basics.
Then there's the countless "Spellchecker" plugins available for IDEs everywhere, VS, Eclipse, NetBeans, etc...
TextMate on OS X has spell checking functionality that is semi-useful, but it's not really "aggressive" enough, and there doesn't seem to be a way to make it such with prefs/configuration.
You can right-click on any "word" (variable name, subroutine name, whatever, just generally a whitespace-delimited group of characters) and it will check the spelling and present alternatives in the context menu. It also recognizes things like perl's sigils so correcting '$teh' turns into '$the', not 'the'.
It _won't_ automatically check spelling except in strings (so e.g. if I have '$teh = "This is a tset.";', 'tset' will be underlined, '$teh' won't). It doesn't include comments in its automatic checking either, which is probably the most annoying part about it.
Overall I typically just don't bother with it, but someone _has_ thought along these lines, at least.
If you maintain a library that is used by customers, that would be a *very* big issue. You just broke backwards compatibility for a spelling fix.
Overall, the answers to the submitters question are absolutely horrible so far. If the tool he's searching for doesn't exist, it damn well should.
For the record, 'I' is a word. Also plenty of spellcheckers will ignore one or two letter words.
The idea isn't anywhere near as nuts as you think it is, provided you make a habit of using meaningful variable/class names.
++ Say to Elrond "Hello.".
Elrond says "No.". Elrond gives you some lunch.
It's not so simple when you're not the one writing the code, but have to deal with the results. There's an SDK that I use as a part of my job, developed by our head office in Japan - it's a set of C# classes, and nothing annoys me more than typing "Connection foo = new Connection();", then noticing Visual Studio isn't highlighting it as I'd expect. Hunting around for anywhere up to a minute and eventually finding out it is actually "Conectin" instead of "Connection". If there were a good "programmers spellchecker", I may not need to use it myself, but I could give it to my Japanese colleagues to make MY life easier! (note: the above example is fictitious, but is an illustration of the type of error that I deal with that this would prevent)
My book about LSD and Self-Discovery
Also on facebook as: DroppingAcidDaleBewan
The question wasn't about user interface strings. It was about spelling in APIs. e.g. One issue at my last company, which was British, is that they standardized on US spelling, but some British spellings crept in too. So sometimes you'd get a function containing "Initialize" and sometimes "Initialise".
Only three things are certain; death, taxes, and apocryphal quotations - Ben Franklin.
More like WTF are you on man?? If a compiler is able to work out what a variable is, what piece of code does what, which bit's of text are going to be displayed, then another spell checking program can be written to recognise this too!! It would be tricky, and there are many circumstances where it could be circumvented but why not still use it to prevent a possible spelling error, and the circumstances where it cannot tell what the word is, so what. Those circumstances you learn to spell but there's nothing wrong with another program to help prevent it!
This is a good idea, and one that can be implemented. Just because it's hard to do it right, and would need to be done seperately for different languages doesn't change the fact it would still be useful and help prevent errors.
Who need's speling and grammar?
I remember from spellchecking some html documents a while back ago that aspell is at least aware of html. I do not know how well it works with other kinds of documents.
Well, I'm a total newbie in terms of compiler architectures and such, but throwing it out there for the purpose of discussion...
I assume a compiler will parse the source and in the process identify which tokens are key words and literals, and which are programmer-defined identifiers in the code. The spell checker would either use the same algorithm, or latch into that part of the algorithm to get at all of the identifiers. There are two possible word separators in typical code--either capital letters or underscors. (If you have something more bizarre, then I think it's a lost cause). So pass those identifiers through a filter that chops them up at each capital letter or underscore (with some exceptions, say, if the identifier is all caps). So, now you've got a pile of strings which are either oddball programming convention stuff, like "p" and "g" for pointers and globals, and things that should generally be words. The rules can include "toss out single character identifiers", "toss out everything up to first capital or underscore", etc. If you have coding guidelines that enforce variable naming conventions, this should get you most of the way.
Now you have English words that you can pass through your standard spelling engine, possibly with a dictionary tweaked for your field of endeavor to decrease false positves and escapes.
-- "This world is a comedy to those who think, a tragedy to those who feel."
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
- string literals (not what the poster wanted, but this is what needs spelchekars the most)
- identifiers
The former can be done by a simple regexp, the latter... you can do a LALR parser, but why even bother? Just look for _any_ potential identifier; in most languages, that's [a-zA-Z_][a-zA-Z_0-9]+; and simply add the few keywords which are not English words to your dictionary. In fact, this would be nearly programming language agnostic.When it comes to StudlyCaps, anything identified as an identifier can be split _before_ any uppercase letter. This would produce a lot of single-letter tokens for ALL-CAPS #defines and the like, but as a nearby post said, you're going to ignore one-two letter tokens anyway. The usual conventions say XMLHttpRequest or XML_http_request so I wouldn't bother with XMLhttpRequest (and thus "lhttp").
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Yes, this is a legitimate problem. I work on code that has spelling mistakes embedded into interfaces and it's very annoying. The fashionable use of StudlyCaps in programming (why? who decided that TextLikeThis is more readable than text_like_this?) makes the job a little harder but not impossible, as long as you follow the sane rule of making each word start with capital and continue lowercase, even if an acronym (so XmlParser not XMLParser or, God forbid, XMLparser - though of course XML_parser would be better than any of those).
/c; foreach (split) { print qq{$_\n} unless $seen{lc $_}++ }" source_file...
/usr/share/dict/words or in the private word list. Indeed, why not this:
/c; foreach (split) { print qq{$_\n} unless $seen{lc $_}++ }" >found_words /usr/share/dict/words >allowed_words
Enough rant. How about this:
perl -ne "s/([a-z])([A-Z])/$1 $2/g; tr/A-Za-z/
That will give a list of unique words in your source code (use find and xargs to scan the whole source tree). Then you can run that list of words through an ordinary spellchecker such as ispell. Unfortunately when you find a mistake you have to go back and grep for it to find where it occurs. You would also need a personal dictionary for things that are not English words but nonetheless appear in code.
I would probably keep the private word list containing things like 'foreach' and 'const' with the program source code, and have a makefile target 'make spellcheck' that runs a command like the above and then prints out all words found that are not in
find . -type f -name '*.c' | xargs perl -ne "s/([a-z])([A-Z])/$1 $2/g; tr/A-Za-z/
sort -u private_word_list
diff -u allowed_words found_words | grep -E '^[+][^+]'
The private word list can be kept under version control and checked in whenever you add a new non-English word like 'Frobule' to your source code.
Adding filenames and line numbers to the output is left as an exercise for the reader. You might also want to change the perl command to ignore words with length < 5.
-- Ed Avis ed@membled.com
His next project is to have a handy little helper with a RAM chip avatar. His name is chippy and he comes out with helpful phrases like:
"You appear to be creating an infinite loop. Would you like me to increment your counter variable?"
"You appear to be writing a virus, would you like a list of the latest Windows Vista sploits?"
which is totally what she said
For .net languages, FxCop does some of this checking, even understanding camel casing and underscores in tokens. And a bunch more, since it is a static code analysis tool.
http://www.gotdotnet.com/Team/FxCop/
Doesn't Visual Assist from Whole Tomato do this? I've used it in the past and I'm sure spelling mistakes (and a whole host of other things) were pointed out.
:-)
I'm not associated with Whole Tomato, but if anyone from WT sees this, can I have a free subscription
No sharp objects, I'm a programmer!
I had your problem once because I was working with people whose first language was not english. I don't write US English either and I always left English spellings in by mistake.
I used aspell and went through huge parts of the source, telling it what wasn't misspelled. It was an incredible pain in the neck because it got confused over all the variable names, bits of C syntax etc etc.
Once I had a dictionary, though, I could recheck the source periodically and although there were a lot of false warnings, we still caught a lot of problems that would have gone into the production release.
As you can work out, I didn't restrict the test to strings - this is because misspelled variable names can cause bugs too so I checked for them as well.
Cheers,
Tim
This is all just my personal opinion.
True, identifier names containing spelling errors can be a real annoyance, but I somehow doubt you'll ever find a usable solution, at least not as long as you'll need to interface to code beyond your control. What spell checker wouldn't choke on regular C++? Just picking a random declaration from MSDN (feel free to choose any other API, it won't change anything):
HRESULT MFGetService(
IUnknown* punkObject,
REFGUID guidService,
REFIID riid,
LPVOID* ppvObject
);
You'll probably just end up spending all your day removing false positives.
Yeah, not to nitpick but, you see; 'i', being a variable-name, would be a properly camel-cased 'I' from the point of view of the spellchecker.
Religion is what happens when nature strikes and groupthink goes wrong.
Man Dies Waiting for Eclipse to Launch
A software engineer in San Jose, CA was found dead at his desk yesterday, apparently having died while waiting for his Java editing program, Eclipse, to finish its boot process. Coworkers say the engineer came in that morning vowing to "get Eclipse working on his box or die trying." The last thing anyone heard him say aloud was the cryptic comment: "I see the splash screen is appropriately blue." Nobody knows what he meant. The man was then thought to have fallen asleep, but hours later it was discovered that the engineer had died suddenly of apparent natural causes. The forensics team's investigation that evening was reportedly interrupted unexpectedly when the dead man's Eclipse program suddenly finished launching. The team tried to interact with it to see if they could find clues about the man's death, but the program was unresponsive and the machine ultimately had to be rebooted. At this time, the police commissioner says there is no evidence of foul play, and they currently believe the man simply died of either boredom or frustration.
Sure, it's the halting problem. We all know that. But there are several common cases where you can deduce that there is an infinite loop in the code. It won't catch all infinite loops, but that doesn't make it useless.
(Suns Java seems to be good at detecting some of those by default when it complains about unreachable return statement)
Spell checkers are fine but they make mistakes as well. The best thing I have found, and this goes for any project, software or printed word, is to have someone who is not connected to the project or better yet not even connected with the subject proofread what the public sees. They will often catch mistakes that jump off the page but people working on the project just don't notice. I have made some really stupid mistakes that I never saw but were on the cover of a book I was publishing. I am SO glad it was proofread before it went to press.
Attempting to tell programs the correct grammar or spelling does not always go well. While most will thank you for your input on catching their mistakes, others take it like you step on their babies head.
"Any douche who doesn't realise a misspelt function name will fail to compile clearly hasn't written any code yet."
;)
You clearly fail to see a programmer can also create their own function names, as well as use other peoples functions. So you prove you are a very inexperienced programmer, (and close minded), which adds weight to the idea you are either young or just arrogant. Also your very apparent need to show hostility, shows a degree of insecurity, where you are over compensating, by verbally hitting out at others, in an attempt to appear to be more knowledgeable than you really are.
The easiest way to become a better programmer, is to be more open minded. So far you have failed to demonstrate this.
As a side note, (back in the DOS days of programming), I found the the spell checker in Multiedit very useful (especially when having to work very late at night, after the coffee stopped working!
There are 10 kinds of people in the world... those who understand binary and those who don't.
how about hacking the linker map file to generate a list of function/variable names? ie, "ld -M". then run the resulting word list through a standard spell checker. the thing is, all you really need is a way ti generate a list of names...
I'm not sure spell-checking can really be made to work because, by definition spell-checkers flag anything that is not in the allowed list (also called dictionary) as an error. But source code always contains tons of identifiers that are not real words, like pid, ret, req, riid, etc. The problem is that there are hundreds if not thousands of them in a large project and that you get a ton of new ones making the maintenance of a custom directory a pain.
But I've been annoyed by spelling errors too and what I noticed is that the same errors come over and over again. So what I did is write a script that specifically checks for common typos. And I've very imaginatively called it 'typos'.
What's great with this approach is that, no matter whether you're writing a C, Perl, PHP or HTML file, 'seperate' is never going to be a real word. So we can identify these with no cumbersome custom dictionary, and a very very low false positive rate.
Typos is open-source (GPL) and has no dependency that I know of (besides perl). So you can try it out just by downloading it, making the script executable, and running it with no argument on your source:
You have one again confirmed Hartman's Law (or Skitt's, depending on preference; see http://en.wikipedia.org/wiki/Hartman's_law).
"Misspelt" is a legitimate spelling in British English. It's in the OED, with examples from 1762 to 1990.
Since I have just corrected you, I assume I have made an error somewhere in this post, though I haven't managed to find it.
.sig withheld by request
It's in the third word. You missed a letter.
Remember... your code will run faster if you remove some, but not all, vowels from your variable names.
To the original question: is strncpy misspelled? What about foo? sqrt? exp? Impl? Programese has an interesting linguistic history and its lexicon contains much not found in English.
While misspelled variable and function names are annoying, a refactor tool and a compile make them relatively painless. Perhaps the best approach would be to take your API documentation, run a script to split CamelCase and words_with_underscores, then feed that document to the spell checker. If it's not in your public API, it shouldn't matter how it's spelled.
Also, externalize your strings so that people with English writing training can write your field labels and error messages. Even programmers who spell check strings often misgrammarize them.
Ceci n'est pas une signature.
I was doing work at NASA. NASA was still into punch cards years after very powerful text editors came into existence. I remember the day my girl friend offered to key punch the PDP-11 code I had written onto coding pad to cards. "Honey, you sure can't spell very good. Good thing I caught it. Move is spelled with an 'e'." :-(
Wow, 240 comments about spelling and programming and no-one's mentioned the famous Ken Thompson quote:
"If I had to do it over again? Hmm... I guess I'd spell 'creat' with an 'e'."
All ideas^H^H^H^H^Hprocesses in this post are Patent Pending. (as well as the process of patenting all postings)
Still a great IDE after all these years...
"There is more worth loving than we have strength to love." - Brian Jay Stanley
no, clearly he meant you need to keep all your _identifiers_ in external files too, by "interface" he means API
We've secretly replaced Slashdot with new Folgers Crystals - let's see if it notices.
1. The stop-problem is undecidable only on a device with infinite RAM, if you put an upper bound on the RAM, you get a decidable problem (in theory only).
2. There are some practical ways to construct proofs that a loop ends (remember the CS lectures). Sure, it's not a perfect solution, but if you can't construct a proof that the loop ends, you'd better rethink the loop, and possibly rewrite it.
Its impossible for a computer program to be constructed which can do so for all cases (hence, the halting problem), but that doesn't mean that its impossible to detect some infinite loops, or to detect constructs which are particularly likely to be infinite loops, either of which could, in theory, be useful features in an IDE.
Spelling/grammar checkers for human language aren't flawless, either, but they still have utility. The fact that perfection in a task is impractical or even provably impossible doesn't rule out useful applications.
If you are too damn lazy or too stupid to type your language properly, then you shouldn't be a programmer. Become an insurance adjuster or something less demanding.
I don't think I'd like to hire someone who can't spell. It shows volumes about you.
Intelligence starts with a keen understanding and application of your language.
if you simply must have it, editplus has syntax highlighting and offers spellchecking dictionaries.
They're using their grammar skills there.
hate replying to myself, but it missed what they key part...
;; mostly copied from flyspell-small-region
;; (flyspell-word)
(require 'flyspell)
(require 'cc-subword)
(defvar ps-flyspell-check-subwords nil
"*Non-nil if Flyspell should check subwords separately.
If this variable is set to non-nil, an identifier such
MyLongFunctionName will be treated as four separate words (My,
Long, Function, Name) for the purposes of Flyspell.")
(defadvice flyspell-region (around subword-checking (beg end))
"Check individual subwords if ps-flyspell-check-subwords is set."
(if ps-flyspell-check-subwords
(save-excursion
(if (> beg end)
(let ((old beg))
(setq beg end)
(setq end old)))
(goto-char beg)
(let ((count 0))
(while (< (point) end)
(if (and flyspell-issue-message-flag (= count 100))
(progn
(message "Spell Checking...%d%%"
(* 100 (/ (float (- (point) beg)) (- end beg))))
(setq count 0))
(setq count (+ 1 count)))
(if (>= (length (car (save-excursion (flyspell-get-word nil)))) 5)
(flyspell-word))
(sit-for 0)
(let ((cur (point)))
(c-forward-subword 1)
(if (and (< (point) end) (> (point) (+ cur 1)))
(backward-char 1)))))
(backward-char 1)
(if flyspell-issue-message-flag (message "Spell Checking completed."))
(if (>= (length (car (save-excursion (flyspell-get-word nil)))) 5)
(flyspell-word) 'nil))
ad-do-it))
(ad-activate 'flyspell-region)
(defun flyspell-get-word (following &optional extra-otherchars)
"Return the word for spell-checking according to Ispell syntax.
If optional argument FOLLOWING is non-nil or if `flyspell-following-word'
is non-nil when called interactively, then the following word
\(rather than preceding\) is checked when the cursor is not over a word.
Optional second argument contains otherchars that can be included in word
many times.
Word syntax described by `flyspell-dictionary-alist' (which see)."
(let* ((flyspell-casechars (flyspell-get-casechars))
(flyspell-casechars-non-initial (if ps-flyspell-check-subwords
(downcase flyspell-casechars)
flyspell-casechars))
(flyspell-not-casechars (flyspell-get-not-casechars))
(ispell-otherchars (ispell-get-otherchars))
(ispell-many-otherchars-p (ispell-get-many-otherchars-p))
(word-regexp (concat flyspell-casechars
flyspell-casechars-non-initial
"*\\("
(if (not (string= "" ispell-otherchars))
(concat ispell-otherchars "?"))
(if extra-otherchars
(concat extra-otherchars "?"))
BSD is for people who love UNIX. Linux is for those who hate Microsoft.
Yep. Programmers should know how to spell correctly in their native language. But hey, all through school those technonerds where likely the same ones who never missed a chance to whine about how they hated their English (or whatever) classes and thought that learning grammar and spelling were a waste of time when they could be doing cool geek stuff. The rise 1337-speak and txtspeak hasn't helped.
At least in the real writing business there are editors trained and paid to catch these errors.
Being unable to spell correctly makes you look really stupid to most people.
Just FYI, if you have a decent programming environment, it should at least flag cases where you've mistyped an existing identifier. If there's an ImmediateFlag in your code, you'd get a warning if you typed ImediateFlag or ImmediateFalg or whatever. Not much help when the programmer is creating new identifiers, of course. Although I've seen cases where the programmer in question for whatever reason decided that because ImediateFlag was undefined then they would just define it, even though ImmediateFlag existed and was what they meant. That ought to get you fired in my book.
Hey by the way, pair programming is a great way to have continuous code reviews and a check on some of the more typical fumble-finger errors.