Falsehoods Programmers Believe About Names
Jamie points out this interesting article about how hard it is for programmers to get names right. Since software ultimately is used by and for humans, and we humans are pretty tightly linked to our names (whatever the language, spelling, or orthography), this is a big deal. This piece notes some of the ways that names get mishandled, and suggests rules of thumb (in the form of anti-suggestions) to encourage programmers to handle names more gracefully.
I found the piece very interesting.
Though my inability to post this comment appears to have outlived the slashdotting of the site.
3Jane Tessier-Ashpool, for one.
homonyms?
Hey, learn a little tolerance, bud.
OMG! Wau!
Mr. Ochocinco
For those that aren't privy to American Football. Apparently some guy with the number 85, renamed himself 85.
I am fortunate enough to be the child of a professional smart-ass who intentionally gave all his children two middle names so that we would not fit into the computer systems of the era.
When I grew up my parents used my first middle name as a "given nickname" (it's actually in quotation marks on my birth certificate). So most of the time when I give my name for something I use my "given nickname" as my first name. Unless I feel like using my legal first name as my first name in which case I use that. There are probably four or five different versions of my name attached to my SSN in various different databases.
I've also got a sufffix: III. I don't have two ancestors with the exact same name as me, but since the various parts come from two different relatives my parents settled on III.
After just 15 minutes of the story being posted?
Wow, that's gotta be a personal best for /. (or, the site is a wee bit underpowered... ;)
Here's the Google cache in the meanwhile: http://webcache.googleusercontent.com/search?q=cache:http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/
Chinese, written in pinyin, has numbers. Pinyin is how Chinese is typed. The numbers represent tones and every word in Chinese has a tone.
John Graham-Cumming wrote an article today complaining about how a computer system he was working with described his last name as having invalid characters. It of course does not, because anything someone tells you is their name is--by definition--an appropriate identifier for them. John was understandably vexed about this situation, and he has every right to be, because names are central to our identities, virtually by definition.
I have lived in Japan for several years, programming in a professional capacity, and I have broken many systems by the simple expedient of being introduced into them. (Most people call me Patrick McKenzie, but I'll acknowledge as correct any of six different "full" names, any many systems I deal with will accept precisely none of them.) Similarly, I've worked with Big Freaking Enterprises which, by dint of doing business globally, have theoretically designed their systems to allow all names to work in them. I have never seen a computer system which handles names properly and doubt one exists, anywhere.
So, as a public service, I'm going to list assumptions your systems probably make about names. All of these assumptions are wrong. Try to make less of them next time you write a system which touches names.
Names of what?!
Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
Well, for starters, Thurston B. Howell, III. Malcolm X, and Jimmy Two Times.
You are welcome on my lawn.
Software is NOT designed to be perfect and cover every case. Have a numeral in your name? Too bad. Need some names to be case sensitive, and others case insensitive? Sucks to be you. Have a 200 character name that doesn't fit in the 100 characters the designers thought no crazy person would ever have? Tough.
I started reading through the list, and it's just ridiculous. There's a few good points, like names don't change, or names are unique. But they're so obvious that the vast majority of the times it's not a big problem. More often it's just a matter of training the data edit/entry folks how to change someones name, or how to not assume a name is a sole identifier.
But assuming the worst and trying to design a system that'll allow people's names to be Chinese characters when you don't do business in China, have presence in China, or ever ever plan to? That's ridiculous. Software doesn't have to be perfect out of the shoot. It should be adaptable though if some unforeseen shortcoming becomes a larger problem. Gee, I guess if you ever chose to do business in China and need Chinese character names you might have to re-write part of the damn software. Oh well, that's what software developers are FOR!
If you don't even HAVE a name, then I submit you're crazier than the artist formerly known as the artist formerly known as Prince. At least HE had a name, though it was an unpronounceable symbol. The world can't accommodate every possibility, and software is no exception.
AccountKiller
He's essentially arguing that, because names vary a lot and are complex, your software should never do anything useful with them. Sorry, but that's a stupid answer. In a lot of systems, being able to sort by surname may well be more important than being able to handle people who claim they have no surname.
Of course, you shouldn't gratuitously do stupid things, and interfaces should aim to be relatively clear. But most people can figure out how to enter their names into relatively standardized forms, and those that don't should probably figure out how.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Thanks, Prince
Bo3b? Presumably, the 3 is silent because he wants to point out how individual he is (ironically, by rehashing a joke made over 50 years ago.)
From Tom Lehrer's introduction to "We will all go together when we go":
I am reminded at this point of a fellow I used to know whose name was Henry, only to give you an idea of what an individualist he was he spelt it H-E-N-3-R-Y. The 3 was silent, you see.
Ahh - My eye!
The doctor said I'm not supposed to get Slashdot in it!
You are a little confused. Please reread the Wikipedia article on Hanyu Pinyin. It normally uses diacritics - namely macron, acute, hacek ("caron"), and grave - to represent the Mandarin tones other than neutral tone. Numbers have been used by people who lack diacritics on their typewriter or input system, but using numbers is not standard in Hanyu Pinyin, instead it's a kludge.
That said, if your input form doesn't allow some guy to type in his name with tone number suffixes on a US Windows keyboard layout where he lacks access to diacritics, then you're not a very thoughtful programmer.
Also, people who make software with an input fields that accept Unicode but specify a particular font that has a tiny character repertoire suck.
Oh, and Slashdot sucks even more for only supporting ASCII and stripping everything else.
A database MUST treat all of these names the same: McClean, MacClean, MCLean, Mc Clean, Mac Clean. McCleen, ...
Are you sure? What if "Mac Clean" is actually somebody's first and last names?
I know plenty of people whose legal name is a single word, such as "Alex", "Max" or "Virgil." Would your system put that in the first_name, middle_name or surname column? Storing names and using them sensibly is hard, as TFA acknowledges.
You'd think that e-mail addresses by comparison would be simpler, but I have a hard time trying to register my e-mail address with sites that won't allow even simple things like "+", "-" or "." characters in the local part.
A database MUST treat all of these names the same: McClean, MacClean, MCLean, Mc Clean, Mac Clean. McCleen, ...
I assume you left out a "not" in that sentence? I think there are quite a few people that will kindly (or maybe not-so-kindly) explain why "Mc" and "Mac" are not the same.
My last name is O'Leary and over the past 5 years web sites have not gotten any better, and arguably have gotten worse, at handling the apostrophe in my last name
Help me Slashdot, you're my only hope.
First thing I learned back in 1993 when I got started.
1) George Foreman has five boys named George Foreman. Your database better be able to handle that. :-)
2) Your database better be able to handle Cher (no last name).
3) People are not required to have Social Security numbers. (it's an optional program - you don't have to partipate).
4) Not everyone's last name starts with a capital letter.
5) Mexican people's names break ASCII (the tilda n).
6) People named O'Grady have a hard time getting their name in a database sometimes and have a hard time getting their name passed via a URL sometimes and generally mess stuff up.
7) People from Sri Lanka will break your name length limits.
8) Some people's name is only a single letter.
9) Some people go by their middle name god damn it!
My first name: "where 1=1 "
My last name: "'; drop table users; --"
I code to spec. The product and marketing departments write the spec (what little there is); the QA department amends the spec with overly specific test cases. I suggest that the spec is incomplete and won't handle...but I'm told, just code it to spec. I recommend changed, but we don't have time for edge cases. I point out potential problems, but we're unlikely to get any of those. I warn of potential compatibility problems but we don't care. Are you just trying to be difficult? If there's a problem QA will catch it. The project is overdue already, and by the way here are some new requirements that need to make it in, and we can't change the release date because we already promised the stockholders. Why is your code so complicated, my twelve-year-old kid could write this.
It's not my fault. I code to spec.
It seems to me that most misconceptions about names can be fixed by the following:
Allow a single, Unicode-enabled field of "unlimited" length (let's say 4 kilobytes) which represents "name". Several would be defined by different roles -- "Real name", "Nickname", "login", where only login (sometimes simply an email address) is required to be globally unique.
Now let's look at what that breaks:
First, #1, 2, 4, and 5. How am I supposed to avoid assuming these? People should be allowed to enter an arbitrary number of names for themselves? I suppose that's possible, but it immediately kills most of the potential uses of this data. If I want to set a nickname that goes with my forum posts, say, what good is it for me to have five nicknames? Seems like the only potential use would be making people easy to find by real name -- so, a social network.
#6 -- surely 4k is enough, but this is also not a terribly difficult assumption to change later. Annoying, but not devastating, not even as hard as changing from the first name / last name combination into one "real name" field.
#7, 8 -- most systems would make it trivial for people to change their names.
#9, 10 -- UTF8 is easy.
#11 -- very, very curious to see an example. And wouldn't that be a bug in Unicode? And this is again one where I have to ask -- how do you change this? Allow arbitrary images?
#12, 13 -- obvious solution is to make the name system case-preserving, thus allowing both case-sensitive and case-insensitive searches.
#14 -- again, avoid by simply allowing the name to be a single opaque field.
#15, 16, 17 -- if your name supports random unicode, no idea why these would be a problem.
#18 -- not sure why it matters.
#19, 20 -- again, if it's just arbitrary text, it just works.
#21, 22, 23 -- not sure how I'd make that assumption.
#24, 25, 26, 27 -- again, the name is just an opaque bunch of characters.
#28 -- what?
#29 -- opaque characters.
#30 -- keep the original text as-is. If you want to try to split people out by naming scheme, do it later, but keep the original. This should be a "duh" concept -- always preserve the original user input. Cache transformations for speed, if you like, but they're a cache -- keep the original. Your algorithm might change.
#31 -- bad idea to assume bad words won't cause problems in general. I currently play an MMO in which I physically can't talk about Emily Dickinson, and have occasion to more frequently than you might suspect.
#32-36 -- why would it matter? Unless...
#37 -- Fine, but how would I otherwise connect the same person?
#38 -- How about unicode-equivalent? And of course, they might not -- one might make a mistake, or the name might be represented differently. But you'd have to deal with typos anyway, so this isn't exactly shocking.
#39 -- I'm going to have to agree with the assumption, though. If I develop a system which works well for people who only follow the US standard, and I suddenly have a ton of people from China wanting to use my service -- enough that this is actually a problem for me -- that's a nice problem to have.
#40 -- People can make up names. I guess this explains #32-36, though.
The sense I get is that half the list is stuff you'd almost have to be stupid to run into (seriously, who doesn't use Unicode?), and the other half involves some seriously weird names and cultures that are going to have to meet me halfway, if they expect me to do anything interesting with their name. As I understand it, the only way to get this right would be to allow people to have zero or more names, each of which is either an unlimited amount of text in any encoding, or an image (raster or vector) of unlimited size. To query such a system requires insane amounts of logic just to deal with the text, and throw in some OCR for good measure.
I think this is a case where I would much rather see people evolve to match the technology, rather than the other way
Don't thank God, thank a doctor!
Pinyin is how Chinese is typed. The numbers represent tones...
No it isn't. Pinyin is how Chinese is romanized. Chinese is typed using an IME to produce Han characters. Pinyin is typically only used to represent pronunciation, for example in dictionaries, and to represent names in contexts where romanization is necessary (such as international contexts, like Western media), as well as a few other limited contexts. Writing Chinese in Pinyin, even with tone marks, is often inadequate because each syllable/tone combination corresponds to several characters, and the distinction between them is easily lost in romanization. For example, Zhang Zilin and Zhang Ziyi do not have the same surname, even though both are Zhang1 in pinyin.
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
Yep, there's rampant homophonia around here.
A database MUST treat all of these names the same: McClean, MacClean, MCLean, Mc Clean, Mac Clean. McCleen, ...
I assume you left out a "noot" in that sentence? I think there are quite a few people that will kindly (or maybe not-so-kindly) explain why "Mc" and "Mac" are noot the same.
fixed that
A database MUST treat all of these names the same: McClean, MacClean, MCLean, Mc Clean, Mac Clean. McCleen, ...
I assume you left out a "not" in that sentence? I think there are quite a few people that will kindly (or maybe not-so-kindly) explain why "Mc" and "Mac" are not the same.
Yeah, one goes in front of 'Donald's' and the other goes in front of 'beth'.
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
True. I run into email validation problems constantly. I have a two-part first name that has "-" in the middle, so my firstname.lastname email addresses (usually work addresses) always have a "-". In addition at the moment I'm a consultant in a large company, where they put "ext-" in front of everyone who is not employed by them but works for them and has an email account from them. I also often run into problems with length, because my name is 19 characters and the last place I worked for had a 15 character company name and when you add TLD to that, you sum to an email address that is 39 characters long, which for some seems to be too much. I really don't get why you would use only 32 characters to store an email address..
This problem very often bites in name fields, too, that don't accept "-" and two capital letters in my first name.
And I used to live near a border of two cities, where my postal address was from one city while my real city of residence was the other one. I have had a lot of problems with that, when the guys who made the systems were trying to deduce my city of residence from my postal address. Which is also impossible in my country, because the national post office also permits addresses that have postalnumber + company (instead of city) for large companies who take their mail in one place and deliver it themselves the rest of the way.
The regular expression, if one must be used, doesn't need to be any more complex than:
^[^@]+@[^@]+$
Sending out response emails to an improperly validated address just turned you into an open relay. Spammers can use your server to send spam by embedding their entire message as the email address, trailed by '\x004@.'
Validate your inputs. Always.
Wow, if you consider McLean and MacLean the same, I suggest you never visit Scotland.
The Mc's and the Mac's consider the correct usage as a matter of extreme pride. You could end up with one or more bruises if you get it wrong and then insist that "well, they're the same anyway".
The author must have missed his history lesson explaining that family names only became popular in Western European culture when governments started tabulating people. In a rural village everyone knows that Jack the butcher is different from Jack the baker.
Hence Butcher, Baker, Smith, Brewer, Tanner, Farmer, etc became "family names".
*Even if the system did a conversion to a latin representation of an asian name most people can't pronounce them because they are based on different sound primitives.
Such a "translation" can easily be one to many, dependent on various factors.
Which is why Asians tend to adopt westernised versions of their real names.
Or they adopt a regular English, German, French, Spanish, etc name to be known by.
I though the article was about the inability of programmer to remember names and recognise people, Maybe I should have read the article.
It's a real problem though - is it just me? I often know things about people (ah yes, plays squash, good at making cakes, father of that kid who rides a unicycle), but their actual name - no. It's a miracle if I recognise them at all.
Mind you, it means if anyone says "Hello" to me, I am obliged to be polite to them as I might actually know them quite well, but haven't recognised them yet - and certainly don't know their name.
It's a right pain. Anybody else suffer from this - and what the heck do they do about it? (I'd like a camera attachment what would whisper in my ear "that's Mrs Jones, her daughter, Kira is in the same class at school as your daughter. Likes chess and is obsessed with kayaking" - something tiny that could clip on my glasses, maybe).
"Cats like plain crisps"
Sometimes I despair when I read or hear somebody referring to eg. Djengis Khan as "Mr Khan" ("Khan" is a title, not a name) or even call Hu Jintao, "Mr Jintao"; you would have thought people would, by now, have caught on to the idea that something like half the world's population has the family name first.
Oh, come now - are you seriously saying you expect every single person to understand every subtle nuance of every other culture's use of titles and names? Here are some non-English equivalents to Mr., are you seriously telling us you know all of these? Here are the various forms of address in the UK alone, do you know all of these and every other culture's equivalent? How many of these should I learn before I go from being someone you despair of to someone you feel is welcome in your titular elite?
If half the world's population has the family name first, which half do I choose to offend when I don't know the exact rule for the home country of the person I'm speaking to? That's even assuming I know which country they're from. There's no reason to assume in this shrinking planet that someone who looks like they're from country A wasn't in fact born in country B to parents from countries A and C - a person born in Japan but with lineage in China might take great offence if I use Chinese honorifics to address him, surely it's better to be polite within the confines of my own known culture than to make such crass assumptions about his? The key thing I take from someone saying "Mr Khan" or "Mr Jintao" is that they're at least making the effort to communicate in a civil manner, which certainly causes me no despair.
To make things worse, it's not necessarily the family name you use to address someone politely.
If you have to speak to Paul McCartney (of Beatles' fame), you have to formally address him as "Sir Paul". No, "Sir McCartney" is impolite, you shouldn't use it.
If you have to speak to Vladimir Putin, you won't address him as "Mr. Putin". It's "Vladimir Vladimirovich", please!
You know, attitudes like yours are IMHO the root of all that's wrong with computers today. And I'm saying that as a programmer, not as Jane Grandma. The whole idiotic OCD idea that you _must_ make up rules about everything, and that your rules are more important than what people are actually trying to do. The idea that if even someone's name doesn't fit "your" database, then you can just brush them off and have a beer.
Here's some free clue: yes, you can't handle every edge case in the universe, but you'll find it's easier if you don't create such edge cases in the first place. If your database (actually more likely the program in front of it) can't handle last names with more than one capital letter, or with a dash in the middle, or which are more than 32 bytes long (which with UTF-8 might mean less than you'd think), then guess what? _You_ created an artificial edge case that had no reason to be there in the first place. Instead of handling every edge case in the universe, how about not creating them in the first place?
I find that about 90% of the problems boil down to the above: some idiot put some artificial limits or rules, that really aren't needed anywhere else. Just because he has the delusion that he's some kind of Moses on the mountain and just _has_ to come down with some rules.
E.g., he just had to define a byte limit, because he's prematurely optimizing a non-problem he doesn't understand. God forbid wasting space in the database by allowing 256 or 2000 byte strings... never mind that if he actually understood that underlying database, he'd know that a VARCHAR is not padded to that max length. If someone just entered "Alex", the same 4 bytes will be actually used in the database, regardless if the field is a defined as maximum 4, 32, 256 or 2000 characters. But nah, he has to put some restrictive number there, 'cause it looks more like he's doing some smart job.
There is hardly any reason to even use a user name for anything other than display purposes. (You do have a primary key for that record for everything else, right?) As such there is no reason to make any assumptions about it, or enforce any particular format, or anything. There's no reason to even disallow SQL keywords (just effing quote it before using it in SQL) or angular brackets (just quote it before using it in HTML.)
There is no reason to create any edge cases in the first place.
And really it's not even just about names. Names are just one case where people make up BS rules just to feel more like they did the great design job. One could make the same case for the gazillion other pointless rules imposed upon the user or his work-flow or data, not because they're actually needed anywhere, but just because some OCD idiot feels like he _must_ impose some rigid structure upon things that really have none and don't need any. But he'd just feel naked without defining that kind of rigid structure, or without imposing upon humans some data structures theory that was intended only for use by programs.
A polar bear is a cartesian bear after a coordinate transform.