A Japanese company once made the decision that they needed "virtual" employees in a particular system, for example to support e.g. adding a job to the org chart before that position had been filled (and another dozen use cases), so they had the clever idea "Hey, if we need to do this, we'll just input their 'name in Japanese' as one of a dozen status flags, like XX_JOB_REQUEST or XX_INCOMING_TRANSFER."
One developer at this company, who was annoyed with having to tweak a particular system every time they added a new possible status flag, wrote code which was, essentially:
if (InternalStringUtils.isAllLatinCharacters(employee.getJapaneseName()) {
/* no need to pay this 'employee' so remove them from batch
before we retrieve bank details for salary transfers */
...
}
Do I have to explain why I'm aware of this curious implementation choice?
Many systems in China depend on your Chinese name and identification number. Suffice it to say, foreigners who work in China don't have either of these: a made up Chinese name is not meaningful or legal (China also lacks any kind of kana), a passport number is not a "valid" ID number and changes every 10 years anyways. I don't get to use many online services accordingly, and every year there is some problem with how they handle my last name (MC DIRMID, there is a freaking space after the MC in my passport, causes all sorts of problems).
When I worked in Taiwan they came up with a Chinese name for me and gave me a little official wooden stamp I could use to "sign" official documentation. The name was indeed useful to fill in all sorts of forms that expected the Chinese format. That was a neat time.
It's meaningful in some contexts. I had a different Chinese name on my work permit and marriage certificate (in both cases transliterated without my input), and it caused me no end of hassle when I was applying for a mortgage - we got refused the first time and had to get the work permit changed to make them the same.
Yes, we (my wife and I) were careful about that when we got married last year. But to be honest, I think that's the only case (and the name on my work permit is not even my own chosen Chinese name!)
Tibetans (with Hukou anyways) have Chinese names though. Usually they are phonetically chosen. Its something that must be done when you are born in China I guess, even if your native language isn't Chinese. This applies to all minorities who use different writing systems (Uigher, Manchu, Mongolian, etc...).
Japanese of Korean descent can also choose Kanji names I think, to use as legal aliases.
Anyone who lives in Japan can register a legal alias, regardless of their citizenship. The special permanent residents that you mentioned (who typically hold North / South Korean or Chinese citizenship) are probably the most common users, but some Japanese people who are divorced use them, too.
An alias with Kanji is really useful for living in Japan. I'm an American citizen but I use an alias with a Kanji last name for just about everything I can (including my job, bank account, and apartment contract). Immigration paperwork and credit cards are just about the only things where the alias can't be used.
I worked in a security software development department where the primary security request application had to allow a request from anyone, for anyone (Approval was more stringent). I personally found several bugs in the system in my first few months, because I, personally, conflicted with the various "uniqueness" constraints in the system... like lastname + ssn-last-four, or dob + firstname, etc.
The org had 380K active entries, so it was definitely interesting being a dev on such a project, with a relatively common name, and conflicting dob and last 4-5 of my ssn.
Your DOB and last 4-5 of your SSN matched someone else?
The reason I ask is because rules like this are often used to de-duplicate records. It's not perfect but it is useful, especially when trying to integrate data from more than one system. It's also used quite a bit in fraud detection etc. to find connections in the data.
There were about 380K users in the various systems... so conflict chances were pretty high... I can't imagine what it would be like to have a name like "John Smith" or "Adam Jones" ... even more common...
That was imprecision because I was trying to avoid the quick discussion of Japanese orthography. Like most systems in Japan dealing with names, there are separate fields for 漢字名 and カナ名. (Some systems also have ローマ字名.)
Japanese systems have wide, wide variability in what they do for 漢字名 for people who, ahem, don't have one. Some repeat the カナ名. Some do so but use half-width kana (半角 vs. 全角). Some managers who believe that there is such a thing as an "official name" think that one's official name should go in 漢字名, regardless of whether it is 漢字 or not.
A related problem: what happens when you have two systems which have different behaviors on this? For example, let's say you're a Japanese bank, and your branch employees were instructed in 2012 to update any 漢字名 of foreigners to be the name written on their foreigner registration card, in double-width characters. Let's further suppose that your web tier does Javascript validations when you try to sign up for online banking, and because any engineer can see that DOUBLEWIDTH latin characters are not 漢字, this means that it is literally impossible for the web tier to match the DB for affected customers.
It's not quite the same, and it does't break things so much as make them rubbish, but try getting a radiology information system to talk to a scanner of some description (CT, MRI etc). GE scanners accept Surname^Name. If someone has a middle name it doesn't display or come across to the scanner, so as to save space (I assume). This is fine until you get someone who has a first name with 2 separate words. I discovered it with someone called something like Al Amen as a first name. No hyphen. So now he is called Al. To make the medical images correct we have to incorrectly spell his name and make the RIS incorrect. Since then I look out for this and I have seen lots of patient names broken in this manner. Mid name capitalization also breaks and all becomes lower case. McDonald to Mcdonald etc. Names are horrid to deal with and people (myself included) like them to be correct.
For a few years my airplane boarding passes said I was PAULA JUNGWIRTH because A is my middle initial. I got a few questions trying to board. I've noticed the last couple years that they print with a space now.
When I flew Lufthansa, I was granted an honorary Ph.D.
My official name is "Aleksandr Feinberg" -- it's transliterated from Cyrillics (hence the "ks") and Russian version of "Alexander" does not put an e between the d and r -- which Lufthansa decided to print on my boarding pass as Dr. Aleksan Feinberg
I recently moved to a house on a street named "Martin Luther King Junior Way East". The service rep must have bumped the tab key while I was signing up for DSL, because I now get mail addressed to "Mars Saxman Junior Way East".
The funny part is that the street name is still filled out in all of its glorious detail - but the capitalization is different, which tells me they did some kind of zip code based address sanitization.
I've never understood why people entering data into a system enter an initial. It's a partial entry. You wouldn't enter a date of birth as 3.
On further thought, actually, they do. Then dismiss all the error messages and quit the program to get past the system keeping them in the field waiting for completion. Users seem hell bent on breaking our databases.
I frequently have to enter my first initial and middle name as my "first name". Why? Because that is how it appears in numerous official places, such as my credit card.
The users aren't broken, your database (and your assumptions about names) is.
The data base has errors and faulty assumptions, yes. I have some too, but I do not allow names entered into a big medical system to be anything other than the persons name, minimum of first name and last name, but I go over every record that passes through our scanners and enter middle names too. We have an AKA field where the patient can be called what ever the want, characters and numbers allowed. This is not stuck into medical image dicom headers but appears on the information system which is used when talking to patients or browsing records. Dicom files area transmitted across hospital, out information system data isn't but does transmit reports with a limited amount of patient data on them. Screw ups with identification happen too often already (once is too many) and matter too much to have a load of bad data in the system. Abbreviate anything at great risk. We have lots of people with the same name and same date of birth already, so extreme care is needed.
Wow, that first one is great, thanks. I'm saving that for future reference. Japan seems a hotbed for hard database problems.
There is no way any of our systems could handle some of that, and its mostly not our fault - trying getting access to change stuff on medical database software or imaging equipment. It isn't possible.
That example came along fast. I didn't expect an Anglo-American example (assumption based on links) and I wonder about the origin? The lack of a period makes it somewhat simpler for system handling, but I wouldn't bet on it sailing through without issue.
My father's middle "name" was a single letter. My grandparents didn't give him a middle name (at least not in English), but the nurse took it upon herself to record what sounded like a middle initial to her, and that's what ended up on his birth certificate. It stood for nothing but the letter itself.
I love it when I get asked that as a security question, only to be told it's invalid (too short). Tell that to Harry S Truman!
Ok, more accurately, people enter names into our database as an initial., as in they stick in a period. Such as John Andrew Doe becomes Doe^John^A., the period is not part of their name. I agree that its possible that is someone's name, but I am 100% certain that every instance I have encountered is incorrect. The various manufactures software we deal with either get confused by middle names or drop them. They also commonly assume that having 2 names in the first name field means that one is a middle name, and drop it. This isn't useful. I have never encountered someone with a single letter name in my workplace (first, last, anything) and so hadn't considered it. I am confident of this as I compare what every person writes on a form as their name with what our system says.
Some people might find their middle name embarrassing, thus choosing to use an initial to keep it secret where it isn't required for anything but disambiguation.
When the middle name avoids confusion with another person and the situation is medical files its about as important as it gets. There is also a legal obligation for it to be accurate with some if our government contract work. That said, they don't seem to monitor accuracy. I spent a lot of time monitoring it though (checking, correcting, hounding data entry inaccuracy serial offenders), and it still causes me cold sweats every now and again.
Particularly names in southern regions of India, where traditionally people would have only a single name (no surname) and when a surname is required they might give the first letter of their father's name.
The name of someone if a delicate problem, but all of these are straightforward questions.
There is an "official name" and it's on your registration certificate.
That's the one that goes in the 漢字名 field and that's the one that the bank will accept (my registration had both the double with latin character and the katakana in parenthesis: no questions asked, they both go, parenthesis included).
The validation of the kanji name is not done on the exact range of the characters ('is it really a kanji?') but if it's double width or not, double width latin characters are OK, you could use emoji the validation would pass. Half width katana would get rejected.
Protip: if you have a shitty registration name, have it change, that's easy and that's for you own good.
There is an "official name" and it's on your registration certificate.
The thing about official names is that I have so many to choose from! Alien registration certificate? MCKENZIE PATRICK JOHNATHAN. No kana because town hall called up the local immigration authorities and heard "'Nicknames' are not required for the administration of Japanese immigration law and accordingly should not be registered. You should only register him under the exact name printed on his passport." (This is official policy, but many local government authorities ignore it, including half of the clerks at Ogaki. I drew the short straw on my most recent visit though and "had to change.")
Mr. Short Straw did not, however, actually use the name written in my passport, because some genius at the US Passport Control Center thinks Irish people get an extra space in their last names and, after substantial argument with town hall, I was able to convince them that a lifetime of being addressed as Mc-san would be very inconvenient for my wife and I.
But wait there's more! As a result of marriage the McKenzie household finally exists on the books in Japan as a 戸籍, whereas before it was just little ol' me happily residing here as a foreigner. An hour of investigation with a totally different part of the Ministry of Justice later, Town Hall refused to register a 戸籍 with Latin characters, and was actually able to produce an authoritative Least Frequently Asked Questions At Ogaki City Hall internal guidelines document on what to do in the event of international marriages. So my "official" name in that part of the system is different: ミッケンジー、パトリックジョナサン. Mr. Short Straw remarked, direct quote, "Cripes, that seems like an inconvenient name to go around with. Have you considered just changing it? I've got the forms and I'm pretty sure you could be Tanaka Taro by the end of today." (Bonus points: We filed a name change for Ruriko at the same time as getting married, and hers is based on what's written in the 戸籍 and her 住民票, which gives us the wonderful circumstance where "Wife took husband's name after marriage but, important note, their names will still fail naive string compares... well, some of the time, depending on which agency and what data source we're querying.")
But wait there's more! City Hall is my single point of contact for Japanese Social Security, Japanese national insurance, and the Gifu prefectural revenue office. I think I count four different official names there unless one or more decided to change policies recently. Gifu extends its apologies but it is physically incapable of handling sole proprietors with given names which are 7 letters long because, quote, "Who does that to a child?!", so Kalzumeus Software is on the books as being owned by MCKENZIE P.
The decision not to manage "nicknames" (通称名) under the new immigration law because they aren't necessary from an administrative standpoint is illustrative of the disconnect between the people making these laws in Japan and the people that are subject to them. I realize that this is inevitable because foreign residents can't vote, but it's frustrating that the government doesn't seek input from them when formulating new policy that will have large effects on them.
Because of the difficulties in using foreign names with Japanese computer systems and paperwork that you mentioned, 通称名 ("nicknames") are essential for many foreign residents. Some groups have been using them for decades now, so even a cursory attempt to get feedback on the new laws would have identified this problem.
Still, some groups of special permanent residents have organized and successfully overturned some of the more odious aspects of the immigration law, like the fingerprinting requirement for alien registration. In particular, the Korean special permanent resident community has some degree of influence on policy because of their size and organization.
Given the general ignorance of the central government (and the immigration bureaucracy as a whole) towards the real needs of foreign residents, I see this decision as ignorance on the importance of 通称名 rather than an attempt to quash the rights of foreign residents. In my experience the local governments tend to be more sympathetic towards the actual needs of foreign residents, perhaps because they have more prolonged interaction with them. (Though as with every government organ in Japan, the interpretation of the law varies wildly depending on which clerk you interact with.)
Troubles like the ones you describe are a large part of why I registered 長瀬ダニエル as a 通称名 and use it for everything I possibly can.
Dealing with names in Japan is really a life experience in itself. Yes, I was refering to the Alien registration certificate.
As you say, there are so many to choose from. I changed 4 times during my stay (at the end I had my name twice in the same field, one in latin characters and one in katakana, plus a 'nickname' with my wife's family name. BTW it was the best choice so far, even if it's awkward to fill bank papers with 'SOME ROMAJI NAME (カタカナ名)'. Immigration Office staff really do a shitty job at dealing with the registration, but I had it changed at the prefecture I lived. They are much more forthcoming, and will accept to use anything reasonable as a name.
Could this have been solved by using "外人第一" as a suffix or instead of the romaji/katakana name?
I once tried to apply for a Japanese credit card online about 10 years ago (certainly things have changed with some banks sine then). IIRC the form would not accept romaji and my kana name was too long for the kanji input field.
This was painfully frustrating at the time but helped frame my approach to forms and DB specification when I got into web development (e.g. always using UTF8 in MySQL, full name as a single field in some applications, etc.).
I think the problem you ran into wasn't so much a user interface problem, but a "gaijin aren't our target clientele" thing. 10 years ago a lot of Japanese banks regarded "gaijin without permanent residence" as riskier than a 20-year-old Japanese student; this included gaijin with good credit records and income above average. Things have been changing, though.
As a Norwegian in the UK without any problematic things about my name, I think you don't need to assume malice or lack of imagination, simply that they did not want to handle anything but the braindead "safe" situations online.
I similarly often prefer to apply offline or in person about things, because despite more than a decade here, and great credit history, I occasionally get hit by UK banks assuming that not being on the electoral roll means increased risks (it can mean you're trying to keep your real address out of official registers). They have no problems dealing with me in person, when someone manually reviews the situation, but either they've decided it's not worth the hassle to try to deal with this online, or that it's safer to just point me to a branch or call for extra verification.
Fair enough, but I'm not exactly assuming "malice", I'm just saying that such "lack of imagination" wouldn't happen if foreigners were considered an important target for the banks.
One thing that's annoying to me is that governments and employers increasingly believe many of these things, partly because they want to cross-reference names and match canonical forms.
My given names in English are Mark Jason, and that's on my birth certificate. In Greek, they're Μάρκος Ιάσονας, which are the equivalents, and that's on my municipal birth records there (registered as a foreign birth at the time of baptism). There seems to be a move towards wanting to use "accurate" transliterations, though, rather than the more traditional method of translating names to equivalents (Mark<->Markos, George<->Georgios, Paul<->Pavlos, etc.). Sometimes people desire that: maybe someone named Михаил in Russian really doesn't want to be turned into Michael, but wants to go by Mikhail. That's fine, if they prefer. But in my case, I consider each of these translated forms to be my name in the respective languages, and do not consider the transliterated forms to be my name.
But in trying to sort out some paperwork, it appears that what I am supposed to do is one of these two things: 1) change my name in English from Mark Jason to Markos Iasonas, the transliteration of my Greek name; or 2) change my name in Greek from Μάρκος Ιάσονας to Μαρκ Τζέισον, the transliteration of my English name. But I don't want to do either of those things. #2 in particular is ridiculous, because it doesn't decline properly, and is trying to approximate a 'j' sound with 'tz'.
Growing up, my parents called me by my middle name, as I share a first name with my dad. (I'd rather be an Edward than a Ralph anyway.) When giving my name to someone, I tell them I'm Edward <Lastname>, as telling them I'm R. Edward <Lastname> just sounds pretentious. But if I'm beginning a relationship with a doctor's office or lawyer, or filling in a tax form, it's Ralph E. Lastname, because that's what's on my birth certificate and SSA record. It is quite annoying when the phone rings, and I don't recognize the calling number, so I answer with a guarded, "This is Ed..." and hear the caller ask, "May I speak to Ralph?" and have to explain to them that I really am Ralph, even though I said I was Ed. But I have to say, my problems are nothing compared to yours!
CSB: My mom signed me up for a book club when I was 6 or 7. For the Firstname field, she wrote, "R Edward" for reasons known only to her. For the next three years, every couple months, I'd get a package addressed to Redward <Lastname>. I could just imagine the shipping clerk in that company reading my shipping label and saying to himself, "Redward... what a goofy name."
My name is Kim <Lastname> and I'm a male. Try convincing Americans (and other English speaking countries) about that...
One example: Many years ago I subscribed to TIME and filled out a form where I checked "Mr." Apparently the person who typed in my name decided to "correct" this error and I became a "Mrs."... and I wasn't even married :-)
The company I work at has offices in different cities, so most of the communication are done by email and instant messaging. I see a clear difference between the messages from people who know my gender, and those who probably think I'm female. Even attempts at flirting...
I went to school with a male Kim in Aus. Never even realised it could be a girls name until the 80s when there were several female singers called Kim. Lots of male names seem to become girls names. Ashley is another that seems to have been lost in living memory. Apparently Shirley was once a male name and I am not joking. Between that and boys once wearing dresses until breaching along with pink clothes and long hair and time travel must be really confusing.
Males named Kim go relatively unremarked in Australia at least, perhaps because there have been three well-known male national political figures named Kim in the last 40 years.
In Scotland there's a fair amount of male Kellys and Lesleys which also appear to have slowly evolved into female names in the rest of the English speaking world (as far as I can see)
Many years ago Safeway, a commonly seen grocery store in the US, started using a Safeway Card to gather data on its customers. Shoppers get lower prices if they use it, so I filled out my paper form. Some non-English speaking data entry clerk in Mexico somehow mistook my middle initial for an "Os" and prepended it to my last name, creating quite a tongue twister.
Safeway checkout clerks are apparently required to thank me by name, using the name that pops up on their screens when I swipe my card. For nearly twenty years now, all over the country, every harried Safeway checker has sent me on my way with, "Thank you Mr., uh, Asperger", or "Thanks, um, Mr. Ostrich", or whatever that bizarre cluster of letters randomly turned into on the way out of their mouths.
At first, I thought I should fix it, but I quickly grew to enjoy the show. I also enjoy the thought of them trying to cross-match Mr. Asperger with other consumer databases.
At home, I only pick up for known numbers, and let the answering machine screen the rest. At work, they kind of expect you to identify yourself when you pick up the phone. But yes, what you suggest is a workable alternative, where allowed.
This is my situation as well. I can't count the number of times I've gotten the "Redward" equivalent.
To compound the problems caused by this, I switched to using my first name as my primary name around the same time as I switched coast. People on the east coast know me as my middle name, people from the west coast know me as my first name. This can come in handy sometimes as it gives me a very quick indication of where I know somebody from, but gave me a good deal of trouble recently at a west-coast wedding with lots of east-coast people attending... The fact that some of my friends were introducing me to other people with just my last name made it quite... interesting.
It's an annoying thing to have to deal with on a regular basis. I'm in the same boat, in that my father and I both have the same first and last names, so I've always gone by my middle name.
Except now when I go to networking events or interviews, I tell people my middle name L* and then they point that out that my name tag or application says my first name is M* and I have to go through the whole song and dance of explaining the situation.
I'm frustrated enough to be looking into getting it legally changed. You might want to consider that as well if only to not have to deal with those phone calls any more.
Same here. I get around this by never mentioning my first name unless I'm in a situation where it's legally required, like at a doctor's office or the DMV. That groups down to a small number of cases:
1. If they're working for me, like at the doctor's office, I ask them to please call me by my middle name. They're generally respectful about it and are used to dealing with nicknames and other aliases anyway.
2. In the DMV and other situations, I just grit my teeth and answer by my first name. It's not worth the hassle of explaining and they don't care anyway.
3. If I'm being hired, I fill out my paperwork "officially" and give it to HR, with the explanation that I go by my middle name for all legal purposes.
4. Banks are kind of weird because they perform official government functions, but they're still ultimately working for me. I've only had one bank flat-out refuse to put my middle name on my debit card and checks, and I explained to the branch manager why I was walking out the door before we'd finished opening my account.
I've thought about it, but as soon as I do, I think about all the paperwork inevitably involved in making sure my medical history follows me, my pensions and other financial records get updated, and all that other nonsense, and, having nearly half a century of paperwork that ought to be updated, and being an essentially lazy old cuss, I decide that I can live with the annoyance.
Not sure why, but your story reminded me of a friend of mine who's always gone by the name, "Mick" (a common shortening for the name Michael in Australia.) The thing is, neither his first or middle names are Michael. It was just a nickname that stuck when he was a kid.
I kid you not, at his wedding when the celebrant said, "Do you Susan, take this man Brian..." his bride exclaimed, "Who's Brian?"
Yea, my daughter's name is a bit unusual, and so the transliteration produces a different name than english. Her name is Thamina. It's an Arabic origin name, so in Arabic it's : ثمينة. Not that she's Arabic at all, she's part Russian, and was born in Russia, so her name is Тхамина in Russian. They transliterate her name in a standard way to Tkhamina on her Russian passport, but on her USA passport it's Thamina.
I feel your pain, although we Romanians use the (standard?) latin alphabet. I had to spell 'Andrei Simionescu' over the phone so many times that I'm sure I could win a couple of spelling bees easily.
Speaking of which, why don't all companies just move to automated support systems already? These guys are doing it right http://www.zocdoc.com/
Human names are an excellent illustration of the reason that you shouldn't use 'real' data as a key in a database. If the key is arbitrary and meaningless then it doesn't need to be mutable.
My name fail to register surprisingly often, even here in Brazil.
It is Hélder Maurício Gomes Ferreira Filho
Common reasons for failure is being too long and having non ASCII characters, but sometimes it fails for other reasons, for example do not allow me to register without a middle name ( I don't haven't one actually... ), me confused and not knowing how to register Filho ( it is not a family name, neither a surname or a last name, but it is still part of my name. It means Son, my father has the same name as me, without the Filho part), or breaking when it cross check with somewhere ( for several reasons I ended registering my name in several different ways, usually omitting Hélder, that I did not even knew was on my name until I got to school and got forced to use because of stupid rules that assume your first name is your typical name )
Jr. is absolutely part of my official name in the USA. It is the only distinction between my and my father's name. Many forms have a specific spot for suffix.
No I'm stating that Jr. is an official part of Norman John Harman Jr., my name. It is on my birth certificate, filled out on my tax return, etc. In any case were I'm required to use my real name if I used Norman John Harman Sr. I would be committing fraud. Likewise fraud if I left off the Jr. in an attempt to confuse with or impersonate my father.
By your reasoning, there is no such person. If there can be a "Norman John Harman Sr." then there can also be a "Norman John Harman Jr." who does not have it listed on official documents.
When your father dies, and you've named your son Norman John Harman as well, don't you become senior? It's can't be an immutable part of your name if your junior/senior status changes.
To me, Americans with III after their name always look like they're pretending to be royalty. Where I live, only monarchs have that (and only after their first name).
One memorable day at work a POS Kodak system decided that it wouldn't store a particular record. Nothing worked. This happened a few times until it became clear that the only things these cases had in common was that the people's first names started BRE. I can remember a rather sad looking IT manager nodding with agreement having tried everything when I suggested we just get them to change their names. Bug is still there. It's just a historic archive now, thank god. Worst software ever.
I found your comments on "Filho" interesting. In English, the equivalent is "Junior" with the father sometimes using "Senior." It is not part of your name per se, and thus would not be typed into the name field of a form. However, in places where naming sons after their fathers is common, there is often be an additional drop down box listing (Jr, Sr, I, II, III, and so on).
Oh yeah, these drop boxes piss me off, specially because here in Brazil there is BOTH Junior and Filho... And I am Filho, not Junior.
(also we have "Neto" that means Grandson, it is quite popular, I know a bunch of guys like that, I don't think dro down boxes in other countries will expect that)
Generally not. On the Internet, ASCII generally means ANSI_X3.4-1968, a 7-bit standard with 128 code points. (Run "man ascii" on a Unix system to see this.) There aren't any accented characters.
By contrast, there were national variants of ISO/IEC 646 (also a 7-bit character set, and essentially the internationalized version of ASCII) that included accented characters within those 128 code points. Generally these swapped out things like the at-sign (@) and the curly braces and vertical pipe character for accented vowels instead.
There were also lots of 8-bit character sets in ISO/IEC 8859 (e.g. Latin-1, or ISO/IEC 8859 part 1) that included accented characters within the "extended" set of code points 128-255.
There are a number of different "extended set" (IBM code pages and ISO/IEC 8859 parts for instance), and they're "extended" because they're not ASCII but supersets of it (as is UTF-8).
ASCII is the 7-bit encoding ANSI_X3.4-1968, composed of 95 printable and 33 control characters.
Or they account wrongly ;) (ie: from the wrong set)
I love how sometimes even on the same company, each place account ASCII differently.
I remember registering for a IM, and in one info screen my name was Maur&cio and in the site info screen Maur€cio and in the search screen was Maur£cio and so on...
Extended ASCII is not ASCII. There is ASCII, which has no accented characters, and there are other character encodings based on ASCII, which often do. Those other character encodings are not ASCII.
Hyphenated last names seem to break many customer service people - they'll do things like insist one is the "real" name, assume you are married and ask when, etc. I think it's that hyphen is such a simple thing, it screams "understand me!" instead of simply being entered verbatim.
Assuming hyphen = married = wife's father's name-husband's father's name is just so ignorant. I've known many people who have always have hyphenated names, a few I've gone to primary school with. I have a good friend with an always-hyphenated-last-name who gets asked personal questions by strangers and near strangers about her name. How about "It isn't any of your business, I can have as many names as I want?"
My best friend in elementary school had the misfortune of a hyphenated last name that was exactly sixteen characters long. The systems at the school evidently limited you to fifteen characters, because I saw her name printed in a lot of places without the final character.
The result is kinda wonky (lots of people write it wrong, usually "Elder" that of course resulted into video game savy friends nicknaming me "Mr. Scrolls")
In our app we neither validate nor escape user strings for any free form text (eg. "names" and descriptions)[1]. We only validate the max length.
If text is truly free form then you don't need to validate or white list anything. Just make sure it's valid UTF-8 (or whatever encoding you're using) and escape it when you display it. That combined with using prepared statements with bind variables (aka named parameters) and you don't have any issues with user inputs.
One other benefit of this approach is that you end up with proper i18n support without doing anything special. From your apps perspective all text is the same. If user's want to use unicode characters or put html tags in their descriptions then let them. If you escape it then there's no XSS issue. Plus it's WYSIWYG[2] from a user's perspective.
Who am I to judge that a user putting "<script>alert('Haxors!');</script>" as the name of an object is a bad idea?
[1]: "Names" don't include usernames which generally should have a whitelisted character set (ex: ASCII [a-z][a-z0-9+]) or email addresses (use a a real validator ... not a regex!).
"Just make sure it's valid UTF-8 (or whatever encoding you're using) and escape it when you display it."
I've lately been coming around to the belief that anyone who uses the term "sanitize" in this domain, as in, "sanitize user input" really doesn't know what they are talking about (at least on average). The approach you describe is the generally correct approach; you need to ensure that the proper levels of escaping are being applied. Unfortunately this is nontrivial in practice, but it's still the correct solution.
The "sanitization" meme has resulted in me smacking down at least 3 commits from developers in my organization trying to "solve" XSS by scrubbing out all less than characters across all input from the user, or eliminating all quotes, apostrophes, less than, greater than, backticks (for shell interpolation problems), etc etc. Unfortunately, the problem is, these are in general all perfectly valid input values, and some of them really smack you in the face immediately. (For instance, names may contain apostrophes. You can't "sanitize" them away; you need to write your SQL layer to handle that correctly, such as with binding.) You handle them by managing your encoding layers correctly, not by "sanitizing" them.
(There's still some sanitization components in the resulting solution, I just don't think they are the way you should think about it. For instance, there are some characters that are flat-out forbidden in, say, an HTML attribute, and the right thing to do is just strip them out of any incoming string. But that should be thought of as a "sanitization" step being a importent element of proper encoding, but not the actual "answer".)
It's a shame there's such a proximity in terminology between 'sanitize' and 'sanity check'. I wonder if that's where this whole confusion began in the first place. Yes, it is extremely unlikely that a user's given name contains a <script> tag, but there are few reasons why your sofware should really care about it on a technical level - least of all if the way you choose to care about it leads to it also complaining when someone claims their name is O'Reilly. The correct response to someone claiming their name is "'; DROP TABLE Users --" should, ideally, be to say "Are you really sure about that?" but defer to the human decision on whether it's really the right thing to do.
> I've lately been coming around to the belief that anyone who uses the term "sanitize" in this domain, as in, "sanitize user input" really doesn't know what they are talking about (at least on average).
I've had this view for a long while. I think there's a common sense to it that either clicks or it doesn't. Plus people hear/read "escape your inputs!" so often it becomes a cargo cult.
> You can't "sanitize" them away; you need to write your SQL layer to handle that correctly, such as with binding.) You handle them by managing your encoding layers correctly, not by "sanitizing" them.
Exactly. Whitelisting the values that can be stored in field should be done to maintain the data integrity of the field. It's not an approach to solve security problems or prevent SQL injection.
> For instance, there are some characters that are flat-out forbidden in, say, an HTML attribute, and the right thing to do is just strip them out of any incoming string. But that should be thought of as a "sanitization" step being a importent element of proper encoding, but not the actual "answer".)
We ran into something like this in our app as well. When displaying meta data for an object we create related objects in the dom and reference them by id. Originally the ids were generated by simply escaping the name of the raw object but that doesn't work because as you mention there are additional restrictions on what can be used in an "id" field. The solution? Hash it! Obviously that's a very specific solution as we only cared about it being unique and tied to the other object on the same page but it worked.
If you're going to accept all characters by default, be prepared to sanitize the outputs for every use, not just your website.
Maybe you will output a data dump for someone else to print mailouts. Or you'll share the user database with a vendor's web forum. Or payment processing. Or any SaaS.
No. That completely doesn't work. This is really important: You CAN'T "sanitize" for every possible use. You can not correctly figure out in advance how to represent an input, because the different possibilities are numerous and actively self-contradictory.
To "sanitize" for "every possible use" is pretty much to remove everything that isn't an ASCII letter. Even unexpected spaces can cause crazy behavior. Commas can cause CSV-injections. And you might still have length problems even so. Oh, and you still can't guarantee something won't screw up even so! https://news.ycombinator.com/item?id=6140631
You can not, at the time input comes in to a system, even pretend to know where all the data might end up, someday, given the whims of who knows whom, and who knows when. The only thing that works is for each system to correctly encode its output as needed, and if you output the correct thing and a subsequent system blows it up, it's the subsequent system's fault. You can't prevent it. You only think you can, but you're wrong.
To be clear, if you could defend against those systems messing up, I'd be willing to consider it. But you can't. It's impossible, both in theory and in practice.
There's no easy answer to writing secure code. (Though it would help a lot of people used type systems to better effect in this problem.) Filtering out certain "dirty" characters isn't an easy answer either, on the grounds that it isn't even an answer. (It turns out to often become not easy, too, because as you gradually and inevitably learn exactly how it isn't working for you, the subsequent frantically flailing addition of heuristics becomes very not easy itself. It is easier in the long run to do it correctly.)
Perhaps I was unclear, but I did not claim that there could be one single sanitized version of the data, safe for all use cases. I was saying that you have to do different sanitization for every output.
That's not called 'sanitizing', it's called 'escaping' and 'encoding'.
The byte sequence I need to store to communicate the name "Kei$ha O'Shaughnessey, Jr." in a UTF-8 JSON string literal, a UTF-8 HTML attribute, a UTF-16 bigendian CSV file, or an ISO-8859 SQL parameter, are going to be different - but so long as all the characters I need to pass are representable in all of those domains all I have to do is perform the correct escaping and encoding. At no point do I need to 'sanitize' the name. It's a name, it's not dirty.
If there are characters there that I can't represent in the target domain, then I need to handle the loss of information.
A strategy of 'escaping' assumes that the partner system does the right thing with its data. This is not always the case.
For instance, it may be perfectly fine in my system to have a user named '<script>alert("ha!")</script>'. Are you sure that's okay in your PHP-based web forum? Really sure? Every place they've ever shown a username to the user, it's well-escaped?
And even if that's true today, what about the day when someone decides to change the web forum software to something else? What about the day when someone turns on a feature that copies certain forum threads to an internal support system, also provided by a third party?
Somewhat related example anecdote: For several years, Vimeo was sending me newsletter emails addressed to "Dear Jarek_Piórkowski" (previously "Hi Jarek Pi??rkowski"). The ó that should be there shows up fine on the Vimeo website and I even cleared and re-input the name into my profile to give them a chance to re-encode it. Still continued.
I unsubscribed from the newsletter eventually.
And ó isn't even a difficult character, it's in ISO 8859-1 for crying out loud.
I expect Vimeo used a Linux system to collect your data, and I bet the thing that blasts emails out is ultimately Linux as well. So the Windows-1252 bungle probably happened in a third system in between, maybe a Windows system chosen for its ease of administration by the community managers.
Not that this is relevant to data sanitization (they're just being fuckups here) but it shows how complex this can get.
Just to be a bit pedantic, unfortunately you don't get "proper i18n support" just by putting everything in UTF-8.
Unicode lets you represent lots of abstract characters, from different languages and societies, in one character set. That doesn't quite tell you how to render the characters. For that, you need to know what language the text is in. Unicode wants you to provide that information out-of-band, e.g. in an HTML "lang" attribute, which the renderer can use to paint the proper glyphs.
For example, the Arabic digits 4 through 7 (۴ U+06F4 .. ۷ U+06F7) have different glyphs in Persian, Sindhi, and Urdu. And a character like 直 (U+76F4) has Chinese and Japanese glyphs that may not be mutually recognizable.
Bottom line: if you want an internationalized system that can store and render multilingual text, storing the text in Unicode is a good start, but you will need to store additional info (like the language) to be able to properly render the text.
I found http://en.wikipedia.org/wiki/Eastern_Arabic_numerals which shows examples of the differences in those numerals, but it looks like the different representations have different Unicode codepoint. So, there's no need for the lang attribute. (The page uses them, but if you take them off there's no difference in the display.)
You probably need to know the language to do things like sorting, comparison, regex, etc. But if you're just storing and displaying user-entered strings and your software has no need to understand the meaning of the strings, I think it's enough to do what the parent says.
Not quite. The Wikipedia article shows the difference between U+0660 .. U+0669 (Arabic-Indic digits) on the top row and U+06F0 .. U+06F9 (Eastern Arabic-Indic digits) on the bottom row.
But what I'm talking about are the different glyphs used to represent the bottom row (U+06F0 .. U+06F9) depending on whether the text is in Persian, Sindhi, or Urdu. See
http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf, table 8-2.
There is also the issue I mentioned about Chinese vs. Japanese glyphs for the same coded character, which is at least as important in practice.
This list is useless, because trying to follow it is impossibly ambitious. Which of these do I need to support for my system to work for X% of users with X+Y % being able to work around the limitations?
Logical fallacy (bifurcation): either you correctly implement all of the requirements, or it makes no sense trying at all. Note that the article even explicitly says "try to make _fewer_ of these assumptions," not "you MUST explicitly support all this."
Similar example: Do you lock your door, or does that make no sense to you? (Because if there's no absolute, perfect, 100% protection, there's apparently no difference at all between locked, closed and wide open; right?)
Logical fallacy (non sequitur), as the comment you're responding to said nothing of the kind, and argued specifically for a middle point in the second of only two sentences.
Well, you can still get bitten by "11. People’s names are all mapped in Unicode code points," as well as the sets 1-8 and 32-36 (people have exactly X names at a given point in time, where X>0); that's not to mention ordering and collation (12,13,18,30). But it's definitely the easiest option, and avoids many common pitfalls (if I had a nickel for every database using latin1 + latin1_swedish_ci because that's the first charset + collation in the list, I'd have a lot of nickels).
I can see 11, but as long as you're not using the name as a unique key but just as a label then the mutability, non singularity, and non-orderedness aren't such problems.
Reminds me of a story a police reservist told me. Guy got a license plate caled "none," and instantly had thousands of outstanding warrants. (The cop thought "none" was trying a fast one, and so deserved it.)
In 1986, Robert Barbour applied for a personalized plate. The DMV gives you three choices: he picked SAILING, BOATING. If he couldn't get those, he didn't want a personalized plate, so he put NO PLATE. Of course, he ended up with NO PLATE. Since that's what the police put as the license if a car doesn't have a plate, he ended up with 2500 tickets. Eventually the DMV told the police to write "NONE" instead of "NO PLATE" and said "We're just hoping that no one will come up with plates that say NONE." (Which would be ironic if the parent comment is accurate.)
Reminds me of "the Phantom of Heilbronn", and alleged female serial killer the German police used DNA to link to more than 40 crimes. It baffled the investigators for more than two years until they found out that the DNA was from pollutions in the cotton swabs used for collecting DNA samples.
If I could change one thing about human history, I'd remake the 0 and O's (# and letter) and the I, l, and 1 (letter i, letter L, and #) so you could never confuse them in hard to read CAPATCHs.
We had a customer with the last name "Echo" who couldn't make a credit card payment. Turns out that the card processor was looking for strings which were common Unix commands and not allowing them.
Security procedures for vendors hosting websites for Members of Congress apparently require them to look for sql injection attacks and redirect to 404 if they think one was found. The result appears to be that many just keep a list of keywords and characters and fail if found. Is your first name "Walter"? Oh, you tried to run the "alter" command in your message to your Congressman... we will take you to a 404 page. Oh you used semi-colons and single quotes in your message? ...hacker alert! off to blank page with you. Completely inconsistent between vendors/forms of course.
The system that prints our shipment labels stripped "var" from customer and street names. Sorry Halvar, you're now known as Hal. Customer names wheren't so bad, the mail service got the right people anyway, but reducing streetnames like "Vardegade" to "degade" is a bit more troublesome.
They never told me what the bug was, only that they fixed it. I have some idea though.
God, what a terrible idea. That would exclude Man Ray, Sir Thomas More, Murray Head, Tex Avery, Kimiko Date, Rollie Fingers, James Last, and even Steve Jobs!
This problem isn't about funny employee names. It's about thick, untransparent software stacks that make simple problems difficult.
SOAP is maybe the most popular example and this is really why it's lost popularity against REST. However, similar "It works with values W, X, Y, but not Z" situations are found in any stack or standard that has too much magic going on. Rails certainly comes to mind.
This is the biggest argument in favour of using many small, isolated components rather than one big all-encompasing framework, in my opinion. If every piece of third party functionality you import into your project can be easily understood, problems like these shrink in size, because there's a limit to how deep the magic can go.
I'm very fond of the Node.js ecosystem for particularly this aspect (even though I dislike the language). There's a big bucketload of tiny components there, rather than 90% of the community relying on a single humongous framework, like is common for e.g. Ruby or C#.
>It's about thick, untransparent software stacks that make simple problems difficult.
In this case, the flex framework is handling the marshaling/unmarshaling process incorrectly, but these are not simple problems. Handling things like names, or dates is deceptively difficult.
>There's a big bucketload of tiny components there...
Are you sure that's a good thing? First, anything works when you have a small code-base, and a significant portion of JavaScript ecosystem involves building fairly small self-contained applications. I have a strong feeling micro-frameworks fail hard when the line count gets over a certain threshold.
For all that is wrong with flash as a platform, I have always found Actionscript 3 to be a far better structured and easy to work with language, compared to something like javascript, which has a very 'organic' structure.
I think in this case, and the comments seem to confirm, that it's an issue with the SOAP/XML encoder's handling of null values. That aside, AS3 doesn't bug me too much, my biggest issue with Flash/Flex was how it handled programmatic audio elements so differently from video/clip elements, that is/was annoying.
Of course since Adobe has all but abandoned the platform, it probably won't be much of an issue in the future... Kind of a shame, as Flex was actually pretty nice. If adobe was more open with the flash client as a platform, and focused on the tooling (where they make their money), it could have been integrated into browsers, and had a much better chance of sticking around.
There was a point where Adobe had a chance to evolve flash into a platform for creating HTML sites and applications (much like coffeescript, LESS, or any other intermediate might be). It certainly would have involved open sourcing large components of the platform, which they seemed to be against.
It's a real shame, since the closed source aspect of Flash really led to a brain drain within the community. Flash for many years was the superior platform for making complex web applications, but sadly, it did not have the community and brain trust that more open standards did, like HTML and javascript did. In the long run, this led to serious stagnation with the platform.
Also AS3 has strong static typing, a huge advantage, as we've just heard once again yesterday:
[...] the value of types is super super important. [...] Everything that is syntacticly legal, that the compiler will accept, will eventually wind up in your code base and that's why I think that static typing is so valuable because it cuts down on what can kinda make it pass down there. I'm only getting stronger in my stands on static typing, static analysis.[...] - John Carmack, qcon keynote 2013
I actually feel like AS3 could go further with static typing. Due to things like promotions with arithmetic operators, you lose some performance. I always felt it would be better to either force the user to explicitly cast between floating point operators and ints, and crash if they tried to do something with an int that would lead to a float.
What's the issue? CF makes writing web services stupid easy. Just add access="remote" to a CFC's function. Instantly introspected and performs all SOAP conversion for you.
Now, I'll admit, SOAP anything is crazy, but based on the age of the application, may have been the best option at the time.
Python makes a distinction between types, but it has a concept of falsyness (as do many languages, though it is often more restricted e.g. in Ruby only false and nil are falsy IIRC).
Falsyness is used in boolean-ish contexts (if, while and explicit boolean conversion), by default all of None (null), False, 0, 0.0, "" (the empty string, unicode or binary) and empty collections ([], (), {}, set()) are falsy and although UDTs are "truthy" it can be overridden.
This has nothing to do with making the distinction between ints and strings. You could actually implement the same thing in e.g. Haskell (by creating a "Booleanish" typeclass and implementing it on all the types you care for).
edit: please note that — in Python — 0 (the integer) is falsy but "0" (the string containing a single character 0x30) is truthy. I expect gtaylor talks about bugs in handling of IDs or sequence numbers and the like, not usernames. I know I've hit them when not being very attentive.
Vaguely related but we had a customer using some feature that split text blobs using a record separator. One time someone wanted to not split at all and set separator="none" without bothering to look up how to actually turn off record splitting. It worked well enough in quick tests, but by the time we'd gotten the support call, they'd corrupted a massive database where every product with a description containing the string "none" was now corrupted.
SOAP: the gift that keeps on giving. I built two SOAP APIs for Google (Search and AdWords) and spent way, way too much time on this kind of data interop nonsense. I wrote up a quick summary of why SOAP sucks, it's still my most popular blog post. http://www.somebits.com/weblog/tech/bad/whySoapSucks.html
XML is a terrible encoding for data. Fortunately JSON does pretty well and has mostly replaced it for new stuff.
Funny, I read this as "I blame C [the language]". I was taken aback because obviously "null" in C is a 5 byte character array and would never be confused for NULL.
All I meant was that all three of those technologies are famous for their unreliability. OK perhaps the flash runtime not so much, it's disliked for different reasons. It's definitely true of the other two. I've had to talk soap to a coldfusion service before and it's not an environment conducive to safe programming.
All it takes is one piece of code in the chain deciding to "helpfully" "fix" cases where people have stringified null values in a system that stringifies null to "null" by "translating them back" to null, and voila, you have a broken system.
A lot of problems like these are down to programmers that try to fix bugs by fixing the symptoms.
I recently moved to Japan. My name, as it appears on my passport, is 26 characters long and consists of a-z letters and spaces. This means I run into three problems:
First, many computer systems here can't fit names that long, leading to truncation, or in some cases, denial of account creation (some places have rules that whatever's on the account must match exactly what's on the passport; one was nice enough to add a new column to their database for the rest of my name).
Second, many systems don't support a-z, so we throw the name in as katakana or double-byte romaji, depending on what works. Sometimes neither will work, and we just have to give up and go somewhere else.
Finally, many computer systems can only link accounts if the names match exactly, for example inter-bank transfers and bill payments. Since my name is truncated differently in different places, and is formatted differently in different systems, in many cases it's just impossible.
For example, right now I'm charged a fee for transferring money between two bank accounts, since the free transfers only apply if the names match exactly. I just withdraw cash from one and deposit it into the other as a workaround.
Another example is my cell phone bill. To pay it by CC my name must match exactly, but all my cards have slightly different spellings.
Don't even get me started on my wife, who is ethnically Japanese but came with me as a foreigner and hence has a name written using a-z letters. It literally has set of fraud detection systems.
My last name has a space in it. The three years that I have filed paper taxes, IRS has split up my last name into two words and put one of them as my middle name. This despite the fact that I clearly label the middle name box as "n/a". I was told by the IRS representative that I should change my name.
I remember having "funny" situations with with AS3 SOAP marshaling and unmarshaling. For example, in same cases it would just fail silently, providing half populated objects without without any errors. In the end, the workaround was to just invoke SOAP methods directly using HTTP and just manually parse XML response. Good old days.
That's a nonsensical suggestion, the guy's name is the string "Null" which is munged by some intermediate system, if you're looking for Mr Null you don't want to get people with empty names back.
One developer at this company, who was annoyed with having to tweak a particular system every time they added a new possible status flag, wrote code which was, essentially:
Do I have to explain why I'm aware of this curious implementation choice?