Hacker News new | past | comments | ask | show | jobs | submit login
International domain names: where does https://meßagefactory.ca lead you? (lemire.me)
127 points by Amorymeltzer on Jan 24, 2023 | hide | past | favorite | 79 comments



This is a difference between what's called "transitional" and "nontransitional" IDNA processing.

The WHATWG URL standard mandates nontransitional processing, which says meßagefactory.ca becomes xn--meagefactory-m9a.ca. Firefox and Safari follow the standard.

Chrome and Edge are still using transitional processing. Here's the Chrome bug to switch to nontransitional processing (which appears to be very close to shipping): https://bugs.chromium.org/p/chromium/issues/detail?id=694157

In 2021, Go's HTTP client switched from transitional to nontransitional processing and the relevant issue is quite informative: https://github.com/golang/go/issues/46001


The transition in case is from IDNA 2003 processing to IDNA 2008 processing. The internet was and is rather late in adopting the latter. Just ten years ago it wasn’t a sure thing that it would be adopted in the WhatWG spec and in browsers:

https://annevankesteren.nl/2012/11/idna-hell


Seems to skip over that with the Canadian top level domain, only the characters `é, ë, ê, è, â, à, æ, ô, œ, ù, û, ü, ç, î, ï, and ÿ` are allowed[0][1] (not all unicode points/not all valid punycode). And, the Canadian Internet Registration Authority (CIRA) supports "administrative bundling" (all variants including plain ASCII are reserved for the same registrant) so, for example, cira.ca has 18 possible domain variants in total[2]. meßagefactory.ca is presently invalid.

[0]: https://www.cira.ca/blog/byron-holland/internationalized-dom... [1]: https://en.wikipedia.org/wiki/Country_code_top-level_domain [2]: https://www.cira.ca/ca-domains/register-your-ca-domain/domai...


This is great! It's usually more expensive to have a non ASCII leter in your surname, as you have to register two domains.

And if we'll ever ditch punycode for real UTF-8, one will have to buy three for backwards conpatibility with punycode.

My national registry does not do administrative building - I have a domain with a (correctly) placed š, but some other guy has the š in his surname substituted with a s in the domain he owns (:

does CIRA actually register alternative domain representations or just reserve them and wait for you to pay?


There's, sadly, no cost savings.. you have to register/pay for each variant (they're not activated by default[2]). But it does stop the case like yours - two domains with different owners/intents varying by one or more accents on letters. If UTF8 is adopted (seems like a huge if.. maybe after we adopt IPv6?) hopefully they'd make a similar decision to bundle unicode variants.


I'm sorry but I have accents in my name and your choice is not smart. Probably will have a hard time getting to your site, might even think it's a mistake. The guy that used the normal s did the right thing, you have the equivalent of a novelty domain with an emoji on it.


What the hell, how can you be so colonial you think it is normal to deny someone the right to their name just because it doesn't come from that one language you ever bothered to learn?


Like I said, my own language has accents and my own name has accents. I value practicality and being available, if I'm taking the time to create a website and host it. I would think the OP would be familiar with the fact that every single website in their country, even when the brands have accents in their names, do not use them in their URLs. Their choice is a novelty one. I'm not sure why you're getting offended over something that is purely practical and that affects me personally and I still do the practical thing.


Not dumbing down your own culture to submit to the outdated limitation of the American Standard Code for Information Interchange is not a novelty. Forgoing your language is not practicality.

It's normal to expect technology to evolve to be usuable by all rather than expect people changing to conform to technology. People should fight for that more often.


Even us, users of IDN domains, still submit to the outdated and always present ASCII. Punycode is, after all, still ASCII and "real" UTF-8 characters are rarely and exceptionally seen in DNS.

It may be debated that with introduction of punycode, support for real accent and non ASCII characters was hindered.

https://pi.cr.yp.to/ experimented with UTF-8 in domain names before punycode and this will rarely ever work in the future, purely because now we have a half-solved problem with punycode and no one will bother to implement UTF-8 domains - it's would be ambiguous.


I'd rather get my website seen if I'm putting work into it than have a novelty URL that much less people will navigate to and will be harder to verbally communicate, and fill in online forms that assume ascii, just for some idealogical fight against internet standards.


Why respect your origin when you can just submit to American imperialism after all?


I don't think the person - who also has non-ascii characters in their name - is saying we should deny people the right to their name. Just that only certain characters should be in URLs.

I'm not picking a side; you just seem to have supremely misinterpreted their position.


Lately I've been noticing more commenters who immediately launch into accusations that ignore and contradict what was written.


I would argue that this is a pragmatic argument, not a colonial one. If you see 你好.com written on a billboard, are you likely to be able to visit that site by typing it in on your computer? Unless you have a Chinese keyboard installed (and also know what those characters mean), you simply cannot visit that site, whether or not you are a xenophobe. Nihao.com, on the other hand, is more or less universal accessible.


On the other hand, because of domain restrictions and other such nonsense Chinese websites will often use numbers that sound like words when spoken out loud rather than actual words for their domains.

Imagine being forced to only register 1337-code domains because some other country doesn't have Latin key caps on their keyboards.

Not understanding the local writing system or language does suck for foreigners, but most people in the world aren't foreigners in the country they live in.

Why should countries almost exclusively using the Arabic script be forced to switch between Latin and their normal way of writing because they want to edit the URL between writing comments? Why shouldn't people native in the 120+ languages using Devanagari be able to register domains?

Pragmatically speaking, the lack of proper language support in many non-Latin websites had led to a significant amount of them using images for labels rather than text, making Google/Bing/Baidu Translate worthless for navigating them. If you know what 你好.com is supposed to provide or sell to you, you can find it in a search engine and manually picture-check the domain if you want to be sure; the burden of being available to people outside your target demographic shouldn't fall on you just because a tiny slither of your user base doesn't understand your writing system, but you can opt into making it easier for them to find your website regardless.


In fact, many Chinese sites use numbers instead. Easier to type, easier to memorize.

https://mediaoptions.com/blog/understanding-numeric-domain-v...


Some years ago I had the weird idea of localized domain names. Some mechanism like rel=canonical, that would indicate first that two domain names in different scripts are the same and second indicates the preferred displayed domain name according to the locale of the user. So sesamstraße.example could be displayed as is for de-DE and de-AT, but as sesamstrasse.example for de-CH, assuming both domain names are registered and indicate sameness. With that at least for the main locale of a domain the users do have the minimal courtesy of having the domain displayed right and for other locales there exists an ASCII compatibility display.

But of course giving websites control over the display of their domain name is a major security headache. No idea what kind of stuff would be possible and how it would interact with TLS.


This is probably a reasonable way to handle sesamstraße.example, but how do we manage if skroutz.gr wishes to become σκρουτζ.com? Do we grant them scrooge.com in a non-Greek locale, as Skroutz is the Greek transliteration of Scrooge, or do they get stuck with the letter-for-letter skroutz.com? Will I go to a different site depending on my locale if I type scrooge.com?

What happens when someone registers θεν.com and someone else gets δεν.com, since both transliterate into ASCII as then.com?

In short, the idea has many potential pitfalls.


That was why I thought that for the reason to work the entity has already have to have registered both (or more) domain names and both domains must, possible in DNS records, indicate their respective equivalence. So no automatic translation, but definitive opt-in by the domain owners.


I had a similar non-related idea for web browsers. Currently domains are mainly lowercase on the web, so we could use a HTTP header to specify the correct capitalization.

capitalization: HarryPotter.Hogwarts.UK

Of course we'd have a lot of lookalike domain problems, like we have with IDN now (:


On the other hand: If you're Chinese and don't know what nihao means, would you not rather be able to simply go to 你好.com? (Chinese might be not the best example here because pinyin is very widespread, but that's not the case for every language)

If you can't type 你好.com you're simply not the website's intended audience.


Some around here have argued that allowing filenames with spaces was a bad idea. I disagree, but the rationale is not "colonial", it's a matter of practical limitations and whether we should modify our behavior to make the system simpler or try to conform the system to our ludicrously messy standards and, as a consequence, make them more complex and therefor fragile.

The standards do have limitations anyway. For instance, you cannot have an underscore in a hostname on the internet. You can have one in a CNAME, technically, but most CAs will not sign a certificate for any name with an underscore.


https://_.4a.si. works for me with letsencypt wildcard cert in firefox (:


Bad case of "computer says no"-ism


Yes, I agree. I hoped for the domain with s to be forgotten by the owner. It impossible to use this domain with š if I don't own the other one, that's why I don't use the domain under national TLD and instead use .eu for personal email and website.

It's weird that Google prefers the IDN variant under national TLD instead of the non-IDN variant with an IDN counterpart under .eu.

Another interesting phenomenon is that chromium detects suspicious activity and informs the user if he meant to go to š.tld when visiting s.tld. Or vice-versa, depending on which site he opened first I think. It's nice how the browser detects different ownership - this does not happen for other TLDs where I own both domains.


Slightly offtopic: Don't know if that's just overly nerdy, but I giggle when I think about the folks over at messagefactory.ca, that are now seeing a spike in page visits, wondering which campaign helped them getting more visibilty.


The article doesn’t investigate which behaviour is correct, or whether both are justifiable.

A very quick, probably flawed investigation suggests the Chromium-family behaviour (meßagefactory → messagefactory) might be correct, and Firefox and Safari (meßagefactory → xn--meagefactory-m9a) incorrect: RFC 3492 (Punycode) defers to RFC 3491 (nameprep), which is a profile of RFC 3454 (stringprep), explicitly using its mapping tables B.1 and B.2, and table B.2 says to map ß to ss (“00DF; 0073 0073; Case map”).

I welcome correction or more detail from anyone more knowledgeable on the matter. It’s fiddly stuff and I haven’t dealt with this stuff very much, so I could very easily have missed a trick.


According to the German (.de) NIC, Chromium's behaviour appears to be incorrect,[0] though it appears to be the old IDN standard. While for the most part, one can reasonably normalise ß into ss, it can be used to distinguish words, like Buße (penance) and Busse (buses).

[0] https://www.denic.de/en/know-how/idn-domains/


This has not stopped the Swiss for ditching the ß entirely.

Then again, the Swiss are also crazy people who say nonante for 90, instead of being French and saying quatre-vingt-dix (4x20+10).


That's a shame, since in this particular example (Buße / Busse) the spelling is a good pronounciation guide and that disappears when the difference is flattened out.


> Then again, the Swiss are also crazy people who say nonante for 90, instead of being French and saying quatre-vingt-dix (4x20+10).

That's not crazy, it's just common sense that for some reason the Académie Française, the authority on the French language, refuses to have.


> That's not crazy, it's just common sense that for some reason the Académie Française, the authority on the French language, refuses to have.

The Académie Française was in favor of septante, octante and nonante ("octante" is not used anywhere today, by the way, "huitante" is sometimes used in parts of Switzerland but the rest of the French-speaking world uses "quatre-vingts") from the moment it was created, and National Education also recommended these terms until 1945.

They just never really became widespread and eventually completely died out in France (Belgium also uses septante and nonante) but the Académie did try to promote them.

I'm not sure why one would think the contrary as the Académie has always been quite progressist, probably too much for their suggestions to actually enter usage I guess.


Well, if most French people say quatre-vingt-dix-sept for 97, why should the Academy invent a new word of its own to replace it?

On the other hand, if people on the street do typically say nonante-sept for that, then I agree that the Academy should recognize the change. As far as I know though, this is not the case at all, and septante/huitante/nonante sound strange to actual French people.

Either way, the GP was definitely being sarcastic.


I took it as sarcasm. The way 90/90-something is in french usually causes eye rolling.


Per the WHATWG URL spec, Chrome is wrong. Chrome is in the process of aligning with the spec: https://bugs.chromium.org/p/chromium/issues/detail?id=694157


Any idea if the fact that ß is not a legal character per the .ca domain rules impacts how it should be read?


It's irrelevant that .ca forbids ß. The URL spec is clear, and .de very much allows ß so it's not an academic concern: browsers need to be able to handle it correctly.


If the ß character is a forbidden character then the Chromium-family behaviour is correct. If its a legal character then the Firefox and Safari behaviour is correct.

If it is a forbidden character then registrars will not allow you to register xn--meagefactory-m9a.ca, and thus the problem will never exist.


domain names can get pretty weird. Case in point, I got excited when I realized I could purchase this one:

https://xn--lcal-5qa.host/

(Hacker News renders it as described in the article, but once you navigate to it your browser will show it)


The javascript[1] that's adding/removing the umlaut really messed with me for a bit. I was wondering what was broken with my browser/gpu/monitor/whatever when I noticed it in the footer, first.

[1] https://xn--lcal-5qa.host/js.js


what the heck are the images on there? AI generated?



That’s awesome! This is what I like to call internet art, and it reminds me of the vibe of the old internet, the way it used to be before social media like Facebook etc became a thing.

Kudos to the author!

I recently tried to do something similar on my own website. In terms of bringing back fun and exploration I mean.

My own website has very little content as of yet though.

The most relevant part of my website here is a directory listing of text files that I have. But it only has a couple of files in it yet. One of which is a json file actually because I allowed myself to compromise a little on what is acceptable as “text”.

I might compromise a bit more soon, in order to make the viewing experience for that section of my site more enjoyable on mobile devices. After all, the goal is not specifically for me to restrict myself technologically to the web as it was, but more so to be about like I said, a general vibe of fun and exploration.

I want my site to be something someone can run across randomly, be intrigued by, and find themselves exploring for some time. And preferably that I will make enough content that visitors will come back the next day or the next week to explore a bit more.


This is a really neat idea! It's a shame it didn't get more attention at the time, you should resubmit it.


Definitely AI generated. I describe those images as how I see people/things in my dreams. Recognizable, but not really. (ya creepy I know)


Thats pretty funny. Sounds like an ikea Product Name


Unrelated but localho.st is very useful.


For finding local sex workers in São Tomé?


Golfing for single letter International domain names are a fun way to spend an hour, though not sure what I will do with 𐰡.cc


I wonder if there is a online tool that finds you a list of shortest possible domain name.


https://micro.domains/ seems to be somewhat useful for this


Small XSS payloads in more unsuspecting places.


For a while we had a Hebrew domain name (rendered in punycode) that redirected to our main company website with an English alphabet domain name. (We have offices in Tel Aviv and Sunnyvale.)

I looked at the logs after and noticed that the only hits to the Hebrew one were from me. I don't bother with it anymore. The fact that browsers inconsistently resolve them makes it an idea not worth considering for most countries/languages.


I've noticed that when the .fr TLD opened up to domains less than three letters long.

I already had ssz.fr which I occasionally accessed through "ßz.fr", and then ßz.fr became available for registration. I contacted AFNIC but they never really understood what I was talking about. Not that it's a real problem anyway, nobody uses ß in domain names and surely not for .fr domains.


In programming languages it's much worse. Identifiers can either be unidentifiable, and if so everybody has a different opinion what "identifiable" means. Even the standard on identifiers, UTF-39, is buggy and has too many interpretations, leading to a complete disaster. https://github.com/rurban/libu8ident/blob/master/doc/c11.md

In punycode domain names it's quite simple still.

With other names, it's even worse. No-one cares. Linkers do not, username and filesystem drivers do not. The Apple HFS+ did care a bit one day, until someone in the higher ranks decided that no-one needs unicode security anymore and switched the new APFS to unsafe again.


Unicode domains are interesting.

I registered nick<ninja unicode character>.eth as an ENS domain (which is allowed by the ENS spec[1]). It's pretty hit and miss what services will work with it and what won't. https://xn--nick-ow14c.eth.xyz/ does work though.

I actually had to edit this comment because HN strips the <ninja unicode character> out.

[1] https://docs.ens.domains/frequently-asked-questions#what-abo...



Very nice work!

Just noting that the ".link" "check" link on https://adraffy.github.io/ens-normalize.js/test/resolver.htm... isn't working. Unclear is that is a problem with my domain or with .link


Only today did I learn you could use 。 as a separator. https://xn--googlecom-0w64c (which is google 。 com without the spaces) actually leads somewhere.


I think allowing unicode in domain names is a mistake. It does make sense to allow more characters, because ascii is very limited, but there's a lot of characters in unicode, including some that look deceptively like different ascii characters, which could really mislead people. I wouldn't mind if browsers gave some warning for non-ascii domain names.


This is a known problem and registrars are handling it. For example, forbidding multiple mixed scripts in a domain. Of course, typosquatting and similar characters are an issue in English too. The registrar deals with it.

While I understand your position, it is very anglocentric. Other countries don't use the Roman alphabet at all and shouldn't be forced to use it, especially when a simple whitelist of Unicode characters would have solved the security issue.


> This is a known problem and registrars are handling it

Nitpick: registries are handling it. Registrars usually just use the IDN tables provided by the registries. Mixed script rules in particular are a nightmare to implement perfectly when the tables are not given directly by the registry (for some ccTLDs mainly).

For gTLDs these tables are directly available on iana: https://www.iana.org/domains/idn-tables


My position isn't specifically anglocentric (I'm not English myself), and like I said, it makes a lot of sense to allow more characters, but it opens up a lot of pitfalls. Banning mixed script is definitely a great idea.


The ß is not really mixed script: It’s a (mostly) normal character in the german language which itself is based on latin script. It’s part of common words in the language and part of personal names, including mine. I’d rather like having the ability to register a domain with my last name and was rather annoyed that the Internet infrastructure made that difficult, first with ASCII–centrism, then with IDNA2003 and now with a decades long transition.


Yeah, it's a known issue: https://en.wikipedia.org/wiki/IDN_homograph_attack (see the Defending against the attack section for client- and registry-side measures)


How to steal money from old people via unicode obfuscated domain names.


Slightly off-topic, but does anyone know why this works on macOS?

  $ echo "foo" >"ß.txt"
  $ cat "ss.txt"
  foo


Mac OS’ default filesystems, HFS+ and APFS, historically have been case-insensitive, although case-sensitive variants exist. In the Unicode database the uppercase variant of ‘ß’ is recorded as upper('ß') → 'SS', according to most recent and common usage. A capital ẞ exists, but is rather new. One can assume that your filesystem does their filename comparisons and possible storage with `lower(upper(filename))` or such.

Another pitfall of Mac filesystems are Unicode normalizations of precomposed characters, which changed between HFS+ and APFS I think.


> One can assume that your filesystem does their filename comparisons and possible storage with `lower(upper(filename))` or such.

The correct way to compare Unicode strings case-insensitively involves "case-folding" which directly maps "ß" to "ss".


Gotcha! The crucial piece I was missing was that Unicode case-folding can turn a single codepoint into many.

One more thing to the list of falsehoods programmers believe about Unicode I guess :-)


In German you can write 'ss' for 'ß'. So 'spaß' (fun) becomes 'spass' .

And the 'e' can be used to encode a diaeresis. For example 'spät' (late) becomes spaet'.

It makes sense to apply this encoding for filenames, since not all software may support Unicode filenames. The file may one day be transfered to anorher OS, etc.


Anyone else seeing this behavior in Safari mobile?

  Safari cannot open the page because the server cannot be found.


Link doesn’t even load on my iPhone in safari


Yeah, I'm not clicking on that


I’ll do it for you


I read .ca as .cn which was a big part of my hesitation. Thanks for making the (not so) risky click for me


Remember when we vainly attempted to train people how to distinguish between legitimate and spoofed domain names, such as in phishing emails?

Remember when big tech companies started sending confirmation/survey emails with links to domain names indistinguishable from spoofed ones? 1drv.ms, I'm looking at you.

Now good luck convincing anyone that "xn--meagefactory-m9a.ca" is totally legit.


Yeah punycode is a bad "solution", it just makes most url look suspect.

The comment about the Canadian rules shows that the confusion problem should be handled on the registrar level rather than the url bar.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: