Hacker News Clone

The ü/ü Conundrum

by firstSpeaker on 3/24/2024, 4:50:41 PM with 275 comments

by re on 3/24/2024, 7:25:37 PM
> Can you spot any difference between “blöb” and “blöb”?
It's tricky to try to determine this because normalization can end up getting applied unexpectedly (for instance, on Mac, Firefox appears to normalize copied text as NFC while Chrome does not), but by downloading the page with cURL and checking the raw bytes I can confirm that there is no difference between those two words :) Something in the author's editing or publishing pipeline is applying normalization and not giving her the end result that she was going for.
```
  00009000: 0a3c 7020 6964 3d22 3066 3939 223e 4361  .<p id="0f99">Ca
  00009010: 6e20 796f 7520 7370 6f74 2061 6e79 2064  n you spot any d
  00009020: 6966 6665 7265 6e63 6520 6265 7477 6565  ifference betwee
  00009030: 6e20 e280 9c62 6cc3 b662 e280 9d20 616e  n ...bl..b... an
  00009040: 6420 e280 9c62 6cc3 b662 e280 9d3f 3c2f  d ...bl..b...?</
```
Let's see if I can get HN to preserve the different forms:
Composed: ü Decomposed: ü
Edit: Looks like that worked!
by mglz on 3/24/2024, 9:21:45 PM
My last name contains an ü and it has been consistenly horrible.
* When I try to preemptively replace ü with ue many institutions and companies refuse to accept it because it does not match my passport
* Especially in France, clerks try to emulate ü with the diacritics used for the trema e, ë. This makes it virtually impossible to find me in a system again
* Sometimes I can enter my name as-is and there seems to be no problem, only for some other system to mangle it to � or or a box. This often triggers error downstream I have no way of fixing
* Sometimes, people print a u and add the diacritics by hand on the label. This is nice, but still somehow wrong.
I wonder what the solution is. Give up and ask people to consistenly use a ascii-only name? Allow everybody 1000+ unicode characters as a name and go off that string? Officially change my name?
by weinzierl on 3/25/2024, 4:40:14 PM
This article is about a failure to do normalization properly and is not really about an issue with Unicode. Regardless what some comments seem to allude to, an Umlaut-ü should always render exactly the same, no matter how it is encoded.
There is, however, a real ü/ü conundrum, regarding ü-Umlaut and ü-diaeresis. The ü's in the words Müll and aigüe should render differently. The dots in the French word are too close to the letter. In printed French material this is usually not the case.
Unfortunately Unicode does not capture the nuance of the semantic difference between an Umlaut and a Tréma or Diaresis.
The Umlaut is a letter in its own right with its own space in the alphabet. An ü-Umlaut can never be replaced by an u alone. This would be just as wrong as replacing a p by a q. Just because they look similar does not mean they are interchangeable. [1]
The Tréma on the other hand, is a modifier that helps with proper pronunciation of letter combinations. It is not a letter in its own right, just additional information. It can sometimes move over other adjacent letters (aiguë=aigüe, both are possible) too.
Some say this should be handled by the rendering system similar to Han-Unification, but I strongly disagree with this. French words are often used in German and vice versa. Currently there is no way to render a German loan word with Umlaut (e.g. führer) properly in French.
[1] The only acceptable replacement for ü-Umlaut is the combination ue.
by noodlesUK on 3/24/2024, 7:48:45 PM
One thing that is very unintuitive with normalization is that MacOS is much more aggressive with normalizing Unicode than Windows or Linux distros. Even if you copy and paste non-normalized text into a text box in safari on Mac, it will be normalized before it gets posted to the server. This leads to strange issues with string matching.
by jesprenj on 3/24/2024, 7:40:28 PM
Should you really change filenames of users' files and depend on the fact that they are valid utf8? Wouldn't it be better to keep the original filename and use that most of the time sans the searches and indexing?
Why don't you normalize latin alphabets filenames for indexing even further -- allow searching for "Führer" with queries like "Fuehrer" and "Fuhrer"?
by josephcsible on 3/24/2024, 9:09:51 PM
IMO, it was a mistake for Unicode to provide multiple ways to represent 100% identical-looking characters. After all, ASCII doesn't have separate "c"s for "hard c" and "soft c".
by layer8 on 3/24/2024, 5:36:47 PM
The more general solution is specified here: https://unicode.org/reports/tr10/#Searching
by blablabla123 on 3/24/2024, 8:15:54 PM
As a German macOS user with US keyboard I run into a related issue every now and then. What's nice about macOS is I can easily combine Umlaute but also other common letters from European languages without any extra configuration. But some (Web) Applications stumble upon it, while entering because it's like: 1. ¨ (Option-u) 2. ü (u pressed)
by chuckadams on 3/24/2024, 7:39:39 PM
Clearly the author already knows this, but it highlights the importance of always normalizing your input, and consistently using the same form instead of relying on the OS defaults.
by userbinator on 3/24/2024, 8:26:54 PM
its[sic] 2024, and we are still grappling with Unicode character encoding problems
More like "because it's 2024." This wouldn't be a problem before the complexity of Unicode became prevalent.
by _nalply on 3/24/2024, 7:24:55 PM
Sometimes it makes sense to reduce to Unicode confusables.
For example the Greek letter Big Alpha looks like uppercase A. Or some characters look very similar like the slash and the fraction slash. Yes, Unicode has separate scalar values for them.
There are Open Source tools to handle confusables.
This is in addition to the search specified by Unicode.
by Havoc on 3/24/2024, 8:01:00 PM
For those intrigued by this sort of thing check tech talk “plain text” by Dylan Beattie
Absolute gem. His other talks are entertaining too
by mawise on 3/24/2024, 7:56:09 PM
I ran into this building search for a family tree project. I found out that Rails provides `ActiveSupport::Inflector.transliterate()` which I could use for normalization.
by anewhnaccount2 on 3/25/2024, 6:35:28 AM
Reminded of this classic diveintomark post http://web.archive.org/web/20080209154953/http://diveintomar...
by CoastalCoder on 3/24/2024, 9:39:33 PM
Isn't ü/ü-encoding a solved problem on Unix systems?
</joke>
by philkrylov on 3/25/2024, 7:52:10 PM
The article suggests using NFC normalization as a simple solution, but fails to mention that HFS+ always does NFD normalization to file names, and APFS kinda does not but some layer above it actually does (https://eclecticlight.co/2021/05/08/explainer-unicode-normal...), and ZFS has this behavior controlled by a dataset-level option. I don't see how applying its suggestion literally (just normalize to NFC before saving) can work.
by jph on 3/24/2024, 7:31:50 PM
Normalizing can help with search. For example for Ruby I maintain this gem: https://rubygems.org/gems/sixarm_ruby_unaccent
by kazinator on 3/24/2024, 7:20:52 PM
Oh that Mötley Ünicöde.
by raffy on 3/24/2024, 8:58:50 PM
I created a bunch of Unicode tools during development of ENSIP-15 for ENS (Ethereum Name Service)
ENSIP-15 Specification: https://docs.ens.domains/ensip/15
ENS Normalization Tool: https://adraffy.github.io/ens-normalize.js/test/resolver.htm...
Browser Tests: https://adraffy.github.io/ens-normalize.js/test/report-nf.ht...
0-dependancy JS Unicode 15.1 NFC/NFD Implementation [10KB] https://github.com/adraffy/ens-normalize.js/blob/main/dist/n...
Unicode Character Browser: https://adraffy.github.io/ens-normalize.js/test/chars.html
Unicode Emoji Browser: https://adraffy.github.io/ens-normalize.js/test/emoji.html
Unicode Confusables: https://adraffy.github.io/ens-normalize.js/test/confused.htm...
by WalterBright on 3/25/2024, 3:03:16 AM
> Can you spot any difference between “blöb” and “blöb”?
That's where Unicode lost its way and went into a ditch. Identical glyphs should always have the same code point (or sequence of code points).
Imagine all the coding time spent trying to deal with this nonsense.
by ulrischa on 3/24/2024, 9:20:04 PM
It is really so awful that we have to deal with encoding issues in 2024.
by ComputerGuru on 3/24/2024, 10:46:10 PM
ZFS can be configured to force the use of a particular normalized Unicode form for all filenames. Amazing filesystem.
by NotYourLawyer on 3/24/2024, 8:17:45 PM
ASCII should be enough for anyone.
by earthboundkid on 3/24/2024, 10:38:10 PM
This isn’t an encoding problem. It’s a search problem.
by juujian on 3/24/2024, 8:04:10 PM
I ran into encoding problems so many times, I just use ASCII aggressively now. There is still kanji, Hanzi, etc. but at least for Western alphabets, not worth the hassle.
by keybored on 3/24/2024, 8:02:33 PM
I try to avoid Unicode in filenames (I’m on Linux). It seems that a lot of normal users might have the same intuition as well? I get the sense that a lot will instinctually transcode to ASCII, like they do for URLs.