Hacker News Clone

What programmers need to know about encodings and charsets (2011)

by neiesc on 8/14/2020, 7:55:50 PM with 22 comments

by at_a_remove on 8/14/2020, 8:56:45 PM
I was looking for the catch. Here it is: "It's really simple: Know what encoding a certain piece of text, that is, a certain byte sequence, is in, then interpret it with that encoding."
That's like "knowing" the truth. How?
I have received some very interesting files that made Python yack unicode errors, again and again. Why? Not only did I not "know" what encoding it was in -- the encodings changed at different points in the stream of bytes. I call this "slamming bytes together" because somewhere along the line, someone's program did exactly that.
Everything is simple -- until it isn't.
by jbandela1 on 8/14/2020, 8:49:50 PM
Note: This post is basically a TLDR of https://www.theregister.com/2013/10/04/verity_stob_unicode/ by Verity Stob.
One of the reasons there is a lot of confusion about encodings vs Unicode is that Unicode was initially an encoding. It was thought that 65K characters was enough to represent all the characters in actual use across the languages and thus you just needed to change the from an 8 bit char to a 16 bit char and all would be well (apart from the issue of endianness). Thus Unicode initially specified what each symbol would look like encoded in 16bits. (see http://unicode.org/history/unicode88.pdf, particularly section 2). Windows NT, Java, ICU, all embraced this.
Then it turned out that you needed a lot more characters than 65K and instead of each character being 16 bits, you would need 32 bit characters (or else have weird 3 byte data types). Whereas people could justify going from 8 bits to 16 bits as a cost of not having to worry about charsets, most developers balked at 32 bits for every character. In addition, you now had a bunch of the early adopters (Java and Windows NT) that had already embraced 16 bit characters. So then encodings were hacked on such as UTF-16 (surrogate pairs of 16 bit characters for some unicode code points).
I think, if the problem had been understood better at the start that you have a lot more characters than will fit in 16 bits, then something UTF-8 would likely have been chosen as the canonical encoding and we could have avoided a lot of these issues. Alas, such is the benefit of 20/20 hindsight.
by sgopalra on 8/15/2020, 12:46:16 AM
Interesting article from Joel spoolsky on unicode and character sets. https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
by dang on 8/14/2020, 8:07:29 PM
If curious see also
2015 https://news.ycombinator.com/item?id=9788253
2012: https://news.ycombinator.com/item?id=4771987
by UpdatedFolders on 8/14/2020, 9:56:29 PM
I personally had a good time re-reading this over and over again when I was migrating python 2 to python 3, it's a great resource: http://farmdev.com/talks/unicode/
by neiesc on 8/15/2020, 11:24:57 AM
I think not explorer BOM UTF-8 https://en.wikipedia.org/wiki/Byte_order_mark
by ExtremisAndy on 8/14/2020, 10:17:43 PM
I love C++ so much, and it has brought me such joy as a hobbyist programmer, but good grief, this one aspect of it (dealing with encodings & charsets) is so depressing I just want to cry sometimes.
by nunez on 8/14/2020, 11:20:28 PM
F for respects for everyone who got wrecked by BOM (byte-order mark) and CRLF vs LF.