• by nuc1e0n on 5/1/2024, 9:04:23 PM

    Codepoints can only be 1 to 4 utf-8 bytes. Utf-8's bit pattern can extend up to 6 bytes, but there are only 1,114,111 valid unicode codepoints. and U+10FFFF takes 4 bytes to encode in utf-8 in a not overlong form. I guess you could encode it overlong, but utf-8 should only be encoded not overlong, so anything else could be considered invalid and potentially harmful.

  • by nuc1e0n on 5/3/2024, 3:41:51 PM

    Also I think the step you feel you are missing is the one where the combining of codepoints into ligatures and laying out of text on screen is done. Google Chrome uses a library called Pango to do this IIRC. Edit: maybe it's one called Skia instead. https://en.wikipedia.org/wiki/Complex_text_layout