I came up with a scheme a number of years ago that takes advantage of the illegality of overlong encodings [0].
Obviously UTF-8 has 256 code units (<00> to <FF>). 128 of them are always valid within a UTF-8 string (ASCII, <00> to <7F>), leaving 128 code units that could be invalid within a UTF-8 string (<80> to <FF>).
There also happen to be exactly 128 2-byte overlong representations (overlong representations of ASCII characters).
Basically, any byte in some input that can't be interpreted as valid UTF-8 can be replaced with a 2-byte overlong representation. This can be used as an extension of WTF-8 so that UTF-16 and UTF-8 errors can both be stored in the same stream. I called the encoding WTF-8b [2], though I'd be interested to know if someone else has come up with the same scheme.
This should technically be "fine" WRT Unicode text processing, since it involves transforming invalid Unicode into other invalid Unicode. This principle is already used by WTF-8.
I used it to improve preservation of invalid Unicode (ie, random 8-bit data in UTF-8 text or random 16-bit data in JSON strings) in jq, though I suspect the PR [1] won't be accepted. I still find the changes very useful personally, so maybe I'll come up with a different approach some time.
[2] I think I used the name "WTF-8b" as an allusion to UTF-8b/surrogateescape/PEP-383 which also encodes ill-formed UTF-8, though UTF-8b is less efficient storage-wise and is not compatible with WTF-8.
That's brilliant, tbh. I guess the challenge is how you represent those in the decoded character space. Maybe they should allocate 128 characters somewhere and define them as "invalid byte values".
In my jq PR I used negative numbers to represent them (the original byte, negated), since they're already just using `int` to represent a decoded code point, and it's somewhat normal to return distinguishable errors as negative numbers in C. I think it would also make sense to represent the UTF-16 errors ("unpaired surrogates") as negative numbers, though I didn't make that change internally (maybe because they're already used elsewhere). I did make it so that they are represented as negatives in `explode` however, so `"\uD800" | explode` emits `[-0xD800]`.
In something other than C, I'd expect they should be distinguished as members of an enumeration or something, eg:
I don't expect anyone to adopt this. Listing complaints about a heavily used standard, and proposing something else incompatible won't gain any traction.
Compare to WTF-8, which solves a different problem (representing invalid 16-bit characters within an 8-bit encoding).
Yeah, WTF-8 is a very straightforward "the spec semi-artificially says we can't do this one thing, and it prevents you from using utf8 under the hood to represent JS and Java strings which allow for unpaired utf16 surrogates, so in practice utf8-except-this-one-thing is the only way to do an in memory representation in things that want to implement or interop round trip with those".
It's literally the exact opposite of this proposal, in that there's an actual concrete problem and how to make it not a problem. This one is a list of weird grievances that aren't actually problems for anyone, like the max code point number.
I'm surprised this doesn't mandate one of the Unicode Normalization Forms. Normalization is both obscure and complex. Unicode should have a single canonical binary encoding for all character sequences.
Its a missed opportunity that this isn't already the case - but if you're going to replace utf8, we should absolutely mandate one of the normalization forms along the way.
I don't think you can mandate that in this kind of encoding. This just encodes code points, with some choices so certain invalid code points are unable to be encoded.
But normalized forms are about sequences of code points that are semantically equivalent. You can't make the non-normalized code point sequences unencodable in an encoding that only looks at one code point at a time. You wouldn't want to anchor the encoding to any particular version of Unicode either.
Normalized forms have to happen at another layer. That layer is often omitted for efficiency or because nobody stopped to consider it, but the code point encoding layer isn't the right place.
Normalization is annoying but understandable - you have common characters that are clearly SOMETHING + MODIFIER, and they are common enough that you want to represent them as a single character to avoid byte explosion. SOMETHING and MODIFIER are also both useful on their own, potentially combining with other less common characters that are less valuable to encode (unfrequent, but valuable).
If you skip all the modifiers, you end up with an explosion in code space. If you skip all the precomposed characters, you end up with an explosion in bytes.
There's no good solution here, so normalization makes sense. But then the committee says ".. and what about this kind of normalization" and then you end up.. here.
Right. But if we had a chance for a do-over, it'd be really nice if we all just agreed on a normalization form and used it from the start in all our software. Seems like a missed opportunity not to.
I think NFC is the agreed-upon normalization form, is it not? The only real exception I can think of is HFS+ but that was corrected in APFS (which uses NFC now like the rest of the world).
Forbidding \r\n line endings in the encoding just sort of sinks the whole idea. The first couple ideas are nice, but then you suddenly get normative with what characters are allowed to be encoded? That creates a very large initial hurdle to clear to get people to use your encoding. Suddenly you need to forbid specific texts, instead of just handling everything. Why put such a huge footgun in your system when it's not necessary?
Yeah it doesn’t make much sense. In addition to being the default line ending on Windows, \r\n is part of the syntax of many text-based protocols (e.g. SMTP and IMAP) that support UTF-8, so clients/servers of all these protocols would be broken.
Many things makes sense to me, but as we can all guess, this will never become a thing :(
But the "magic number" thing to me is a waste of space. If this standard is accepted, if no magic number you have corrected UTF-8.
As for \r\n, not a big deal to me. I would like to see if forbidden if only to force Microsoft to use \n like UN*X and Apple. I still need to deal with \r\n in files showing up every so often.
"If this standard is accepted, if no magic number you have corrected UTF-8."
That's true only if "corrected UTF-8" is accepted and existing UTF-8 becomes obsolete. That can't happen. There's too much existing UTF-8 text that will never be translated to a newer standard.
This scheme skips over 80 through 9F because they claim it's never appropriate to send those control characters through interchangeable text, but it just seems like a very brave proposal to intentionally have codepoints that can't be encoded.
I think the offset scheme should only be used to fix overlength encodings, and not trying to patch over an adhoc hole at the same time. It seems safer to make it possible to encode all codepoints whether those codepoints should be used or not. Unicode already has holes in various ranges anyways.
1) Adding offsets to multi-byte sequences breaks compatibility with existing UTF-8 text, while generating text which can be decoded (incorrectly) as UTF-8. That seems like a non-starter. The alleged benefit of "eliminating overlength encodings" seems marginal; overlength encodings are already invalid. It also significantly increases the complexity of encoders and decoders, especially in dealing with discontinuities like the UTF-16 surrogate "hole".
2) I really doubt that the current upper limit of U+10_FFFF is going to need to be raised. Past growth in the Unicode standard has primarily been driven by the addition of more CJK characters; that isn't going to continue indefinitely.
3) Disallowing C0 characters like U+0009 (horizontal tab) is absurd, especially at the level of a text encoding.
4) BOMs are dumb. We learned that lesson in the early 2000s - even if they sound great as a way of identifying text encodings, they have a nasty way of sneaking into the middle of strings and causing havoc. Bringing them back is a terrible idea.
I was going to object to using something new at all, but their recommendation for up to 31 bits is the same as the original UTF-8. They only add new logic for sequences starting with FF.
I'm not super thrilled with the extensions, though. They jump directly from 36 bits to 63/71 bits with nothing in between and then use a complicated scheme to go further.
The proposed extension mechanism itself is quite extensible in my understanding, so you should be able to define UCS-T and UCS-P (for tera and peta respectively) with minimal changes. The website offers an FAQ for this very topic [1], too.
That FAQ doesn't address my issues with their UTF-8 variants. And I don't want more extensions, I want it to be simpler. Once your prefix bits fill up, go directly to storing the number of bytes. Don't have this implicit jump from 7 to 13. And arrange the length encoding so you don't have to do that weird B4 thing to keep it in order.
The length encoding is fun to think about, if you want it to go all the way up to infinity, and avoid wasting bytes.
My thought: Bytes C2–FE begin 1 to 6 continuation bytes as usual. "FF 80+x", for x ≤ 0x3E, begins an (x+7)-byte sequence. "FF BF 80+x", again for x ≤ 0x3E, begins an (x+2)-byte length for the following sequence, offset as necessary to avoid overlong length encodings. (Length bits are expressed in the same 6-bit "80+x" encoding as the codepoint itself.) "FF BF BF 80+x" begins an (x+2)-byte length for the encoded length of the sequence. And so on, where the number of initial BF bytes denotes the number of length levels past the first. (I believe there's a name for this sort of representation, but I cannot find it.)
Assuming offsets are used properly, decoders would have an easy time jumping off the wagon at whatever point the lengths would become too long for them to possibly work with. In particular, you can get a simple subset up to 222 codepoint bits by just using "FF 80" through "FF BE" as simple lengths, and leaving "FF BF" reserved.
Magic prefix (similar to byte-order-mark, BOM) is also killing the idea. The reason for success of any standard is the ability to establish consensus while navigating existing constraints. UTF-8 won over codepages, and UTF-16/32 by being purely ASCII-compatible. A magic prefix is killing that compatibility.
Even if all wire format encoding is utf8, you wouldn't be able to decode these new high codepoints into systems that are semantically utf16. Which is Java and JS at least, hardly "obsolete" targets to worry about.
And even Swift is designed so the strings can be utf8 or utf16 for cheap objc interop reasons.
Discarding compatibility with 2 of the top ~5 most widely used languages kind of reflects how disconnected the author of this is from the technical realities if any fixed utf8 was feasible outside of the most toy use cases.
It seems plausible that this could be made efficiently doable byte-wise. For example, C3 xx could be made to uppercase to C4 xx. Unicode actually does structure its codespace to make certain properties easier to compute, but those properties are mostly related to legacy encodings, and things are designed with USC2 or UTF32 in mind, not UTF8.
It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.
"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."
I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)
Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.
Character case is a locale-dependent mess; trying to represent it in the values of code points (which need to be universal) is a terrible idea.
For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").
Of course you sometimes need tailoring to a particular language. On the other hand, I don't see how encoding untailered casing would make tailored casing harder.
This is really helpful - thanks. I write a CRDT library for text editing. I should probably restrict the characters that I transport to the "Unicode Assignables" subset. I can't think of any sensible reason to let people insert characters like U+0000 into a collaborative text document.
> The original design of UTF-8 (as "FSS-UTF," by Pike and Thompson; standardized in 1996 by RFC 2044) could encode codepoints up to U+7FFF FFFF. In 2003 the IETF changed the specification (via RFC 3629) to disallow encoding any codepoint beyond U+10 FFFF. This was purely because of internal ISO and Unicode Consortium politics; they rejected the possibility of a future in which codepoints would exist that UTF-16 could not represent. UTF-16 is now obsolete, so there is no longer any reason to stick to this upper limit, and at the present rate of codepoint allocation, the space below U+10 FFFF will be exhausted in something like 600 years (less if private-use space is not reclaimed). Text encodings are forever; the time to avoid running out of space is now, not 550 years from now.
UTF-16 is integral to the workings of Windows, Java, and JavaScript, so it's not going away anytime soon. To make things worse, those systems don't even support surrogates correctly, to the point where we had to build WTF-8, a system for handling malformed UTF-8 converted from these UTF-16 early adopters. Before we can start talking about characters beyond plane 16, we need to find an answer for how those existing systems should handle characters beyond U+10FFFF.
I can't think of a good way for them to do this, though:
1. Opting in to an alternate UTF-8 string type to migrate these systems off UTF-16 means loads of old software that just chokes on new characters. Do you remember how MySQL decided you had to opt into utf8mb4 encoding to use astral characters in strings? And how basically nobody bothered to do this up until emoji forced everyone's hand? Do you want to do that dance again, but for the entire Windows API?
2. We can't just "rip out UTF-16" without breaking compatibility. WCHAR strings in Windows are expected to be 16 bits long and hold Unicode codepoints, and programs can index those directly. JavaScript strings are a bit better in that they could be UTF-8 internally, but they still have length and indexing semantics inherited from Unicode 1.0.
3. If we don't "rip out UTF-16" though, then we need some kind of representation of characters beyond plane 16. There is no space left in plane 1 for this; we already used a good chunk of it for surrogates. Furthermore, it's a practical requirement of Unicode that all encodings be self-synchronizing. Deleting or inserting a byte shouldn't change the meaning of more than one or two characters.
The most practical way forward for >U+10FFFF "superastrals" would be to reserve space for super-surrogates in the currently unused plane 4-13 space. A plane for low surrogates and half a plane for high would give us 31 bits of coding, but they'd already be astral characters. This yields the rather comical result of requiring 8 bytes to represent a 4 byte codepoint, because of two layers of surrogacy.
If we hadn't already dedicated codepoints to the first layer of surrogates, we could have had an alternative with unlimited coding range like UTF-8. If I were allowed to redefine 0xD800-0xDFFF, I'd change them from low and high surrogates to initial and extension surrogates, as such:
- 2-word initial surrogate: 0b1101110 + 9 bits of initial codepoint index (U+10000 through U+7FFFF)
- 3-word initial surrogate: 0b11011110 + 8 bits of initial codepoint index (U+80000 through U+FFFFFFF)
- 4-word initial surrogate: 0b110111110 + 7 bits of initial codepoint index (U+10000000 through U+1FFFFFFFFF)
- Extension surrogate: 0b110110 + 10 bits of additional codepoint index
U+80000 to U+10FFFF now take 6 bytes to encode instead of 4, but in exchange we now can encode U+110000 through U+FFFFFFF in the same size. We can even trudge on to 37-bit codepoints, if we decided to invent a surrogacy scheme for UTF-32[0] and also allow FE/FF to signal very long UTF-8 sequences as suggested in the original article. Suffice it to say this is a comically overbuilt system.
Of course, the feasibility of this is also debatable. I just spent a good while explaining why we can't touch UTF-16 at all, right? Well, most of the stuff that is married to UTF-16 specifically ignores surrogates, treating it as headache for the application developer. In practice, mispaired surrogates never break things, that's why we had to invent WTF-8 to clean up after that mess.
You may have noticed that initial surrogates in my scheme occupy the coding space for low surrogates. Existing surrogates are supposed to be sent in the order high, low. So an initial, extension pair is actually the opposite surrogate order from what existing code expects. Unfortunately this isn't quite self-synchronizing in the world we currently live in. Deleting an initial surrogate will change the meaning of all following 2-word pairs to high/low pairs, unless you have some out of band way to signal that some text is encoded with initial / extension surrogates instead of high / low pairs. So I wouldn't recommend sending anything like this on the wire, and UTF-16 parsers would need to forbid mixed surrogacy ordering.
But then again, nobody sends UTF-16 on the wire anymore, so I don't know how much of a problem this would be. And of course, there's the underlying problem that the demand for codepoints beyond U+10FFFF is very low. Hell, the article itself admits the current Unicode growth rate has 600 years before it runs into this problem.
[0] Un(?)fortunately this would not be able to reuse the existing surrogate space for UTF-16, meaning we'd need to have a huge amount of the superastral planes reserved for even more comically large expansion.
I came up with a scheme a number of years ago that takes advantage of the illegality of overlong encodings [0].
Obviously UTF-8 has 256 code units (<00> to <FF>). 128 of them are always valid within a UTF-8 string (ASCII, <00> to <7F>), leaving 128 code units that could be invalid within a UTF-8 string (<80> to <FF>).
There also happen to be exactly 128 2-byte overlong representations (overlong representations of ASCII characters).
Basically, any byte in some input that can't be interpreted as valid UTF-8 can be replaced with a 2-byte overlong representation. This can be used as an extension of WTF-8 so that UTF-16 and UTF-8 errors can both be stored in the same stream. I called the encoding WTF-8b [2], though I'd be interested to know if someone else has come up with the same scheme.
This should technically be "fine" WRT Unicode text processing, since it involves transforming invalid Unicode into other invalid Unicode. This principle is already used by WTF-8.
I used it to improve preservation of invalid Unicode (ie, random 8-bit data in UTF-8 text or random 16-bit data in JSON strings) in jq, though I suspect the PR [1] won't be accepted. I still find the changes very useful personally, so maybe I'll come up with a different approach some time.
[0] https://github.com/Maxdamantus/jq/blob/911d01aaa5bd33137fadf...
[1] https://github.com/jqlang/jq/pull/2314
[2] I think I used the name "WTF-8b" as an allusion to UTF-8b/surrogateescape/PEP-383 which also encodes ill-formed UTF-8, though UTF-8b is less efficient storage-wise and is not compatible with WTF-8.
That's brilliant, tbh. I guess the challenge is how you represent those in the decoded character space. Maybe they should allocate 128 characters somewhere and define them as "invalid byte values".
In my jq PR I used negative numbers to represent them (the original byte, negated), since they're already just using `int` to represent a decoded code point, and it's somewhat normal to return distinguishable errors as negative numbers in C. I think it would also make sense to represent the UTF-16 errors ("unpaired surrogates") as negative numbers, though I didn't make that change internally (maybe because they're already used elsewhere). I did make it so that they are represented as negatives in `explode` however, so `"\uD800" | explode` emits `[-0xD800]`.
In something other than C, I'd expect they should be distinguished as members of an enumeration or something, eg:
I don't expect anyone to adopt this. Listing complaints about a heavily used standard, and proposing something else incompatible won't gain any traction.
Compare to WTF-8, which solves a different problem (representing invalid 16-bit characters within an 8-bit encoding).
Yeah, WTF-8 is a very straightforward "the spec semi-artificially says we can't do this one thing, and it prevents you from using utf8 under the hood to represent JS and Java strings which allow for unpaired utf16 surrogates, so in practice utf8-except-this-one-thing is the only way to do an in memory representation in things that want to implement or interop round trip with those".
It's literally the exact opposite of this proposal, in that there's an actual concrete problem and how to make it not a problem. This one is a list of weird grievances that aren't actually problems for anyone, like the max code point number.
I'm surprised this doesn't mandate one of the Unicode Normalization Forms. Normalization is both obscure and complex. Unicode should have a single canonical binary encoding for all character sequences.
Its a missed opportunity that this isn't already the case - but if you're going to replace utf8, we should absolutely mandate one of the normalization forms along the way.
https://unicode.org/reports/tr15/
I don't think you can mandate that in this kind of encoding. This just encodes code points, with some choices so certain invalid code points are unable to be encoded.
But normalized forms are about sequences of code points that are semantically equivalent. You can't make the non-normalized code point sequences unencodable in an encoding that only looks at one code point at a time. You wouldn't want to anchor the encoding to any particular version of Unicode either.
Normalized forms have to happen at another layer. That layer is often omitted for efficiency or because nobody stopped to consider it, but the code point encoding layer isn't the right place.
Normalization is annoying but understandable - you have common characters that are clearly SOMETHING + MODIFIER, and they are common enough that you want to represent them as a single character to avoid byte explosion. SOMETHING and MODIFIER are also both useful on their own, potentially combining with other less common characters that are less valuable to encode (unfrequent, but valuable).
If you skip all the modifiers, you end up with an explosion in code space. If you skip all the precomposed characters, you end up with an explosion in bytes.
There's no good solution here, so normalization makes sense. But then the committee says ".. and what about this kind of normalization" and then you end up.. here.
Right. But if we had a chance for a do-over, it'd be really nice if we all just agreed on a normalization form and used it from the start in all our software. Seems like a missed opportunity not to.
I think NFC is the agreed-upon normalization form, is it not? The only real exception I can think of is HFS+ but that was corrected in APFS (which uses NFC now like the rest of the world).
He got very close to killing the SOH (U+01) which is useful in various technical specifications. Seems to still want to put the boot in.
I don't understand the desire to make existing characters unrepresentable for the sake of what? Shifting used characters earlier in the byte sequence?
Forbidding \r\n line endings in the encoding just sort of sinks the whole idea. The first couple ideas are nice, but then you suddenly get normative with what characters are allowed to be encoded? That creates a very large initial hurdle to clear to get people to use your encoding. Suddenly you need to forbid specific texts, instead of just handling everything. Why put such a huge footgun in your system when it's not necessary?
Yeah it doesn’t make much sense. In addition to being the default line ending on Windows, \r\n is part of the syntax of many text-based protocols (e.g. SMTP and IMAP) that support UTF-8, so clients/servers of all these protocols would be broken.
Many things makes sense to me, but as we can all guess, this will never become a thing :(
But the "magic number" thing to me is a waste of space. If this standard is accepted, if no magic number you have corrected UTF-8.
As for \r\n, not a big deal to me. I would like to see if forbidden if only to force Microsoft to use \n like UN*X and Apple. I still need to deal with \r\n in files showing up every so often.
"If this standard is accepted, if no magic number you have corrected UTF-8."
That's true only if "corrected UTF-8" is accepted and existing UTF-8 becomes obsolete. That can't happen. There's too much existing UTF-8 text that will never be translated to a newer standard.
Magic numbers do appear a lot in C# programs. The default text encoder will output a BOM marker.
> I would like to discard almost all of the C0 controls as well—preserving only U+0000 and U+000A
What's wrong with horizontal tab?
This scheme skips over 80 through 9F because they claim it's never appropriate to send those control characters through interchangeable text, but it just seems like a very brave proposal to intentionally have codepoints that can't be encoded.
I think the offset scheme should only be used to fix overlength encodings, and not trying to patch over an adhoc hole at the same time. It seems safer to make it possible to encode all codepoints whether those codepoints should be used or not. Unicode already has holes in various ranges anyways.
1) Adding offsets to multi-byte sequences breaks compatibility with existing UTF-8 text, while generating text which can be decoded (incorrectly) as UTF-8. That seems like a non-starter. The alleged benefit of "eliminating overlength encodings" seems marginal; overlength encodings are already invalid. It also significantly increases the complexity of encoders and decoders, especially in dealing with discontinuities like the UTF-16 surrogate "hole".
2) I really doubt that the current upper limit of U+10_FFFF is going to need to be raised. Past growth in the Unicode standard has primarily been driven by the addition of more CJK characters; that isn't going to continue indefinitely.
3) Disallowing C0 characters like U+0009 (horizontal tab) is absurd, especially at the level of a text encoding.
4) BOMs are dumb. We learned that lesson in the early 2000s - even if they sound great as a way of identifying text encodings, they have a nasty way of sneaking into the middle of strings and causing havoc. Bringing them back is a terrible idea.
Yes it should be completely incompatible with UTF-8 not only partially. As in, anything beyond ASCII should be invalid and not decodable as UTF.
If you do need the expansion of code point space, https://ucsx.org/ is the definitive answer; it was designed by actual Unicode contributors.
I was going to object to using something new at all, but their recommendation for up to 31 bits is the same as the original UTF-8. They only add new logic for sequences starting with FF.
I'm not super thrilled with the extensions, though. They jump directly from 36 bits to 63/71 bits with nothing in between and then use a complicated scheme to go further.
The proposed extension mechanism itself is quite extensible in my understanding, so you should be able to define UCS-T and UCS-P (for tera and peta respectively) with minimal changes. The website offers an FAQ for this very topic [1], too.
[1] https://ucsx.org/why#3.1
That FAQ doesn't address my issues with their UTF-8 variants. And I don't want more extensions, I want it to be simpler. Once your prefix bits fill up, go directly to storing the number of bytes. Don't have this implicit jump from 7 to 13. And arrange the length encoding so you don't have to do that weird B4 thing to keep it in order.
The length encoding is fun to think about, if you want it to go all the way up to infinity, and avoid wasting bytes.
My thought: Bytes C2–FE begin 1 to 6 continuation bytes as usual. "FF 80+x", for x ≤ 0x3E, begins an (x+7)-byte sequence. "FF BF 80+x", again for x ≤ 0x3E, begins an (x+2)-byte length for the following sequence, offset as necessary to avoid overlong length encodings. (Length bits are expressed in the same 6-bit "80+x" encoding as the codepoint itself.) "FF BF BF 80+x" begins an (x+2)-byte length for the encoded length of the sequence. And so on, where the number of initial BF bytes denotes the number of length levels past the first. (I believe there's a name for this sort of representation, but I cannot find it.)
Assuming offsets are used properly, decoders would have an easy time jumping off the wagon at whatever point the lengths would become too long for them to possibly work with. In particular, you can get a simple subset up to 222 codepoint bits by just using "FF 80" through "FF BE" as simple lengths, and leaving "FF BF" reserved.
Magic prefix (similar to byte-order-mark, BOM) is also killing the idea. The reason for success of any standard is the ability to establish consensus while navigating existing constraints. UTF-8 won over codepages, and UTF-16/32 by being purely ASCII-compatible. A magic prefix is killing that compatibility.
"UTF-16 is now obsolete."? That's news to me.
I wish it were true, but it's not.
Yeah, for example it's how Java stores strings to this day. But I think it's more or less never transmitted over the Network.
Even if all wire format encoding is utf8, you wouldn't be able to decode these new high codepoints into systems that are semantically utf16. Which is Java and JS at least, hardly "obsolete" targets to worry about.
And even Swift is designed so the strings can be utf8 or utf16 for cheap objc interop reasons.
Discarding compatibility with 2 of the top ~5 most widely used languages kind of reflects how disconnected the author of this is from the technical realities if any fixed utf8 was feasible outside of the most toy use cases.
What about encoding it in such way we dont need huge tables to figure the category for each code point?
It means that you are encoding those categories into the code point itself, which is a waste for every single use of the character encoding.
It seems plausible that this could be made efficiently doable byte-wise. For example, C3 xx could be made to uppercase to C4 xx. Unicode actually does structure its codespace to make certain properties easier to compute, but those properties are mostly related to legacy encodings, and things are designed with USC2 or UTF32 in mind, not UTF8.
It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.
> Usually, what you want is either the byte or the grapheme cluster.
Exactly ! That's what I understood after reading this great post https://tonsky.me/blog/unicode/
"Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."
I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)
Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.
But they don't have that explosion if you only encode the combinatoric primitives those characters are made of and then use composing rules?
Character case is a locale-dependent mess; trying to represent it in the values of code points (which need to be universal) is a terrible idea.
For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").
Of course you sometimes need tailoring to a particular language. On the other hand, I don't see how encoding untailered casing would make tailored casing harder.
Relevant: https://www.ietf.org/archive/id/draft-bray-unichars-15.html - IETF approved and will have an RFC number in a few weeks.
Tl;dr: Since we're kinda stuck with Uncorrected UTF-8, here are the "characters" you shouldn't use. Includes a bunch of stuff the OP mentioned.
The most important bit of that is the “Unicode Assignables” subset <https://www.ietf.org/archive/id/draft-bray-unichars-15.html#...>:
This is really helpful - thanks. I write a CRDT library for text editing. I should probably restrict the characters that I transport to the "Unicode Assignables" subset. I can't think of any sensible reason to let people insert characters like U+0000 into a collaborative text document.
> The original design of UTF-8 (as "FSS-UTF," by Pike and Thompson; standardized in 1996 by RFC 2044) could encode codepoints up to U+7FFF FFFF. In 2003 the IETF changed the specification (via RFC 3629) to disallow encoding any codepoint beyond U+10 FFFF. This was purely because of internal ISO and Unicode Consortium politics; they rejected the possibility of a future in which codepoints would exist that UTF-16 could not represent. UTF-16 is now obsolete, so there is no longer any reason to stick to this upper limit, and at the present rate of codepoint allocation, the space below U+10 FFFF will be exhausted in something like 600 years (less if private-use space is not reclaimed). Text encodings are forever; the time to avoid running out of space is now, not 550 years from now.
UTF-16 is integral to the workings of Windows, Java, and JavaScript, so it's not going away anytime soon. To make things worse, those systems don't even support surrogates correctly, to the point where we had to build WTF-8, a system for handling malformed UTF-8 converted from these UTF-16 early adopters. Before we can start talking about characters beyond plane 16, we need to find an answer for how those existing systems should handle characters beyond U+10FFFF.
I can't think of a good way for them to do this, though:
1. Opting in to an alternate UTF-8 string type to migrate these systems off UTF-16 means loads of old software that just chokes on new characters. Do you remember how MySQL decided you had to opt into utf8mb4 encoding to use astral characters in strings? And how basically nobody bothered to do this up until emoji forced everyone's hand? Do you want to do that dance again, but for the entire Windows API?
2. We can't just "rip out UTF-16" without breaking compatibility. WCHAR strings in Windows are expected to be 16 bits long and hold Unicode codepoints, and programs can index those directly. JavaScript strings are a bit better in that they could be UTF-8 internally, but they still have length and indexing semantics inherited from Unicode 1.0.
3. If we don't "rip out UTF-16" though, then we need some kind of representation of characters beyond plane 16. There is no space left in plane 1 for this; we already used a good chunk of it for surrogates. Furthermore, it's a practical requirement of Unicode that all encodings be self-synchronizing. Deleting or inserting a byte shouldn't change the meaning of more than one or two characters.
The most practical way forward for >U+10FFFF "superastrals" would be to reserve space for super-surrogates in the currently unused plane 4-13 space. A plane for low surrogates and half a plane for high would give us 31 bits of coding, but they'd already be astral characters. This yields the rather comical result of requiring 8 bytes to represent a 4 byte codepoint, because of two layers of surrogacy.
If we hadn't already dedicated codepoints to the first layer of surrogates, we could have had an alternative with unlimited coding range like UTF-8. If I were allowed to redefine 0xD800-0xDFFF, I'd change them from low and high surrogates to initial and extension surrogates, as such:
- 2-word initial surrogate: 0b1101110 + 9 bits of initial codepoint index (U+10000 through U+7FFFF)
- 3-word initial surrogate: 0b11011110 + 8 bits of initial codepoint index (U+80000 through U+FFFFFFF)
- 4-word initial surrogate: 0b110111110 + 7 bits of initial codepoint index (U+10000000 through U+1FFFFFFFFF)
- Extension surrogate: 0b110110 + 10 bits of additional codepoint index
U+80000 to U+10FFFF now take 6 bytes to encode instead of 4, but in exchange we now can encode U+110000 through U+FFFFFFF in the same size. We can even trudge on to 37-bit codepoints, if we decided to invent a surrogacy scheme for UTF-32[0] and also allow FE/FF to signal very long UTF-8 sequences as suggested in the original article. Suffice it to say this is a comically overbuilt system.
Of course, the feasibility of this is also debatable. I just spent a good while explaining why we can't touch UTF-16 at all, right? Well, most of the stuff that is married to UTF-16 specifically ignores surrogates, treating it as headache for the application developer. In practice, mispaired surrogates never break things, that's why we had to invent WTF-8 to clean up after that mess.
You may have noticed that initial surrogates in my scheme occupy the coding space for low surrogates. Existing surrogates are supposed to be sent in the order high, low. So an initial, extension pair is actually the opposite surrogate order from what existing code expects. Unfortunately this isn't quite self-synchronizing in the world we currently live in. Deleting an initial surrogate will change the meaning of all following 2-word pairs to high/low pairs, unless you have some out of band way to signal that some text is encoded with initial / extension surrogates instead of high / low pairs. So I wouldn't recommend sending anything like this on the wire, and UTF-16 parsers would need to forbid mixed surrogacy ordering.
But then again, nobody sends UTF-16 on the wire anymore, so I don't know how much of a problem this would be. And of course, there's the underlying problem that the demand for codepoints beyond U+10FFFF is very low. Hell, the article itself admits the current Unicode growth rate has 600 years before it runs into this problem.
[0] Un(?)fortunately this would not be able to reuse the existing surrogate space for UTF-16, meaning we'd need to have a huge amount of the superastral planes reserved for even more comically large expansion.