Corrected UTF-8 (2022)

49 points by RGBCube 4 days ago

I came up with a scheme a number of years ago that takes advantage of the illegality of overlong encodings [0].

Obviously UTF-8 has 256 code units (<00> to <FF>). 128 of them are always valid within a UTF-8 string (ASCII, <00> to <7F>), leaving 128 code units that could be invalid within a UTF-8 string (<80> to <FF>).

There also happen to be exactly 128 2-byte overlong representations (overlong representations of ASCII characters).

Basically, any byte in some input that can't be interpreted as valid UTF-8 can be replaced with a 2-byte overlong representation. This can be used as an extension of WTF-8 so that UTF-16 and UTF-8 errors can both be stored in the same stream. I called the encoding WTF-8b [2], though I'd be interested to know if someone else has come up with the same scheme.

This should technically be "fine" WRT Unicode text processing, since it involves transforming invalid Unicode into other invalid Unicode. This principle is already used by WTF-8.

I used it to improve preservation of invalid Unicode (ie, random 8-bit data in UTF-8 text or random 16-bit data in JSON strings) in jq, though I suspect the PR [1] won't be accepted. I still find the changes very useful personally, so maybe I'll come up with a different approach some time.

[0] https://github.com/Maxdamantus/jq/blob/911d01aaa5bd33137fadf...

[1] https://github.com/jqlang/jq/pull/2314

[2] I think I used the name "WTF-8b" as an allusion to UTF-8b/surrogateescape/PEP-383 which also encodes ill-formed UTF-8, though UTF-8b is less efficient storage-wise and is not compatible with WTF-8.

mmastrac 5 hours ago

That's brilliant, tbh. I guess the challenge is how you represent those in the decoded character space. Maybe they should allocate 128 characters somewhere and define them as "invalid byte values".
- maxdamantus 4 hours ago
  In my jq PR I used negative numbers to represent them (the original byte, negated), since they're already just using `int` to represent a decoded code point, and it's somewhat normal to return distinguishable errors as negative numbers in C. I think it would also make sense to represent the UTF-16 errors ("unpaired surrogates") as negative numbers, though I didn't make that change internally (maybe because they're already used elsewhere). I did make it so that they are represented as negatives in `explode` however, so `"\uD800" | explode` emits `[-0xD800]`.
  In something other than C, I'd expect they should be distinguished as members of an enumeration or something, eg:
  enum DecodeResult { Ok(char); ErrUtf8(u8); // 0x80..0xFF ErrUtf16(u16); // 0xD800..0xDFFF }

Dwedit 9 hours ago

I don't expect anyone to adopt this. Listing complaints about a heavily used standard, and proposing something else incompatible won't gain any traction.

Compare to WTF-8, which solves a different problem (representing invalid 16-bit characters within an 8-bit encoding).

esrauch 8 hours ago

Yeah, WTF-8 is a very straightforward "the spec semi-artificially says we can't do this one thing, and it prevents you from using utf8 under the hood to represent JS and Java strings which allow for unpaired utf16 surrogates, so in practice utf8-except-this-one-thing is the only way to do an in memory representation in things that want to implement or interop round trip with those".
It's literally the exact opposite of this proposal, in that there's an actual concrete problem and how to make it not a problem. This one is a list of weird grievances that aren't actually problems for anyone, like the max code point number.

josephg 6 hours ago

I'm surprised this doesn't mandate one of the Unicode Normalization Forms. Normalization is both obscure and complex. Unicode should have a single canonical binary encoding for all character sequences.

Its a missed opportunity that this isn't already the case - but if you're going to replace utf8, we should absolutely mandate one of the normalization forms along the way.

https://unicode.org/reports/tr15/

toast0 4 hours ago

I don't think you can mandate that in this kind of encoding. This just encodes code points, with some choices so certain invalid code points are unable to be encoded.
But normalized forms are about sequences of code points that are semantically equivalent. You can't make the non-normalized code point sequences unencodable in an encoding that only looks at one code point at a time. You wouldn't want to anchor the encoding to any particular version of Unicode either.
Normalized forms have to happen at another layer. That layer is often omitted for efficiency or because nobody stopped to consider it, but the code point encoding layer isn't the right place.
mmastrac 5 hours ago

Normalization is annoying but understandable - you have common characters that are clearly SOMETHING + MODIFIER, and they are common enough that you want to represent them as a single character to avoid byte explosion. SOMETHING and MODIFIER are also both useful on their own, potentially combining with other less common characters that are less valuable to encode (unfrequent, but valuable).
If you skip all the modifiers, you end up with an explosion in code space. If you skip all the precomposed characters, you end up with an explosion in bytes.
There's no good solution here, so normalization makes sense. But then the committee says ".. and what about this kind of normalization" and then you end up.. here.
- josephg 4 hours ago
  
  Right. But if we had a chance for a do-over, it'd be really nice if we all just agreed on a normalization form and used it from the start in all our software. Seems like a missed opportunity not to.
  - torstenvl 3 hours ago
    
    I think NFC is the agreed-upon normalization form, is it not? The only real exception I can think of is HFS+ but that was corrected in APFS (which uses NFC now like the rest of the world).

philipwhiuk 9 hours ago

He got very close to killing the SOH (U+01) which is useful in various technical specifications. Seems to still want to put the boot in.

I don't understand the desire to make existing characters unrepresentable for the sake of what? Shifting used characters earlier in the byte sequence?

chowells 10 hours ago

Forbidding \r\n line endings in the encoding just sort of sinks the whole idea. The first couple ideas are nice, but then you suddenly get normative with what characters are allowed to be encoded? That creates a very large initial hurdle to clear to get people to use your encoding. Suddenly you need to forbid specific texts, instead of just handling everything. Why put such a huge footgun in your system when it's not necessary?

csb6 5 hours ago

Yeah it doesn’t make much sense. In addition to being the default line ending on Windows, \r\n is part of the syntax of many text-based protocols (e.g. SMTP and IMAP) that support UTF-8, so clients/servers of all these protocols would be broken.
jmclnx 10 hours ago

Many things makes sense to me, but as we can all guess, this will never become a thing :(
But the "magic number" thing to me is a waste of space. If this standard is accepted, if no magic number you have corrected UTF-8.
As for \r\n, not a big deal to me. I would like to see if forbidden if only to force Microsoft to use \n like UN*X and Apple. I still need to deal with \r\n in files showing up every so often.
- _kst_ 9 hours ago
  
  "If this standard is accepted, if no magic number you have corrected UTF-8."
  That's true only if "corrected UTF-8" is accepted and existing UTF-8 becomes obsolete. That can't happen. There's too much existing UTF-8 text that will never be translated to a newer standard.
- Dwedit 8 hours ago
  
  Magic numbers do appear a lot in C# programs. The default text encoder will output a BOM marker.

eviks 3 hours ago

> I would like to discard almost all of the C0 controls as well—preserving only U+0000 and U+000A

What's wrong with horizontal tab?

omoikane 9 hours ago

This scheme skips over 80 through 9F because they claim it's never appropriate to send those control characters through interchangeable text, but it just seems like a very brave proposal to intentionally have codepoints that can't be encoded.

I think the offset scheme should only be used to fix overlength encodings, and not trying to patch over an adhoc hole at the same time. It seems safer to make it possible to encode all codepoints whether those codepoints should be used or not. Unicode already has holes in various ranges anyways.

duskwuff 10 hours ago

1) Adding offsets to multi-byte sequences breaks compatibility with existing UTF-8 text, while generating text which can be decoded (incorrectly) as UTF-8. That seems like a non-starter. The alleged benefit of "eliminating overlength encodings" seems marginal; overlength encodings are already invalid. It also significantly increases the complexity of encoders and decoders, especially in dealing with discontinuities like the UTF-16 surrogate "hole".

2) I really doubt that the current upper limit of U+10_FFFF is going to need to be raised. Past growth in the Unicode standard has primarily been driven by the addition of more CJK characters; that isn't going to continue indefinitely.

3) Disallowing C0 characters like U+0009 (horizontal tab) is absurd, especially at the level of a text encoding.

4) BOMs are dumb. We learned that lesson in the early 2000s - even if they sound great as a way of identifying text encodings, they have a nasty way of sneaking into the middle of strings and causing havoc. Bringing them back is a terrible idea.

rini17 9 hours ago

Yes it should be completely incompatible with UTF-8 not only partially. As in, anything beyond ASCII should be invalid and not decodable as UTF.

lifthrasiir 10 hours ago

If you do need the expansion of code point space, https://ucsx.org/ is the definitive answer; it was designed by actual Unicode contributors.

Dylan16807 7 hours ago

I was going to object to using something new at all, but their recommendation for up to 31 bits is the same as the original UTF-8. They only add new logic for sequences starting with FF.
I'm not super thrilled with the extensions, though. They jump directly from 36 bits to 63/71 bits with nothing in between and then use a complicated scheme to go further.
- lifthrasiir 5 hours ago
  
  The proposed extension mechanism itself is quite extensible in my understanding, so you should be able to define UCS-T and UCS-P (for tera and peta respectively) with minimal changes. The website offers an FAQ for this very topic [1], too.
  [1] https://ucsx.org/why#3.1
  - Dylan16807 4 hours ago
    
    That FAQ doesn't address my issues with their UTF-8 variants. And I don't want more extensions, I want it to be simpler. Once your prefix bits fill up, go directly to storing the number of bytes. Don't have this implicit jump from 7 to 13. And arrange the length encoding so you don't have to do that weird B4 thing to keep it in order.
    
    LegionMammal978 2 hours ago
    
    The length encoding is fun to think about, if you want it to go all the way up to infinity, and avoid wasting bytes.
    My thought: Bytes C2–FE begin 1 to 6 continuation bytes as usual. "FF 80+x", for x ≤ 0x3E, begins an (x+7)-byte sequence. "FF BF 80+x", again for x ≤ 0x3E, begins an (x+2)-byte length for the following sequence, offset as necessary to avoid overlong length encodings. (Length bits are expressed in the same 6-bit "80+x" encoding as the codepoint itself.) "FF BF BF 80+x" begins an (x+2)-byte length for the encoded length of the sequence. And so on, where the number of initial BF bytes denotes the number of length levels past the first. (I believe there's a name for this sort of representation, but I cannot find it.)
    Assuming offsets are used properly, decoders would have an easy time jumping off the wagon at whatever point the lengths would become too long for them to possibly work with. In particular, you can get a simple subset up to 222 codepoint bits by just using "FF 80" through "FF BE" as simple lengths, and leaving "FF BF" reserved.

oleganza 10 hours ago

Magic prefix (similar to byte-order-mark, BOM) is also killing the idea. The reason for success of any standard is the ability to establish consensus while navigating existing constraints. UTF-8 won over codepages, and UTF-16/32 by being purely ASCII-compatible. A magic prefix is killing that compatibility.

_kst_ 9 hours ago

"UTF-16 is now obsolete."? That's news to me.

I wish it were true, but it's not.

timbray 9 hours ago

Yeah, for example it's how Java stores strings to this day. But I think it's more or less never transmitted over the Network.
- esrauch 8 hours ago
  
  Even if all wire format encoding is utf8, you wouldn't be able to decode these new high codepoints into systems that are semantically utf16. Which is Java and JS at least, hardly "obsolete" targets to worry about.
  And even Swift is designed so the strings can be utf8 or utf16 for cheap objc interop reasons.
  Discarding compatibility with 2 of the top ~5 most widely used languages kind of reflects how disconnected the author of this is from the technical realities if any fixed utf8 was feasible outside of the most toy use cases.

moonshadow565 10 hours ago

What about encoding it in such way we dont need huge tables to figure the category for each code point?

lifthrasiir 10 hours ago

It means that you are encoding those categories into the code point itself, which is a waste for every single use of the character encoding.
- panpog 9 hours ago
  
  It seems plausible that this could be made efficiently doable byte-wise. For example, C3 xx could be made to uppercase to C4 xx. Unicode actually does structure its codespace to make certain properties easier to compute, but those properties are mostly related to legacy encodings, and things are designed with USC2 or UTF32 in mind, not UTF8.
  It’s also not clear to me that the code point is a good abstraction in the design of UTF8. Usually, what you want is either the byte or the grapheme cluster.
  - karteum 8 hours ago
    
    > Usually, what you want is either the byte or the grapheme cluster.
    Exactly ! That's what I understood after reading this great post https://tonsky.me/blog/unicode/
    "Even in the widest encoding, UTF-32, [some grapheme] will still take three 4-byte units to encode. And it still needs to be treated as a single character. If the analogy helps, we can think of the Unicode itself (without any encodings) as being variable-length."
    I tend to think it's the biggest design decision in Unicode (but maybe I just don't fully see the need and use-cases beyond emojis. Of course I read the section saying it's used in actual languages, but the few examples described could have been made with a dedicated 32 bits codepoint...)
    
    panpog 4 hours ago
    
    Can you fit everything into 32 bits? I have no idea, but Hangul and indict scripts seem like they might have a combinatoric explosion of infrequently used characters.
    
    eviks 2 hours ago
    
    But they don't have that explosion if you only encode the combinatoric primitives those characters are made of and then use composing rules?
  - duskwuff 8 hours ago
    
    Character case is a locale-dependent mess; trying to represent it in the values of code points (which need to be universal) is a terrible idea.
    For example: in English, U+0049 and U+0069 ("I" and "i") are considered an uppercase/lowercase pair. In the Turkish locale, these are considered two separate characters with their own uppercase and lowercase versions: U+0049/U+0130 ("I" / "ı") and U+0131/U+0069 ("İ" / "i").
    
    panpog 5 hours ago
    
    Of course you sometimes need tailoring to a particular language. On the other hand, I don't see how encoding untailered casing would make tailored casing harder.

timbray 9 hours ago

Relevant: https://www.ietf.org/archive/id/draft-bray-unichars-15.html - IETF approved and will have an RFC number in a few weeks.

Tl;dr: Since we're kinda stuck with Uncorrected UTF-8, here are the "characters" you shouldn't use. Includes a bunch of stuff the OP mentioned.

chrismorgan 4 hours ago

The most important bit of that is the “Unicode Assignables” subset <https://www.ietf.org/archive/id/draft-bray-unichars-15.html#...>:

  unicode-assignable =
     %x9 / %xA / %xD /               ; useful controls
     %x20-7E /                       ; exclude C1 controls and DEL
     %xA0-D7FF /                     ; exclude surrogates
     %xE000-FDCF /                   ; exclude FDD0 nonchars
     %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
     %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
     %x30000-3FFFD / %x40000-4FFFD /
     %x50000-5FFFD / %x60000-6FFFD /
     %x70000-7FFFD / %x80000-8FFFD /
     %x90000-9FFFD / %xA0000-AFFFD /
     %xB0000-BFFFD / %xC0000-CFFFD /
     %xD0000-DFFFD / %xE0000-EFFFD /
     %xF0000-FFFFD / %x100000-10FFFD

josephg 6 hours ago

This is really helpful - thanks. I write a CRDT library for text editing. I should probably restrict the characters that I transport to the "Unicode Assignables" subset. I can't think of any sensible reason to let people insert characters like U+0000 into a collaborative text document.

kmeisthax 5 hours ago

> The original design of UTF-8 (as "FSS-UTF," by Pike and Thompson; standardized in 1996 by RFC 2044) could encode codepoints up to U+7FFF FFFF. In 2003 the IETF changed the specification (via RFC 3629) to disallow encoding any codepoint beyond U+10 FFFF. This was purely because of internal ISO and Unicode Consortium politics; they rejected the possibility of a future in which codepoints would exist that UTF-16 could not represent. UTF-16 is now obsolete, so there is no longer any reason to stick to this upper limit, and at the present rate of codepoint allocation, the space below U+10 FFFF will be exhausted in something like 600 years (less if private-use space is not reclaimed). Text encodings are forever; the time to avoid running out of space is now, not 550 years from now.

UTF-16 is integral to the workings of Windows, Java, and JavaScript, so it's not going away anytime soon. To make things worse, those systems don't even support surrogates correctly, to the point where we had to build WTF-8, a system for handling malformed UTF-8 converted from these UTF-16 early adopters. Before we can start talking about characters beyond plane 16, we need to find an answer for how those existing systems should handle characters beyond U+10FFFF.

I can't think of a good way for them to do this, though:

1. Opting in to an alternate UTF-8 string type to migrate these systems off UTF-16 means loads of old software that just chokes on new characters. Do you remember how MySQL decided you had to opt into utf8mb4 encoding to use astral characters in strings? And how basically nobody bothered to do this up until emoji forced everyone's hand? Do you want to do that dance again, but for the entire Windows API?

2. We can't just "rip out UTF-16" without breaking compatibility. WCHAR strings in Windows are expected to be 16 bits long and hold Unicode codepoints, and programs can index those directly. JavaScript strings are a bit better in that they could be UTF-8 internally, but they still have length and indexing semantics inherited from Unicode 1.0.

3. If we don't "rip out UTF-16" though, then we need some kind of representation of characters beyond plane 16. There is no space left in plane 1 for this; we already used a good chunk of it for surrogates. Furthermore, it's a practical requirement of Unicode that all encodings be self-synchronizing. Deleting or inserting a byte shouldn't change the meaning of more than one or two characters.

The most practical way forward for >U+10FFFF "superastrals" would be to reserve space for super-surrogates in the currently unused plane 4-13 space. A plane for low surrogates and half a plane for high would give us 31 bits of coding, but they'd already be astral characters. This yields the rather comical result of requiring 8 bytes to represent a 4 byte codepoint, because of two layers of surrogacy.

If we hadn't already dedicated codepoints to the first layer of surrogates, we could have had an alternative with unlimited coding range like UTF-8. If I were allowed to redefine 0xD800-0xDFFF, I'd change them from low and high surrogates to initial and extension surrogates, as such:

- 2-word initial surrogate: 0b1101110 + 9 bits of initial codepoint index (U+10000 through U+7FFFF)

- 3-word initial surrogate: 0b11011110 + 8 bits of initial codepoint index (U+80000 through U+FFFFFFF)

- 4-word initial surrogate: 0b110111110 + 7 bits of initial codepoint index (U+10000000 through U+1FFFFFFFFF)

- Extension surrogate: 0b110110 + 10 bits of additional codepoint index

U+80000 to U+10FFFF now take 6 bytes to encode instead of 4, but in exchange we now can encode U+110000 through U+FFFFFFF in the same size. We can even trudge on to 37-bit codepoints, if we decided to invent a surrogacy scheme for UTF-32[0] and also allow FE/FF to signal very long UTF-8 sequences as suggested in the original article. Suffice it to say this is a comically overbuilt system.

Of course, the feasibility of this is also debatable. I just spent a good while explaining why we can't touch UTF-16 at all, right? Well, most of the stuff that is married to UTF-16 specifically ignores surrogates, treating it as headache for the application developer. In practice, mispaired surrogates never break things, that's why we had to invent WTF-8 to clean up after that mess.

You may have noticed that initial surrogates in my scheme occupy the coding space for low surrogates. Existing surrogates are supposed to be sent in the order high, low. So an initial, extension pair is actually the opposite surrogate order from what existing code expects. Unfortunately this isn't quite self-synchronizing in the world we currently live in. Deleting an initial surrogate will change the meaning of all following 2-word pairs to high/low pairs, unless you have some out of band way to signal that some text is encoded with initial / extension surrogates instead of high / low pairs. So I wouldn't recommend sending anything like this on the wire, and UTF-16 parsers would need to forbid mixed surrogacy ordering.

But then again, nobody sends UTF-16 on the wire anymore, so I don't know how much of a problem this would be. And of course, there's the underlying problem that the demand for codepoints beyond U+10FFFF is very low. Hell, the article itself admits the current Unicode growth rate has 600 years before it runs into this problem.

[0] Un(?)fortunately this would not be able to reuse the existing surrogate space for UTF-16, meaning we'd need to have a huge amount of the superastral planes reserved for even more comically large expansion.