4-byte utf-8 and DC software


I detected whilst updating my hub’s MOTD (for reference I have hanged it at http://klondike.es/motd.txt) that messages with 4-byte UTF-8 whilst sent properly are filtered by many DC engines.

Ironically the character I was testing was the one at http://www.fileformat.info/info/unicode/char/1f572/index.htm (this forums SQL databse rejects inserting it).

So far my results are these:

Hub software:

  • ADCH: Can send messages with the character but filters incoming ones
  • uhub: Can send messages with the character but filters incoming ones
  • Flexhub: Replaces the character on the message by an ‘?’ (question mark)

Client software:

  • eiskaltdc: Can send messages with the character but filters incoming ones

I’m unsure what’s causing it as the character is valid utf-8 and shouldn’t be filtered.

Pretorian asked if this could be caused by an old unicode engine as it is fairly recent. I didn’t test as deeply with the pineapple codepoint http://www.fileformat.info/info/unicode/char/1f34d/index.htm as I did with the other but it seems to trigger similar behaviour where I tested.

MySQL’s manual probably explains why the forum has rejected your 4-byte “NO PIRACY” codepoint:

I haven’t checked uhub or FlexHub, but adchpp’s Text.cpp doesn’t support 4-byte UTF-8 encodings in either Text::utf8ToWc or Text::wcToUtf8, which explains your observations.

4-byte UTF-8 shouldn’t be filtered, no. RFC 3629 defines 4-byte UTF-8; however, it was simply never implemented.

Yeah I found the code doing the filtering in uhub too, see https://github.com/janvidar/uhub/blob/master/src/util/misc.c

We probably should explain this into the ADC standard and maybe provide some use cases (if possible with something other than that stupid “no piracy” icon).

The main question this raises is, should we leave the ground set in case they decide to expand to 5 or even 6 bytes unicode or just stick with the standard?

Hopefully you’ve alerted janvidar to the issue in uhub.

I agree, this many hubs getting it wrong in the same way suggests a widespread misunderstanding with an unambiguously correct fix. It’s worth mentioning somewhere, with example use cases.

I’m disinclined to encourage preparation for 5+ byte UTF-8 support. The previous UTF-8 RFC did allow for “sequences of 1 to 6 octets”. However, not only does the superceding UTF-8 RFC actually enumerate exactly 4 possible codepoint representation lengths, but, after five years’ accumulated experience between 1998 and 2003, deliberately “Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range)” versus RFC 2279, all representable within 4 bytes. I’d suggest just sticking with the current standard.

Nah, I’ll just fix the code my self and send a merge request his way instead.

A good example would probably the emoticon block which can be found on http://www.fileformat.info/info/unicode/block/emoticons/list.htm most are utf-8, likely to be used and providing a good baseline for what is historically an evolution of IRC.

Sounds like a good idea, I’d like to see what the others say so if this bites us again on the far future (let’s hope not) whoever has to fix this can know the reasons why we didn’t take another approach.

Works for me.

Agree that documenting rationales for this sort of decision is important.

Additionally, both XML 1.0 (2000) and XML 1.1 (2006) define supported character ranges topping out at #x10FFFF, just as RFC 3629. For filelists, this likewise restrains valid Unicode codepoints to those fitting in 4-byte UTF-8 representations.

Cool, I suppose Pretorian knows better where should we keep this noted down.

For the record (and in case it helps those needing a reference) here is the patch for uhub https://github.com/janvidar/uhub/pull/27

Yes, Pretorian would be the person to ask here.

Your commit to add support for 4 byte UTF-8 characters and stricter character checking including a specific check for what appears to be CESU-8 or some similar UTF-16 surrogate encoding puzzles me; RFC 3926 notes that:

While your code indeed appears to treat this as invalid input, I’m puzzled why it should have to be specifically checked for and prohibited at all; to the extent it’s invalid, it should be invalid without any specific reference to this not-meant-for-internet-usage surrogate encoding. What is the use case here; has some ADC software historically used such encodings?

I realize that adchpp’s Text::utf8ToWc has a similar “Ugly utf-16 surrogate catch” and I’m mostly confused why it’s there at all. I was intending to remove it, so I’m especially curious why you specifically (re-)added it.

The reason why they recommend treating surrogates as invalid is the same as why they recommend checking for character ranges: to prevent various ways of encoding the same thing which may (for example) allow going over filters.

The reason why some encodings allow encoding surrogates is because there are systems using utf-16 (back when they thought 16 bits would be enough for everybody) which encode the 16byte value directly into utf-8 instead of handling the surrogates for example for storage in a database.

The problem with such approach is that a system using utf-16 as its internal representation will decode the surrogates and then interpret them as the symbol they represent instead of converting the 3 or 4 byte utf-8 sequence with the symbol into the appropriate surrogate set. This causes information to be representable in two ways which is a bad thing for filters.

Fixed in ADCH++. I’ve removed the previous surrogate handling, though I may (probably should) add it back for code points between U+D800 and U+DFFF; the previous check wasn’t ideal for that anyway (e.g., didn’t check for encoding said characters).