Sounds like a good idea, I’d like to see what the others say so if this bites us again on the far future (let’s hope not) whoever has to fix this can know the reasons why we didn’t take another approach.
Agree that documenting rationales for this sort of decision is important.
Additionally, both XML 1.0 (2000) and XML 1.1 (2006) define supported character ranges topping out at #x10FFFF, just as RFC 3629. For filelists, this likewise restrains valid Unicode codepoints to those fitting in 4-byte UTF-8 representations.
While your code indeed appears to treat this as invalid input, I’m puzzled why it should have to be specifically checked for and prohibited at all; to the extent it’s invalid, it should be invalid without any specific reference to this not-meant-for-internet-usage surrogate encoding. What is the use case here; has some ADC software historically used such encodings?
I realize that adchpp’s Text::utf8ToWc has a similar “Ugly utf-16 surrogate catch” and I’m mostly confused why it’s there at all. I was intending to remove it, so I’m especially curious why you specifically (re-)added it.
The reason why they recommend treating surrogates as invalid is the same as why they recommend checking for character ranges: to prevent various ways of encoding the same thing which may (for example) allow going over filters.
The reason why some encodings allow encoding surrogates is because there are systems using utf-16 (back when they thought 16 bits would be enough for everybody) which encode the 16byte value directly into utf-8 instead of handling the surrogates for example for storage in a database.
The problem with such approach is that a system using utf-16 as its internal representation will decode the surrogates and then interpret them as the symbol they represent instead of converting the 3 or 4 byte utf-8 sequence with the symbol into the appropriate surrogate set. This causes information to be representable in two ways which is a bad thing for filters.
Fixed in ADCH++. I’ve removed the previous surrogate handling, though I may (probably should) add it back for code points between U+D800 and U+DFFF; the previous check wasn’t ideal for that anyway (e.g., didn’t check for encoding said characters).