@Lady No fan of U+FFFC?
@aschmitz U+FFFC is allowed in XML documents; i need a character which is NOT allowed in XML documents but which is still a valid Unicode character
there are three of these: U+0000 (not ideal), U+FFFE, and U+FFFF (both of these last two are great)
@aschmitz (well, do not use U+FFFE in a UTF‐16 environment where it might be confused for a byte‐swapped U+FEFF)
@Lady Fair enough I suppose, though expecting that your input will always be valid feels like asking for a certain kind of trouble. But if you're the one writing it you're probably okay. (And yeah, though FFFE is theoretically allowed I'd avoid it for the reason you say unless you can guarantee it won't show up early.)
@aschmitz usually the best‐practice as i understand it when using noncharacters (which FFFE and FFFF are) is to first search for them in the string and replace any existing ones with FFFD
this would need to happen to make the XML valid anyway, so that seems acceptable to me; i agree that in the general case you probably shouldn’t assume valid input tho
@aschmitz the other best‐practice with noncharacters is to never store them in a place where anyone other than the program which understands their meaning will see them
having the noncharacters produce XML which isn’t valid provides a bit of a guarantee against that; a downstream recipient SHOULD error out if it receives a document where the noncharacter wasn’t handled/removed