@Lady No fan of U+FFFC?
@aschmitz U+FFFC is allowed in XML documents; i need a character which is NOT allowed in XML documents but which is still a valid Unicode character
there are three of these: U+0000 (not ideal), U+FFFE, and U+FFFF (both of these last two are great)
@aschmitz (well, do not use U+FFFE in a UTF‐16 environment where it might be confused for a byte‐swapped U+FEFF)
@aschmitz usually the best‐practice as i understand it when using noncharacters (which FFFE and FFFF are) is to first search for them in the string and replace any existing ones with FFFD
this would need to happen to make the XML valid anyway, so that seems acceptable to me; i agree that in the general case you probably shouldn’t assume valid input tho
@Lady Ideally! (In my world, most XML parsers are extremely far from validating, but a final check that things are valid as they depart is feasible enough, at least.)
@aschmitz i am very disappointed in the state of XML parsers as well
@aschmitz the other best‐practice with noncharacters is to never store them in a place where anyone other than the program which understands their meaning will see them
having the noncharacters produce XML which isn’t valid provides a bit of a guarantee against that; a downstream recipient SHOULD error out if it receives a document where the noncharacter wasn’t handled/removed