U+FFFF is a VALID UNICODE CHARACTER and i WILL use it as a delimiting sigil when processing strings which are restricted to the Char production!!

@aschmitz U+FFFC is allowed in XML documents; i need a character which is NOT allowed in XML documents but which is still a valid Unicode character

there are three of these: U+0000 (not ideal), U+FFFE, and U+FFFF (both of these last two are great)

@aschmitz (well, do not use U+FFFE in a UTF‐16 environment where it might be confused for a byte‐swapped U+FEFF)

@Lady Fair enough I suppose, though expecting that your input will always be valid feels like asking for a certain kind of trouble. But if you're the one writing it you're probably okay. (And yeah, though FFFE is theoretically allowed I'd avoid it for the reason you say unless you can guarantee it won't show up early.)

@aschmitz usually the best‐practice as i understand it when using noncharacters (which FFFE and FFFF are) is to first search for them in the string and replace any existing ones with FFFD

this would need to happen to make the XML valid anyway, so that seems acceptable to me; i agree that in the general case you probably shouldn’t assume valid input tho

@aschmitz the other best‐practice with noncharacters is to never store them in a place where anyone other than the program which understands their meaning will see them

having the noncharacters produce XML which isn’t valid provides a bit of a guarantee against that; a downstream recipient SHOULD error out if it receives a document where the noncharacter wasn’t handled/removed

@Lady Ideally! (In my world, most XML parsers are extremely far from validating, but a final check that things are valid as they depart is feasible enough, at least.)

Follow

@aschmitz i am very disappointed in the state of XML parsers as well :Eevee_awkward:

· · Web · 0 · 0 · 0
Sign in to participate in the conversation
📟🐱 GlitchCat

A small, community‐oriented Mastodon‐compatible Fediverse (GlitchSoc) instance managed as a joint venture between the cat and KIBI families.