Yesterday, I finally got JSON parsing working, so today, I must fight HTML

:vaporeon: <- reading things about SGML and the HTML4 DTD and such

This was a reasonable introduction, but I still don't understand:

  • For some elements, e.g. <BODY> it's possible to omit the start tag and the end tag... So how do I know when such an element begins and when it is over? (Since the HTML of Mastodon posts doesn't use <HEAD> and <BODY>, I suppose I can ignore this and expect every element to have a start tag, but I still would like to know...)
  • The SGML says which elements are allowed to be contained in an element, for example:
<!ELEMENT P - O (%inline;)*            -- paragraph -->

this says that a <P> must have a start tag, the end tag is allowed to be omitted, and that its children are allowed to be zero or more %inline... So what is my parser supposed to do if a forbidden element is encountered? Since for <P>, the closing tag may be omitted, I guess the reasonable thing to do would be to close that paragraph and open whatever the new tag is. E.g. <P>Line one<P>Line two, upon encountering the second <P>, it would open a new paragraph, since <P> is not allowed to contain a <P>.

But in general, if the end tag may not be omitted, what do I do then? Throw an error?

@vaporeon_ the living standard is much more explicit about answering these kinds of questions. see, e.g., what they have to say about <P>

@aescling That page ate my laptop's entire remaining RAM and I had to close it... I did see that there's a multipage version, can you perhaps link that? Sorry, I'm still on the laptop with less than 700MB of RAM...

Sign in to participate in the conversation
📟🐱 GlitchCat

A small, community‐oriented Mastodon‐compatible Fediverse (GlitchSoc) instance managed as a joint venture between the cat and KIBI families.