There are so many odd edge cases in HTML, a good one I found was with forms. If you open a <form> but don't have a closing tag, the browser will close the form block "visually" at the end of the forms immediate parent, as you would expect. All styles are applied to it, or children via selectors, up to that automatically inserted end point. It's how browsers handles most unclosed block tags.
However, the forms "functionality" isn't closed at that point, any inputs further down the page (outside of the forms DOM tree) are included in the post/get when the form is submitted. Or at least until another form is found in the DOM. Effectively an unclosed form is two things, a visual block that is closed automatically, and an "overlapping" form capturing inputs indefinitely.
> The form element pointer points to the last form element that was opened and whose end tag has not yet been seen. It is used to make form controls associate with forms in the face of dramatically bad markup, for historical reasons.
And search through the rest of the page for the term to find how it’s implemented—it’s straightforward, just set on a <form> open tag and reset on an (explicit) </form> close tag.
This is somewhat unreliable: browsers support it, but tools using XML pipelines are allowed to ignore it (§13.2.9), and lots of JavaScript code will assume hierarchy rather than using form.elements, and thus not catch such elements, or elements that manually specify a form owner via the form attribute.
③ Look through the DOM interface listed, elements sounds promising. Find the explanation of that IDL attribute below: “The elements IDL attribute must return an HTMLFormControlsCollection rooted at the form element's root, whose filter matches listed elements whose form owner is the form element, with the exception of input elements whose type attribute is in the Image Button state, which must, for historical reasons, be excluded from this particular collection.” Roll your eyes at the bizarre exclusion of <input type=image>, then focus on the term form owner which sounds relevant. That links you to https://html.spec.whatwg.org/multipage/form-control-infrastr....
④ Hmm… null, parser inserted flag, nearest ancestor form element, form attribute. Parser inserted flag sounds relevant (though it’s just a flag, not the actual association link). Also the note “They are also complicated by rules in the HTML parser that, for historical reasons, can result in a form-associated element being associated with a form element that is not its ancestor.”
⑤ This is where having the whole spec open, rather than the multipage version, is handy: you can search the entire document for the term “parser inserted flag” to see where that gets set. You can also guess that it’s going to be in §13.2 Parsing HTML documents (parsing.html). In the end, it’s https://html.spec.whatwg.org/multipage/parsing.html#creating...: “… then associate element with the form element pointed to by the form element pointer and set element's parser inserted flag.” Ah hah!
⑥ You have found the concept in the parser: “form element pointer”. You can then look through where it’s used and quickly see how it’s set on <form> and unset on </form>, thus deliberately handling the missing-</form> case.
You develop a feeling for this kind of thing over time. I didn’t know about the form element pointer (though I feel I should have known about it), but this is a loose description of what I did, though I was able to speed through some of the steps, and I really should have just started by looking at “An end tag whose tag name is "form"”, but at first I thought the claim was bogus.
I think I got to point 2, found no reference in the form tag section, and gave up.
But what's fascinating is that it describes the html parser effectively implementing "overlapping markup", as in the Wikipedia article, for this edge case for backwards compatibility.
There’s a lot of messy stuff the HTML syntax supports for historical reasons.
https://html.spec.whatwg.org/multipage/parsing.html#an-intro... covers a variety of fun ones, like how inline formatting elements basically support overlapping markup, and how malformed nesting can be achieved in a few different ways. It’s a useful section of the spec for understanding these things because it explains what’s going on, with links to the precise parser details.
It's not in the DOM, from memory chrome dev tools even shows a closing form tag where it's been inserted. I have no idea how it's implemented internally.
Confuse me for a while when debugging a legacy website. It had actually been done intentionally to work around a rather complex architecture.
There exists a "form"-attribute for input elements that can be used to associate input elements outside the form hierarchy to be included in the form submission.
So the semantics of "form field outside the actual form" are available anyway. When parsing a not-closed <form> the browsers just make use of that.
XHTML was an attempt at such strictness. It failed, though you can still use XML syntax for HTML if you choose to (serve it with MIME type application/xhtml+xml instead of text/html). I leave you to research why it failed if you want; plenty has been written on the topic.
In the mid-2000s, HTML syntax was finally codified. There is now no undefined behaviour: all inputs have a defined output, however surprising may be, because a lot of it was being relied upon.
I've frequently wondered why a hierarchical approach is the norm for text formatting. It seems that many problems could be solved trivially using a text buffer and a list of formatting sequences defined by a starting index and a length. The only place I've seen this in practice is in Telegram's TL Schema [1]. Is this method found anywhere else?
Edit to note: there is one obvious advantage to in-band markup such as HTML -- streaming formatted content. Though I wonder if this could be done with a non-hierarchical method, for example using in-band start tags which also encode the length.
Edit 2: looks like Condé Nast maintains a similar technology called atjson [2].
There are a number of rich text editors that model documents as a flat array of characters and a separate table of formatting modifiers (each with an offset and length). Medium's text editor is one of them. This post [1] on their engineering blog introduced me to the idea, and I think it's a good starting point for anyone interested in this topic.
ProseMirror (a JavaScript library for building rich text editors) also employs a document model like this. The docs for that project [2] do a good job of explaining how their implementation of this idea works, and what problems it solves.
"I've frequently wondered why a hierarchical approach is the norm for text formatting."
80/20, if not 90/10, effectiveness. Most people are not trying to do what the Wikipedia article is talking about. About the most complicated thing that people want to do is the moral equivalent of <i>italic <b>bold and italic</i> bold</b>, and you can losslessly convert that to <i>italic <b>bold and italic</b></i><b> bold</b> for almost all practical purposes.
It isn't until you're getting very precise about what your tags mean, for tags that intrinsically "cross" hierarchies like that, that you start seeing this issues. And then by the time you've gotten that far, you realize you have all sorts of problems, as the article says.
But a good deal of the answer is that while the stuff mentioned in the Wikipedia article is true and important, it's also fairly specialist.
As for "The only place I've seen this in practice is in Telegram's TL Schema [1]. Is this method found anywhere else?", tag-based formatting is the norm for rich text widgets, which generally can natively represent my first HTML example above in its internal format. Generally if you dig into your favorite language you'll find someone has already implemented this efficiently as a library you can pick up if you want to use the capability directly outside of a text widget. It has its own consequences, as anyone who has ever fought with them may realize, but it's not impossibly difficult to deal with.
It isn't a magic solution to everything either, though. Even if it is what you think you want, a widget able to represent a bold section starting in the middle of a paragraph, then proceeding through the first three rows of a table, then stopping in the middle of a paragraph in the third column of the next row is generally weird. To some extent, people have a certain hierarchiness to their thinking about these matters too, whether it's cause or effect. But that hierarchiness is messy; I think it's fair to say most people wouldn't "mean" that bold to mean something in my table case, we don't necessarily expect tags to proceed through tables like that, but <i>i<b>bi</i>b</b> is something that people might intuitively expect to be able to do. It's a fractally messy space both in the computer science and human expectations, and the fractal messiness only gets messier when we try to harmonize those two things.
I guess because it would be a total pain for humans to read and write without specialised tooling. Imagine trying to add a word at the start of your document.
That list of formatting sequences would have to be updated with new indexes when the content of the buffer changed. Keeping the two in sync wouldn't be trivial (for a computer or a human), a tree of nodes fixes that and works for 99.99% of use cases.
It may not be trivial, but it's a solved problem. Many rich text UI widgets and corresponding backing data structures exist today, based on a tagging system where tags can trivially define regions that overlap with each other. It's tricky and full of corner cases, but not that hard if you put your mind to it, and it's not computationally inefficient either.
Wouldnt one obvious solution be to allow tags from different namespaces to overlap? Maybe it is mentioned in the article but I could not see it:
<ns1:root>
<ns2:root>
<ns1:elemA>This is some <ns2:mark>content</ns1:elemA>
<ns1:elemB>that is split</ns:mark> into two nodes</ns1:elemB>
</ns2:root>
</ns1:root>
Then in this case two trees with common leaf nodes (4 text nodes) are constructed.
From point of ns2-root there are only 3 children (the 2 next nodes outside <mark> and the <mark>) and from point fof ns1-root there are two children (elemA and elemB).
Then when parsing one could even pre-select which namespaces to parse and skip all other, for example if I am only interested in ns1, ns2 could be skipped during parsing.
Back in the day when I was in school, and there was a IE monopoly, I wrote a simple HTML parser. Instead of parsing it into a tree, it just recorded the beginning and end position of tags as indicies into the string. I think I did use a stack to match nested tags properly. But overlapping markup was common back then, and IE rendered it "correctly" IIRC. This simple parser was enough to power a scraper (I don't remember what I was scraping. Maybe a competitor's emule link site or something like that :-P) and a crude rich text renderer, which I was very proud of.
The "Approaches and implementations" section includes some clear (to my eyes at least) examines of overlapping lines and sentences in poetry represented as html-like markup.
What sort of examples would improve the article's clarity for you?
Wrt the existing examples, perhaps there should be a small section before that, explicitly called "examples", that contains a minimal summary of those examples to illustrate the concept before the reader delves deeper.
Yeah, I agree with what you are saying. I was viewing this article on mobile and it was hard to spot these examples on mobile, because all sections are collapsed by default and none of the sections had the examples stand out at a cursory glance on mobule. Now that I am on a laptop I easily spot them. I also agree with what you are saying that an explicit section named examples would be good. Especially for mobile reading.
SGML's CONCUR feature (criticized but not described in that Wikipedia article) allows tags to have optional name groups specifying one or more document type names (that must be declared in the prolog) to which the tag applies, and allows tag pairs with disjoint document type name qualifiers to overlap like this:
<(a)x>bla <(b|c)y>bla</(a)x></(b|c)y>
Traditionally used for poetry and lyrics/drama but could also be useful for postal addresses, lyrics in certain types of musical notation, in translations, and maybe specific text apps such as subtitles/tracks for the hearing impaired. Basically, wherever there's a desire to markup text in more than a single hierarchy.
Book-like indices are a notable example. A book index is a list of topics with references to pages. A topic may very easily overlap with another topic and all indexing tools imply that. For example, XSL-FO is an XML-based notation, so in general it is strictly hierarchical: a tag encloses other tags. For index entries, however, it uses disjoint tags like `<index-range-begin>` and `<index-range-end>` linked via the `id/ref-id` attributes.
overlapping b and i elements
<p>he<b>ll<i>o w</b>or</i>ld</p>
contary to the article it can still be represented as a tree, by decomposing the children into their own nodes (so in this case characters become nodes with child nodes expressing what formatting is active, followed by the letter, and then turn of all the active formatting)
No that's just nesting. It's overlapping if the lifetime of a child tag is greater than the lifetime of the parent tag.
Example if you have two paragraphs and bold the end of one and the start of the next
<p>hello <b>world</p> <p>this is</b> your captain speaking</p>
Obviously bold is a poor example as you can just terminate and start a new bold without penalty. But if these were more semantic elements like "sections" and "verses" and "lines" then it might not be possible.
It’s actually fiddlier than you may think. Take “Ta” for an example: in most decent fonts, there will be a kerning pair that tightens those characters, tucking the “a” underneath the beam of the “T” a little. The shaper thus needs to follow the actual fonts being used, for kerning purposes, rather than the markup—but this is still visible at the element level, with getBoundingClientRect().
Take this demo (which depends on your default font having such a kerning pair; if it doesn’t, you may need to find one that does and change the font by inserting <html style="font-family:sans-serif"> or similar after the comma):
This shows five variants of “Ta”, with the last two being <b>Ta</b> and <b>T</b><b>a</b>, and prints five numbers to the console, the widths of each <b> element. Numbers one and four (both corresponding to a <b>T</b>) differ if you have a kerning pair such as I describe: for me, the first is 11.7px, and the second 10.73333px (though it overflows that width in its rendering) because of the <b>a</b> that follows it. If you gave bold elements the style `display: inline-block`, it wouldn’t kern the pair and would thus go back to 11.7px.
Most fonts could really use italic-aware kerning (that is, kerning a pair where one glyph is regular and the other italic), but it’s sadly not a thing.
FrontPage used an interval tree to represent misnested HTML tags. The browsers would interpret the tags as covering the contained area using ad-hoc parsing techniques (e.g. simply keeping a running character property stack to handle font, b, i tags). Frontpage tried to preserve all the tags and the structure it saw on input while also producing the same visual output so it had a tougher problem. (I wrote that interval tree code.)
'Tag soup' might not be perfect, but then neither are widely-used languages that - if they spot a single misplaced period or missing comma or a couple of extra characters after a right-curly-bracket - just QUIT. Why that sort of fragility is tolerated, even lauded is beyond me. That HTML just ignores any tag it can't recognize, without stomping a foot, means all the others are available for other uses. Great!
Change my view: given any data storage medium, the smallest granularity of data must also be the most-child element of any markup language. Given the immense overhead of storing markups on a granular level, processing markup therefore must be a perpetual exercise in recursion.
Therefore, any given letter (here as a <char> type) can retain a back reference of parents, so the <char> object retains a hashset of {Line,Word,P} parent type references representing three domains, but really needs to be a Dictionary of key values, the key being the domain name, the value being the parent name, so that would be:
Domain: Poetry, Value: Line
Domain: Book Object Model, Value: Word
Domain: HTML, Value: P Element
We could then ask any letter arbitrarily "what is your Font Style in your HTML context?" and it would be able to walk up the parent P which obtains its style from a CSS markup, and return that correctly. Or "What is your Poem's name in your Poetry context?" and it could recurse up to the Poem element to find it's Title.
Are you claiming the parents will always be unique? Because as the article says, you can easily have this, where going to the right is a parent relationship:
-> Line -> Verse -> Poem
char -> Word
-> Clause -> Sentence -> Poem
You can try adding a further constraint that any given property must have only one path, so you can then recurse over the tree and find the one match, but as your model gets richer you will find that breaks.
And it's that last clause that is the killer for pretty much anything: "As your model gets richer you will find that breaks."
Plus the UI experience for that is awful. "I want to add this property to this Line but you're telling me it's a duplicate for some particular character? What the hell does that mean? I'm not adding a property to the character!" etc. etc.
If I'm understanding you correctly, in that model a Paragraph should have a parent Page (and there should be a clear answer to the question "what page is this paragraph on?"). Is that correct? If so, that doesn't match how most paginated texts are formatted, where paragraphs frequently start on one page but finish on another
This is how styled SimpleText read-me files worked in classic Mac OS. A normal file was plain text, but styles could be appended based on indices (much like selection and regions work in modern web APIs).
CCS/CCMS (Component Content Systems / Component Content Management Systems) require overlap in order to function. Sort of a "the phone call is coming from inside the house" moment.
It's probably worth noting that I've never seen a CCMS that works well, yet CCS is to-this-day the Holy Grail for most content technologists. It's a little bit mad.
For interest, here is WYSIWYG standoff property text editor in JS. It allows changes to the text stream and management of annotations (called here "standoff properties").
Many, if not most, computer models represent data as a tree. Some data, however, can't really be represented by a tree, because a "thing" can have multiple parents.
The example in the link:
Example, with lines marked up:
<line>I, by attorney, bless thee from thy mother,</line>
<line>Who prays continually for Richmond's good.</line>
<line>So much for that.—The silent hours steal on,</line>
<line>And flaky darkness breaks within the east.</line>
With sentences marked up:
<sentence>I, by attorney, bless thee from thy mother,
Who prays continually for Richmond's good.</sentence>
<sentence>So much for that.</sentence>
<sentence>—The silent hours steal on, And flaky darkness breaks within the east.</sentence>
If you care about lines and sentences, this is difficult to represent as a tree.
One way to solve this could be to provide separate start/end tags without inner content.
<line-start/><sentence-start/>I, by attorney, bless thee from thy mother,<line-end/>
<line-start/>Who prays continually for Richmond's good.<sentence-end/><line-end/>
Yeah, that's how the linked article does it, but that's ... icky? It's still a token spanning multiple parents, it's just masquerading as a couple of self-closing tags.
Which, of course, is the point of the article, and why this is a difficult problem.
Ah you're right, I should have read the article before commenting, haha. I agree it's not an ideal solution. A disadvantage I imagine is that this syntax pushes the problem onto the parser/consumer to keep track of overlapping regions.
> Milestones are empty elements that mark the beginning and end of a component, typically using the XML ID mechanism to indicate which "begin" element goes with which "end" element.
However, the forms "functionality" isn't closed at that point, any inputs further down the page (outside of the forms DOM tree) are included in the post/get when the form is submitted. Or at least until another form is found in the DOM. Effectively an unclosed form is two things, a visual block that is closed automatically, and an "overlapping" form capturing inputs indefinitely.