I’m surprised to see the highlights don’t include another common detail of the parsing algorithm that often trips people up: table rows and cells (tr/th/td) must be in one of thead/tbody/tfoot. If they’re not, they’re implicitly nested into a tbody. As in:
<table>
<!-- <tbody> -->
<tr>
<th>Column one</th>
<th>Column two</th>
</th>
<tr>
<td>Row one col one</td>
<td>Row one col two</td>
</th>
<!-- </tbody> -->
</table>
I’ve frequently seen it cause a variety of issues with VDOM libraries, and even plain DOM libraries with a notion of declarative templates, ranging from hydration mismatch logs (meh) to actual logic errors (corruption of the real DOM when nodes aren’t where they’re expected to be).
Other implied/omitted tags like body can cause similar issues too, but I think that’s become a far less common “mistake” (all of these are totally valid since at least HTML5) in recent years.
Annother interesting table one, tr/td/th outside of a <table> will never appear in the DOM. You can make up your own tags and they appear anywhere, but those three are magic and can only exist inside a table.
Forms are also weird, if you leave off the closing tag, an implicit one is included in the DOM. However, if you have inputs further down the page, and technically outside the form, they are included in the submitted form data.
Also fun stuff like you can't have a form inside a form, but if you stick a form inside a form inside a form you end up with a form in a form in the DOM anyways back when I ran into this last.
Perhaps a more intuitive name would be "round-trip serialization HTML". That is, if you use the browser to parse and print some HTML, it matches the source code.
Or in other words, it's formatted the same way that the browser would do it. So, you use the browser to pretty-print the HTML page, and save the code as the source. It's not hard at all and could be done automatically.
Round-trip tests are often used to check that a deserialization routine outputs data that can be serialized again and no data is lost. It even lets you change the serialization format, provided that you change the parser and printer to match.
I expect that these sort of tests are a lot more useful with fuzzing, though. Finding one example that works mostly just tells you that the browser's HTML printing code isn't completely broken. A single test of that sort is only useful for catching stupid bugs quickly.
This is called print-read consistency in the Lisp world: an object is printed in such a way that the syntax can be read to produce a similar object, or else is given a deliberately unreadable notation like #<...>, where the #< combination is required to produce a read error.
In Python, there is a distinction between the text representation of an object, and the result of converting the object to a string. Classes can implement both methods independently, and it's not uncommon to have a repr method that returns something that you could (at least in theory) evaluate as literal Python code. This is very useful for debugging and logging, although not nearly as cool or powerful as the Lisp equivalent.
My favorite example of the technical failings of HTML: https://research.securitum.com/mutation-xss-via-mathml-mutat... is a HTML sanitizing vulnerability that came about because some HTML not only doesn't survive a parse-stringify cycle, but the generated DOM tree does not survive a stringify-parse cycle!
There actually may be! Depending on what you’re trying to do and what’s inconsistent between your markup and the actual DOM. As noted in my earlier comment, implicit insertion/wrapping of certain elements can cause structural changes which lead to actual code errors or unexpected behavior.
> the real reason to code in Fixed-Point HTML is simply the satisfaction of knowing that you and the browser are in total agreement about the HTML.
Interesting idea, I've been trying to achieve something similar but in reverse... rather than make my source match the browser, make the browser match my source by making it not ignore spacing.
i.e The basics being `white-space: pre;` on the body element, and fixed width and sized fonts. But I still want a HTML document so i can opt in to html where it matters. My reasons are to A) avoid a pre-processor and build toolchain complexity, stick to nice simple static files, and B) I get something similar to WYSIWYG but as source code. C) I like fixed width fonts and to plain text formatting (reducing decisions is helpful for focus).
Before now I've explicitly reduced the size of my HTML docs (nothing critical/production facing, all passion projects) by removing certain HTML tags (e.g DOCTYPE, closing tags, etc) because I know modern browsers will still render them correctly.
This means there are miniscule savings from a bandwidth serving perspective. I wonder what the trade off is between the HTTP call and document parse/paint.
E.g is it correct to assume the browser will parse/paint the HTML content - fixing incorrectly closed tags on the fly faster than the few milliseconds more it would take to serve fixed-point HTML from the server?
Thanks [I'm the author]. I tested with Chrome 105 on macOS and it succeeded. Possibly there are OS/plugin/etc issues?
Of course, I know there is no guarantee that every browsers innerHTML implementation will produce exactly the same result, but so far I haven't found any variation (Chrome, FF, Safari, Edge).
In Firefox 105.0.1 on MacOS, the button also always fails when I click it.
EDIT:
In my case, it appears to be some extra "<div style=\"position: static !important;\"></div>" text added before the closing </body> tag. I suspect this is introduced by a plugin, probably LastPass.
Doesn’t make sense. What’s wrong with <br />? It’s a hell of a lot easier to parse than having an exception for <br> which is then transformed in <br></br>.
Other implied/omitted tags like body can cause similar issues too, but I think that’s become a far less common “mistake” (all of these are totally valid since at least HTML5) in recent years.