The problem with XHTML is that the "abort on parse failure" behaviour simplifies a computer science problem at the expense of creating a business problem; now if you have an error somewhere in your content generation pipeline it means that the site goes down. That's a pretty difficult tradeoff given that most CMS are absoutely not designed in a way that ensures well formed markup.
Back in the dim and distant past when XHTML was a fashionable term to throw around, Evan Goer did a study of whether sites claiming to serve XHTML were actually doing so. The results were not pretty [1]. Some people took the results as a challenge and tried to ensure they were sending valid XHTML with the correct mime type so that browsers would catch fire in the case of a parsing failure. In almost every case it turned out to be possible to get their sites to break with user generated content (e.g. searchng for XML-invalid characters which were then echoed back onto the page).
So I contend we ran the XHTML experiment pretty thoroughly 15 years ago, and it turns out that it doesn't really work. Once you accept that parsing errors being fatal isn't viable for publishing, you have to have some kind of error recovery system. The one in HTML isn't ideal since it's basically just the codification of many years of improvisation and reverse engineering. Maybe something like XML5 [2] would be better. But figuring out how to move the world in that direction is an unsolved problem. Meanwhile HTML Just Works for most of the people most of the time.
The right solution is pervasive programming language support for interpolation inside XML/HTML code. Like there is in JavaScript now, known as JSX. Like HHVM has, in the form of XHP. (Interesting that both are Facebook innovations.)
The industry didn't defeat SQL injection through more permissive query parsers, but by generating queries via ORMs and parametrised templates instead of dumb string concatenation.
Also, have you noticed that JSON injection bugs are almost never heard of? That's because in many programming languages this very problem comes pretty much pre-solved before you even add any JSON support to them.
Error is still there, it is just a non breaking error.
Imagine Word or Excel document that's trying to silently recover. I do not see programming languages adopting "not to fail" approach
1 + "2"
//"12"
1 - "2"
-1
There were PHP sites with mysql connection error all around. As industry we've chosen AirBrake approach — fail and notify developers. HTML makes it easier to edit plain text but there is a price. What you load is not what you've stored, HTML is a lossy serialization [1] [2] [3].
Program not human produced DOM, it should be safe to serialize-deserialize. It could be JSON, XML, s-expressions. It is unsafe with HTML.
It is very easy to author XHTML. Start DOM first (HTML for brevity):
I can't mess it when I edit DOM and browser restores it as it was.
I may have <ul> in <p> (we had it in 1978), I may have <a> in <script> (and it works like comment), I may have <pre>\n and don't worry that it disappear each time I save document. I may have nested <script type="foo"> tags [1].
DOM supports it. XHTML supports it. HTML breaks my content on save-load.
+1 for XHTML. I never understood why people think it's a good idea to avoid closing elements. It's like a dangling brace to me... it nags me and it just doesn't look right. How is it seen as acceptable practice?
It's not a good idea to close all elements in HTML because browser does not care about your closing elements, they'll be closed automatically regardless of whether you closed them or not. And your closed elements will be opened automatically.
<p>List: <ul><li>item1</li></ul> of items</p>
becomes
<p>List: </p><ul><li>item1</li></ul> of items<p></p>
and that's probably not what you wanted. So you have to understand auto-closing behaviour either way. And if you understand it, you can just spare yourself from closing them.
Now with XHTML, that's a different thing. But I think that ship sailed many years ago and HTML is a preferred way to go nowadays.
> you have to understand auto-closing behaviour either way
You do, if you use HTML, but my read of the above two comments was that they would've preferred if the world had stuck on the XHTML path.
> Now with XHTML, that's a different thing. But I think that ship sailed many years ago and HTML is a preferred way to go nowadays.
Actually, XHTML is a (little-known) part of the HTML5 spec.[0], so going the strict path is still an option. In the past, this would've required complex content-negotiation for media-type backward-compat but that's no longer an issue unless a non-neglible % of your visitors are using IE8.
The only remaining issue is that of draconian error handling, which is an issue browsers definitely would have fixed the UX of had the mainstream stayed on the XHTML track, but sadly that never happened. Still, good modern support for server-side validation of well-formed XML documents means this is also less of an issue than it once was (though tbh, still a significant issue imo).
It's only because there's a known structure for HTML. A block-level element cannot be within another block, but you have to know that "p" and "ul" are block-level. Explicit closing does not require you to know the structure and thus makes the processing simpler.
It's as in JavaScript: you can omit the ";" at the end of a statement and the parser will figure it out, but it only makes the parser more complex and introduces subtle differences you have to learn. If JS parser were more strict, it would be simpler both internally and conceptually.
The second sentence is wrong: there's nothing wrong with nested block elements, but you never nest block elements inside inline elements. For example <div> elements are block-level and they are nested all the time. There are some exceptions though such as your example of an <ul> in a <p>. The <p> tag only permits "phrasing content" within itself, and <ul> isn't considered phrasing content.
> they'll be closed automatically regardless of whether you closed them or not.
Actually that is not really true. It is defined per html5 spec what is closed automatically when another open node is being parsed.
The <p> element always had a weird flow-root behaviour, that is why it is always closed automatically.
Rather than that iirc mostly form relevant elements are also closed automatically, like optgroup, option, select, input and such.
Additionally it's only table (same problem with flow root) and body, pretty much.
I can understand that a lot of people are confused why that is. But the reason is not the difference of XML vs SGML per se (xhtml will simply break if a p is within a p)... it's the flow root model and the difference in behaviours of layouting that is specified here, not the notation structure.
So SGML was the better choice, to allow all php crapsites that never test their html validity to still run rather than forcing the enduser to (fix?) the xml.
I mean, at some point opera didn't even work on bbcode based forum software(s), so people quickly started abandoning it.
Woah this is actually very nice, very nice indeed.
I'm currently building my own HTML5 parser for my browser stealth [1] and I aim to be spec-compliant with it, and this might come in very handy for testing against.
What I kind of miss with SGML as a feature is something similar to XSLT stylesheets that can transform chunks of a website into another chunk.
Currently I'm kind of reinventing the wheel here due to my optimizer having the idea to "upgrade" websites on the fly before they get to the client.
If all websites were XHTML1.1 strict based, that part would have been so much easier.
SGML itself has link process declarations, an additional type of declaration set that can appear in an SGML prolog next to DTDs and that can be used to remap elements (in SGML you can have multiple DTDs and LPDs, pipeline LPDs, and so on). sgmljs uses this and adds templating to capture attributes at call sites for passing these into templates as regular entities, allowing for parametric macro expansion. Basically, if you have eg
<div bla=x>
in your main doc, you can make SGML expand it using
<!DOCTYPE div SYSTEM [
<!ENTITY bla SYSTEM>
]>
<div>
<p>Value of bla is &bla</p>
</div>
honoring escaping/sanitizing etc. LPDs can apply rules in a context-dependent way using an automaton capturing much of core CSS.
Now, for arbitrary markup manipulation (XSLT is Turing-complete), don't tell the HN crowd that SGML has/had Scheme-based DSSSL (precursor of XSLT) ;) My opinion, having done large, nontrivial XSLT projects (including extracting the DTD grammar rules you see on the site from spec text) is that the more complex it gets, the more a general-purpose language with unit testing etc becomes a better choice over XSLT.
Edit: much luck with your browser project! Don't hesitate to use my code or ask questions (here or on StackOverflow tagged sgml)
Wow, today I learned! I didn't realize that transformation would take place. I wonder how many people writing HTML are aware of these... your comment basically made me realize I in fact simply don't know HTML. Thanks!
Your linter should warn you about these. E.g. if you use prettier the first `<p>` will be automatically closed on auto-format before the `<ul>`, and the second (closing) `</p>` will cause a parsing error.
Sometimes we want <ul> inside <p> [1]. It works in DOM, it works in XHTML. And before "you should not want to do it" argument GML (SGML predecessor) had paragraph without indentation so text can continue after list [2].
> The pc tag identifies a paragraph continuation -- that is, one or more sentences related by their subject matter to a paragraph which has been interrupted by an address, example, figure, list, or long quotation.
> Usage: The paragraph continuation can occur after the sequence consisting of a paragraph unit followed by an address, example, figure, list, or long quotation.
:p.The subject of a paragraph might be continued through
:sl
:li.an address, a list,
:li.an example or figure, or
:li.a long quotation, :esl
:pc.and continue to be discussed in flowing text.
The discussion could continue indefinitely through
To long to quote entirely [3], unfortunately IBM destroyed online version
Interesting. Any kind of sanitizer like this seems doomed to fail repeatedly--I like the discussion at http://langsec.org/
I wrote a Semgrep rule (I'm one of the maintainers) to look for calls to the .sanitize() method and suggest the final recommendation in the post, which is to use either RETURN_DOM or RETURN_DOM_FRAGMENT. To use it:
Personally, i find the level of success dompurify has had quite amazing. I also thought they were doomed to fail when i first heard about it, but its had very few bypasses all things consider.
These XSS sanitization bugs don't seem to ever stop.
The simplest and safest solution for sites would be to implement Content-Security-Policy with a good script-src (with limited sources and without the 'unsafe-inline' escape hatch).
Unfortunately, even HN itself uses a script-src with unsafe-inline, and it allows everything on cloudfare's CDN...
I had a situation lately where the CSP was not enough because of Safari, so I had to fallback to dom-purify. It is not always possible to just rely on it unfortunately.
I think the best solution is CSP _and_ injection mitigations - even without XSS there is still DOM injection which can be equally damaging reputationally.
iirc, IE11 (under Windows 7, specifically) does not support CSP. I don't think CSP mitigates all XSS vectors either (`<a href="javascript:alert(1)">` for example). Sure IE11 is deprecated but that doesn't mean you don't need to account for it when building an enterprise application.
I'm curious if you can provide any details on what / how Safari was exploitable with CSP - https://caniuse.com/?search=content-security-policy indicates that it should be pretty uniform across popular browsers. If you'd prefer a private channel @yoloClin on twitter.
A CSP without unsafe-inline will block your example as well.
I agree that one should still sanitize input, at least for fields which allow HTML* , but it's obvious XSS filtering/sanitization can introduce XSS as much as not. This article is merely one example, there were enough to make Chrome give up and turn off their XSS filter. So main defence should be CSP and sanitization is just a nice-to-have.
* Because sanitized input is often saner than the nonsense users can insert when they are allowed to put in tags. Basically use sanitization as an HTML Tidy with extra filtering. Also for very old browsers.
Sorry, I just re-read my initial comment and I think the way I wrote it was misleading.
My case was about the allow-scripts directive of the sandboxed iframe, which I thought was linked to the csp mechanism, but now that I checked the documentation, it seems that I was wrong.
I basically display a random HTML document in an sandboxed iframe with disabled scripts. When you do so on Chrome and Firefox, the event listeners injected inside the iframe from a script in the parent frame still works, but on Safari it does not because all the scripts (or events) inside the iframe are disabled.
So rather that relying on this mechanism, I used DOMPurify to filter all the scripts.
Hm, wouldn't the proper way to defend against such mutation attacks be to run parse-then-serialize in a loop until the output stabilizes? And throw some sort of error if it takes too many iterations. Only after stabilization would you go making any sort of modifications.
Although, I guess, now that I think about it, the fact that you are making modifications means I guess there could still be attacks that rely on mutations that potentially happen after those modifications...
This would work and this something that Angular DomSanitizer does [1]. But I personally am not a big fan of this solution as it has large performance penalty if the sanitized string is huge.
In the beginning, the form element pointer is set to the one with id="outer". Then, a div is being started, and the </form> end tag set the form element pointer to null. Because it’s null, the next form with id="inner" can be created; and because we’re currently within div, we effectively have a form nested in form.
---------
I was under the impression that a closing tag automatically also closed any tags that were currently open, but contained within the tag being closed. Is it not a bug that, when the </form> tag is encountered, the open <div> tag isn't closed at that point?
Under HTML and the DOM (no Javascript) You can do any ordering of the tags, that's pretty normal a <strong><div>hello</strong></div> is perfectly valid HTML
This seems to be the wrong part. If "</math><img src onerror=alert(1)>" was interpreted as part of a #text node, it should have been serialized into "</math><img src onerror=alert(1)>" so that it becomes harmless in any subsequent round-trip.
Then it needs to be better aware of what kind of syntax is allowed in CSS, because </math> clearly isn't valid CSS.
Sanitizing HTML isn't just about removing unwanted tags and attributes. Assuming you want to preserve some styling, the sanitizer needs to be aware of the spec for not only HTML but also CSS, JS, SVG, and whatnot. It must ensure that it never emits output that would be invalid in any of those contexts. (JS is easy because you normally just delete it.)
To date, I haven't seen any HTML sanitizer that's as serious about enforcing spec as HTMLPurifier. It strips out everything that isn't in a whitelist and guarantees that the output is syntactically valid. Unfortunately, it's written in PHP, its whitelist is kinda outdated, and nobody seems to want to port it to any other language. But the only alternative is to take dangerous shortcuts.
Not sure strict CSS parsing would have avoided this issue, since you could just surround what the purifier thought was a style text node with /* and */ (which is valid CSS, being a comment)
Sanitizers are not required to keep the original content intact. They absolutely can, and should, strip invalid and/or suspicious content from CSS strings.
Any idea if <script> has similar issues? I've always had it in the back of my mind that HTML parsing must have some problems if you mess with the code inside <script> tags too hard, but I've never gotten motivated enough to try to investigate.
<script> is even more complicated than <style> as far as the HTML spec goes, I think. (There’s parsing weirdness where the behaviour of <!-- in <script> depends on whether </script> followed by --> is present somewhere, for example.) Not sure about whether DOMPurify supports non-executable <script>s.
Oof. I'm not so worried about DOMPurify specifically but I've been thinking surely there's got to be some inconsistency between different browsers or libraries...
sample = `<form><math><mtext></form><form><mglyph><style></math><img src onerror=alert(1)>`
fragment = DOMPurify.sanitize(sample, {RETURN_DOM_FRAGMENT: true})
body = (new XMLSerializer).serializeToString(fragment)
//...<style></math><img src onerror=alert(1)></style>...
div.innerHTML = body
div.append(fragment)
iframe.srcdoc = body
I think you need to do something more than that, or round trips will give you text that looks like &amp;amp;lt;img src onerror=alert(1)&amp;amp;gt;
> Of course, special characters that are already escaped, i.e. already valid in a text node, don't need to be escaped again.
But this isn't true at all. You could have text that's supposed to look like "<span>" -- just look at the discussion we're having right now. And that text needs to be rendered as "&lt;span&gt;", but it also needs not to be rendered as "&amp;lt;span&amp;gt;". You have to keep track of what you've done to it.
Any decent WYSIWYG editor (which I assume your data is coming from if you're trying to sanitize arbitrary HTML) will render "<span>" as "&lt;span&gt;" in the first place. The sanitizer doesn't need to do anything to it, as the markup is already valid.
You've managed to miss the whole point of the discussion thread and the article above it. Your data is coming from a user. You need to change "<img src onerror=alert(1)>" to "<img src onerror=alert(1)>".
You will not be able to do this if your approach is "doesn't matter; we'll just trust whatever we get".
Not at all true. As an example, HTML emails may contain <style>, and any webmail client will want to render that, and they’ll definitely want to sanitise the HTML.
Fastmail uses DOMPurify for this purpose, augmenting it with rewriting the selectors in style blocks in order to scope them safely.
(Fastmail also sponsors the DOMPurify bug bounty—I presume the latest mXSS listed at https://www.fastmail.com/about/bugbounty/ corresponds to this thing, and as noted, this issue didn’t affect Fastmail; the reason for that is that Fastmail uses the DOM that DOMPurify returns, rather than problematic and less efficient serialising and deserialising approach.)
And this is hardly the only justifiable reason for allowing <style>. A couple of other examples that occur to me are platforms allowing full user-generated and -styled pages, and validation tools.
Remember with this that each application can choose what it wants to allow or deny. It would be very harmful for DOMPurify to not allow certain sorts of tags just because you can’t immediately think up a reason why you might want to use it.
How to people do proper XSS protection and more importantly validate against it?
It's effectively a manual process of calling appropriate escaping (html, css, js) at every dynamic output site. Ensuring you are doing it 100% in thousands if not millions of calls is daunting.
I have seen security tools crawling your app and using payload to test for vulnerability which sounds nice in theory but in a really complex application where you need to understand the business rules to even crawl the whole application (not get caught up on validation/error) how do people ensure that?
In general: never mix user input indistinguishably from control content.
In HTML generation, that means you never dynamically parse HTML generated directly from raw data. E.g. in React, you define your content entirely in terms of components that are explicitly defined in your code (no HTML parsing - they're objects in your code, explicitly built by your code) and content for those components (still no HTML parsing, they directly become text either for attribute params, or for text nodes that sit within your HTML tags).
It's the same as protecting against SQL injection. Don't interpolate user input directly into a SQL string and then parse it. Define/generate the query in your code separately, with explicit parameter placeholders for where dynamic input will go, and then provide variables that can be safely used for those parameters, but which are never parsed as part of the query itself.
For HTML, the only time this becomes challenging is when you accept user input that is itself HTML already. In that case, you have to parse it, and you have a big problem.
Don't do that, wherever possible. Accept structured data, accept plaintext text, accept markdown (without HTML tags either disabled or _extremely_ carefully limited), but wherever possible, don't accept and output HTML given to you by a user. For similar reasons, don't accept SQL queries uploaded by a user.
Dompurify exists for the case where you can't do this, but in general it's better, and not that hard, to avoid needing it at all.
Not sure about validation, but Content-Security-Policy is the best tool we have at our disposal right now to prevent XSS - define what content the browser is allowed to load and execute.
I have a feeling it will remain a very manual and diligent process. Be always on top of new techniques and solutions, have a good understanding how everything works in detail and reduce your attack surface by keeping things simple.
I agree with CSP, but as I've commented on another thread I recommend CSP _with_ other mitigation factors due to DOM/HTML injection, and browser support.
You use libraries with safe apis that make it easy to do the right thing and hard to do the wrong thing. Often this means you generate some sort of data structure which gets serialized into html by a library as a last step (e.g. React, but there are lots of other examples).
If you're always using safe apis, its really easy to verify that you dont do the wrong thing (e.g. via static analysis, or even grep. Browsers are even now suporting trusted types to dynamically verify this at runtime)
I remember back in the days when SQL injections were a big thing. I was a newb and had no idea what I was doing, but I was using a scheme library that created the queries by something akin to format strings (or was it something like quasiquoted s-expressions? It is almost 20 years ago). Anyway, it did the right thing: it escaped everything.
The result? The website of my little diablo 2 clan was the only one in my group of friends that did not get hacked. Due to me using an obscure library that did the right thing in a language nobody liked. It wasn't any victory of mine. Had I used PHP I would have been just as pwned as my friends.
>Had I used PHP I would have been just as pwned as my friends.
This is a myth, the issue is if you learn by Google instead of using something like a book. All languages and libraries (that I used) allow you to run raw sql if you need and all books/good learning materials when introducing SQL will explain how to do the correct thing with escaping parameters or even nicer using bound paramters.
That was the thing I was trying to convey: what was offered for me was a safe API. The things all the PHP tutorials of 2001 taught was "glue strings together and it will do what you want".
That was the knowledge offered for us 13 years olds wanting to write a discussion board for our game groups. I think it is a lovely example of "the simple thing should be the safe thing, and if you need more there should be ample warning". That was not the case of PHP 4.0.
Anecdote: the Swedish PHP book (I recall there being only one) that most of us got taught the string building approach without any escaping. The oldest things I can find Googling now are from 2007 and are almost all safe, either through PDO or a safe mysqli interface.
But this is not a language issue but some educational problem, it depends on luck if you land on some bad book or tutorial. If the language offers performance it must give me access to unsafe stuff like run raw SQL.
Of course. I am trying to find it now, but I recently used a sqlite library where, barring bugs in escaping, there were no way to execute SQL queries in the "simple API" with strings that were not compile time constants. A dynamically generated string would be refused with a clear error message pointing to the correct part of the manual.
The raw queries were hidden in a sqlite3/DANGEROUS library. Despite doing things like stepping queries, bypassing the statement cache or mucking around and changing parameterized queries I didn't have to touch the DANGEROUS API.
The problem is most libraries/frameworks also has just outputting option and those are valid for a lot of use case (system generated id and such). I guess one option is never allowing any call to that even if you know it's valid.
The trouble comes is in an existing code base if you know 95% of those use are valid but to pass the check you know have to migrate that all and do QA.
It depends a lot on the library I guess, does it make it easier to do the right thing or not.
AFAIK Go template library parses HTML and applies appropriate escaping depending on context. I think that most sane HTML template libraries will do at least HTML escaping by default. Dynamic CSS and JS usually rare so you can pay extra attention.
These are meant to be run through the DOMPurify library which serializes back to a different DOM tree than the input, so the input is not expected to work straight away
All this tracking of string types (encoded or decoded) makes me wonder if we should have separate string type AND strong-typing language for handling of HTML strings.
Last i checked, it was only used for stuff provided to the user themselves (so a bypass in dompurify would allow you to xss yourself but not others.). However that was a long time ago, so im not sure what the current state is.
Yeah, that's true. We also always use the RETURN_DOM_FRAGMENT / RETURN_DOM options, which avoids the issue according to the article (I don't think we serialize/reparse the output anywhere ourselves). And also, we forbid the 'style' tag, which seems to be required by the exploits (although that is just a lucky coincidence).
> DOMPurify’s job is to take an untrusted HTML snippet, supposedly coming from an end-user, and remove all elements and attributes that can lead to Cross-Site Scripting
The error here is to use a blacklist instead of a whitelist.
If you read the article, you will find that DOMPurify uses a whitelist, that DOMPurify’s model thinks the offending snippet is safe, that nothing in particular is wrong with the model per se, and that the bug is something else entirely.
I disagree, dompurify takes a blacklist approach of trying to remove evil instead of only allowing through application-specific sane things (that it uses whitelists is a separate thing)
You need defaults, and that default is an allow-list of sane HTML. If you want, you can initialize DOMPurify with your own allow-list, that might just be the bold, italics, underline, strikethrough.
The fact that their list is not namespace-specific is a bigger issue imo.
I know why I don't like automatically closing elements. Xhtml for the win!