Hacker News new | past | comments | ask | show | jobs | submit login
Defusedxml – defusing XML bombs and other exploits (github.com/tiran)
92 points by gudzpoz 54 days ago | hide | past | favorite | 20 comments



Back in one day… I saw my first xml/html exploit by another script kiddie on AOL. I was like wtf and began trying to crash my own account. A simple quote mismatch with a bunch of ampersands eventually did it (something 1000 nested ampersand escape sequences). Anyways was pretty proud of what I built, so I found a guy smarting off in an AOL chat and decided to bomb him. And yeah, he was a moderator for AOL… so at 12 years old I got our family amount blocked and my dad had a uncomfortable call with the AOL reps and how his son was hacking.

Ah those were the days


Fascinating reading:

> The majority of developers are unacquainted with features such as processing instructions and entity expansions that XML inherited from SGML. At best they know about <!DOCTYPE> from experience with HTML but they are not aware that a document type definition (DTD) can generate an HTTP request or load a file from the file system.

I was one of them!


Developers are even less aware that SGML has (and always had) quantities in the SGML declaration, allowing among other things to restrict the nesting/expansion level of entities (and hence to counter EE attacks without resorting to heuristics).

Regarding DOCTYPE and DTDs, browsers at best made use of those to switch into or out of "quirks mode", on seeing special hardcoded public identifiers but ignored any declarations. WHATWG's cargo cult "<!DOCTYPE html>" is just telling an SGML parser that the "internal and external subset is empty", meaning there are no markup declarations necessary to parse HTML which is of course bogus when HTML makes abundant use of empty elements (aka void/self-closing elements in HTML parlance), tag omission, attribute shortforms, and other features that need per-element declarations for parsing. Btw that's what defines the XML subset of SGML: that XML can always be parsed without a DTD, unlike HTML or other vocabularies making use of above stated features.

Keep in mind SGML is a markup language for text authoring, and it would be pretty lame for a markup language to not have text macros (entities). In fact, the lack of such a basic feature is frequently complained about in browsers. The problems came when people misused XML for service payloads or other generic data exchange. Note SOAP did forbid DTDs, and stacks checked for presence of DTDs in payloads. That said, XML and XML Schema with extensive types for money/decimals, dates, hashes, etc. is heavily used in eg ISO 20022 payments and other financial messages, and to this date, there hasn't evolved a single competitor with the same coverage and scope (with the potential exception of ASN.1 which is even older and certainly more baroque).


> Regarding DOCTYPE and DTDs, browsers at best made use of those to switch into or out of "quirks mode", on seeing special hardcoded public identifiers but ignored any declarations.

Not when processing XML mime types. In modern browsers that mostly means SVG files, but i think XHTML is still possible.

(Modern) HTML is neither SGML nor XML, so it doesn't follow the rules of either.


"Modern" WHATWG HTML is still following SGML rules to the letter in its dealings with tag inference and attribute shortforms ([1]). Which isn't surprising when it's supposed to hold up backward compat. To say that "HTML is not SGML" is a mere political statement so as not be held accountable to SGML specs. But (the loose group of Chrome devs and other individuals financed by Google to write unversioned HTML spec prose that changes all the time, and that you're calling "modern HTML" even though it doesn't refer to a single markup language) WHATWG had actually better used SGML DTDs or other formal methods, since their loose grammar presentation and its inconsistent, redundant procedural specification in the same doc is precisely were they dropped the ball with respect to the explicitly enumerated elements on which to infer start- and end-element tags. This was already the case with what became W3C HTML 5.1 shortly after Ian Hickson's initial HTML 5 spec (which captured SGML very precisely) ([1]). But despite WHATWG's ignorance, even as recent as two or three years ago, backward compatibility was violated [2]. Interestingly, this controversity (hgroup content model) showed up in a discussion about HTML syntax checkers/language servers just the other day ([3]).

Where HTML does violate SGML was when CSS and JS were introduced already, to prevent legacy browsers displaying inline CSS or JS as content. The original sin being to be place these into content rather than attributes or strictly into external resources in the first place.

Regarding SVG and XHTML, note browsers basically ignore most DTD declarations in those.

[1]: XML Prague 2017 proceedings pp. 101 ff. available at <https://archive.xmlprague.cz/2017/files/xmlprague-2017-proce...>

[2]: <https://sgmljs.net/blog.html>

[3]: <https://lobste.rs/s/o9khjn/first_html_lsp_reports_syntax_err...>


> "Modern" WHATWG HTML is still following SGML rules to the letter...To say that "HTML is not SGML" is a mere political statement so as not be held accountable to SGML specs.

That is self-contradictory and makes no sense. If its following sgml to the letter than there is nobody to be held accountable for violating the sgml spec and hence nobody to hide behind "political statements".

You can't have this both ways.

> Regarding SVG and XHTML, note browsers basically ignore most DTD declarations in those.

They listen to dtd's for entity references and default attribute values. I'd hardly call that ignoring.


Most of these exploits are so famous that common xml processors have disabled the underlying features.

So in practise you probably dont have to worry too much as long as you dont enable optional features in your xml library. (There are probably exceptions)


> I was one of them!

I still one of them!


This is largely historic. I had lengthy discussions about this with expat's maintainer.

expat, the xml library underlying python's etree and other xml interfaces, has either mitigated these standard xml vulnerabilities or disables the dangerous features by default.

The python docs are still a bit confusing there, but if you look at this table: https://docs.python.org/3/library/xml.html#xml-vulnerabiliti...

While this table has a lot of "Vulnerable" in it, they all come with footnotes saying that up-to-date versions of expat are not vulnerable.

So... if you want to have more secure xml parsing in python, make sure you use an up-to-date expat library or one where security fixes have been backported. You don't need anything else.


DefusedXML is an amazing piece of code.

This being said, many of the mitigations it enables are now also available by default in many “standard” libraries. For example, bandit will often tell you to not use lxml in Python, but instead use defusedxml. However, modern versions don’t suffer the same issues at all, and this is a case where automatically following the advice of the linter/SCA is not a great idea.


Do you mean that it is, in fact, a mistake to use defusedxml instead of lxml in Python?


From the author themselves, 6 years ago:

> defusedxml.lxml is no longer needed and supported. Nowadays libxml2 has builtin limitation for entity expansion.

https://github.com/tiran/defusedxml/issues/25#issuecomment-4...


Note that this is not enabled by default, although there is an upper bound on tree size which does limit the reach of the issue.

See https://lxml.de/FAQ.html#is-lxml-vulnerable-to-xml-bombs for more about the tuning knobs.


OK, so the defusedxml.lxml submodule is deprecated and one should use the other APIs from defusedxml instead. That does not mean that defusedxml in it's entirety would be useless.


libxml2 segfaults on me whenever I give it vaguely complicated xsl templates so I'm doubtful about how effective that handling will be.


If you’re trying to use it for lxml then yes, it was only ever experimental and has been deprecated (it also failed to define some interfaces correctly causing issues).

If you’re using it over the stdlib then no.


> XML Bomb

This reminds me of Zip Bomb [1], aka, Zip of Death (ZOD) [2]

1. https://en.m.wikipedia.org/wiki/Zip_bomb

2. https://github.com/iamtraction/ZOD


I’ve always appreciated their drop-in replacement support. It’s so nice to just change an import and move on. I’ve used it on multiple legacy projects with great success- never a single compatibility issue. Great project!


Does `lxml` match `etree` in the table?


No. etree is xml.etree.ElementTree.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: