Python module that makes working with XML feel like working with JSON

hafabnew · on Aug 23, 2013

Cool, although LXML (mature, fast, reliable, generally awesome) can do this (and more): http://lxml.de/FAQ.html#how-can-i-map-an-xml-tree-into-a-dic...

It also has the 'objectify' API, where you can access XML nodes via regular object access (i.e. `access.nodes[0].like.this`). http://lxml.de/objectify.html

wcarss · on Aug 23, 2013

I have two major gripes with lxml that this library solves, but I agree that for serious projects, lxml is the correct choice.

1) you have to build libxml to use lxml

2) lxml has a large, powerful, complicated API

For 1), A friend of mine had to do an annoying workaround to use lxml on his box, due to its limited memory preventing him from being able to build libxml. Because xmltodict is Expat[1]-based, you don't have to build libxml in your environment to use it.

For 2), When I went to write a simple rss-reader project this past weekend, I dreaded going back to lxml. I knew that I'd have to go peruse its huge documentation to answer questions about whether to use lxml.XML or lxml.fromstring, whether methods I wanted were on Elements or ElementTrees, xpath syntax, custom parsers, etc. If I'd ever seen the objectify API I'd forgotten about it, because there's just so much _other_ stuff in lxml.

I happen to have found xmltodict in a brief search for lxml alternatives. It's in PyPI, so pip grabbed it with no complaints. It installed without building anything. And in less than a minute of glancing at the README, I grokked the API as "pydict = xmltodict.parse(xml_string)". I don't know if there are other things. I never had to find out.

Less than 10 minutes from finding it to forgetting I was reading XML as a source -- really a wonderful project. But if I were doing something 'serious', I'd absolutely use lxml. That large API and those byzantine docs exist for a good reason: they're dealing with XML properly. But sometimes coders just wanna have fun, or build a quick prototype or hack.

[1] http://docs.python.org/2/library/pyexpat.html

hafabnew · on Aug 23, 2013

Agree that lxml is certainly quite large and batteries-included, but I've always found I can do pretty much everything I want to just using a few select methods, namely etree and xpath.

E.g. for extracting URLs from a sitemap:

    from lxml import etree

    root = etree.XML(data)

    urls = root.xpath(
        './/sitemap:loc/text()',
        namespaces={
            'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9',
        }
    )

For RSS reading, here's a straightforward example. Obviously you can factor out the index access on each item, and calling `/text()` too.

    import urllib2

    from lxml import etree

    data = urllib2.urlopen('https://news.ycombinator.com/rss').read()

    root = etree.XML(data)

    for i, item in enumerate(root.xpath('.//item')):
        print i, item.xpath('title/text()')[0]
        print item.xpath('description/text()')[0]
        print item.xpath('link/text()')[0]
        print

All pretty simple, xpath selectors can get a bit gnarly at times though. The tradeoff being that you can be very expressive with them.

raverbashing · on Aug 23, 2013

Good, I had never seen this transformation tip in LXML

But usually, yes, LXML is "good" meaning it's the least worse way of dealing with XML

Also, it has some idiosyncrasies, like insisting on adding the namespace on tag names, so you end with something like {http://example.com/your.xlsd}.index (I don't remember it exactly and I don't have an example here with me)

toyg · on Aug 23, 2013

Correctly handling namespaced QNames is a requirement, not a bug. It makes things awkward at times, but that's a job for lib writers to provide decent interfaces.

I haven't used LXML in a while, but ElementTree, for example, forces you to use the QName in XPath expressions, which is technically correct but a huge pain; it would be nice if there was a ScrewNamespace option that would allow "simple" searches, although this might blow up in your face one day (when two namespaces define the same element name, and your xpath search brings up elements you didn't really want).

raverbashing · on Aug 23, 2013

Actually the short name worked for the searches, the problem was reading element names from a subtree

It's not as much as dodging the requirements but rather an inconsistency in the API.

TimSAstro · on Aug 23, 2013

I also found the namespacing to be a bit weird, and it took quite a while to grok the documentation. In case anyone wants a working example, I implemented a wrapper to drop the namespacing (resulting in simple objectify attribute access) for one particular XML schema here: https://github.com/timstaley/voevent-parse/blob/master/voepa...

georgebashi · on Aug 23, 2013

That's the "node-name", i.e. the fully qualified name of the tag in its namespace. You probably wanted to ask for the local-name, which in your example would just be "index". Not sure how to with LXML, but it's a common mistake people make when dealing with XML.

martinblech · on Aug 23, 2013

There's however a small feature in xmltodict that most people overlook: the streaming mode. I actually wrote xmltodict the day I tried to parse a Wikipedia dump, I just couldn't keep it all in memory but needed something more high-level than SAX.

martinblech · on Aug 23, 2013

xmltodict is in no way trying to compete with LXML feature-wise (no support for namespaces yet, just to name one thing). It's just a lightweight approach to roundtrip between XML and JSON documents that worked for my use case and decided to share it.

drunkpotato · on Aug 23, 2013

Yes, my first thought when reading the headline was that it would be about LXML.

However, while LXML can do this, and makes it easy, the documentation does not stress this way of using LXML. I like this project's emphasis on simplicity and doing one thing. It's the difference between "You can use LXML to get a dict" vs "Here is how to use xmltodict to get a dict". And it's right there in the name. Emphasis and naming are important when getting started.

aidos · on Aug 23, 2013

LXML is a little unapproachable when you first use it but it's all of the other great things you mentioned too. Now that I've used it a lot I would never consider trading it for something simpler. It can handle any situation you're going to run into. I'd suggest that if people were looking to do anything more than a quick hack they invest an afternoon in learning the LXML API.

JimmaDaRustla · on Aug 23, 2013

I use it as the processor for BeautifulSoup! This should be the default IMHO - the default caused me lots of issues.

chernevik · on Aug 23, 2013

The objectify API looks a lot better than the __getattr__ hacks I've been using for this. Thanks.

toyg · on Aug 23, 2013

There are umpteen versions of this sort (there was even an ActiveState cookbook recipe at one point, I think). The problem is that they tend to be coded up with certain requirements in mind, without really going through XML specs, so inevitably you'll find places where they break down.

In this case, I don't see any namespace in examples nor (more damning) in unit tests, which tells me it's likely going to fail on complex documents with multiple namespaces.

clarkevans · on Aug 23, 2013

In 1999, there were a group of us working on a subset of XML (we called SML at the time) that had a simpler information model and syntax limitations so it'd be easier to work with. It seems that any simple encoding of XML onto JSON might find the deliberations interesting (if quaint and outdated): http://tech.groups.yahoo.com/group/sml-dev/message/0

fosap · on Aug 23, 2013

Today there is MicroXML. http://www.w3.org/community/microxml/

dkhenry · on Aug 23, 2013

I think a better title would be make working with XML feel like native dictionaries, There are many ways for interacting with JSON dictionaries are only one of them.

For example you can do XPath style interaction with JSON in python which for many things works out just a good if not better.

fr33k3y · on Aug 23, 2013

There is XML::Simple perl module since a century ago and it's ruby "copy" with the same name since long time ago also.

I've used the perl one and it only makes sense in very simple use-cases.

nmcfarl · on Aug 23, 2013

I’ve used the perl one as well. And while I’ll agree the XML handled needs to be simple, I’d say the the use case needn’t be.

At one job dealt with 10 and 20 Megabyte XML files on a daily basis - and I never thought of throwing XML::Simple at those. The results would be anything but Simple.

On the other hand I’ve run production code for years with an XML::Simple client to XML only RESTish API with ~100 endpoints. And that suckers still critical infrastructure - and barely being touched. Just humming along.

tootie · on Aug 23, 2013

XML::Simple was glorious for working with 1-100K documents. I've used it on bigger docs and it actually handles them pretty gracefully although I've never tried at scale.

Erwin · on Aug 23, 2013

10 years ago (he was just about 16 then!!), Aaron Swartz wrote this which has been very useful to me: http://www.aaronsw.com/2002/xmltramp/

Since then the builtin interfaces have improved and I use mostly xml.etree.cElementTree

zamalek · on Aug 23, 2013

What about namespaces? If you are not using namespaces you are very likely using XML as the wrong solution for a problem - still your library has use for helping cleaning up old code that did use XML as the wrong solution. Good work!

eli · on Aug 23, 2013

The overwhelming majority of XML I encounter in the real world does not use namespaces.

anaphor · on Aug 23, 2013

Doesn't this defeat some of the advantages of using XML though? Like Xpath?

raverbashing · on Aug 23, 2013

Using XPath is good but it's a small comfort in the bloated and overcomplicated world of XML

I don't see any advantage in using XML to be honest

emn13 · on Aug 23, 2013

I've never understood this criticism. XML is quite simple - perhaps too simple.

If you're really offended by needing to repeat an element name in the closing tag... well... that's not exactly a serious criticism is it?

I mean, you could make various reasonable criticsms - XML sure isn't perfect. You might say the object model maps to documents, but not so well to typical programming constructs such as unordered hashes or ordered lists of unnamed items. You might complain that its data model is too simple in that it can only easily represent printable text-strings not binary data, numbers, dates etc. You might complain that namespaces are a hassle in many API's. You might complain that the distinction between attributes and elements is odd and somewhat arbitrary.

But bloated? You're just ranting.

arethuza · on Aug 23, 2013

XML isn't too bad in itself but, a bit like Java, it appears to be one of the favoured tools for those who wish to build the eldritch abominations of the enteprise software world.

I've seen people do terrible mind rotting things with XML (e.g. encoded XML documents as attribute values in other XML documents and nested a few levels deep, using XML documents as keys in databases, using formats that aren't quite valid XML and then there is SOAP....)

haberman · on Aug 23, 2013

I think "bloated" is a reasonable description when a feature that almost no one uses (or even knows the existence of) can DoS an XML parser in 15 lines of code.

http://en.wikipedia.org/wiki/Billion_laughs

And it especially is bloated if you throw in the whole suite of XML-related technologies, such as Schema (you may think that's unfair, but XML advocates frequently point to such technologies as the solution to various deficiencies of pure XML).

Goladus · on Aug 23, 2013

I agree with the sentiment of this post but I believe the parent is specifically referring to "bloat" that comes from XML being "perhaps too simple".

  For example:

     <e1> first <e2> second </e2> third </e1>

If you're parsing this into a data structure, how you you store "first" and "third"? There are a few different, very similar options. This ambiguity means XML parsers account for this differently and a particular programmer is going to have to sort through the various options to find what he needs (dom/etree/xpath/sax/etc). JSON (mostly) forces the author to decide how it should be parsed before encoding the data.

I'd say it's unfair to say XML is bloated, it's reasonable to say that the "world of XML parsing" is bloated and confusing, especially if you're a programmer looking for a data serialization format.

mbell · on Aug 23, 2013

> XML is quite simple - perhaps too simple.

I disagree pretty strongly with that statement: http://www.jelks.nu/XML/xmlebnf.txt

XML's grammar is far from simple.

bct · on Aug 23, 2013

A significant portion of that grammar is for parsing DTDs, which isn't really a fair inclusion.

mbell · on Aug 24, 2013

jbaiter · on Aug 23, 2013

> I don't see any advantage in using XML to be honest

Well, it's great for what it was intended for: structured documents. I work at a library, where I'm dealing with thousands of digitized books and I haven't yet found a proper non-XML alternative to something like TEI.

sigzero · on Aug 23, 2013

I was going to pipe up on this. XML is for documents. JSON is for data. Sadly, the worlds isn't black and white like that.

Dru89 · on Aug 23, 2013

I haven't had much of a chance to play with it yet, but I wonder how it would handle a query to find all the "baz" elements for a structure like this: http://pastebin.com/m7qTy8kw. Does it just support a loop, or possibly something similar to xpath like /document/foo/bar/baz to get an array of all of them. (Perhaps document["foo"]["bar"]["baz"], though I think that would break other things.)

wulczer · on Aug 23, 2013

We wrote a similar utility for our ORM-for-SaaS library some time ago, using lxml if available and falling back to ElementTree:

https://github.com/ducksboard/libsaas/blob/master/libsaas/xm...

Bear in mind that both approaches are lossy - trying to support every XML quirk would quickly lead to reimplementing the libxml wheel...

cronin101 · on Aug 23, 2013

The Ruby equivalent would be XmlSimple: http://xml-simple.rubyforge.org/

eCa · on Aug 23, 2013

And the Perl equivalent is XML::Simple: https://metacpan.org/module/XML::Simple

fuzzix · on Aug 23, 2013

I would not recommend XML::Simple, there are various cases (not even that uncommon - say having an attribute and an element of the same name) where it breaks down.

It may be awkward to use at times, but XML::LibXML is robust and consistent.

brianmcconnell · on Aug 23, 2013

I work with a lot of translation services. While the machine translation services are easy to deal with, the professional translation agencies _love_ XML. They also generally know f--- all about building usable web APIs. This looks pretty useful.

lechevalierd3on · on Aug 23, 2013

This reminds me of PHP's SimpleXML http://php.net/manual/en/book.simplexml.php Which belongs to the good parts of PHP.

larvaetron · on Aug 23, 2013

I used this to build http://showhen.me and it worked very well for my use case.

level09 · on Aug 23, 2013

Does this mean we can simply convert any xml api to a json equivalent by just json.dumps(xmltodict(xml)) ?

I am going to give it a try.

watchdogtimer · on Aug 23, 2013

I used this to parse some XML when creating a shipping calculator. It was a huge time-saver.

pknerd · on Aug 23, 2013

I recently used it for a project and I am quite a happy consumer of this library.

mathattack · on Aug 23, 2013

This seems so obvious that I wonder why it wasn't done before. Well done!

martinblech · on Aug 23, 2013

Thanks!

vittore · on Aug 24, 2013

There are two older similar projects - beautiful soup and untangle.

xmjee · on Aug 24, 2013

Made a port for FreeBSD, thanks much!

bonemachine · on Aug 23, 2013

Which is a pity, because... XML is nothing like JSON.