Hacker News new | past | comments | ask | show | jobs | submit login
Python module that makes working with XML feel like working with JSON (github.com/martinblech)
186 points by dagw on Aug 23, 2013 | hide | past | favorite | 51 comments



Cool, although LXML (mature, fast, reliable, generally awesome) can do this (and more): http://lxml.de/FAQ.html#how-can-i-map-an-xml-tree-into-a-dic...

It also has the 'objectify' API, where you can access XML nodes via regular object access (i.e. `access.nodes[0].like.this`). http://lxml.de/objectify.html


I have two major gripes with lxml that this library solves, but I agree that for serious projects, lxml is the correct choice.

1) you have to build libxml to use lxml

2) lxml has a large, powerful, complicated API

For 1), A friend of mine had to do an annoying workaround to use lxml on his box, due to its limited memory preventing him from being able to build libxml. Because xmltodict is Expat[1]-based, you don't have to build libxml in your environment to use it.

For 2), When I went to write a simple rss-reader project this past weekend, I dreaded going back to lxml. I knew that I'd have to go peruse its huge documentation to answer questions about whether to use lxml.XML or lxml.fromstring, whether methods I wanted were on Elements or ElementTrees, xpath syntax, custom parsers, etc. If I'd ever seen the objectify API I'd forgotten about it, because there's just so much _other_ stuff in lxml.

I happen to have found xmltodict in a brief search for lxml alternatives. It's in PyPI, so pip grabbed it with no complaints. It installed without building anything. And in less than a minute of glancing at the README, I grokked the API as "pydict = xmltodict.parse(xml_string)". I don't know if there are other things. I never had to find out.

Less than 10 minutes from finding it to forgetting I was reading XML as a source -- really a wonderful project. But if I were doing something 'serious', I'd absolutely use lxml. That large API and those byzantine docs exist for a good reason: they're dealing with XML properly. But sometimes coders just wanna have fun, or build a quick prototype or hack.

[1] http://docs.python.org/2/library/pyexpat.html


Agree that lxml is certainly quite large and batteries-included, but I've always found I can do pretty much everything I want to just using a few select methods, namely etree and xpath.

E.g. for extracting URLs from a sitemap:

    from lxml import etree

    root = etree.XML(data)

    urls = root.xpath(
        './/sitemap:loc/text()',
        namespaces={
            'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9',
        }
    )
For RSS reading, here's a straightforward example. Obviously you can factor out the index access on each item, and calling `/text()` too.

    import urllib2

    from lxml import etree

    data = urllib2.urlopen('https://news.ycombinator.com/rss').read()

    root = etree.XML(data)

    for i, item in enumerate(root.xpath('.//item')):
        print i, item.xpath('title/text()')[0]
        print item.xpath('description/text()')[0]
        print item.xpath('link/text()')[0]
        print

All pretty simple, xpath selectors can get a bit gnarly at times though. The tradeoff being that you can be very expressive with them.


Good, I had never seen this transformation tip in LXML

But usually, yes, LXML is "good" meaning it's the least worse way of dealing with XML

Also, it has some idiosyncrasies, like insisting on adding the namespace on tag names, so you end with something like {http://example.com/your.xlsd}.index (I don't remember it exactly and I don't have an example here with me)


Correctly handling namespaced QNames is a requirement, not a bug. It makes things awkward at times, but that's a job for lib writers to provide decent interfaces.

I haven't used LXML in a while, but ElementTree, for example, forces you to use the QName in XPath expressions, which is technically correct but a huge pain; it would be nice if there was a ScrewNamespace option that would allow "simple" searches, although this might blow up in your face one day (when two namespaces define the same element name, and your xpath search brings up elements you didn't really want).


Actually the short name worked for the searches, the problem was reading element names from a subtree

It's not as much as dodging the requirements but rather an inconsistency in the API.


I also found the namespacing to be a bit weird, and it took quite a while to grok the documentation. In case anyone wants a working example, I implemented a wrapper to drop the namespacing (resulting in simple objectify attribute access) for one particular XML schema here: https://github.com/timstaley/voevent-parse/blob/master/voepa...


That's the "node-name", i.e. the fully qualified name of the tag in its namespace. You probably wanted to ask for the local-name, which in your example would just be "index". Not sure how to with LXML, but it's a common mistake people make when dealing with XML.


There's however a small feature in xmltodict that most people overlook: the streaming mode. I actually wrote xmltodict the day I tried to parse a Wikipedia dump, I just couldn't keep it all in memory but needed something more high-level than SAX.


xmltodict is in no way trying to compete with LXML feature-wise (no support for namespaces yet, just to name one thing). It's just a lightweight approach to roundtrip between XML and JSON documents that worked for my use case and decided to share it.


Yes, my first thought when reading the headline was that it would be about LXML.

However, while LXML can do this, and makes it easy, the documentation does not stress this way of using LXML. I like this project's emphasis on simplicity and doing one thing. It's the difference between "You can use LXML to get a dict" vs "Here is how to use xmltodict to get a dict". And it's right there in the name. Emphasis and naming are important when getting started.


LXML is a little unapproachable when you first use it but it's all of the other great things you mentioned too. Now that I've used it a lot I would never consider trading it for something simpler. It can handle any situation you're going to run into. I'd suggest that if people were looking to do anything more than a quick hack they invest an afternoon in learning the LXML API.


I use it as the processor for BeautifulSoup! This should be the default IMHO - the default caused me lots of issues.


The objectify API looks a lot better than the __getattr__ hacks I've been using for this. Thanks.


There are umpteen versions of this sort (there was even an ActiveState cookbook recipe at one point, I think). The problem is that they tend to be coded up with certain requirements in mind, without really going through XML specs, so inevitably you'll find places where they break down.

In this case, I don't see any namespace in examples nor (more damning) in unit tests, which tells me it's likely going to fail on complex documents with multiple namespaces.


In 1999, there were a group of us working on a subset of XML (we called SML at the time) that had a simpler information model and syntax limitations so it'd be easier to work with. It seems that any simple encoding of XML onto JSON might find the deliberations interesting (if quaint and outdated): http://tech.groups.yahoo.com/group/sml-dev/message/0


Today there is MicroXML. http://www.w3.org/community/microxml/


I think a better title would be make working with XML feel like native dictionaries, There are many ways for interacting with JSON dictionaries are only one of them.

For example you can do XPath style interaction with JSON in python which for many things works out just a good if not better.


There is XML::Simple perl module since a century ago and it's ruby "copy" with the same name since long time ago also.

I've used the perl one and it only makes sense in very simple use-cases.


I’ve used the perl one as well. And while I’ll agree the XML handled needs to be simple, I’d say the the use case needn’t be.

At one job dealt with 10 and 20 Megabyte XML files on a daily basis - and I never thought of throwing XML::Simple at those. The results would be anything but Simple.

On the other hand I’ve run production code for years with an XML::Simple client to XML only RESTish API with ~100 endpoints. And that suckers still critical infrastructure - and barely being touched. Just humming along.


XML::Simple was glorious for working with 1-100K documents. I've used it on bigger docs and it actually handles them pretty gracefully although I've never tried at scale.


10 years ago (he was just about 16 then!!), Aaron Swartz wrote this which has been very useful to me: http://www.aaronsw.com/2002/xmltramp/

Since then the builtin interfaces have improved and I use mostly xml.etree.cElementTree


What about namespaces? If you are not using namespaces you are very likely using XML as the wrong solution for a problem - still your library has use for helping cleaning up old code that did use XML as the wrong solution. Good work!


The overwhelming majority of XML I encounter in the real world does not use namespaces.


Doesn't this defeat some of the advantages of using XML though? Like Xpath?


Using XPath is good but it's a small comfort in the bloated and overcomplicated world of XML

I don't see any advantage in using XML to be honest


I've never understood this criticism. XML is quite simple - perhaps too simple.

If you're really offended by needing to repeat an element name in the closing tag... well... that's not exactly a serious criticism is it?

I mean, you could make various reasonable criticsms - XML sure isn't perfect. You might say the object model maps to documents, but not so well to typical programming constructs such as unordered hashes or ordered lists of unnamed items. You might complain that its data model is too simple in that it can only easily represent printable text-strings not binary data, numbers, dates etc. You might complain that namespaces are a hassle in many API's. You might complain that the distinction between attributes and elements is odd and somewhat arbitrary.

But bloated? You're just ranting.


XML isn't too bad in itself but, a bit like Java, it appears to be one of the favoured tools for those who wish to build the eldritch abominations of the enteprise software world.

I've seen people do terrible mind rotting things with XML (e.g. encoded XML documents as attribute values in other XML documents and nested a few levels deep, using XML documents as keys in databases, using formats that aren't quite valid XML and then there is SOAP....)


I think "bloated" is a reasonable description when a feature that almost no one uses (or even knows the existence of) can DoS an XML parser in 15 lines of code.

http://en.wikipedia.org/wiki/Billion_laughs

And it especially is bloated if you throw in the whole suite of XML-related technologies, such as Schema (you may think that's unfair, but XML advocates frequently point to such technologies as the solution to various deficiencies of pure XML).


I agree with the sentiment of this post but I believe the parent is specifically referring to "bloat" that comes from XML being "perhaps too simple".

  For example:

     <e1> first <e2> second </e2> third </e1>
If you're parsing this into a data structure, how you you store "first" and "third"? There are a few different, very similar options. This ambiguity means XML parsers account for this differently and a particular programmer is going to have to sort through the various options to find what he needs (dom/etree/xpath/sax/etc). JSON (mostly) forces the author to decide how it should be parsed before encoding the data.

I'd say it's unfair to say XML is bloated, it's reasonable to say that the "world of XML parsing" is bloated and confusing, especially if you're a programmer looking for a data serialization format.


> XML is quite simple - perhaps too simple.

I disagree pretty strongly with that statement: http://www.jelks.nu/XML/xmlebnf.txt

XML's grammar is far from simple.


A significant portion of that grammar is for parsing DTDs, which isn't really a fair inclusion.


why?


> I don't see any advantage in using XML to be honest

Well, it's great for what it was intended for: structured documents. I work at a library, where I'm dealing with thousands of digitized books and I haven't yet found a proper non-XML alternative to something like TEI.


I was going to pipe up on this. XML is for documents. JSON is for data. Sadly, the worlds isn't black and white like that.


I haven't had much of a chance to play with it yet, but I wonder how it would handle a query to find all the "baz" elements for a structure like this: http://pastebin.com/m7qTy8kw. Does it just support a loop, or possibly something similar to xpath like /document/foo/bar/baz to get an array of all of them. (Perhaps document["foo"]["bar"]["baz"], though I think that would break other things.)


We wrote a similar utility for our ORM-for-SaaS library some time ago, using lxml if available and falling back to ElementTree:

https://github.com/ducksboard/libsaas/blob/master/libsaas/xm...

Bear in mind that both approaches are lossy - trying to support every XML quirk would quickly lead to reimplementing the libxml wheel...


The Ruby equivalent would be XmlSimple: http://xml-simple.rubyforge.org/


And the Perl equivalent is XML::Simple: https://metacpan.org/module/XML::Simple


I would not recommend XML::Simple, there are various cases (not even that uncommon - say having an attribute and an element of the same name) where it breaks down.

It may be awkward to use at times, but XML::LibXML is robust and consistent.


I work with a lot of translation services. While the machine translation services are easy to deal with, the professional translation agencies _love_ XML. They also generally know f--- all about building usable web APIs. This looks pretty useful.


This reminds me of PHP's SimpleXML http://php.net/manual/en/book.simplexml.php Which belongs to the good parts of PHP.


I used this to build http://showhen.me and it worked very well for my use case.


Does this mean we can simply convert any xml api to a json equivalent by just json.dumps(xmltodict(xml)) ?

I am going to give it a try.


I used this to parse some XML when creating a shipping calculator. It was a huge time-saver.


I recently used it for a project and I am quite a happy consumer of this library.


This seems so obvious that I wonder why it wasn't done before. Well done!


Thanks!


There are two older similar projects - beautiful soup and untangle.


Made a port for FreeBSD, thanks much!


Which is a pity, because... XML is nothing like JSON.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: