It also has the 'objectify' API, where you can access XML nodes via regular object access (i.e. `access.nodes[0].like.this`).
http://lxml.de/objectify.html
I have two major gripes with lxml that this library solves, but I agree that for serious projects, lxml is the correct choice.
1) you have to build libxml to use lxml
2) lxml has a large, powerful, complicated API
For 1), A friend of mine had to do an annoying workaround to use lxml on his box, due to its limited memory preventing him from being able to build libxml. Because xmltodict is Expat[1]-based, you don't have to build libxml in your environment to use it.
For 2), When I went to write a simple rss-reader project this past weekend, I dreaded going back to lxml. I knew that I'd have to go peruse its huge documentation to answer questions about whether to use lxml.XML or lxml.fromstring, whether methods I wanted were on Elements or ElementTrees, xpath syntax, custom parsers, etc. If I'd ever seen the objectify API I'd forgotten about it, because there's just so much _other_ stuff in lxml.
I happen to have found xmltodict in a brief search for lxml alternatives. It's in PyPI, so pip grabbed it with no complaints. It installed without building anything. And in less than a minute of glancing at the README, I grokked the API as "pydict = xmltodict.parse(xml_string)". I don't know if there are other things. I never had to find out.
Less than 10 minutes from finding it to forgetting I was reading XML as a source -- really a wonderful project. But if I were doing something 'serious', I'd absolutely use lxml. That large API and those byzantine docs exist for a good reason: they're dealing with XML properly. But sometimes coders just wanna have fun, or build a quick prototype or hack.
Agree that lxml is certainly quite large and batteries-included, but I've always found I can do pretty much everything I want to just using a few select methods, namely etree and xpath.
For RSS reading, here's a straightforward example. Obviously you can factor out the index access on each item, and calling `/text()` too.
import urllib2
from lxml import etree
data = urllib2.urlopen('https://news.ycombinator.com/rss').read()
root = etree.XML(data)
for i, item in enumerate(root.xpath('.//item')):
print i, item.xpath('title/text()')[0]
print item.xpath('description/text()')[0]
print item.xpath('link/text()')[0]
print
All pretty simple, xpath selectors can get a bit gnarly at times though. The tradeoff being that you can be very expressive with them.
Good, I had never seen this transformation tip in LXML
But usually, yes, LXML is "good" meaning it's the least worse way of dealing with XML
Also, it has some idiosyncrasies, like insisting on adding the namespace on tag names, so you end with something like {http://example.com/your.xlsd}.index (I don't remember it exactly and I don't have an example here with me)
Correctly handling namespaced QNames is a requirement, not a bug. It makes things awkward at times, but that's a job for lib writers to provide decent interfaces.
I haven't used LXML in a while, but ElementTree, for example, forces you to use the QName in XPath expressions, which is technically correct but a huge pain; it would be nice if there was a ScrewNamespace option that would allow "simple" searches, although this might blow up in your face one day (when two namespaces define the same element name, and your xpath search brings up elements you didn't really want).
I also found the namespacing to be a bit weird, and it took quite a while to grok the documentation. In case anyone wants a working example, I implemented a wrapper to drop the namespacing (resulting in simple objectify attribute access) for one particular XML schema here:
https://github.com/timstaley/voevent-parse/blob/master/voepa...
That's the "node-name", i.e. the fully qualified name of the tag in its namespace. You probably wanted to ask for the local-name, which in your example would just be "index". Not sure how to with LXML, but it's a common mistake people make when dealing with XML.
There's however a small feature in xmltodict that most people overlook: the streaming mode. I actually wrote xmltodict the day I tried to parse a Wikipedia dump, I just couldn't keep it all in memory but needed something more high-level than SAX.
xmltodict is in no way trying to compete with LXML feature-wise (no support for namespaces yet, just to name one thing). It's just a lightweight approach to roundtrip between XML and JSON documents that worked for my use case and decided to share it.
Yes, my first thought when reading the headline was that it would be about LXML.
However, while LXML can do this, and makes it easy, the documentation does not stress this way of using LXML. I like this project's emphasis on simplicity and doing one thing. It's the difference between "You can use LXML to get a dict" vs "Here is how to use xmltodict to get a dict". And it's right there in the name. Emphasis and naming are important when getting started.
LXML is a little unapproachable when you first use it but it's all of the other great things you mentioned too. Now that I've used it a lot I would never consider trading it for something simpler. It can handle any situation you're going to run into. I'd suggest that if people were looking to do anything more than a quick hack they invest an afternoon in learning the LXML API.
There are umpteen versions of this sort (there was even an ActiveState cookbook recipe at one point, I think). The problem is that they tend to be coded up with certain requirements in mind, without really going through XML specs, so inevitably you'll find places where they break down.
In this case, I don't see any namespace in examples nor (more damning) in unit tests, which tells me it's likely going to fail on complex documents with multiple namespaces.
In 1999, there were a group of us working on a subset of XML (we called SML at the time) that had a simpler information model and syntax limitations so it'd be easier to work with. It seems that any simple encoding of XML onto JSON might find the deliberations interesting (if quaint and outdated): http://tech.groups.yahoo.com/group/sml-dev/message/0
I think a better title would be make working with XML feel like native dictionaries, There are many ways for interacting with JSON dictionaries are only one of them.
For example you can do XPath style interaction with JSON in python which for many things works out just a good if not better.
I’ve used the perl one as well. And while I’ll agree the XML handled needs to be simple, I’d say the the use case needn’t be.
At one job dealt with 10 and 20 Megabyte XML files on a daily basis - and I never thought of throwing XML::Simple at those. The results would be anything but Simple.
On the other hand I’ve run production code for years with an XML::Simple client to XML only RESTish API with ~100 endpoints. And that suckers still critical infrastructure - and barely being touched. Just humming along.
XML::Simple was glorious for working with 1-100K documents. I've used it on bigger docs and it actually handles them pretty gracefully although I've never tried at scale.
What about namespaces? If you are not using namespaces you are very likely using XML as the wrong solution for a problem - still your library has use for helping cleaning up old code that did use XML as the wrong solution. Good work!
I've never understood this criticism. XML is quite simple - perhaps too simple.
If you're really offended by needing to repeat an element name in the closing tag... well... that's not exactly a serious criticism is it?
I mean, you could make various reasonable criticsms - XML sure isn't perfect. You might say the object model maps to documents, but not so well to typical programming constructs such as unordered hashes or ordered lists of unnamed items. You might complain that its data model is too simple in that it can only easily represent printable text-strings not binary data, numbers, dates etc. You might complain that namespaces are a hassle in many API's. You might complain that the distinction between attributes and elements is odd and somewhat arbitrary.
XML isn't too bad in itself but, a bit like Java, it appears to be one of the favoured tools for those who wish to build the eldritch abominations of the enteprise software world.
I've seen people do terrible mind rotting things with XML (e.g. encoded XML documents as attribute values in other XML documents and nested a few levels deep, using XML documents as keys in databases, using formats that aren't quite valid XML and then there is SOAP....)
I think "bloated" is a reasonable description when a feature that almost no one uses (or even knows the existence of) can DoS an XML parser in 15 lines of code.
And it especially is bloated if you throw in the whole suite of XML-related technologies, such as Schema (you may think that's unfair, but XML advocates frequently point to such technologies as the solution to various deficiencies of pure XML).
I agree with the sentiment of this post but I believe the parent is specifically referring to "bloat" that comes from XML being "perhaps too simple".
For example:
<e1> first <e2> second </e2> third </e1>
If you're parsing this into a data structure, how you you store "first" and "third"? There are a few different, very similar options. This ambiguity means XML parsers account for this differently and a particular programmer is going to have to sort through the various options to find what he needs (dom/etree/xpath/sax/etc). JSON (mostly) forces the author to decide how it should be parsed before encoding the data.
I'd say it's unfair to say XML is bloated, it's reasonable to say that the "world of XML parsing" is bloated and confusing, especially if you're a programmer looking for a data serialization format.
> I don't see any advantage in using XML to be honest
Well, it's great for what it was intended for: structured documents.
I work at a library, where I'm dealing with thousands of digitized books and I haven't yet found a proper non-XML alternative to something like TEI.
I haven't had much of a chance to play with it yet, but I wonder how it would handle a query to find all the "baz" elements for a structure like this: http://pastebin.com/m7qTy8kw. Does it just support a loop, or possibly something similar to xpath like /document/foo/bar/baz to get an array of all of them. (Perhaps document["foo"]["bar"]["baz"], though I think that would break other things.)
I would not recommend XML::Simple, there are various cases (not even that uncommon - say having an attribute and an element of the same name) where it breaks down.
It may be awkward to use at times, but XML::LibXML is robust and consistent.
I work with a lot of translation services. While the machine translation services are easy to deal with, the professional translation agencies _love_ XML. They also generally know f--- all about building usable web APIs. This looks pretty useful.
It also has the 'objectify' API, where you can access XML nodes via regular object access (i.e. `access.nodes[0].like.this`). http://lxml.de/objectify.html