Hacker Public Radio

HPR2013: Parsing XML in Python with Xmltodict


Listen Later

If Untangle is too simple for your XML parsing needs, check out xmltodict. Like untangle, xmltodict is simpler than the usual suspects (lxml, beautiful soup), but it's got some advanced features as well.
If you're reading this article, I assume you've read at least the introduction to my article about Untangle, and you should probably also read, at some point, my article on using JSON just so you know your options.
Quick re-cap about XML:
XML is a way of storing data in a hierarchical arrangement so that the data can be parsed later. It's explicit and strictly structured, so one of its benefits is that it paints a fairly verbose definition of data. Here's an example of some simple XML:
<?xml version="1.0"?>
<book>
<chapter id="prologue">
<title>
The Beginning
</title>
<para>
This is the first paragraph.
</para>
</chapter>
<chapter id="end">
<title>
The Ending
</title>
<para>
Last para of last chapter.
</para>
</chapter>
</book>
And here's some info about the xmltodict library that makes parsing that a lot easier than the built-in Python tools:
Install
Install xmltodict manually, or from your repository, or using pip:
$ pip install xmltodict
or if you need to install it locally:
$ pip install --user xmltodict
Xmltodict
With xmltodict, each element in an XML document gets converted into a dictionary (specifically an OrderedDictionary), which you then treat basically the same as you would JSON (or any Python OrderedDict).
First, ingest the XML document. Assuming it's called sample.xml and is located in the current directory:
>>> import xmltodict
>>> with open('sample.xml') as f:
... data = xmltodict.parse(f.read())
If you're a visual thinker, you might want or need to see the data. You can look at it just by dumping data:
>>> data
OrderedDict([('book', OrderedDict([('chapter',
[OrderedDict([('@id', 'prologue'),
('title', 'The Beginning'),
...and so on...
Not terribly pretty to look at. Slightly less ugly is your data set piped through json.dumps:
>>> import json
>>> json.dumps(data)
'{"book": {"chapter": [{"@id": "prologue",
"title": "The Beginning", "para": "This is the first paragraph."},
{"@id": "end", "title": "The Ending",
"para": "This is the last paragraph of the last chapter."}]
}}'
You can try other feats of pretty printing, if they help:
>>> pp = pprint.PrettyPrinter(indent=4)
>>> pp.pprint(data)
{ 'book': { 'chapter': [{'@id': 'prologue',
'title': 'The Beginning',
'para': 'This is the ...
...and so on...
More often than not, though, you're going to be "walking" the XML tree, looking for specific points of interest. This is fairly easy to do, as long as you remember that syntactically you're dealing with a Python dict, while structurally, inheritance matters.
Elements (Tags)
Exploring the data element-by-element is very easy. Calling your data set by its root element (in our cur
...more
View all episodesView all episodes
Download on the App Store

Hacker Public RadioBy Hacker Public Radio

  • 4.2
  • 4.2
  • 4.2
  • 4.2
  • 4.2

4.2

34 ratings


More shows like Hacker Public Radio

View all
The Infinite Monkey Cage by BBC Radio 4

The Infinite Monkey Cage

1,952 Listeners

Click Here by Recorded Future News

Click Here

418 Listeners

Hacker And The Fed by Chris Tarbell & Hector Monsegur

Hacker And The Fed

168 Listeners