April 20, 2016

HPR2013: Parsing XML in Python with Xmltodict

16 minutes

If Untangle is too simple for your XML parsing needs, check out xmltodict. Like untangle, xmltodict is simpler than the usual suspects (lxml, beautiful soup), but it's got some advanced features as well.

If you're reading this article, I assume you've read at least the introduction to my article about Untangle, and you should probably also read, at some point, my article on using JSON just so you know your options.

Quick re-cap about XML:

XML is a way of storing data in a hierarchical arrangement so that the data can be parsed later. It's explicit and strictly structured, so one of its benefits is that it paints a fairly verbose definition of data. Here's an example of some simple XML:

<?xml version="1.0"?>

<book>

<title>

The Beginning

</title>

<para>

This is the first paragraph.

</para>

</chapter>

<title>

The Ending

</title>

<para>

Last para of last chapter.

</para>

</chapter>

</book>

And here's some info about the xmltodict library that makes parsing that a lot easier than the built-in Python tools:

Install

Install xmltodict manually, or from your repository, or using pip:

$ pip install xmltodict

or if you need to install it locally:

$ pip install --user xmltodict

Xmltodict

With xmltodict, each element in an XML document gets converted into a dictionary (specifically an OrderedDictionary), which you then treat basically the same as you would JSON (or any Python OrderedDict).

First, ingest the XML document. Assuming it's called sample.xml and is located in the current directory:

>>> import xmltodict

>>> with open('sample.xml') as f:

... data = xmltodict.parse(f.read())

If you're a visual thinker, you might want or need to see the data. You can look at it just by dumping data:

>>> data

OrderedDict([('book', OrderedDict([('chapter',

[OrderedDict([('@id', 'prologue'),

('title', 'The Beginning'),

...and so on...

Not terribly pretty to look at. Slightly less ugly is your data set piped through json.dumps:

>>> import json

>>> json.dumps(data)

'{"book": {"chapter": [{"@id": "prologue",

"title": "The Beginning", "para": "This is the first paragraph."},

{"@id": "end", "title": "The Ending",

"para": "This is the last paragraph of the last chapter."}]

}}'

You can try other feats of pretty printing, if they help:

>>> pp = pprint.PrettyPrinter(indent=4)

>>> pp.pprint(data)

{ 'book': { 'chapter': [{'@id': 'prologue',

'title': 'The Beginning',

'para': 'This is the ...

...and so on...

More often than not, though, you're going to be "walking" the XML tree, looking for specific points of interest. This is fairly easy to do, as long as you remember that syntactically you're dealing with a Python dict, while structurally, inheritance matters.

Elements (Tags)

Exploring the data element-by-element is very easy. Calling your data set by its root element (in our cur

...more

View all episodes

By Hacker Public Radio

4.2

3434 ratings

April 20, 2016

HPR2013: Parsing XML in Python with Xmltodict

16 minutes

Quick re-cap about XML:

<?xml version="1.0"?>

<book>

<title>

The Beginning

</title>

<para>

This is the first paragraph.

</para>

</chapter>

<title>

The Ending

</title>

<para>

Last para of last chapter.

</para>

</chapter>

</book>

And here's some info about the xmltodict library that makes parsing that a lot easier than the built-in Python tools:

Install

Install xmltodict manually, or from your repository, or using pip:

$ pip install xmltodict

or if you need to install it locally:

$ pip install --user xmltodict

Xmltodict

First, ingest the XML document. Assuming it's called sample.xml and is located in the current directory:

>>> import xmltodict

>>> with open('sample.xml') as f:

... data = xmltodict.parse(f.read())

If you're a visual thinker, you might want or need to see the data. You can look at it just by dumping data:

>>> data

OrderedDict([('book', OrderedDict([('chapter',

[OrderedDict([('@id', 'prologue'),

('title', 'The Beginning'),

...and so on...

Not terribly pretty to look at. Slightly less ugly is your data set piped through json.dumps:

>>> import json

>>> json.dumps(data)

'{"book": {"chapter": [{"@id": "prologue",

"title": "The Beginning", "para": "This is the first paragraph."},

{"@id": "end", "title": "The Ending",

"para": "This is the last paragraph of the last chapter."}]

}}'

You can try other feats of pretty printing, if they help:

>>> pp = pprint.PrettyPrinter(indent=4)

>>> pp.pprint(data)

{ 'book': { 'chapter': [{'@id': 'prologue',

'title': 'The Beginning',

'para': 'This is the ...

...and so on...

Elements (Tags)

Exploring the data element-by-element is very easy. Calling your data set by its root element (in our cur

...more

More shows like Hacker Public Radio

View all

The Infinite Monkey Cage

1,952 Listeners

Click Here

418 Listeners

Hacker And The Fed

168 Listeners

Share HPR2013: Parsing XML in Python with Xmltodict

Sign up to save your podcasts

HPR2013: Parsing XML in Python with Xmltodict

HPR2013: Parsing XML in Python with Xmltodict

More shows like Hacker Public Radio

The Infinite Monkey Cage

Click Here

Hacker And The Fed