This show has been flagged as Explicit by the host.
More Command line fun: downloading a podcast
In the show
hpr4398 :: Command line fun: downloading a podcast
Kevie walked us through a command to download a podcast.
He used some techniques here that I hadn't used before, and it's always great to see how other people approach the problem.
Let's have a look at the script and walk through what it does, then we'll have a look at some "traps for young players" as the
EEVBlog
Analysis of the Script
wget `curl https://tuxjam.otherside.network/feed/podcast/ | grep -o 'https*://[^"]*ogg' | head -1`
It chains four different commands together to "Save the latest file from a feed".
Let's break it down so we can have checkpoints between each step.
I often do this when writing a complex one liner - first do it as steps, and then combine it.
The curl command gets
https://tuxjam.otherside.network/feed/podcast/
.
To do this ourselves we will call
curl https://tuxjam.otherside.network/feed/podcast/ --output tuxjam.xml
, as the default file name is index.html.
This gives us a xml file, and we can confirm it's valid xml with the
xmllint
$ xmllint --format tuxjam.xml >/dev/null
$ echo $?
0
Here the output of the command is ignored by redirecting it to
/dev/null
Then we check the error code the last command had. As it's
0
it completed sucessfully.
Kevie then passes the output to the
grep
search command with the option
-o
and then looks for any string starting with https followed by anything then followed by two forward slashes, then
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line
We can do the same with. I was not aware that grep defaulted to regex, as I tend to add the
--perl-regexp
grep --only-matching 'https*://[^"]*ogg' tuxjam.xml
http matches the characters http literally (case sensitive)
s* matches the character s literally (case sensitive)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
: matches the character : literally
/ matches the character / literally
/ matches the character / literally
[^"]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
" a single character in the list " literally (case sensitive)
ogg matches the characters ogg literally (case sensitive)
When we run this ourselves we get the following
$ grep --only-matching 'https*://[^"]*ogg' tuxjam.xml
https://archive.org/download/tuxjam-121/tuxjam_121.ogg
https://archive.org/download/tuxjam-120/TuxJam_120.ogg
https://archive.org/download/tux-jam-119/TuxJam_119.ogg
https://archive.org/download/tuxjam_118/tuxjam_118.ogg
https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg
https://tuxjam.otherside.network/tuxjam-115-ogg
https://archive.org/download/tuxjam_116/tuxjam_116.ogg
https://tuxjam.otherside.network/tuxjam-115-ogg
https://tuxjam.otherside.network/tuxjam-115-ogg
https://tuxjam.otherside.network/tuxjam-115-ogg
https://ogg
http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg
https://ogg
https://archive.org/download/tuxjam_114/tuxjam_114.ogg
https://archive.org/download/tuxjam_113/tuxjam_113.ogg
https://archive.org/download/tuxjam_112/tuxjam_112.ogg
The last command returns the first line, so therefore
https://archive.org/download/tuxjam-121/tuxjam_121.ogg
Finally that line is used as the input to the
wget
command.
Problems with the approach
Relying on grep with structured data like xml or json can lead to problems.
When we looked at the output of the command in step 2, some of the results gave
https://ogg
When run the same command without the
--only-matching
argument we see what was matched.
$ grep 'https*://[^"]*ogg' tuxjam.xml
This episode may not be live as in TuxJam 115 from Oggcamp but your friendly foursome of Al, Dave (thelovebug), Kevie and Andrew (mcnalu) are very much alive to treats of Free and Open Source Software and Creative Commons tunes.
https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/
https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/#respond
https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/feed/
With the group meeting up together for the first time in person, it was decided that a live recording would be an appropriate venture. With the quartet squashed around a table and a group of adoring fans crowded into a room at the Pendulum Hotel in Manchester, the discussion turns to TuxJam reviews that become regularly used applications, what we enjoyed about OggCamp 2024 and for the third section the gang put their reputation on the line and allow open questions from the sea of dedicated fans.
OggCamp 2024 on Saturday 12 and Sunday 13 October 2024, Manchester UK.
Two of the hits are not enclosures at all, they are references in the text to OggCamp
what we enjoyed about OggCamp 2024
Normally running
grep
will only get one entry per line, and if the xml is minimised it can miss entries on a file that comes across as one big line.
xmllint --noblanks tuxjam.xml > tuxjam-min.xml
I then edited it and replaced the new lines with spaces. I have to say that the
--only-matching
argument is doing a great job at pulling out the matches.
That said the results were not perfect either.
$ grep --only-matching 'https*://[^"]*ogg' tuxjam-min.xml
https://archive.org/download/tuxjam-121/tuxjam_121.ogg
https://archive.org/download/tuxjam-120/TuxJam_120.ogg
https://archive.org/download/tux-jam-119/TuxJam_119.ogg
https://archive.org/download/tuxjam_118/tuxjam_118.ogg
https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg
https://tuxjam.otherside.network/tuxjam-115-ogg
https://archive.org/download/tuxjam_116/tuxjam_116.ogg
https://tuxjam.otherside.network/tuxjam-115-ogg
https://tuxjam.otherside.network/?p=1029https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/#respondhttps://tuxjam.otherside.network/tuxjam-115-ogg
https://ogg
http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg
https://ogg
https://archive.org/download/tuxjam_114/tuxjam_114.ogg
https://archive.org/download/tuxjam_113/tuxjam_113.ogg
https://archive.org/download/tuxjam_112/tuxjam_112.ogg
You could fix it by modifying the
grep
arguments and add additional searches looking for
enclosure
. The problem with that approach is that you'll forever and a day be chasing issues when someone changes something.
So the approach is officially "Grand", but it's a very likely to break if you're not babysitting it.
Suggested Applications.
I recommend never parsing
structured documents
, like xml or json with grep.
You should use dedicated parsers that understands the document markup, and can intelligently address parts of it.
xml
use
xmlstarlet
json
use
jq
yaml
use
yq
Of course anyone that looks at my code on the
hpr gittea
will know this is a case of "do what I say, not what I do."
Never parse xml with grep, where the only possible exception is to see if a string is in a file in the first place.
grep --max-count=1 --files-with-matches
That's justified under the fact that
grep
is going to be faster than having to parse, and build a
XML Document Object Model
Some Tips
Always refer to examples and specification
A specification is just a set of rules that tell you how the document is formatted.
There is a danger in just looking at example files, and not reading the specifications. I had a situation once where a software developer raised a bug as the files didn't begin with
ken-test-
followed by a
uuid
. They were surprised when the supplied files did not follow this convention as per the examples. Suffice to say that was rejected.
For us there are the rules from the
RSS specification
itself, but as it's a XML file there are
XML Specifications
. While the RSS spec is short, the XML is not, so people tend to use dedicated libraries to parse XML. Using a dedicated tool like
xmlstarlet
will allow us to mostly ignore the details of XML.
RSS is a dialect of XML
. All RSS files must conform to the XML 1.0 specification, as published on the World Wide Web Consortium (W3C) website.
The first line of the tuxjam feed shows it's an XML file.
The specification goes on to say "At the top level, a RSS document is a element, with a mandatory attribute called version, that specifies the version of RSS that the document conforms to. If it conforms to this specification, the version attribute must be 2.0." And sure enough then the second line show that it's a RSS file.
Use the best tool for the job
You wouldn't grep a Excel File ? Why would you grep an XML file ?
We could go on all day but I want to get across the idea that there is structure in the file. As XML is everywhere you should have a tool to process it. More than likely
xmlstarlet
is in all the distro repos, so just install it.
The help looks like this:
$ xmlstarlet --help
XMLStarlet Toolkit: Command line utilities for XML
Usage: xmlstarlet [] []
where is one of:
ed (or edit) - Edit/Update XML document(s)
sel (or select) - Select data or query XML document(s) (XPATH, etc)
tr (or transform) - Transform XML document(s) using XSLT
val (or validate) - Validate XML document(s) (well-formed/DTD/XSD/RelaxNG)
fo (or format) - Format XML document(s)
el (or elements) - Display element structure of XML document
c14n (or canonic) - XML canonicalization
ls (or list) - List directory as XML
esc (or escape) - Escape special XML characters
unesc (or unescape) - Unescape special XML characters
pyx (or xmln) - Convert XML into PYX format (based on ESIS - ISO 8879)
p2x (or depyx) - Convert PYX into XML
are:
-q or --quiet - no error output
--doc-namespace - extract namespace bindings from input doc (default)
--no-doc-namespace - don't extract namespace bindings from input doc
--version - show version
--help - show help
Wherever file name mentioned in command help it is assumed
that URL can be used instead as well.
Type: xmlstarlet --help for command help
XMLStarlet is a command line toolkit to query/edit/check/transform
XML documents (for more information see http://xmlstar.sourceforge.net/)
You can get more help on a given topic by calling the
xmlstarlet
command
$ xmlstarlet el --help
XMLStarlet Toolkit: Display element structure of XML document
Usage: xmlstarlet el []
where
- input XML document file name (stdin is used if missing)
is one of:
-a - show attributes as well
-v - show attributes and their values
-u - print out sorted unique lines
-d - print out sorted unique lines up to depth
XMLStarlet is a command line toolkit to query/edit/check/transform
XML documents (for more information see http://xmlstar.sourceforge.net/)
To prove that it's a structured document we can run the command
xmlstarlet el -u
- show me unique elements
$ xmlstarlet el -u tuxjam.xml
rss
rss/channel
rss/channel/atom:link
rss/channel/copyright
rss/channel/description
rss/channel/generator
rss/channel/image
rss/channel/image/link
rss/channel/image/title
rss/channel/image/url
rss/channel/item
rss/channel/item/category
rss/channel/item/comments
rss/channel/item/content:encoded
rss/channel/item/description
rss/channel/item/enclosure
rss/channel/item/guid
rss/channel/item/itunes:author
rss/channel/item/itunes:duration
rss/channel/item/itunes:episodeType
rss/channel/item/itunes:explicit
rss/channel/item/itunes:image
rss/channel/item/itunes:subtitle
rss/channel/item/itunes:summary
rss/channel/item/link
rss/channel/item/pubDate
rss/channel/item/slash:comments
rss/channel/item/title
rss/channel/item/wfw:commentRss
rss/channel/itunes:author
rss/channel/itunes:category
rss/channel/itunes:explicit
rss/channel/itunes:image
rss/channel/itunes:owner
rss/channel/itunes:owner/itunes:name
rss/channel/itunes:subtitle
rss/channel/itunes:summary
rss/channel/itunes:type
rss/channel/language
rss/channel/lastBuildDate
rss/channel/link
rss/channel/podcast:guid
rss/channel/podcast:license
rss/channel/podcast:location
rss/channel/podcast:medium
rss/channel/podcast:podping
rss/channel/rawvoice:frequency
rss/channel/rawvoice:location
rss/channel/sy:updateFrequency
rss/channel/sy:updatePeriod
rss/channel/title
That is the
xpath
representation of the xml structure. It's very similar to a unix filesystem tree. There is one
rss
branch, of that is one
channel
branch, and that can have many
item
"Save the latest file from a feed"
The ask here is to "Save the latest file from a feed".
The solution Kevie gave gets the "first entry in the feed", which is correct for his feed but is not safe.
However let's see how we could replace
grep
with
xmlstarlet
The definition of
enclosure
is an optional sub-element of .
It has three required attributes. url says where the enclosure is located,
length says how big it is in bytes, and type says what its type is, a standard MIME type.
The url must be an http url.
The location of the files must be in
rss/channel/item/enclosure
or it's not a Podcast feed.
In each
enclosure
there has to be a xml attribute called
url
which points to the media.
xmlstarlet
has the select command to select locations.
$ xmlstarlet sel --help
XMLStarlet Toolkit: Select from XML document(s)
Usage: xmlstarlet sel {