Hacker Public Radio

HPR4404: Kevie nerd snipes Ken by grepping xml


Listen Later

This show has been flagged as Explicit by the host.

More Command line fun: downloading a podcast

In the show
hpr4398 :: Command line fun: downloading a podcast
Kevie walked us through a command to download a podcast.

He used some techniques here that I hadn't used before, and it's always great to see how other people approach the problem.

Let's have a look at the script and walk through what it does, then we'll have a look at some "traps for young players" as the
EEVBlog
is fond of saying.


Analysis of the Script
wget `curl https://tuxjam.otherside.network/feed/podcast/ | grep -o 'https*://[^"]*ogg' | head -1`

It chains four different commands together to "Save the latest file from a feed".

Let's break it down so we can have checkpoints between each step.

I often do this when writing a complex one liner - first do it as steps, and then combine it.


  1. The curl command gets
    https://tuxjam.otherside.network/feed/podcast/
    .

    To do this ourselves we will call
    curl https://tuxjam.otherside.network/feed/podcast/ --output tuxjam.xml
    , as the default file name is index.html.

    This gives us a xml file, and we can confirm it's valid xml with the
    xmllint
    command.


    $ xmllint --format tuxjam.xml >/dev/null
    $ echo $?
    0

    Here the output of the command is ignored by redirecting it to
    /dev/null
    Then we check the error code the last command had. As it's
    0
    it completed sucessfully.


    1. Kevie then passes the output to the
      grep
      search command with the option
      -o
      and then looks for any string starting with https followed by anything then followed by two forward slashes, then
      -o, --only-matching
      Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line

      We can do the same with. I was not aware that grep defaulted to regex, as I tend to add the
      --perl-regexp
      to explicitly add it.


      grep --only-matching 'https*://[^"]*ogg' tuxjam.xml
      http matches the characters http literally (case sensitive)
      s* matches the character s literally (case sensitive)
      Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
      : matches the character : literally
      / matches the character / literally
      / matches the character / literally
      [^"]* match a single character not present in the list below
      Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
      " a single character in the list " literally (case sensitive)
      ogg matches the characters ogg literally (case sensitive)

      When we run this ourselves we get the following


      $ grep --only-matching 'https*://[^"]*ogg' tuxjam.xml
      https://archive.org/download/tuxjam-121/tuxjam_121.ogg
      https://archive.org/download/tuxjam-120/TuxJam_120.ogg
      https://archive.org/download/tux-jam-119/TuxJam_119.ogg
      https://archive.org/download/tuxjam_118/tuxjam_118.ogg
      https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg
      https://tuxjam.otherside.network/tuxjam-115-ogg
      https://archive.org/download/tuxjam_116/tuxjam_116.ogg
      https://tuxjam.otherside.network/tuxjam-115-ogg
      https://tuxjam.otherside.network/tuxjam-115-ogg
      https://tuxjam.otherside.network/tuxjam-115-ogg
      https://ogg
      http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg
      https://ogg
      https://archive.org/download/tuxjam_114/tuxjam_114.ogg
      https://archive.org/download/tuxjam_113/tuxjam_113.ogg
      https://archive.org/download/tuxjam_112/tuxjam_112.ogg
      1. The last command returns the first line, so therefore
        https://archive.org/download/tuxjam-121/tuxjam_121.ogg
      2. Finally that line is used as the input to the
        wget
        command.
        Problems with the approach

        Relying on grep with structured data like xml or json can lead to problems.

        When we looked at the output of the command in step 2, some of the results gave
        https://ogg
        .

        When run the same command without the
        --only-matching
        argument we see what was matched.


        $ grep 'https*://[^"]*ogg' tuxjam.xml

        This episode may not be live as in TuxJam 115 from Oggcamp but your friendly foursome of Al, Dave (thelovebug), Kevie and Andrew (mcnalu) are very much alive to treats of Free and Open Source Software and Creative Commons tunes.

        https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/
        https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/#respond
        https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/feed/

        With the group meeting up together for the first time in person, it was decided that a live recording would be an appropriate venture. With the quartet squashed around a table and a group of adoring fans crowded into a room at the Pendulum Hotel in Manchester, the discussion turns to TuxJam reviews that become regularly used applications, what we enjoyed about OggCamp 2024 and for the third section the gang put their reputation on the line and allow open questions from the sea of dedicated fans.

      3. OggCamp 2024 on Saturday 12 and Sunday 13 October 2024, Manchester UK.
      4. Two of the hits are not enclosures at all, they are references in the text to OggCamp
        what we enjoyed about OggCamp 2024

        Normally running
        grep
        will only get one entry per line, and if the xml is minimised it can miss entries on a file that comes across as one big line.

        I did this myself using


        xmllint --noblanks tuxjam.xml > tuxjam-min.xml

        I then edited it and replaced the new lines with spaces. I have to say that the
        --only-matching
        argument is doing a great job at pulling out the matches.

        That said the results were not perfect either.


        $ grep --only-matching 'https*://[^"]*ogg' tuxjam-min.xml
        https://archive.org/download/tuxjam-121/tuxjam_121.ogg
        https://archive.org/download/tuxjam-120/TuxJam_120.ogg
        https://archive.org/download/tux-jam-119/TuxJam_119.ogg
        https://archive.org/download/tuxjam_118/tuxjam_118.ogg
        https://archive.org/download/tux-jam-117-uncut/TuxJam_117.ogg
        https://tuxjam.otherside.network/tuxjam-115-ogg
        https://archive.org/download/tuxjam_116/tuxjam_116.ogg
        https://tuxjam.otherside.network/tuxjam-115-ogg
        https://tuxjam.otherside.network/?p=1029https://tuxjam.otherside.network/tuxjam-115-oggcamp-2024/#respondhttps://tuxjam.otherside.network/tuxjam-115-ogg
        https://ogg
        http://tuxjam.otherside.network/wp-content/uploads/sites/5/2024/10/tuxjam_115_OggCamp2024.ogg
        https://ogg
        https://archive.org/download/tuxjam_114/tuxjam_114.ogg
        https://archive.org/download/tuxjam_113/tuxjam_113.ogg
        https://archive.org/download/tuxjam_112/tuxjam_112.ogg

        You could fix it by modifying the
        grep
        arguments and add additional searches looking for
        enclosure
        . The problem with that approach is that you'll forever and a day be chasing issues when someone changes something.

        So the approach is officially "Grand", but it's a very likely to break if you're not babysitting it.


        Suggested Applications.

        I recommend never parsing
        structured documents
        , like xml or json with grep.

        You should use dedicated parsers that understands the document markup, and can intelligently address parts of it.

        I recommend:


        • xml
          use
          xmlstarlet
        • json
          use
          jq
        • yaml
          use
          yq

          Of course anyone that looks at my code on the
          hpr gittea
          will know this is a case of "do what I say, not what I do."

          Never parse xml with grep, where the only possible exception is to see if a string is in a file in the first place.


          grep --max-count=1 --files-with-matches

          That's justified under the fact that
          grep
          is going to be faster than having to parse, and build a
          XML Document Object Model
          when you don't have to.


          Some Tips
          Always refer to examples and specification

          A specification is just a set of rules that tell you how the document is formatted.

          There is a danger in just looking at example files, and not reading the specifications. I had a situation once where a software developer raised a bug as the files didn't begin with
          ken-test-
          followed by a
          uuid
          . They were surprised when the supplied files did not follow this convention as per the examples. Suffice to say that was rejected.

          For us there are the rules from the
          RSS specification
          itself, but as it's a XML file there are
          XML Specifications
          . While the RSS spec is short, the XML is not, so people tend to use dedicated libraries to parse XML. Using a dedicated tool like
          xmlstarlet
          will allow us to mostly ignore the details of XML.


          RSS is a dialect of XML
          . All RSS files must conform to the XML 1.0 specification, as published on the World Wide Web Consortium (W3C) website.

          The first line of the tuxjam feed shows it's an XML file.


          The specification goes on to say "At the top level, a RSS document is a element, with a mandatory attribute called version, that specifies the version of RSS that the document conforms to. If it conforms to this specification, the version attribute must be 2.0." And sure enough then the second line show that it's a RSS file.


          Use the best tool for the job

          You wouldn't grep a Excel File ? Why would you grep an XML file ?

          We could go on all day but I want to get across the idea that there is structure in the file. As XML is everywhere you should have a tool to process it. More than likely
          xmlstarlet
          is in all the distro repos, so just install it.

          The help looks like this:


          $ xmlstarlet --help
          XMLStarlet Toolkit: Command line utilities for XML
          Usage: xmlstarlet [] []
          where is one of:
          ed (or edit) - Edit/Update XML document(s)
          sel (or select) - Select data or query XML document(s) (XPATH, etc)
          tr (or transform) - Transform XML document(s) using XSLT
          val (or validate) - Validate XML document(s) (well-formed/DTD/XSD/RelaxNG)
          fo (or format) - Format XML document(s)
          el (or elements) - Display element structure of XML document
          c14n (or canonic) - XML canonicalization
          ls (or list) - List directory as XML
          esc (or escape) - Escape special XML characters
          unesc (or unescape) - Unescape special XML characters
          pyx (or xmln) - Convert XML into PYX format (based on ESIS - ISO 8879)
          p2x (or depyx) - Convert PYX into XML
          are:
          -q or --quiet - no error output
          --doc-namespace - extract namespace bindings from input doc (default)
          --no-doc-namespace - don't extract namespace bindings from input doc
          --version - show version
          --help - show help
          Wherever file name mentioned in command help it is assumed
          that URL can be used instead as well.
          Type: xmlstarlet --help for command help
          XMLStarlet is a command line toolkit to query/edit/check/transform
          XML documents (for more information see http://xmlstar.sourceforge.net/)

          You can get more help on a given topic by calling the
          xmlstarlet
          command
          like this:


          $ xmlstarlet el --help
          XMLStarlet Toolkit: Display element structure of XML document
          Usage: xmlstarlet el []
          where
          - input XML document file name (stdin is used if missing)
          is one of:
          -a - show attributes as well
          -v - show attributes and their values
          -u - print out sorted unique lines
          -d - print out sorted unique lines up to depth
          XMLStarlet is a command line toolkit to query/edit/check/transform
          XML documents (for more information see http://xmlstar.sourceforge.net/)

          To prove that it's a structured document we can run the command
          xmlstarlet el -u
          - show me unique elements


          $ xmlstarlet el -u tuxjam.xml
          rss
          rss/channel
          rss/channel/atom:link
          rss/channel/copyright
          rss/channel/description
          rss/channel/generator
          rss/channel/image
          rss/channel/image/link
          rss/channel/image/title
          rss/channel/image/url
          rss/channel/item
          rss/channel/item/category
          rss/channel/item/comments
          rss/channel/item/content:encoded
          rss/channel/item/description
          rss/channel/item/enclosure
          rss/channel/item/guid
          rss/channel/item/itunes:author
          rss/channel/item/itunes:duration
          rss/channel/item/itunes:episodeType
          rss/channel/item/itunes:explicit
          rss/channel/item/itunes:image
          rss/channel/item/itunes:subtitle
          rss/channel/item/itunes:summary
          rss/channel/item/link
          rss/channel/item/pubDate
          rss/channel/item/slash:comments
          rss/channel/item/title
          rss/channel/item/wfw:commentRss
          rss/channel/itunes:author
          rss/channel/itunes:category
          rss/channel/itunes:explicit
          rss/channel/itunes:image
          rss/channel/itunes:owner
          rss/channel/itunes:owner/itunes:name
          rss/channel/itunes:subtitle
          rss/channel/itunes:summary
          rss/channel/itunes:type
          rss/channel/language
          rss/channel/lastBuildDate
          rss/channel/link
          rss/channel/podcast:guid
          rss/channel/podcast:license
          rss/channel/podcast:location
          rss/channel/podcast:medium
          rss/channel/podcast:podping
          rss/channel/rawvoice:frequency
          rss/channel/rawvoice:location
          rss/channel/sy:updateFrequency
          rss/channel/sy:updatePeriod
          rss/channel/title

          That is the
          xpath
          representation of the xml structure. It's very similar to a unix filesystem tree. There is one
          rss
          branch, of that is one
          channel
          branch, and that can have many
          item
          branches.


          "Save the latest file from a feed"

          The ask here is to "Save the latest file from a feed".

          The solution Kevie gave gets the "first entry in the feed", which is correct for his feed but is not safe.

          However let's see how we could replace
          grep
          with
          xmlstarlet
          .

          The definition of
          enclosure
          is:


          is an optional sub-element of .
          It has three required attributes. url says where the enclosure is located,
          length says how big it is in bytes, and type says what its type is, a standard MIME type.
          The url must be an http url.

          The location of the files must be in
          rss/channel/item/enclosure
          or it's not a Podcast feed.

          In each
          enclosure
          there has to be a xml attribute called
          url
          which points to the media.

          xmlstarlet
          has the select command to select locations.


          $ xmlstarlet sel --help
          XMLStarlet Toolkit: Select from XML document(s)
          Usage: xmlstarlet sel {
          ...more
          View all episodesView all episodes
          Download on the App Store

          Hacker Public RadioBy Hacker Public Radio

          • 4.2
          • 4.2
          • 4.2
          • 4.2
          • 4.2

          4.2

          34 ratings


          More shows like Hacker Public Radio

          View all
          The Changelog: Software Development, Open Source by Changelog Media

          The Changelog: Software Development, Open Source

          292 Listeners

          Defensive Security Podcast - Malware, Hacking, Cyber Security & Infosec by Jerry Bell and Andrew Kalat

          Defensive Security Podcast - Malware, Hacking, Cyber Security & Infosec

          373 Listeners

          LINUX Unplugged by Jupiter Broadcasting

          LINUX Unplugged

          265 Listeners

          SANS Internet Stormcenter Daily Cyber Security Podcast (Stormcast) by Johannes B. Ullrich

          SANS Internet Stormcenter Daily Cyber Security Podcast (Stormcast)

          653 Listeners

          Curious Cases by BBC Radio 4

          Curious Cases

          826 Listeners

          The Strong Towns Podcast by Strong Towns

          The Strong Towns Podcast

          426 Listeners

          Late Night Linux by The Late Night Linux Family

          Late Night Linux

          164 Listeners

          Darknet Diaries by Jack Rhysider

          Darknet Diaries

          8,016 Listeners

          Cybersecurity Today by Jim Love

          Cybersecurity Today

          177 Listeners

          CISO Series Podcast by David Spark, Mike Johnson, and Andy Ellis

          CISO Series Podcast

          189 Listeners

          TechCrunch Daily Crunch by TechCrunch

          TechCrunch Daily Crunch

          41 Listeners

          Strict Scrutiny by Crooked Media

          Strict Scrutiny

          5,769 Listeners

          2.5 Admins by The Late Night Linux Family

          2.5 Admins

          97 Listeners

          Cyber Security Headlines by CISO Series

          Cyber Security Headlines

          136 Listeners

          What the Hack? by DeleteMe

          What the Hack?

          222 Listeners