The PDF Minute

Putting the "Portable" into documents


Listen Later

When PDF was introduced in 1993, one of the most persistent problems in

mainstream computing was that reliably publishing documents (either literally
via printing, or simply electronically distributing them for others to view) was
hard.

There were a lot of hurdles:

  • Simply moving a document (whether an office document, Postscript file, or
  • something else) from one computer to another could result in an unreadable or
    unpleasant display.
  • Printers (from consumer models up to high-end
  • typesetters) each had their own proprietary formats and requirements.
  • Many document formats were tied to a single vendor, or a
  • single operating system.

    One of PDF's initial design criteria and fundamental promises was to address

    this family of problems, so that one could distribute and use documents with any
    display, any operating system, and any print device, with confidence that the
    result would remain faithful to the author's intent. This was such a pressing,
    unmet concern that it gave the file type its name: the Portable Document
    Format. Let's talk about how that portability is accomplished.

    Documents are heterogeneous…

    Most document formats focus on text: oftentimes its logical structure, sometimes

    some aspects of its appearance, and occasionally some metadata. However, for
    a document to be faithfully rendered away from its author's computer, a host of
    other data is needed: fonts, images (if any), vector graphics, essential
    auxillary data, and so on. Documents are definitionally heterogeneous, and
    missing any part of a document's data or dependencies can render it useless.

    The way that web content handles this is by referring to these external data,

    with the expectation that browsers will fetch and integrate them appropriately.
    This is how most non-PDF document formats are also structured: for example,
    Postscript files, PDF's predecessor, refer to fonts and images in a similar way
    as HTML (though using names and sometimes hard-coded relative file paths instead
    of URLs), and those resources have to be carried around alongside the
    document(s) that refer to them. But if a Postscript or HTML file refers to some
    resources that aren't available or have moved unexpectedly, the document's
    rendering will be fundamentally broken.

    …so every PDF carries what it needs

    PDF's solution to this problem is to avoid referring to external resources

    entirely1. Instead, PDF documents are self-contained: all of the data
    needed to render the document is included, from fonts to images to metadata to
    interactive elements and auxiliary data. Satisfying this most basic premise —
    knowing that a document's resources would always travel with it — clears the
    lowest bar of portability.

    Next time, I'll talk about the (very cool) fundamental structures within every

    PDF document, and how they are designed to support including all of these
    disparate data types and resources in a single container file.

    Rendering documents to different devices is hard…

    At the time of PDF's introduction, document rendering was done in a bespoke way

    by each individual application, and often was tied to the particular operating
    system and output device being targeted. That is, a word processing program
    would need to use a completely different rendering approach when rendering to a
    display on Windows vs. a display on a Mac vs. sending a document to a printer.

    …so PDF uses an abstract rendering model for all of them

    Adobe changed that by introducing2 (as part of Postscript) what would later come

    to be known as the Adobe Imaging Model, a high-level procedural rendering
    approach that provided abstractions over the details of operating system and
    output device. The model includes command primitives for drawing text, lines,
    shapes, images, setting fonts, colors, clipping paths, and so on. PDF adopted
    most of the Postscript graphics model's semantics, and then extended it over the
    decades to support new features, media types, and usage patterns.

    It was a good abstraction, in large part because it neatly separated concerns

    between groups with different incentives and requirements: applications could
    target a relatively high-level rendering model, a far simpler task than needing
    to know the details of each class of display or printer they might render to;
    and groups responsible for implementing displays (usually operating system
    vendors) and manufacturing printers could focus on distilling those high-level
    graphics commands into concrete actions to color pixels, move print heads, and
    so on.

    This imaging model was such a successful abstraction that it effectively

    redefined how 2D graphics are programmed and rendered. If you've done any
    graphics programming in the last 30 years, you've benefitted from the results of
    that progress, as you've surely used a library or API that provides a similar
    abstraction; the Adobe Imaging Model was the direct precursor to the most
    widely-used modern 2D graphics APIs like Java's Graphics2D, .NET's
    System.Drawing, Skia's Canvas, and the web-standard canvas API. We'll talk
    a lot about this graphics model in future posts.

    Proprietary document formats actively prevent portability…

    Before PDF, most document formats were proprietary, and choices were regularly made

    by vendors to use document formats as competitive leverage, usually to the
    detriment of users' interests.

    Microsoft Word was a particularly notorious offender, as

    there was not a single "Word document format", but rather a matrix of format
    variations depending on the version of Word and the operating system being used,
    each with its own quirks and limitations when it came to importing other variants.
    While this was a great benefit to Microsoft's Word and Windows businesses, it was
    a nightmare for users who needed to share documents with others using different
    programs or operating systems.

    When Adobe first introduced PDF in 1993, it could have kept the format strictly

    proprietary, so that only Adobe and its designated partners could implement PDF
    generators, viewers, and so on. After all, other peer companies and file formats
    (e.g. Microsoft with Word, Apple with QuickTime) had taken that approach, to
    great commercial success.

    …so PDF was "open" from the start

    Instead of introducing yet another proprietary file format, Adobe did two things

    with PDF that were quite unusual:

    1. They published a detailed specification of the format in 1993, including the
    2. algorithms and data structures used to encode and decode PDF documents.
      Further, they explicitly encouraged software vendors, printer manufacturers,
      and others to adopt and implement PDF. This was a big deal: it meant that
      anyone could write software to read or write PDF documents, without needing
      to reverse engineer the format. This made it possible for a wide variety of
      software to support PDF, from word processors and web browsers to
      printers and image editors.
    3. Later, in 2008, Adobe submitted the PDF specification to the International
    4. Standards Organization (ISO), where it was accepted as an open standard, and
      has since been further refined and expanded in concert among a diversity of
      interested vendors. As part of this, Adobe also issued a public patent
      license3, where they explicitly swore off any claim to enforce
      patents that covered technologies within the PDF standard4.

      If Adobe had treated PDF as a strictly proprietary format, existing only to

      enrich themselves and provide them with a unique competitive advantage, I don't
      think PDF would be as widely-used as it is today. More importantly, though,
      without a coordinated expectation of "openness" (however vaguely defined or
      informal in the early days), and then the tangible commitment to remove all
      remaining proprietary interests from the PDF landscape5, it's
      likely that other vendors and groups would have attempted to create
      mutually-incompatible PDF variants over time.

      Such fragmentation would have significantly degraded the real-world portability

      of PDF documents: just imagine if Microsoft or Apple or Google had successfully
      pushed their own incompatible PDF variants (or some other wholly-different
      document format6), to the extent that "real" PDF documents were no longer
      guaranteed to render correctly on Windows, or Mac, or iPhone or Android devices.
      The promise of PDF's portability would have been broken.

      PDF effectively solved the problem of document portability by addressing

      these three fundamental issues: structurally guaranteeing that document
      resources would always move with the document; disentangling document rendering
      from any particular display, device, or operating system via an abstract
      rendering model; and by being first an open and then a standardized
      specification that anyone could implement. This accomplishment did not come
      without its own set of tradeoffs, which we'll come back to in later posts.

      Footnotes
      1. PDF documents are allowed by refer to certain types of resources

        using file paths, but this rare practice is a concession to certain specialized
        workflows where it would be extremely costly to repeatedly embed
        frequently-updated resources on every edit.

      2. The actual graphics model was first introduced in a 1982 paper,

        published well before Adobe was ever founded.
        'A device independent graphics imaging model for use with raster
        devices' is a short paper,
        easy to read, and is very worth taking in to better understand the design
        decisions that underpin the graphics model, and thus, PDF itself.

      3. https://www.adobe.com/pdf/pdfs/ISO32000-1PublicPatentLicense.pdf

      4. Prior to this, Adobe had made informal assurances about their

        disinterest in enforcing PDF-related patents against third party vendors and
        open source projects that implemented PDF support. Those assurances were not
        legally binding, so the formal patent grant took the legal risk associated
        with implementing PDF software off the table for good.

      5. This is not to say that Adobe has not benefitted from making PDF an

        open standard. They have, and continue to do so, in many ways. However, the
        point is that the benefits of making PDF an open standard have been widely
        distributed, and have accrued to many parties, not just Adobe.

      6. Microsoft did try to push their own document format, XPS, as a

        competitor to PDF. It never gained significant traction, and Microsoft has since
        deprecated it.

        ...more
        View all episodesView all episodes
        Download on the App Store

        The PDF MinuteBy Chas Emerick