January 30, 2025

Putting the "Portable" into documents

8 minutes

When PDF was introduced in 1993, one of the most persistent problems in

mainstream computing was that reliably publishing documents (either literally

via printing, or simply electronically distributing them for others to view) was

hard.

There were a lot of hurdles:

Simply moving a document (whether an office document, Postscript file, or

something else) from one computer to another could result in an unreadable or

unpleasant display.

Printers (from consumer models up to high-end

typesetters) each had their own proprietary formats and requirements.

Many document formats were tied to a single vendor, or a

single operating system.

One of PDF's initial design criteria and fundamental promises was to address

this family of problems, so that one could distribute and use documents with any

display, any operating system, and any print device, with confidence that the

result would remain faithful to the author's intent. This was such a pressing,

unmet concern that it gave the file type its name: the Portable Document

Format. Let's talk about how that portability is accomplished.

Documents are heterogeneous…

Most document formats focus on text: oftentimes its logical structure, sometimes

some aspects of its appearance, and occasionally some metadata. However, for

a document to be faithfully rendered away from its author's computer, a host of

other data is needed: fonts, images (if any), vector graphics, essential

auxillary data, and so on. Documents are definitionally heterogeneous, and

missing any part of a document's data or dependencies can render it useless.

The way that web content handles this is by referring to these external data,

with the expectation that browsers will fetch and integrate them appropriately.

This is how most non-PDF document formats are also structured: for example,

Postscript files, PDF's predecessor, refer to fonts and images in a similar way

as HTML (though using names and sometimes hard-coded relative file paths instead

of URLs), and those resources have to be carried around alongside the

document(s) that refer to them. But if a Postscript or HTML file refers to some

resources that aren't available or have moved unexpectedly, the document's

rendering will be fundamentally broken.

…so every PDF carries what it needs

PDF's solution to this problem is to avoid referring to external resources

entirely1. Instead, PDF documents are self-contained: all of the data

needed to render the document is included, from fonts to images to metadata to

interactive elements and auxiliary data. Satisfying this most basic premise —

knowing that a document's resources would always travel with it — clears the

lowest bar of portability.

Next time, I'll talk about the (very cool) fundamental structures within every

PDF document, and how they are designed to support including all of these

disparate data types and resources in a single container file.

Rendering documents to different devices is hard…

At the time of PDF's introduction, document rendering was done in a bespoke way

by each individual application, and often was tied to the particular operating

system and output device being targeted. That is, a word processing program

would need to use a completely different rendering approach when rendering to a

display on Windows vs. a display on a Mac vs. sending a document to a printer.

…so PDF uses an abstract rendering model for all of them

Adobe changed that by introducing2 (as part of Postscript) what would later come

to be known as the Adobe Imaging Model, a high-level procedural rendering

approach that provided abstractions over the details of operating system and

output device. The model includes command primitives for drawing text, lines,

shapes, images, setting fonts, colors, clipping paths, and so on. PDF adopted

most of the Postscript graphics model's semantics, and then extended it over the

decades to support new features, media types, and usage patterns.

It was a good abstraction, in large part because it neatly separated concerns

between groups with different incentives and requirements: applications could

target a relatively high-level rendering model, a far simpler task than needing

to know the details of each class of display or printer they might render to;

and groups responsible for implementing displays (usually operating system

vendors) and manufacturing printers could focus on distilling those high-level

graphics commands into concrete actions to color pixels, move print heads, and

so on.

This imaging model was such a successful abstraction that it effectively

redefined how 2D graphics are programmed and rendered. If you've done any

graphics programming in the last 30 years, you've benefitted from the results of

that progress, as you've surely used a library or API that provides a similar

abstraction; the Adobe Imaging Model was the direct precursor to the most

widely-used modern 2D graphics APIs like Java's Graphics2D, .NET's

System.Drawing, Skia's Canvas, and the web-standard canvas API. We'll talk

a lot about this graphics model in future posts.

Proprietary document formats actively prevent portability…

Before PDF, most document formats were proprietary, and choices were regularly made

by vendors to use document formats as competitive leverage, usually to the

detriment of users' interests.

Microsoft Word was a particularly notorious offender, as

there was not a single "Word document format", but rather a matrix of format

variations depending on the version of Word and the operating system being used,

each with its own quirks and limitations when it came to importing other variants.

While this was a great benefit to Microsoft's Word and Windows businesses, it was

a nightmare for users who needed to share documents with others using different

programs or operating systems.

When Adobe first introduced PDF in 1993, it could have kept the format strictly

proprietary, so that only Adobe and its designated partners could implement PDF

generators, viewers, and so on. After all, other peer companies and file formats

(e.g. Microsoft with Word, Apple with QuickTime) had taken that approach, to

great commercial success.

…so PDF was "open" from the start

Instead of introducing yet another proprietary file format, Adobe did two things

with PDF that were quite unusual:

They published a detailed specification of the format in 1993, including the

algorithms and data structures used to encode and decode PDF documents.

Further, they explicitly encouraged software vendors, printer manufacturers,

and others to adopt and implement PDF. This was a big deal: it meant that

anyone could write software to read or write PDF documents, without needing

to reverse engineer the format. This made it possible for a wide variety of

software to support PDF, from word processors and web browsers to

printers and image editors.

Later, in 2008, Adobe submitted the PDF specification to the International

Standards Organization (ISO), where it was accepted as an open standard, and

has since been further refined and expanded in concert among a diversity of

interested vendors. As part of this, Adobe also issued a public patent

license3, where they explicitly swore off any claim to enforce

patents that covered technologies within the PDF standard4.

If Adobe had treated PDF as a strictly proprietary format, existing only to

enrich themselves and provide them with a unique competitive advantage, I don't

think PDF would be as widely-used as it is today. More importantly, though,

without a coordinated expectation of "openness" (however vaguely defined or

informal in the early days), and then the tangible commitment to remove all

remaining proprietary interests from the PDF landscape5, it's

likely that other vendors and groups would have attempted to create

mutually-incompatible PDF variants over time.

Such fragmentation would have significantly degraded the real-world portability

of PDF documents: just imagine if Microsoft or Apple or Google had successfully

pushed their own incompatible PDF variants (or some other wholly-different

document format6), to the extent that "real" PDF documents were no longer

guaranteed to render correctly on Windows, or Mac, or iPhone or Android devices.

The promise of PDF's portability would have been broken.

PDF effectively solved the problem of document portability by addressing

these three fundamental issues: structurally guaranteeing that document

resources would always move with the document; disentangling document rendering

from any particular display, device, or operating system via an abstract

rendering model; and by being first an open and then a standardized

specification that anyone could implement. This accomplishment did not come

without its own set of tradeoffs, which we'll come back to in later posts.

Footnotes

PDF documents are allowed by refer to certain types of resources

using file paths, but this rare practice is a concession to certain specialized

workflows where it would be extremely costly to repeatedly embed

frequently-updated resources on every edit. ↩

The actual graphics model was first introduced in a 1982 paper,

published well before Adobe was ever founded.

'A device independent graphics imaging model for use with raster

devices' is a short paper,

easy to read, and is very worth taking in to better understand the design

decisions that underpin the graphics model, and thus, PDF itself. ↩

https://www.adobe.com/pdf/pdfs/ISO32000-1PublicPatentLicense.pdf ↩

Prior to this, Adobe had made informal assurances about their

disinterest in enforcing PDF-related patents against third party vendors and

open source projects that implemented PDF support. Those assurances were not

legally binding, so the formal patent grant took the legal risk associated

with implementing PDF software off the table for good. ↩

This is not to say that Adobe has not benefitted from making PDF an

open standard. They have, and continue to do so, in many ways. However, the

point is that the benefits of making PDF an open standard have been widely

distributed, and have accrued to many parties, not just Adobe. ↩

Microsoft did try to push their own document format, XPS, as a

competitor to PDF. It never gained significant traction, and Microsoft has since

deprecated it. ↩

...more

View all episodes

By Chas Emerick