PDF as XMLDiscerning SoftwareSan FranciscoCA94107USmda@discerning.comhttp://discerning.com/
General
Internet-Draft
This document specifies two translations of PDF ("Portable Document Format") into XML. One format, "Raw", preserves physical file layout decisions; the other format, "Logical", does not.
taken: bcdifghijklmnqsvwy
issues in removing generations. xref streams in 1.5.
references to fonts and XObjects
what instructions aren't covered?
q/Q are nested?
content stream hoisting
xml namespace
specversion
use of xml::id
use of xml namespaces
tagged structure
The PDF (Portable Document Format) has been in use since 1993, undergoing several revisions.
In addition to the general PDF Specification published by Adobe, a print industry consortium
has defined a series of standards known collectively as PDF/X, including several ISO standards such as PDF/X-3.
Another group has produced another ISO specification PDF/A for use in archival.
While it is possible to create a PDF document entirely in ASCII, they are typically not, for
reasons of compression. Even in a purely ASCII form, a PDF document is not particularly easy for
even an engineer to read directly, nor for a computer program to parse. This means that is necessary to
use special PDF-specific software libraries to parse the data, and tools to extract human readable text or images.
Meanwhile there are many mature tools for processing XML, and XML is much more widely understood in the
engineering community than the PDF format.
It would therefore be useful to have a defined XML encoding of PDF which would enable tools to act at the XML level.
This document defines two profiles for representing a PDF document as XML:
"Raw" and "Logical".
The "Raw" profile preserves physical layout decisions in a PDF document, while "Logical" does not.
The two XML documents have different xml root elements ("pdfraw" vs. "pdf").
Neither is a superset of the other, though they share many common features.
The "Raw" profile is useful for debugging issues with PDF files, and for doing low-level
manipulations prior to conversion back to PDF. It preserves most (but not all) of the file structure.
The "Logical" profile is useful for performing conversions to other document formats.
There is no information lost in the "Logical" profile that might be useful
when converted to some other document format. The difference between "Raw" and "Logical"
is never one that would make a difference visually to a user.
Here is a table of what is preserved in each:AspectPhysicalLogicalCommentobject as reference vs. directYNLogical might not even be compliant with PDF rules over when direct objects vs. references are required.identical object numberingYNobject generationsYNunused and freed objectsYNstored bytes (including binary) between header and first object, or in holes between objectsNNnull: present vs. missingYNLogical simply omits any dictionary entries with null values. Note that you cannot omit empty dictionaries, because PDF sometimes distinguishes null versus empty in inheritance.string objects: stored as hex vs. literalYNtext strings: using PDFDocEncoding vs. UTF-16BEYNstrings: use of character escapes, octal escapes, and line continuations in literals (e.g. one character vs. two-character "\n")YNnote that strings containing 8-bit binary characters still have to be escaped for XML; see names: choice of characters being hash-encodedYNpdf comments: preservedNNRaw MAY but it is not required.non-significant white space differences preserved (EOL choice, multiple spaces, etc.)NNcontent streams: collection linked by Extends vs. singleYNLogical does not support Extendsobject streams (PDF 1.5): use or not, with or collections, etc.YNtext streams (PDF 1.5): used or not instead of normal text stringsYNcross-reference table ("xref" section) and cross-reference streams (PDF 1.5)YNoffsets are for debugging purposes and should not be trusted when converting back to PDFencryptionYYSee
This section describes the structure of the two XML formats. There are full examples in the Appendix.
An example Raw fileCatalog
...
...
...
5
...
...
]]>
As can be seen, the Raw format preserves the structure of the PDF file, including incremental updates and object generations.
Practically all aspects of the PDF file are retained, including offsets that are meaningless in the XML file,
and may not actually be accurate if the file is to be reversed back to PDF.
In Raw format, all XML elements have lower case names specific to the format: no PDF names such as "Catalog" appear as element names.
All basic objects are individual XML elements such as 5]]>.
Because the Raw format is so close to the PDF layout, it is relatively robust against new PDF features.
For example, it is not necessary for a tool that interconverts between PDF file format and Raw to
know about "object streams" or "cross-reference streams". As long as no new basic object types are introduced,
and no new low-level sections (like the xref section) are added,
the Raw format should work well with future PDF versions.
An example Logical file
...
...
...
......
]]>
In the Logical format, the immediate children of the "pdf" root element are:
Entries of the trailer, exclusive of "Root", "Prev", "Size". This currently means "Encrypt", "Info", and "ID" (all of which are optional in PDF documents).Any children of the "Root" dictionary. Currently this means "Catalog".
The Logical format does not need object references to the same degree as the Raw format (which uses them to match the PDF file).
In the Logical format, as long as an object is used just once, it is simply inserted as a direct child.
When object references are used, they are not accomplished in the same way as the Raw format.
In Raw format (as in the PDF file format itself), the "indirect objects" are marked with a surrounding object ("o" in Raw and "O" in PDF)
that specifies the object number and generation, using an "objnum" and "gen" attribute in Raw.
In Logical format, there is no such extra surrounding object (the object itself has an "id" attribute), and generations don't exist.
Also, because all relevant objects must be reachable by the trailer or Root, the Logical format has no need for an "objects" element
the way that the Raw format does (to hold unused and freed objects, etc.).
Any "outline dictionary" is treated as a normal dictionary so that these do not appear: First, Last, Count, Prev, Next.
Any "name trees" and "number trees" are flattened so that these do not appear: Kids, Names, Nums, Limits. These are identified by any dictionary with keys "Kids", "Names", or "Nums".
The tree under the dictionary is replaced by a flattened sequence of XML <entry> elements identicatal to what it would have been as a normal dictionary. Order is preserved in the XML.
The dictionaries above any stream have: "Length", which is no preserved in Logical.
A "structure tree" and "structure element dictionary" have: "K", "StructParents". These are identifier numbers into marked content (MCID).
which is removed in favor of XML containment.
A "cross-reference stream dictionary" (which has Size, Prev among others) is not preserved in Logical format, so no transformation is needed.
A "navigation node dictionary" has Next and Prev which are supplanted by document ordering. (TBD: but what if it isn't commutative? so that a node's Next doesn't have that node as its Prev?)
Other higher-level objects are preserved as they are: dates are left as strings, file specification dictionaries are preserved as dictionaries, etc.
The Logical format allows for some some transformations that make for shorter and more readable content:
Any reference can be replaced with the referenced object, as long as any other references to the same object are adjusted accordingly.A "dict" (dictionary) element containing a "Type" entry can be replaced with an element with a
name equal to the value of that "Type" entry, as long as that value is a legal XML NCName.
An entry in a dictionary can become an attribute of its parent if the name of the entry is a legal XML NCName,
and its value is any of:
a number,a name whose characters would not require an "xmlenc" escaping,an array of numbers,an array of names whose characters are in [a-zA-Z0-9\.\-\_], strings whose characters are printable ascii + SP (040-176) exclusive of
backslash and PDF delimeters [\(\)\<\>\[\]\{\}/%\\].An "inst" (instruction) element in a content stream can be replaced by an element whose name is equal to its operator, if the operator is a legal NCName.
The following operators are not legal NCNames and are mapped to these NCNames:
OperatorNCNameb*bstarB*Bstarf*fstarT*TstarW*Wstar'apos"quot
If any new operators are defined in the future that are not NCNames, this table would have to be extended (and meanwhile processors would have to handle "inst" element without hoisting.
The operands of an "inst" element can be hosted to an "operands" attribute if the operands consist of a single value meeting the criteria for dictionary values, or consist of muliple values meeting those criteria, as long as none of them are strings (only numbers and names).Certain known begin-end pairs of operands are replaced with a single element, with the intervening instructions nested between. In particular:
OperandsXMLBX/EXCompatBI/EIImageBT/ETTextBDC/EMCMarkedContent (with propertylist child)BMC/EMCMarkedContent (without propertylist child)
Names always have their leading slash whether in element content or "hoisted" to an attribute value.
When a string value is hoisted from a , , or ]]>,
it is always in the PDF "literal" format, including the surrounding parentheses (though it is a severely restricted subset, not allowing backslash or escapes).
Note that a processing application will still have to handle the "unhoisted" cases, because for example there are instructions that are not valid NCNames, such as ' and ".
The PDF storage format allows for any 8-bit value to be used, for example in image data.
In contrast, it is illegal for some 8-bit values to ever appear in an XML document,
regardless of the document charset (for example, bytes in the range 0x0-0x1F besides CR, LF, and TAB).
So if identical bytes from some portion of a PDF file were copied directly into XML,
the result would not be valid.
We therefore introduce a xmlenc attribute, which may be used in leaf elements in a XML representation,
to indicate an encoding that is done at the level of the XML representation to the element's character content.
This indicated encoding is done in addition to any encryption or compression that might have been done at the PDF level.
The legal values for xmlenc are "none", "base16", and "base64".
Both "base16" and "base64" are as per RFC3548, with no limitations on line length, and with no permitted "ignorable" characters.
Note that "base16" differs from the "hexadecimal encoding" used in PDF, because the PDF hex encoding allows for any white space,
and either upper A-F or lower case a-f letters, and it allowing the final hexadecimal digit to be omitted. We do not permit any of that variability in "base16".
The document character set in XML is Unicode (this is not to be confused with the charset value, which indicates a character encoding).
This is not the case in PDF. Strings in PDF "content streams" are indexes into glyph tables in fonts.
Strings outside of "content streams" might be Unicode (as UTF-16BE) or might be in the "PDFDocEncoding",
a Latin-like single byte charset defined in the PDF specification.
As with any other binary data, such as images, we do not permit bytes from strings to use bytes outside of those valid for XML, or outside of those legal for the document encoding.
This is discussed further in .
We place no constraints on the XML document character encoding ("charset" value),
but we do note that a Unicode encoding (such as utf-8, utf-16be, etc.)
may reduce the need for numerical entities in strings.
This section describes how to encode each of the "basic" objects in PDF. Here is a summary table:
ObjectPDF SyntaxXML Raw SyntaxXML Transformed SyntaxCommentsArray[1 2 3]123]]>"[1 2 3]"Booleantrue<true>Foo="true"Dictionary<< /Type /Foo /Size 17 ... >>/Foo17
...
]]>...]]>Indirect Object1 0 obj...]]>...]]>Object Reference1 0 R]]>Name/Foo/Foo]]>"/Foo"Nullnull]]>(not present)does not appear in LogicalNumber1717]]>"17"both integers and realsStreamstream ... endstream...]]>sameString, literal(hello world)hello world]]>"hello world"String, hex<68>68]]>"h"
In a PDF document, character encoding is addressed differently inside and outside a content stream.
Any text outside of a content stream is called a "text string." Examples are in
annotations, bookmarks, article names, and document information. These are always
either in the PDFDocEncoding, or in UTF-16BE.
String objects in a content stream of a PDF are, strictly speaking, byte streams acting as glyph indices into fonts.
The so-called "simple" fonts use single bytes.
Composite fonts define code ranges in their "CMaps" which can map from 1, 2, 3, or 4 successive bytes.
A font might be "symbolic" (such as ZapfDingbats) or "nonsymbolic".
It can declared a standard character encoding (such as "WinAnsiEncoding"),
or define its own "encoding dictionary". It can also define a "Differences" array relative to a BaseEncoding.
The font mapping table determines how to map from a byte sequence to the appropriate glyphs. In most cases, it is also possible
to determine corresponding Unicode characters.
If it is a simple font with a standard character mapping (such as "WinAnsiEncoding"), conversion to Unicode is easy.
In other cases, such as some composite fonts, it is not always possible (unless there is a "ToUnicode" map, or
it uses a predefined "CMap").
Note that in some encodings (including the single-byte MacExpertEncoding), a single glyph indicate might
be for a ligature (e.g. "ff" or "fi"), which may correspond to multiple Unicode characters.
A glyph might also be used for a fraction such as "5/8".
A font might have different glyphs for subscript and superscript numbers; in general, the same Unicode character
might have multiple glyphs in a font. And of course there are many glyphs that correspond to no character
in a human language (or appear on any keyboard): arrows, bullets, and so on.
Note that TrueType fonts can have platform-dependent mappings to glyphs, so that the same byte stream might map
to different glyphs on different platforms, by use of a "platform-specific encoding id".
All of which is to say that mapping between Unicode characters and glyph indices is not always possible, nor
simple when it is possible.
The possible ways that strings may appear in XML are:
NameExampleCommentsUnicodehello]]>In Logical, only to be used for text strings (outside of content streams), and to be used for all of them. In Raw, this may be used for text strings or string objects; it signifies that the string should be written as UTF-16BE (versus PDFDocEncoding). May require numerical entities in some document charsets.Literalhello]]>Matches string literals in PDF, without the surrounding parentheses. To deal with arbitrary 8-bit bytes, one of the following must be done: all bytes outside of printable ascii + TAB + SP MUST be done as octal escapes, OR the entire element contents must be encoded and an xmlenc attribute added to indicate how. The latter choice is only legal for Raw; it is intended to support direct translation from PDF files.Hex68656C6C6F]]>In Logical, the encoding must be compliant with "base16" encoding, which is more strict than PDF hexadecimal. In Raw, any legal hexadecimal is permitted; for example, the final hexadecimal digit may be omitted.
Note that PDFDocEncoding cannot be used literally because it uses bytes octal 030-036 (= hex 18-1F), illegal in XML.
In all cases, significant XML characters (<>& and "' in attribute values) are escaped necessary to produce valid XML,
using XML mechanisms (such as named or numeric parameter entities).
PDF allows for a pipeline of multiple filters to be applied ("Filter" can be an array).
All information about the former (or desired) filters is preserved in the XML.
It may not be the case that these filters have been applied to the byte contents, as indicated by the ispdfenc and xmlenc attributes.
There are these cases:
Interpretation of byte streams
ispdfencxmlencMeaningtruenoneexact bytes that the PDF has (or will have)truebase64the stored bytes, with an additional escaping in XML to preserve 8-bit valuesfalsenoneall PDF-level encoding and encryption is undonefalsebase64all PDF-level encoding and encryption is undone, and then an additional escaping is done in XML to preserve 8-bit values
The default value for these attributes are ispdfenc="true" xmlenc="none".
The value xmlenc="none" is only legal if the (final) encoding results in only printable ASCII,
or if there are no filters and the string itself is only printable ASCII.
The value xmlenc="none" is also permitted for PDF string (not images), where there are no filters
and all the bytes may be mapped to Unicode without loss of information (see ).
Note that the xmime:contentType attribute MAY be used, as specified in but as that mechanism
relies on XML Schema for determination of the xml-level encoding, it is not mandated here.
We treat PDF "Encoding" in the same way as any PDF compression filters.
In PDF, encryption applied after all encoding filters (and decryption is before any decoding).
Also, PDF 1.5 introduces "crypt filters".
PDF already allows streams to be stored externally to the PDF file (starting with PDF 1.2), by using the "F" key in the stream dictionary.
Images
Fonts
Forms (external FDF files)
Multimedia
Catalog/StructTreeRoot and Catalog/MarkInfo, siblings to Catalog/Pages
certain set of predefined tags (Document, Part, Art, Sect, Div) but can define your own and provide a "role map" to suggest its best interpretation
Also, can have an "Alt" entry in the "structure element dictionary"
can have "multi-language text arrays".
can have OPI proxies for images. OPI comments
a system akin to xml namespaces of "attribute owners". attributes of a single structural hierarchy..
Recommend consult "Tagged PDF" in the PDF Specification
Implement as marked content and "Tagged PDF". Preserve parameters/heuristics.
soft hyphens
layout discontinuities
hidden content (clipped or same foreground and background)
reversed chars for RTL fonts
xml:base, xml:lang, xml:space
encryption.
IPR: compression, PDF format. encryption.
The application/xml mime type is specified in RFC3778.PDF Reference, Fifth Edition, Version 1.6The application/pdf Media TypeISO 19005-1. Document management - Electronic document file format for long-term preservation - Part 1: Use of PDF (PDF/A)ISO 15930-3:2002 Graphic technology -- Prepress digital data exchange using PDF -- Part 6: Complete exchange of printing data suitable for colour-managed workflows using PDF 1.4 (PDF/X-3)The Base16, Base32, and Base64 Data EncodingsDescribing Media Content of Binary Data in XML
W3C Working Group Note
World Wide Web ConsortiumNamespaces in XMLThe author gratefully acknowledges the contributions of: