TOC 
Mark D. AndersonMark D. Anderson
 Discerning Software
 October 16, 2005

PDF as XML

Abstract

This document specifies two translations of PDF ("Portable Document Format") into XML. One format, "Raw", preserves file layout decisions; the other format, "Logical", does not.



Table of Contents

1.  Introduction
2.  File Format
    2.1.  Raw File Format
    2.2.  Logical File Format
3.  Character Set Issues
4.  Basic Objects
    4.1.  Strings
5.  Filters
6.  Encryption
7.  External Content
8.  Alternate Content
    8.1.  Text Extraction
9.  XML Content
10.  Security Considerations
11.  IANA Considerations
12.  References
Appendix A.  pdf DTD
Appendix B.  Examples
Appendix C.  Acknowledgements
§  Author's Address




 TOC 

1. Introduction

The PDF (Portable Document Format) (, “PDF Reference, Fifth Edition, Version 1.6,” .) [1] has been in use since 1993, undergoing several revisions. In addition to the general PDF Specification published by Adobe, a print industry consortium has defined a series of standards known collectively as PDF/X, including several ISO standards such as PDF/X-3 (, “ISO 15930-3:2002 Graphic technology -- Prepress digital data exchange using PDF -- Part 6: Complete exchange of printing data suitable for colour-managed workflows using PDF 1.4 (PDF/X-3),” .) [4]. Another group has produced another ISO specification PDF/A (, “ISO 19005-1. Document management - Electronic document file format for long-term preservation - Part 1: Use of PDF (PDF/A),” .) [3] for use in archival.

While it is possible to create a PDF document entirely in ASCII, they are typically not, for reasons of compression. Even in a purely ASCII form, a PDF document is not particularly easy for even an engineer to read directly, nor for a computer program to parse. This means that is is necessary to use special PDF-specific software libraries to parse the data, and tools to extract human readable text or images. Meanwhile there are many mature tools for processing XML, and XML is much more widely understood in the engineering community than the PDF format. It would therefore be useful to have a defined XML encoding of PDF which would enable tools to act at the XML level.

This document defines two profiles for representing a PDF document as XML: "Raw" and "Logical". The "Raw" profile preserves physical layout decisions in a PDF document, while "Logical" does not. The two XML documents have different xml root elements ("pdfraw" vs. "pdf"). Neither is a superset of the other.

The "Raw" profile is useful for debugging issues with PDF files, and for doing low-level manipulations prior to conversion back to PDF. It preserves most (but not all) of the file structure.

The "Logical" profile is useful for performing conversions to other document formats. There is no information lost in the "Logical" profile that might be useful when converted to some other document format. The difference between "Raw" and "Logical" is never one that would make a difference visually to a user.

Here is a table of what is preserved in each:

AspectPhysicalLogicalComment
object as reference vs. direct Y N Logical might not even be compliant with PDF rules over when direct objects vs. references are required.
identical object numbering Y N  
object generations Y N  
unused and freed objects Y N  
stored bytes (including binary) between header and first object, or in holes between objects N N  
null: present vs. missing Y N Logical simply omits any dictionary entries with null values. Note that you cannot omit empty dictionaries, because PDF sometimes distinguishes null versus empty in inheritance.
string objects: stored as hex vs. literal Y N  
text strings: using PDFDocEncoding vs. UTF-16BE Y N  
strings: use of character escapes, octal escapes, and line continuations in literals (e.g. one character vs. two-character "\n") Y N note that strings containing 8-bit binary characters still have to be escaped for XML; see Character Set Issues (Character Set Issues)
names: choice of characters being hash-encoded Y N  
pdf comments: preserved N N Raw MAY but it is not required.
non-significant white space differences preserved (EOL choice, multiple spaces, etc.) N N  
content streams: collection linked by Extends vs. single Y N Logical does not support Extends
object streams (PDF 1.5): use or not, with or collections, etc. Y N  
text streams (PDF 1.5): used or not instead of normal text strings Y N  
cross-reference table ("xref" section) and cross-reference streams (PDF 1.5) Y N offsets are for debugging purposes and should not be trusted when converting back to PDF
encryption Y Y See Filters (Filters)


 TOC 

2. File Format

This section describes the structure of the two XML formats. There are full examples (Examples) in the Appendix.



 TOC 

2.1. Raw File Format

An example Raw file

 <?xml version="1.0" charset="utf-8"?>
 <!DOCTYPE pdfraw SYSTEM "mda-pdfraw.dtd">
 <pdfraw pdfversion="1.4">
   <objects>
     <o num="1" gen="0">
       <dict>
         <entry name="Type"><s>Catalog</s></entry>
         ...
       </dict>
     </o>
     ...
   </objects>
   <xrefs first="1" count="5">
     <xref offset="998" num="1" gen="0" free="n"/>
     ...
   </xrefs>
   <trailer>
     <dict>
       <entry name="Size"><i>5</i></entry>
       <entry name="Root"><ref num="1" gen="0"/></entry>
     </dict>
   </trailer>
   <xrefs first="6" count="7">
     ...
   </xref>
   ...
 </pdfraw>
 Figure 1 

As can be seen, the Raw format preserves the structure of the PDF file, including incremental updates. Practically all aspects of the PDF file are retained, including offsets that are meaningless in the XML file, and may not actually be accurate if the file is to be reversed back to PDF.

In Raw format, all XML elements have lower case names specific to the format: no PDF names such as "Catalog" appear as element names. All basic objects are individual XML elements such as <i>5</i>. Because the Raw format is so close to the PDF layout, it is relatively robust against new PDF features. For example, it is not necessary for a tool that interconverts between PDF file format and Raw to know about "object streams" or "cross-reference streams". As long as no new basic object types are introduced, and no new low-level sections (like the xref section) are added, the Raw format should work well with future PDF versions.



 TOC 

2.2. Logical File Format

An example Logical file

 <?xml version="1.0" charset="utf-8"?>
 <!DOCTYPE pdf SYSTEM "mda-pdf.dtd">
 <pdf pdfversion="1.4">
   <Catalog>
     <Pages>
       <Page MediaBox="[0 0 612 792]">
         <Contents>
           <stream>
             <T>
               <Tf pdfvalue="/F1 24"/>
               <Td pdfvalue="100 100"/>
               <Tj pdfvalue="(Hello World)"/>
             </T>
           </stream>
         </Contents>
         <Resources ProcSet="[/PDF /Text]"/>
           <Font Subtype="Type1" Name="F1" BaseFont="Helvetica" Encoding="MacRomanEncoding"/>
         </Resources>
       </Page>
       ...
     </Pages>
     ...
   </Catalog>
   <Encrypt>
   ...
   </Encrypt>
   <Info>...</Info>
   <ID>...</ID>
 </pdf>
 Figure 2 

In the Logical format, the immediate children of the "pdf" root element are:

The Logical format does not need object references to the same degree as the Raw format (which uses them to match the PDF file). In the Logical format, as long as an object is used just once, it is simply inserted as a direct child. When object references are used, they are not accomplished in the same way as the Raw format. In Raw format (as in the PDF file format itself), the "indirect objects" are marked with a surrounding object ("o" in Raw and "O" in PDF) that specifies the object number and generation, using an "objnum" and "gen" attribute in Raw. In Logical format, there is no such extra surrounding object (the object itself has an "objnum" attribute), and generations don't exist.

Also, because all relevant objects must be reachable by the trailer or Root, the Logical format has no need for an "objects" element the way that the Raw format does (to hold unused and freed objects, etc.).

The Logical format allows for some some transformations that make for shorter and more readable content:

  1. Any reference can be replaced with the referenced object (as long as any other references to the same object are adjusted accordingly).
  2. A "dict" (dictionary) element containing a "Type" entry can be replaced with an element with a name equal to the value of that "Type" entry, as long as that value is a legal XML NCName (, “Namespaces in XML,” January 1999.) [7].
  3. An entry in a dictionary can become an attribute of its parent if the name of the entry is a legal XML name, and its value is any of: a number, a name whose characters would not require an "xmlenc" escaping (Character Set Issues), an array of numbers, an array of names whose characters are in [a-zA-Z0-9\.\-\_], and strings whose characters are printable ascii + SP (040-176) exclusive of backslash and PDF delimeters [\(\)\<\>\[\]\{\}/%\\].

Names always have their leading slash whether in element content or "hoisted" to an attribute value. When a string value is hoisted from a Section 4.1 (Strings)<u>, <s>, or <h>, it is always in the PDF "literal" format, including the surrounding parentheses (though it is a severely restricted subset, not allowing backslash or escapes).



 TOC 

3. Character Set Issues

The PDF storage format allows for any 8-bit value to be used, for example in image data. In contrast, it is illegal for some 8-bit values to ever appear in an XML document, regardless of the document charset (for example, bytes in the range 0x0-0x1F besides CR, LF, and TAB). So if identical bytes from some portion of a PDF file were copied directly into XML, the result would not be valid.

We therefore introduce a xmlenc attribute, which may be used in leaf elements in a XML representation, to indicate an encoding that is done at the level of the XML representation to the element's character content. This indicated encoding is done in addition to any encryption or compression that might have been done at the PDF level.

The legal values for xmlenc are "none", "base16", and "base64". Both "base16" and "base64" are as per [5] (, “The Base16, Base32, and Base64 Data Encodings,” .), with no limitations on line length, and with no permitted "ignorable" characters. Note that "base16" differs from the "hexadecimal encoding" used in PDF, because the PDF hex encoding allows for any white space, and either upper A-F or lower case a-f letters, and it allowing the final hexadecimal digit to be omitted. We do not permit any of that variability in "base16".

The document character set in XML is Unicode (this is not to be confused with the charset value, which indicates a character encoding). This is not the case in PDF. Strings in PDF "content streams" are indexes into glyph tables in fonts. Strings outside of "content streams" might be Unicode (as UTF-16BE) or might be in the "PDFDocEncoding", a Latin-like single byte charset defined in the PDF specification. As with any other binary data, such as images, we do not permit bytes from strings to use bytes outside of those valid for XML, or outside of those legal for the document encoding. This is discussed further in Strings (Strings). We place no constraints on the XML document character encoding ("charset" value), but we do note that a Unicode encoding (such as utf-8, utf-16be, etc.) may reduce the need for numerical entities in strings.



 TOC 

4. Basic Objects



 TOC 

4.1. Strings

In a PDF document, character encoding is addressed differently inside and outside a content stream. Any text outside of a content stream is called a "text string." Examples are in annotations, bookmarks, article names, and document information. These are always either in the PDFDocEncoding, or in UTF-16BE.

String objects in a content stream of a PDF are, strictly speaking, byte streams acting as glyph indices into fonts. The so-called "simple" fonts use single bytes. Composite fonts define code ranges in their "CMaps" which can map from 1, 2, 3, or 4 successive bytes. A font might be "symbolic" (such as ZapfDingbats) or "nonsymbolic". It can declared a standard character encoding (such as "WinAnsiEncoding"), or define its own "encoding dictionary". It can also define a "Differences" array relative to a BaseEncoding.

The font encoding determines how to map from a byte sequence to the appropriate glyphs. In most cases, it is also possible to determine corresponding Unicode characters. If it is a simple font with a standard character mapping (such as "WinAnsiEncoding"), conversion to Unicode is easy. In other cases, such as some composite fonts, it is not always possible (unless there is a "ToUnicode" map, or it uses a predefined "CMap"). Note that in some encodings (including the single-byte MacExpertEncoding), a single glyph indicate might be for a ligature (e.g. "ff" or "fi"), which may correspond to multiple Unicode characters. A glyph indicate might also be used for a fraction such as "5/8". A font might have different glyphs for subscript and superscript numbers; in general, the same Unicode character might have multiple glyphs in a font. Note that TrueType fonts can have platform-dependent mappings to glyphs, so that the same byte stream might map to different glyphs on different platforms, by use of a "platform-specific encoding id". All of which is to say that mapping between Unicode characters and glyph indices is not always possible, nor simple when it is possible.

The possible ways that strings may appear in XML are:

NameExampleComments
Unicode <u>hello</u>  In Logical, only to be used for text strings (outside of content streams), and to be used for all of them. In Raw, this may be used for text strings or string objects; it signifies that the string should be written as UTF-16BE (versus PDFDocEncoding). May require numerical entities in some document charsets.
Literal <s>hello</s>  Matches string literals in PDF, without the surrounding parentheses. To deal with arbitrary 8-bit bytes, one of the following must be done: all bytes outside of printable ascii + TAB + SP MUST be done as octal escapes, OR the entire element contents must be base64 encoded and an xmlenc="base64" added as an attribute. The latter choice is only legal for Raw; it is intended to support direct translation from PDF files.
Hex <h>68656C6C6F</h>  In Logical, the encoding must be compliant with "base16" encoding, which is more strict than PDF hexadecimal. In Raw, any legal hexadecimal is permitted; for example, the final hexadecimal digit may be omitted.
Note that PDFDocEncoding cannot be used literally because it uses bytes octal 030-036 (= hex 18-1F), illegal in XML.

In all cases, significant XML characters (<>& and "' in attribute values) are escaped necessary to produce valid XML, using XML mechanisms (such as named or numeric parameter entities).



 TOC 

5. Filters

PDF allows for a pipeline of multiple filters to be applied ("Filter" can be an array).

All information about the former (or desired) filters is preserved in the XML. It may not be the case that these filters have been applied to the byte contents, as indicated by the ispdfenc and xmlenc attributes. There are these cases:

Interpretation of byte streams

ispdfencxmlencMeaning
true  none  exact bytes that the PDF has (or will have)
true  base64  the stored bytes, with an additional escaping in XML to preserve 8-bit values
false  none  all PDF-level encoding and encryption is undone
false  base64  all PDF-level encoding and encryption is undone, and then an additional escaping is done in XML to preserve 8-bit values

The default value for these attributes are ispdfenc="true" xmlenc="none". The value xmlenc="none" is only legal if the (final) encoding results in only printable ASCII, or if there are no filters and the string itself is only printable ASCII. The value xmlenc="none" is also permitted for PDF string (not images), where there are no filters and all the bytes may be mapped to Unicode without loss of information (see Section 3 (Character Set Issues)).

Note that the xmime:contentType attribute MAY be used, as specified in [6] (World Wide Web Consortium, “Describing Media Content of Binary Data in XML,” May 2005.) but as that mechanism relies on XML Schema for determination of the xml-level encoding, it is not mandated here.



 TOC 

6. Encryption



 TOC 

7. External Content

PDF already allows streams to be stored externally to the PDF file (starting with PDF 1.2), by using the "F" key in the stream dictionary.

Images Fonts Forms (external FDF files) Multimedia



 TOC 

8. Alternate Content

Implement as marked content.



 TOC 

8.1. Text Extraction



 TOC 

9. XML Content



 TOC 

10. Security Considerations



 TOC 

11. IANA Considerations

The application/xml mime type is specified in RFC3778 (, “The application/pdf Media Type,” May 2004.) [2].



 TOC 

12. References

[1] PDF Reference, Fifth Edition, Version 1.6.”
[2] “The application/pdf Media Type,” May 2004.
[3] ISO 19005-1. Document management - Electronic document file format for long-term preservation - Part 1: Use of PDF (PDF/A)” (DOC).
[4] ISO 15930-3:2002 Graphic technology -- Prepress digital data exchange using PDF -- Part 6: Complete exchange of printing data suitable for colour-managed workflows using PDF 1.4 (PDF/X-3).”
[5] “The Base16, Base32, and Base64 Data Encodings.”
[6] World Wide Web Consortium, “Describing Media Content of Binary Data in XML,” W3C XML, May 2005.
[7] Namespaces in XML,” January 1999.


 TOC 

Appendix A. pdf DTD




 TOC 

Appendix B. Examples



 TOC 

Appendix C. Acknowledgements

The author gratefully acknowledges the contributions of:



 TOC 

Author's Address

  Mark D. Anderson
  Discerning Software
  San Francisco, CA 94107
  US
Phone: 
Email:  mda@discerning.com
URI:  http://discerning.com/