| Network Working Group | Mark D. Anderson |
| INTERNET DRAFT | Discerning Software |
| <draft-mda-docformats-pdf-as-xml-00> | October 2005 |
| FYI: 0 | |
| Category: Informational | |
| Expires: April 2006 |
PDF as XML
draft-mda-docformats-pdf-as-xml-00
By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress”.
The list of current Internet-Drafts can be accessed at <http://www.ietf.org/ietf/1id-abstracts.txt>.
The list of Internet-Draft Shadow Directories can be accessed at <http://www.ietf.org/shadow.html>.
This Internet-Draft will expire in April 2006.
Copyright © The Internet Society (2005). All Rights Reserved.
This document specifies representations of PDF as XML.
This document defines two profiles for representing a PDF document as XML: "Raw" and "Logical". The "Raw" profile preserves physical layout decisions in a PDF document, while "Logical" does not. The two XML documents have different xml root elements ("pdfraw" vs. "pdf"). Neither is a superset of the other.
The "Raw" profile is useful for debugging issues with PDF files, and for doing low-level manipulations prior to conversion back to PDF. It preserves most (but not all) of the file structure.
The "Logical" profile is useful for performing conversions to other formats. There is no information lost in the "Logical" profile that might be useful when converted to some other document format. The difference between "Raw" and "Logical" is never one that would make a difference visually to a user.
Here is a table of what is preserved in each:
| Aspect | Physical | Logical | Comment |
|---|---|---|---|
| object as reference vs. direct | Y | N | Logical might not even be compliant with PDF rules over when direct objects vs. references are required. |
| identical object numbering | Y | N | |
| object generations | Y | N | Logical only preserves the most recent generation. |
| unused, freed, and older generation objects | Y | N | |
| stored bytes (including binary) between header and first object, or in holes between objects | N | N | |
| null: present vs. missing | Y | N | Logical simply omits any dictionary entries with null values. Note that you cannot omit empty dictionaries, because PDF sometimes distinguishes null versus empty in inheritance. |
| string objects: stored as hex vs. literal | Y | N | |
| text strings: using PDFDocEncoding vs. UTF-16BE | Y | N | |
| strings: use of character escapes, octal escapes, and line continuations in literals (e.g. one character vs. two-character "\n") | Y | N | note that strings containing 8-bit binary characters still have to be escaped for XML; see Section 3 |
| names: choice of characters being hash-encoded | Y | N | |
| pdf comments: preserved | N | N | Raw MAY but it is not required. |
| non-significant white space differences preserved (EOL choice, multiple spaces, etc.) | N | N | |
| content streams: collection linked by Extends vs. single | Y | N | Logical does not support Extends |
| object streams (PDF 1.5): use or not, with or collections, etc. | Y | N | |
| text streams (PDF 1.5): used or not instead of normal text strings | Y | N | |
| cross-reference table ("xref" section) and cross-reference streams (PDF 1.5) | Y | N | offsets are for debugging purposes and should not be trusted when converting back to PDF |
| encryption | Y | Y | See Section 4 |
The document charset SHOULD be a Unicode encoding (such as utf-8, utf-16be, etc.), because that reduces the need for numerical entities in strings.
In a PDF document, character encoding is addressed differently inside and outside a content stream. Any text outside of a content stream is called a "text string." Examples are in annotations, bookmarks, article names, and document information. These are always either in the PDFDocEncoding (a latin-like single byte charset defined in the PDF specification), or in UTF-16BE.
String objects in a content stream of a PDF are, strictly speaking, byte streams acting as glyph indices into fonts. The so-called "simple" fonts use single bytes. Composite fonts define code ranges in their "CMaps" which can map from 1, 2, 3, or 4 successive bytes. A font might be "symbolic" (such as ZapfDingbats) or "nonsymbolic". It can declared a standard character encoding (such as "WinAnsiEncoding"), or define its own "encoding dictionary". It can also define a "Differences" array relative to a BaseEncoding.
The font encoding determines how to map from a byte sequence to the appropriate glyphs. In most cases, it is also possible to determine corresponding Unicode characters. If it is a simple font with a standard character mapping (such as "WinAnsiEncoding"), conversion to Unicode is easy. In other cases, such as some composite fonts, it is not always possible (unless there is a "ToUnicode" map, or it uses a predefined "CMap"). Note that in some encodings (including the single-byte MacExpertEncoding), a single glyph indicate might be for a ligature (e.g. "ff" or "fi"), which may correspond to multiple Unicode characters. A glyph indicate might also be used for a fraction such as "5/8". A font might have different glyphs for subscript and superscript numbers; in general, the same Unicode character might have multiple glyphs in a font. Note that TrueType fonts can have platform-dependent mappings to glyphs, so that the same byte stream might map to different glyphs on different platforms, by use of a "platform-specific encoding id". All of which is to say that mapping between Unicode characters and glyph indices is not always possible, nor simple when it is possible.
The possible ways that strings may appear in XML are:
| align="center">Name | align="center">Example | align="center">Comments |
|---|---|---|
| Unicode | <u>hello</u> | In Logicial, only to be used for text strings (outside of content streams), and to be used for all of them. In Raw, this may be used for text strings or string objects; it signifies that the string should be written as UTF-16BE (versus PDFDocEncoding). May require numerical entities in some document charsets. |
| Literal | <s>hello</s> | Matches string literals in PDF, without the surrounding parentheses. To deal with arbitrary 8-bit bytes, one of the following must be done: all bytes outside of printable ascii + TAB + SP MUST be done as octal escapes, OR the entire element contents must be base64 encoded and an xmlenc="base64" added. The latter choice is only legal for Raw; it is intended to support direct translation from PDF files. |
| Hex | <h>68656C6C6F</h> | In Logical, it is illegal to omit the final hexadecimal digit (vs. defaulting it to 0) |
Note that PDFDocEncoding cannot be used literally because it uses bytes octal 030-036 (= hex 18-1F), illegal in XML.
In all cases, significant XML characters (<>& and "' in attribute values) are escaped necessary to produce valid XML, using XML mechanisms (such as named or numeric parameter entities).
The PDF storage format allows for any 8-bit value to be used, for example in image data. In contrast, it is illegal for some 8-bit values to ever appear in an XML document, regardless of document charset (for example, 0x0-0x1F besides CR, LF, and TAB). So it is not always possible for us to simply copy the identical bytes present in a PDF, to XML.
PDF allows for a pipeline of multiple filters to be applied ("Filter" can be an array).
All information about the former (or desired) filters is preserved in the XML. It may not be the case that these filters have been applied to the byte contents, as indicated by the <verb>
Interpretation of byte streams
| ispdfenc | xmlenc | Meaning |
|---|---|---|
| true | none | exact bytes that the PDF has (or will have) |
| true | base64 | the stored bytes, with an additional escaping in XML to preserve 8-bit values |
| false | none | all PDF-level encoding and encryption is undone |
| false | base64 | all PDF-level encoding and encryption is undone, and then an additional escaping is done in XML to preserve 8-bit values |
The default value for these attributes are <verb>
The legal values for <verb>
Note that the <verb>
PDF already allows streams to be stored externally to the PDF file (starting with PDF 1.2), by using the "F" key in the stream dictionary.
Images Fonts Forms (external FDF files) Multimedia
Implement as marked content.
| [1] | World Wide Web Consortium, “Describing Media Content of Binary Data in XML”, W3C XML, May 2005. |
The author gratefully acknowledges the contributions of:
The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at <http://www.ietf.org/ipr>.
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.
This document and the information contained herein are provided on an “AS IS” basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Copyright © The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.
Funding for the RFC Editor function is currently provided by the Internet Society.