Mark D. Anderson (mda@discerning.com)
October 2005
Proposal for PDF documents as XML
* Introduction
xmlns:xmlmime='http://www.w3.org/2004/11/xmlmime
Suggestions for XHTML.
SHOULD use a default xml namespace, to ease dtd validation and non namespace-aware XML processors.
No encryption
Forms
Annotations
Bookmarks
XObjects
* Basic Objects:
** Boolean
Examples:
true
false
White space should not be present.
** Numeric
Examples:
-38
89.980
The presence of a period is sufficient to lexically distinguish integer from pdf real,
but we distinguish anyway.
White space should not be present.
** String
Examples:
(This is a \ntwo-line string.)
This is a
two-line string.
901FA
The "pdfliteral" encoding is the same as in a PDF document, including the surrounding parentheses.
The "none" encoding is the result of undoing the "pdfliteral" or other native PDF encoding.
The "pdfhex" encoding is the same as in a PDF document, without the surrounding <>, and also
it is NOT permitted to omit the final hex digit.
The default encoding is "none".
In the case of using a "pdfliteral" encoding, the xml representation need not preserve:
distinction between actual ("\n" as a single character) and encoded ("\n" as two characters).
presence or location of line continuations (backslash at end of a line).
choice of characters with an octal repesentation
As with any valid XML, the contents MUST NOT contain 8-bit characters if the document encoding
is us-ascii, and MUST NOT contain 8-bit characters that are in "holes" of the document
encoding (as is the case for some byte values in ISO-8859-1, for example).
Text strings for display in PDF are internally either Unicode, encoded as UTF-16BE, or as "PDFDocEncoding",
which is a single-byte encoding similar to latin-1 or Windows CP1252, although it differs from
both of them outside of printable ascii (even some 7-bit chars such as 030-033).
The two cases are distinguished in a PDF document by the first two bytes being 254 and 255.
Also the document SHOULD NOT use 8-bit characters if the document is a multi-byte encoding like UTF-8,
to save on confusion.
The xml encoding MUST perform xml-escaping of < and & by any of the available
xml mechanisms (< or < or ).
** Name
Examples:
/Adobe#20Green
Adobe Green
The "pdfname" encoding includes the leading slash and uses the # encoding.
The default encoding is "none".
All of the encodings for a String may be used with a Name.
The same restrictions on 8-bit characters and XML special characters
that apply to String also apply to Name, except that as with PDF,
the string contents MAY be encoded as UTF-8 if the document encoding
is also UTF-8.
** Array
Examples:
1
2
White space is allowed as mixed content, and is ignored.
** Dictionary
Examples:
17
81.9
The marks the dictionary.
Each E is an entry, with its key value (which is always a Name in PDF) as its value.
The "E" MAY have an "enc" attribute which applies to the interpretation of the key attribute value.
The Type TBD: allow Type and Subtype as attributes of D?
** Streams
Examples:
gobbledey gook
In PDF, a Stream is an extension of a dictionary.
There are these attributes:
Length
Filter
DecodeParams
F
FFilter
FDecodeParams
It is permitted to omit the "Length" in XML (it is mandatory in PDF).
TBD: allow omitting the surrounding dictionary if all dictionary keys can go in as xml attributes?
The XML export need not preserve the particular choice of Filter.
Starting with PDF 1.5 there are "Object Streams". These are identifiable because
the "Type" of the stream dictionary has value "ObjStrm".
In XML, these are expanded as:
99
Alternatively, in XML the whole object stream MAY be removed, and replaced with an
equivalent sequence of objects. Also, the XML export MAY not preserve the
original splitting of a collection into multiple object streams linked by Extends.
TBD: cross-reference streams.
** Null
Examples:
It can also just be omitted entirely (in compliance with the PDF spec).
** Indirect Objects
Examples:
Brillig
The gen can be omitted and defaults to 0.
The XML export MUST preserve the numbering in the original.
The XML export MAY change the choice as to which objects are indirect but is MUST NOT
duplicate objects that were not duplicated in the PDF.
* File Structure
Example:
...
No header, since just attributes in the document.
The XRef is necessary to indicate deleted objects.
The Trailer and XRef need not be present.
If present, they should indicate offsets in the PDF just read.
* Content Streams
Text
BT
ET
* Higher-Level Objects
Root
Catalog
Pages
Page
(Thumbnail)
Annot
(Contents)
Outlines
(outline entries)
(ArticleThreads)
(Thread)
Dests
(Form)
Page Objects:
Resources MediaBox CropBox BleedBox TrimBox ArtBox BoxColorInfo
Contents Rotate Group Thumb B Dur Trans Annots
...
* Transforms
- Type to element
- value as attribute in 'entry'
- entry as attribute in 'dict'
- value as attribute in 'array'
- top-level dictionaries
- xml content
- omit 'Parent'
- rectangles as values
- omit freed objects
- only show latest generations
- flatten content streams that are arrays
- exclusion of comments
Text analysis:
- know or heuristically determine the zones on the page where text goes
from horiz and vertical lines, as per http://www.idealliance.org/papers/dx_xml03/papers/05-03-03/05-03-03.html
- combine letters and words into lines
- combine lines into paragraphs when in the same text object
- deal with skipping around of position
- heuristics about similar height and width
- heuristics about letter spacing to mark a new word
- heuristics about baseline change to be on same line
- heuristics to identify subscript and superscript
- assign to zones
- apply rules to zones to assign roles (font height, number of words, position)
See:
xpdf-NNN/xpdf/TextOutputDev.cc
multivalent-NNN/src/multivalent/std/adaptor/pdf/PDF.java
pdfbox-NNN/src/org/pdfbox/util/PDFTextStripper.java
* TBD
binary
0x0-0x1F illegal besides CR LF TAB
original pdf is already safe.
additional for xml
hexBinary (only upper case letters)
base64Binary (RFC 2045, no line length limitations, must be multiple of 4)
single file
MIME
OpenDocument uses "jar" format (zip including content.xml)
iXF
ixfstd.org
.xml or .pdfxml application/vnd.discerning.pdf
application/vnd.discerning.pdfjar
Thumbnails
Pictures
Fonts
Images
Graphics
Forms
Scripts
An XObject and a Pattern ColorSpace can contain a content stream.
"Rich Text" (xhtml with css)
PDF tables
Page-Piece Dictionary to hold its own xml info?
(external) File Specifications