Mark D. Anderson (mda@discerning.com) October 2005 Proposal for PDF documents as XML * Introduction xmlns:xmlmime='http://www.w3.org/2004/11/xmlmime Suggestions for XHTML. SHOULD use a default xml namespace, to ease dtd validation and non namespace-aware XML processors. No encryption Forms Annotations Bookmarks XObjects * Basic Objects: ** Boolean Examples: true false White space should not be present. ** Numeric Examples: -38 89.980 The presence of a period is sufficient to lexically distinguish integer from pdf real, but we distinguish anyway. White space should not be present. ** String Examples: (This is a \ntwo-line string.) This is a two-line string. 901FA The "pdfliteral" encoding is the same as in a PDF document, including the surrounding parentheses. The "none" encoding is the result of undoing the "pdfliteral" or other native PDF encoding. The "pdfhex" encoding is the same as in a PDF document, without the surrounding <>, and also it is NOT permitted to omit the final hex digit. The default encoding is "none". In the case of using a "pdfliteral" encoding, the xml representation need not preserve: distinction between actual ("\n" as a single character) and encoded ("\n" as two characters). presence or location of line continuations (backslash at end of a line). choice of characters with an octal repesentation As with any valid XML, the contents MUST NOT contain 8-bit characters if the document encoding is us-ascii, and MUST NOT contain 8-bit characters that are in "holes" of the document encoding (as is the case for some byte values in ISO-8859-1, for example). Text strings for display in PDF are internally either Unicode, encoded as UTF-16BE, or as "PDFDocEncoding", which is a single-byte encoding similar to latin-1 or Windows CP1252, although it differs from both of them outside of printable ascii (even some 7-bit chars such as 030-033). The two cases are distinguished in a PDF document by the first two bytes being 254 and 255. Also the document SHOULD NOT use 8-bit characters if the document is a multi-byte encoding like UTF-8, to save on confusion. The xml encoding MUST perform xml-escaping of < and & by any of the available xml mechanisms (< or < or ). ** Name Examples: /Adobe#20Green Adobe Green The "pdfname" encoding includes the leading slash and uses the # encoding. The default encoding is "none". All of the encodings for a String may be used with a Name. The same restrictions on 8-bit characters and XML special characters that apply to String also apply to Name, except that as with PDF, the string contents MAY be encoded as UTF-8 if the document encoding is also UTF-8. ** Array Examples: 1 2 White space is allowed as mixed content, and is ignored. ** Dictionary Examples: 17 81.9 The marks the dictionary. Each E is an entry, with its key value (which is always a Name in PDF) as its value. The "E" MAY have an "enc" attribute which applies to the interpretation of the key attribute value. The Type TBD: allow Type and Subtype as attributes of D? ** Streams Examples: gobbledey gook In PDF, a Stream is an extension of a dictionary. There are these attributes: Length Filter DecodeParams F FFilter FDecodeParams It is permitted to omit the "Length" in XML (it is mandatory in PDF). TBD: allow omitting the surrounding dictionary if all dictionary keys can go in as xml attributes? The XML export need not preserve the particular choice of Filter. Starting with PDF 1.5 there are "Object Streams". These are identifiable because the "Type" of the stream dictionary has value "ObjStrm". In XML, these are expanded as: 99 Alternatively, in XML the whole object stream MAY be removed, and replaced with an equivalent sequence of objects. Also, the XML export MAY not preserve the original splitting of a collection into multiple object streams linked by Extends. TBD: cross-reference streams. ** Null Examples: It can also just be omitted entirely (in compliance with the PDF spec). ** Indirect Objects Examples: Brillig The gen can be omitted and defaults to 0. The XML export MUST preserve the numbering in the original. The XML export MAY change the choice as to which objects are indirect but is MUST NOT duplicate objects that were not duplicated in the PDF. * File Structure Example: ... No header, since just attributes in the document. The XRef is necessary to indicate deleted objects. The Trailer and XRef need not be present. If present, they should indicate offsets in the PDF just read. * Content Streams Text BT ET * Higher-Level Objects Root Catalog Pages Page (Thumbnail) Annot (Contents) Outlines (outline entries) (ArticleThreads) (Thread) Dests (Form) Page Objects: Resources MediaBox CropBox BleedBox TrimBox ArtBox BoxColorInfo Contents Rotate Group Thumb B Dur Trans Annots ... * Transforms - Type to element - value as attribute in 'entry' - entry as attribute in 'dict' - value as attribute in 'array' - top-level dictionaries - xml content - omit 'Parent' - rectangles as values - omit freed objects - only show latest generations - flatten content streams that are arrays - exclusion of comments Text analysis: - know or heuristically determine the zones on the page where text goes from horiz and vertical lines, as per http://www.idealliance.org/papers/dx_xml03/papers/05-03-03/05-03-03.html - combine letters and words into lines - combine lines into paragraphs when in the same text object - deal with skipping around of position - heuristics about similar height and width - heuristics about letter spacing to mark a new word - heuristics about baseline change to be on same line - heuristics to identify subscript and superscript - assign to zones - apply rules to zones to assign roles (font height, number of words, position) See: xpdf-NNN/xpdf/TextOutputDev.cc multivalent-NNN/src/multivalent/std/adaptor/pdf/PDF.java pdfbox-NNN/src/org/pdfbox/util/PDFTextStripper.java * TBD binary 0x0-0x1F illegal besides CR LF TAB original pdf is already safe. additional for xml hexBinary (only upper case letters) base64Binary (RFC 2045, no line length limitations, must be multiple of 4) single file MIME OpenDocument uses "jar" format (zip including content.xml) iXF ixfstd.org .xml or .pdfxml application/vnd.discerning.pdf application/vnd.discerning.pdfjar Thumbnails Pictures Fonts Images Graphics Forms Scripts An XObject and a Pattern ColorSpace can contain a content stream. "Rich Text" (xhtml with css) PDF tables Page-Piece Dictionary to hold its own xml info? (external) File Specifications