<?xml version="1.0"?>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes" private="Mark D. Anderson" ?>
<rfc category="info" ipr="full3978" docName="draft-mda-docformats-pdf-as-xml-00" seriesNo="0">
  <front>
    <title>PDF as XML</title>
    <author initials="Mark D." surname="Anderson" fullname="Mark D. Anderson">
      <organization>Discerning Software</organization>
      <address>
        <postal>
          <street></street>
          <city>San Francisco</city> <region>CA</region> <code>94107</code>
          <country>US</country>
        </postal>
        <phone></phone>
        <email>mda@discerning.com</email>
        <uri>http://discerning.com/</uri>
      </address>
    </author>
    <date month="October" year="2005"/>
    <area>General</area>
    <keyword>Internet-Draft</keyword>
    <abstract>
      <t>
This document specifies two translations of PDF ("Portable Document Format") into XML. One format, "Raw", preserves physical file layout decisions; the other format, "Logical", does not.
      </t>
    </abstract>
  </front>
  <middle>
    <section anchor="tbd" title="TBD">
taken: bcdifghijklmnqsvwy
issues in removing generations. xref streams in 1.5.
references to fonts and XObjects
what instructions aren't covered?
q/Q are nested?
content stream hoisting
xml namespace
specversion
use of xml::id
use of xml namespaces
tagged structure
    </section>
    <section anchor="intro" title="Introduction">
      <t>
The <xref target="refs.PDF">PDF (Portable Document Format)</xref> has been in use since 1993, undergoing several revisions.
In addition to the general PDF Specification published by Adobe, a <eref  target="http://www.pdf-x.com/">print industry consortium</eref>
has defined a series of standards known collectively as PDF/X, including several ISO standards such as <xref target="refs.PDFX">PDF/X-3</xref>.
Another group has produced another ISO specification <xref target="refs.PDFA">PDF/A</xref> for use in archival.
      </t>
      <t>
While it is possible to create a PDF document entirely in ASCII, they are typically not, for
reasons of compression. Even in a purely ASCII form, a PDF document is not particularly easy for
even an engineer to read directly, nor for a computer program to parse. This means that is necessary to
use special PDF-specific software libraries to parse the data, and tools to extract human readable text or images.
Meanwhile there are many mature tools for processing XML, and XML is much more widely understood in the
engineering community than the PDF format.
It would therefore be useful to have a defined XML encoding of PDF which would enable tools to act at the XML level.
      </t>
      <t>
This document defines two profiles for representing a PDF document as XML:
"Raw" and "Logical".
The "Raw" profile preserves physical layout decisions in a PDF document, while "Logical" does not.
The two XML documents have different xml root elements ("pdfraw" vs. "pdf").
Neither is a superset of the other, though they share many common features.
      </t>
      <t>
The "Raw" profile is useful for debugging issues with PDF files, and for doing low-level
manipulations prior to conversion back to PDF. It preserves most (but not all) of the file structure.
      </t>
      <t>
The "Logical" profile is useful for performing conversions to other document formats.
There is no information lost in the "Logical" profile that might be useful
when converted to some other document format. The difference between "Raw" and "Logical"
is never one that would make a difference visually to a user.
      </t>
      <texttable>
        <preamble>Here is a table of what is preserved in each:</preamble>
  <ttcol align="left">Aspect</ttcol>
  <ttcol align="left">Physical</ttcol>
  <ttcol align="left">Logical</ttcol>
  <ttcol align="left">Comment</ttcol>
  <c>object as reference vs. direct</c><c>Y</c><c>N</c><c>Logical might not even be compliant with PDF rules over when direct objects vs. references are required.</c>
  <c>identical object numbering</c><c>Y</c><c>N</c><c></c>
  <c>object generations</c><c>Y</c><c>N</c><c></c>
  <c>unused and freed objects</c><c>Y</c><c>N</c><c></c>
  <c>stored bytes (including binary) between header and first object, or in holes between objects</c><c>N</c><c>N</c><c></c>
  <c>null: present vs. missing</c><c>Y</c><c>N</c><c>Logical simply omits any dictionary entries with null values. Note that you cannot omit empty dictionaries, because PDF sometimes distinguishes null versus empty in inheritance.</c>
  <c>string objects: stored as hex vs. literal</c><c>Y</c><c>N</c><c></c>
  <c>text strings: using PDFDocEncoding vs. UTF-16BE</c><c>Y</c><c>N</c><c></c>
  <c>strings: use of character escapes, octal escapes, and line continuations in literals (e.g. one character vs. two-character "\n")</c><c>Y</c><c>N</c><c>note that strings containing 8-bit binary characters still have to be escaped for XML; see <xref target="charsets" format="title"/></c>
  <c>names: choice of characters being hash-encoded</c><c>Y</c><c>N</c><c></c>
  <c>pdf comments: preserved</c><c>N</c><c>N</c><c>Raw MAY but it is not required.</c>
  <c>non-significant white space differences preserved (EOL choice, multiple spaces, etc.)</c><c>N</c><c>N</c><c></c>
  <c>content streams: collection linked by <v>Extends</v> vs. single</c><c>Y</c><c>N</c><c>Logical does not support <v>Extends</v></c>
  <c>object streams (PDF 1.5): use or not, with or collections, etc.</c><c>Y</c><c>N</c><c></c>
  <c>text streams (PDF 1.5): used or not instead of normal text strings</c><c>Y</c><c>N</c><c></c>
  <c>cross-reference table ("xref" section) and cross-reference streams (PDF 1.5)</c><c>Y</c><c>N</c><c>offsets are for debugging  purposes and should not be trusted when converting back to PDF</c>
  <c>encryption</c><c>Y</c><c>Y</c><c>See <xref target="filters" format="title"/></c>
      </texttable>

    </section>

    <section anchor="file" title="File Format">
      <t>
	This section describes the structure of the two XML formats. There are full <xref target="examples">examples</xref> in the Appendix.
      </t>

      <section anchor="file-raw" title="Raw File Format">

        <figure anchor="example-raw">
          <preamble>An example Raw file</preamble>
          <artwork><![CDATA[
 <?xml version="1.0" charset="utf-8"?>
 <!DOCTYPE pdfraw SYSTEM "mda-pdfraw.dtd">
 <pdfraw pdfversion="1.4">
   <objects>
     <o num="1" gen="0">
       <dict>
         <entry name="Type"><s>Catalog</s></entry>
         ...
       </dict>
     </o>
     ...
   </objects>
   <xrefs first="1" count="5">
     <xref offset="998" num="1" gen="0" free="n"/>
     ...
   </xrefs>
   <trailer>
     <dict>
       <entry name="Size"><i>5</i></entry>
       <entry name="Root"><ref num="1" gen="0"/></entry>
     </dict>
   </trailer>
   <xrefs first="6" count="7">
     ...
   </xref>
   ...
 </pdfraw>
]]></artwork>
	</figure>
	<t>
As can be seen, the Raw format preserves the structure of the PDF file, including incremental updates and object generations.
Practically all aspects of the PDF file are retained, including offsets that are meaningless in the XML file,
and may not actually be accurate if the file is to be reversed back to PDF.
	</t>
	<t>
In Raw format, all XML elements have lower case names specific to the format: no PDF names such as "Catalog" appear as element names.
All basic objects are individual XML elements such as <q><![CDATA[<i>5</i>]]></q>.
Because the Raw format is so close to the PDF layout, it is relatively robust against new PDF features.
For example, it is not necessary for a tool that interconverts between PDF file format and Raw to
know about "object streams" or "cross-reference streams". As long as no new basic object types are introduced, 
and no new low-level sections (like the xref section) are added,
the Raw format should work well with future PDF versions.
	</t>
      </section>

      <section anchor="file-logical" title="Logical File Format">
        <figure anchor="example-logical">
          <preamble>An example Logical file</preamble>
          <artwork><![CDATA[
 <?xml version="1.0" charset="utf-8"?>
 <!DOCTYPE pdf SYSTEM "mda-pdf.dtd">
 <pdf pdfversion="1.4">
   <Catalog>
     <Pages>
       <Page MediaBox="[0 0 612 792]">
         <Contents>
           <stream>
             <Text> 
               <Tf operands="/F1 24"/>
               <Td operands="100 100"/>
               <Tj operands="(Hello World)"/>
             </Text> 
           </stream>
         </Contents>
         <Resources ProcSet="[/PDF /Text]"/>
           <Font Subtype="Type1" Name="F1" BaseFont="Helvetica" Encoding="MacRomanEncoding"/>
         </Resources>
       </Page>
       ...
     </Pages>
     ...
   </Catalog>
   <Encrypt>
   ...
   </Encrypt>
   <Info>...</Info>
   <ID>...</ID>
 </pdf>
]]></artwork>
        </figure>

	<t>
In the Logical format, the immediate children of the "pdf" root element are:
<list style="symbols">
  <t>Entries of the trailer, exclusive of "Root", "Prev", "Size". This currently means "Encrypt", "Info", and "ID" (all of which are optional in PDF documents).</t>
  <t>Any children of the "Root" dictionary. Currently this means "Catalog".</t>
</list>
	</t>
	<t>
The Logical format does not need object references to the same degree as the Raw format (which uses them to match the PDF file).
In the Logical format, as long as an object is used just once, it is simply inserted as a direct child.
When object references are used, they are not accomplished in the same way as the Raw format.
In Raw format (as in the PDF file format itself), the "indirect objects" are marked with a surrounding object ("o" in Raw and "O" in PDF)
that specifies the object number and generation, using an "objnum" and "gen" attribute in Raw.
In Logical format, there is no such extra surrounding object (the object itself has an "id" attribute), and generations don't exist.
	</t>
	<t>
Also, because all relevant objects must be reachable by the trailer or Root, the Logical format has no need for an "objects" element
the way that the Raw format does (to hold unused and freed objects, etc.).
	</t>
<t>
Any "outline dictionary" is treated as a normal dictionary so that these do not appear: First, Last, Count, Prev, Next.
</t>
	<t>
Any "name trees" and "number trees" are flattened so that these do not appear: Kids, Names, Nums, Limits. These are identified by any dictionary with keys "Kids", "Names", or "Nums".
The tree under the dictionary is replaced by a flattened sequence of XML &lt;entry&gt; elements identicatal to what it would have been as a normal dictionary. Order is preserved in the XML.
	</t>
	<t>
The dictionaries above any stream have: "Length", which is no preserved in Logical.
	</t>
	<t>
A "structure tree" and "structure element dictionary" have: "K", "StructParents". These are identifier numbers into marked content (MCID).
which is removed in favor of XML containment. 	  
	</t>
	<t>
A "cross-reference stream dictionary" (which has Size, Prev among others) is not preserved in Logical format, so no transformation is needed.
	</t>
<t>
A "navigation node dictionary" has Next and Prev which are supplanted by document ordering. (TBD: but what if it isn't commutative? so that  a node's Next doesn't have that node as its Prev?)
</t>
      <t>
Other higher-level objects are preserved as they are: dates are left as strings, file specification dictionaries are preserved as dictionaries, etc.
      </t>

	<t>
	  The Logical format allows for some some transformations that make for shorter and more readable content:
	  <list style="numbers">
	    <t>Any reference can be replaced with the referenced object, as long as any other references to the same object are adjusted accordingly.</t>
	    <t>A "dict" (dictionary) element containing a "Type" entry can be replaced with an element with a 
name equal to the value of that "Type" entry, as long as that value is a legal XML <xref target="refs.XMLNAMES">NCName</xref>.
	    </t>
	    <t>An entry in a dictionary can become an attribute of its parent if the name of the entry is a legal XML NCName, 
and its value is any of: 
<list style="symbols">
  <t>a number,</t>
  <t>a name whose characters would not require an <xref target="charsets">"xmlenc" escaping</xref>,</t>
  <t>an array of numbers,</t>
  <t>an array of names whose characters are in [a-zA-Z0-9\.\-\_],</t>
  <t> strings whose characters are printable ascii + SP (040-176) exclusive of 
backslash and PDF delimeters [\(\)\&lt;\&gt;\[\]\{\}/%\\].</t>
</list>
	    </t>
	    <t>An "inst" (instruction) element in a content stream can be replaced by an element whose name is equal to its operator, if the operator is a legal NCName.
The following operators are not legal NCNames and are mapped to these NCNames: 
<texttable>
  <ttcol align="left">Operator</ttcol>
  <ttcol align="left">NCName</ttcol>
  <c>b*</c><c>bstar</c>
  <c>B*</c><c>Bstar</c>  
  <c>f*</c><c>fstar</c>  
  <c>T*</c><c>Tstar</c>  
  <c>W*</c><c>Wstar</c>
  <c>'</c><c>apos</c>
  <c>"</c><c>quot</c>
</texttable>
If any new operators are defined in the future that are not NCNames, this table would have to be extended (and meanwhile processors would have to handle "inst" element without hoisting.
	    </t>
	    <t>The operands of an "inst" element can be hosted to an "operands" attribute if the operands consist of a single value meeting the criteria for dictionary values, or consist of muliple values meeting those criteria, as long as none of them are strings (only numbers and names).</t>
	    <t>Certain known begin-end pairs of operands are replaced with a single element, with the intervening instructions nested between. In particular: 
	    <texttable>
	      <ttcol align="left">Operands</ttcol>
	      <ttcol align="left">XML</ttcol>
	      <c>BX/EX</c><c>Compat</c>
	      <c>BI/EI</c><c>Image</c>
	      <c>BT/ET</c><c>Text</c>
	      <c>BDC/EMC</c><c>MarkedContent (with propertylist child)</c>
	      <c>BMC/EMC</c><c>MarkedContent (without propertylist child)</c>
	    </texttable>
	    </t>
	  </list>
	  Names always have their leading slash whether in element content or "hoisted" to an attribute value.
When a string value is hoisted from a <xref target="strings"><q><![CDATA[<u>, <s>, or <h>]]></q></xref>,
it is always in the PDF "literal" format, including the surrounding parentheses (though it is a severely restricted subset, not allowing backslash or escapes).
	</t>
	<t>
Note that a processing application will still have to handle the "unhoisted" cases, because for example there are instructions that are not valid NCNames, such as ' and ".
	</t>
      </section>
    </section>

    <section anchor="charsets" title="Character Set Issues">
      <t>
The PDF storage format allows for any 8-bit value to be used, for example in image data.
In contrast, it is illegal for some 8-bit values to ever appear in an XML document,
regardless of the document charset (for example, bytes in the range 0x0-0x1F besides CR, LF, and TAB).
So if identical bytes from some portion of a PDF file were copied directly into XML,
the result would not be valid.
      </t>
      <t>
We therefore introduce a <q>xmlenc</q> attribute, which may be used in leaf elements in a XML representation,
to indicate an encoding that is done at the level of the XML representation to the element's character content.
This indicated encoding is done in addition to any encryption or compression that might have been done at the PDF level.
      </t>
      <t>
The legal values for <q>xmlenc</q> are "none", "base16", and "base64".
Both "base16" and "base64" are as per <xref target="refs.RFC3548">RFC3548</xref>, with no limitations on line length, and with no permitted "ignorable" characters.
Note that "base16" differs from the "hexadecimal encoding" used in PDF, because the PDF hex encoding allows for any white space,
and either upper A-F or lower case a-f letters, and it allowing the final hexadecimal digit to be omitted. We do not permit any of that variability in "base16".
      </t>
      <t>
The document character set in XML is Unicode (this is not to be confused with the charset value, which indicates a character encoding).
This is not the case in PDF. Strings in PDF "content streams" are indexes into glyph tables in fonts.
Strings outside of "content streams" might be Unicode (as UTF-16BE) or might be in the "PDFDocEncoding",
a Latin-like single byte charset defined in the PDF specification.
As with any other binary data, such as images, we do not permit bytes from strings to use bytes outside of those valid for XML, or outside of those legal for the document encoding.
This is discussed further in <xref target="strings" format="title"/>.
We place no constraints on the XML document character encoding ("charset" value),
but we do note that a Unicode encoding (such as utf-8, utf-16be, etc.)
may reduce the need for numerical entities in strings.
      </t>
    </section>
    <section anchor="basic" title="Basic Objects">
      <t>This section describes how to encode each of the "basic" objects in PDF. Here is a summary table:
      <texttable>
	<ttcol align="left">Object</ttcol>
	<ttcol align="left">PDF Syntax</ttcol>
	<ttcol align="left">XML Raw Syntax</ttcol>
	<ttcol align="left">XML Transformed Syntax</ttcol>
	<ttcol align="left">Comments</ttcol>
	<c>Array</c>
	   <c><q>[1 2 3]</q></c>
	   <c><q><![CDATA[<array><i>1</i><i>2</i><i>3</i></array>]]></q></c>
	   <c><q>"[1 2 3]"</q></c>
	   <c></c>
	<c>Boolean</c>
	   <c><q>true</q></c>
	   <c><q>&lt;true&gt;</q></c>
	   <c><q>Foo="true"</q></c>
	   <c></c>
	<c>Dictionary</c>
	   <c><q>&lt;&lt; /Type /Foo /Size 17 ... &gt;&gt;</q></c>
	   <c><q><![CDATA[<dict>
  <entry name="/Type"><n>/Foo</n></entry>
  <entry name="/Size"><i>17</i></entry>
  ...
</dict>]]></q></c>
	   <c><![CDATA[<Foo Size="17">...</Foo>]]></c>
	   <c></c>
	<c>Indirect Object</c>
	   <c><q>1 0 obj</q></c>
	   <c><q><![CDATA[<obj objnum="1" gen="0">...</obj>]]></q></c>
	   <c><q><![CDATA[<Foo objnum="1">...</Foo>]]></q></c>
	   <c></c> 
	<c>Object Reference</c><c><q>1 0 R</q></c><c><q><![CDATA[<ref objnum="1" gen="0"/>]]></q></c><c></c><c></c> 
	<c>Name</c><c><q>/Foo</q></c><c><q><![CDATA[<n>/Foo</n>]]></q></c><c>"/Foo"</c><c></c>
	<c>Null</c><c><q>null</q></c><c><q><![CDATA[<null/>]]></q></c><c>(not present)</c><c>does not appear in Logical</c>
	<c>Number</c><c><q>17</q></c><c><q><![CDATA[<r>17</r>]]></q></c><c>"17"</c><c>both integers and reals</c>
	<c>Stream</c><c><q>stream ... endstream</q></c><c><q><![CDATA[<stream>...</stream>]]></q></c><c>same</c><c></c>
	<c>String, literal</c><c><q>(hello world)</q></c><c><q><![CDATA[<s>hello world</s>]]></q></c><c>"hello world"</c><c></c>
	<c>String, hex</c><c><q>&lt;6A&gt;</q></c><c><q><![CDATA[<h>6A</h>]]></q></c><c>"j"</c><c>because "j" is ascii hex 6A </c>
      </texttable>
      </t>
    <section anchor="strings" title="Strings">
      <t>
In a PDF document, character encoding is addressed differently inside and outside a content stream.
Any text outside of a content stream is called a "text string." Examples are in
annotations, bookmarks, article names, and document information. These are always
either in the PDFDocEncoding, or in UTF-16BE.
      </t>
      <t>
String objects in a content stream of a PDF are, strictly speaking, byte streams acting as glyph indices into fonts.
The so-called "simple" fonts use single bytes.
Composite fonts define code ranges in their "CMaps" which can map from 1, 2, 3, or 4 successive bytes.
A font might be "symbolic" (such as ZapfDingbats) or "nonsymbolic".
It can declared a standard character encoding (such as "WinAnsiEncoding"), 
or define its own "encoding dictionary". It can also define a "Differences" array relative to a BaseEncoding.
      </t>
      <t>
The font mapping table determines how to map from a byte sequence to the appropriate glyphs. In most cases, it is also possible
to determine corresponding Unicode characters.
If it is a simple font with a standard character mapping (such as "WinAnsiEncoding"), conversion to Unicode is easy.
In other cases, such as some composite fonts, it is not always possible (unless there is a "ToUnicode" map, or
it uses a predefined "CMap").
Note that in some encodings (including the single-byte MacExpertEncoding), a single glyph indicate might
be for a ligature (e.g. "ff" or "fi"), which may correspond to multiple Unicode characters.
A glyph might also be used for a fraction such as "5/8". 
A font might have different glyphs for subscript and superscript numbers; in general, the same Unicode character
might have multiple glyphs in a font. And of course there are many glyphs that correspond to no character
in a human language (or appear on any keyboard): arrows, bullets, and so on.
Note that TrueType fonts can have platform-dependent mappings to glyphs, so that the same byte stream might map
to different glyphs on different platforms, by use of a "platform-specific encoding id".
All of which is to say that mapping between Unicode characters and glyph indices is not always possible, nor
simple when it is possible.  
      </t>
      <t>
The possible ways that strings may appear in XML are:
      <texttable>
        <ttcol align="left">Name</ttcol>
        <ttcol align="left">Example</ttcol>
        <ttcol align="left">Comments</ttcol>
        <c>Unicode</c><c><q><![CDATA[<u>hello</u>]]></q></c><c>In Logical, only to be used for text strings (outside of content streams), and to be used for all of them. In Raw, this may be used for text strings or string objects; it signifies that the string should be written as UTF-16BE (versus PDFDocEncoding). May require numerical entities in some document charsets.</c>
        <c>Literal</c><c><q><![CDATA[<s>hello</s>]]></q></c><c>Matches string literals in PDF, without the surrounding parentheses. To deal with arbitrary 8-bit bytes, one of the following must be done: all bytes outside of printable ascii + TAB + SP MUST be done as octal escapes, OR the entire element contents must be encoded and an <q>xmlenc</q> attribute added to indicate how. The latter choice is only legal for Raw; it is intended to support direct translation from PDF files.</c>
        <c>Hex</c><c><q><![CDATA[<h>68656C6C6F</h>]]></q></c><c>In Logical, the encoding must be compliant with "base16" encoding, which is more strict than PDF hexadecimal. In Raw, any legal hexadecimal is permitted; for example, the final hexadecimal digit may be omitted.</c>
      </texttable>
Note that PDFDocEncoding cannot be used literally because it uses bytes octal 030-036 (= hex 18-1F), illegal in XML.
      </t>
      <t>
In all cases, significant XML characters (&lt;&gt;&amp; and "' in attribute values) are escaped necessary to produce valid XML,
using XML mechanisms (such as named or numeric parameter entities).
      </t>

    </section>
    </section>

    <section anchor="filters" title="Filters">
      <t>
PDF allows for a pipeline of multiple filters to be applied ("Filter" can be an array).
      </t>
      <t>
All information about the former (or desired) filters is preserved in the XML.
It may not be the case that these filters have been applied to the byte contents, as indicated by the <q>ispdfenc</q> and <q>xmlenc</q> attributes.
There are these cases:
      <texttable>
        <preamble>Interpretation of byte streams
        </preamble>
        <ttcol align="left">ispdfenc</ttcol>
        <ttcol align="left">xmlenc</ttcol>
        <ttcol align="left">Meaning</ttcol>
        <c><q>true</q></c><c><q>none</q></c><c>exact bytes that the PDF has (or will have)</c>
        <c><q>true</q></c><c><q>base64</q></c><c>the stored bytes, with an additional escaping in XML to preserve 8-bit values</c>
        <c><q>false</q></c><c><q>none</q></c><c>all PDF-level encoding and encryption is undone</c>
        <c><q>false</q></c><c><q>base64</q></c><c>all PDF-level encoding and encryption is undone, and then an additional escaping is done in XML to preserve 8-bit values</c>
      </texttable>
      </t>
      <t>
The default value for these attributes are <q>ispdfenc="true" xmlenc="none"</q>.
The value <q>xmlenc="none"</q> is only legal if the (final) encoding results in only printable ASCII,
or if there are no filters and the string itself is only printable ASCII.
The value <q>xmlenc="none"</q> is also permitted for PDF string (not images), where there are no filters
and all the bytes may be mapped to Unicode without loss of information (see <xref target="charsets"/>).
      </t>
      <t>
      </t>
      <t>
Note that the <q>xmime:contentType</q> attribute MAY be used, as specified in <xref target="xmime"/> but as that mechanism
relies on XML Schema for determination of the xml-level encoding, it is not mandated here.
      </t>
    </section>
    <section anchor="encryption" title="Encryption">
We treat PDF "Encoding" in the same way as any PDF compression filters.
In PDF, encryption applied after all encoding filters (and decryption is before any decoding). 
Also, PDF 1.5 introduces "crypt filters".

    </section>

    <section anchor="files" title="External Content">
      <t>
PDF already allows streams to be stored externally to the PDF file (starting with PDF 1.2), by using the "F" key in the stream dictionary.
      </t>
      <t>
Images
Fonts
Forms (external FDF files)
Multimedia
      </t>
    </section>
    <section anchor="alternate" title="Alternate Content">
      <t>
Catalog/StructTreeRoot and Catalog/MarkInfo, siblings to Catalog/Pages

certain set of predefined tags (Document, Part, Art, Sect, Div) but can define your own and provide a "role map" to suggest its best interpretation

Also, can have an "Alt" entry in the "structure element dictionary"

can have "multi-language text arrays".

can have OPI proxies for images. OPI comments

a system akin to xml namespaces of "attribute owners". attributes of a single structural hierarchy..


Recommend consult "Tagged PDF" in the <xref target="refs.PDF">PDF Specification</xref>
Implement as marked content and "Tagged PDF". Preserve parameters/heuristics.

      </t>
      <section anchor="text-extract" title="Text Extraction">
soft hyphens
layout discontinuities
hidden content (clipped or same foreground and background)
reversed chars for RTL fonts

 
      </section>
    </section>
    <section anchor="xmlcontent" title="XML Content">
xml:base, xml:lang, xml:space
    </section>

    <section anchor="security" title="Security Considerations">
encryption.
IPR: compression, PDF format. encryption.
    </section>
    <section anchor="iana" title="IANA Considerations">
      <t>The application/xml mime type is specified in <xref target="refs.RFC3778">RFC3778</xref>.</t>
    </section>
  </middle>
  <back>
    <references>
      <reference anchor="refs.PDF" target="http://partners.adobe.com/public/developer/pdf/index_reference.html">
        <front>
          <title>PDF Reference, Fifth Edition, Version 1.6</title>
        </front>
      </reference>
      <reference anchor="refs.RFC3778">
        <front>
          <title>The application/pdf Media Type</title>
          <date month="May" year="2004"/>
        </front>
      </reference>

      <reference anchor="refs.PDFA" target="http://www.aiim.org/pdf_a/">
        <front>
          <title>ISO 19005-1. Document management - Electronic document file format for long-term preservation - Part 1: Use of PDF (PDF/A)</title>
        </front>
        <format type="DOC" target="http://www.aiim.org/documents/standards/ISO_19005-1_(E).doc"/>
	
      </reference>

      <reference anchor="refs.PDFX" target="http://www.iso.org/iso/en/CatalogueListPage.CatalogueList?ICS1=35&amp;ICS2=240&amp;ICS3=30&amp;scopelist=">
        <front>
          <title>ISO 15930-3:2002 Graphic technology -- Prepress digital data exchange using PDF -- Part 6: Complete exchange of printing data suitable for colour-managed workflows using PDF 1.4 (PDF/X-3)</title>
        </front>
      </reference>

      <reference anchor="refs.RFC3548">
        <front>
          <title>The Base16, Base32, and Base64 Data Encodings</title>
        </front>
      </reference>
      <reference anchor="xmime" target="http://www.w3.org/TR/xml-media-types/">
        <front>
          <title>Describing Media Content of Binary Data in XML</title>
          <area>W3C Working Group Note</area>
          <date month="May" year="2005"/>
          <author><organization abbrev="W3C">World Wide Web Consortium</organization></author>
        </front>
        <!--<format type="TXT" target="http://www.w3.org/TR/xml-media-types/"/>-->
        <seriesInfo name="W3C" value="XML"/>
      </reference>
      <reference anchor="refs.XMLNAMES" target="http://www.w3.org/TR/REC-xml-names/">
	<front>
	  <title>Namespaces in XML</title>
	  <date month="January" year="1999"/>
	</front>
      </reference>
    </references>

    <section anchor="pdf.dtd" title="pdf DTD">
      <figure>
        <artwork><![CDATA[
]]></artwork>
      </figure>
    </section>

    <section anchor="examples" title="Examples">
    </section>

    <section title="Acknowledgements">
      <t>The author gratefully acknowledges the contributions of:
      </t>
    </section>

  </back>
</rfc>

