anolis 1.0

Documentation — 28 August 2008

Contents

  1. 1 Introduction
  2. 2 Installing anolis
    1. 2.1 Requirements
    2. 2.2 Obtaining a copy
    3. 2.3 Installation
    4. 2.4 Running the test suite
  3. 3 Using anolis
  4. 4 Processes
    1. 4.1 Cross-referencing
    2. 4.2 Table of contents/section numbering
    3. 4.3 Substitution
  5. Acknowledgements

1 Introduction

The need for anolis came from the need for long technical documents to include niceties such as cross-references and a table of contents for the purpose of easy navigation — doing this manually can be a great chore especially when sections are numbered and a section is added, consequently changing the numbering of many others, leading to it being advantageous to do it programmatically.

Anolis does this on HTML documents, as a number of sequential processes. Currently cross-referencing, section numbering, table of contents creation, and a number of substitutions are done (mainly relating to the current date).

2 Installing anolis

2.1 Requirements

The following are the minimum requirements: later versions should also work without issue.

2.2 Obtaining a copy

Releases are occasionally made. A link to the latest release can be found from the anolis website.

Alternatively, a copy can be obtained from our Mercurial repository: this is where our ongoing development occurs, and allows any revision (and therefore any release) to be downloaded. Our repository is located at http://hg.gsnedders.com/anolis/.

2.3 Installation

Normally, installation is done through setuptools, with the following command:

python setup.py install

Please see setuptools' documentation for information on installation options (such as installing in non-standard locations).

2.4 Running the test suite

The source distribution and the current development copy (in Mercurial) both contain a test suite. It can be run with the following command:

python runtests.py

Any test failures should be reported at our bug tracker.

3 Using anolis

Anolis is invoked through the anolis command. The --help (or -h) option gives some basic help.

The --enable and --disable options enable/disable respectively the process given as the option value (e.g., --disable=toc disables building the table of contents and numbering sections). The default processes are sub (substitution), toc (table of contents/section numbering), and xref (cross-referencing). Any enabled process loaded via from processes import foo, and if that fails import foo (where foo is the process), and is then called as foo.foo(ElementTree, **kwargs).

Some options alter what is used to parse and serialize the document: by default, html5lib is used to parse the document; passing the --lxml.html option uses libxml2's HTML parser and serializer instead (this is quicker, but does not comply to the HTML 5 standard, and sometimes results in a fatal error).

anolis offers a compatibility mode, which aims to be compatible with the CSS3 module postprocessor (within reason). This is mainly provided for the sake of pre-existing W3C documents. The --w3c-compat option turns on this compatibility mode, although specific options that turn on just one compatibility feature at a time are also available (and are documented below under each process) — these are all implied by the --w3c-compat options, with one exception: --w3c-compat-crazy-substitutions, as it can lead to undesirable results.

The options --newline-char and --indent-char set the newline and indent strings (they do not have to be a single character) respecively. They default to U+000A LINE FEED (LF) and U+0009 CHARACTER TABULATION (tab) respectively. These are only used when generating large trees of generated markup, such as the table of contents.

Other process specific options are documented under the process to which they belong.

Upon a fatal error, processing of the document is terminated and the output file is left unchanged.

The textContent property is the same as that defined in DOM Level 3 Core on the Node interface.

Whitespace is as defined in HTML 5: U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

Interactive content is as defined in HTML 5: the a, bb, details, and datagrid elements; the audio and video elements when they have a controls attribute; the menu element when the type attribute is case-insensitively equal to toolbar.

When an id attribute is needed, it is created as follows:

  1. If the element already has an id attribute, its value is used, and this algorithm stops; otherwise:
  2. If the title attribute is present and its value is not empty and does not consist of whitespace only, let generated_id be equal to its value; otherwise, let generated_id be equal to textContent.
  3. The generated_id is stripped of leading/trailing whitespace and converted to lowercase (behaviour of this is dependent on the current locale setting of Python).
  4. The first of the following list whose condition matches the current state of the string is done:
    1. If generated_id is an empty string, generated_id is set to generatedID.
    2. If the DOCTYPE's public identifier is one of -//W3C//DTD HTML 4.0//EN, -//W3C//DTD HTML 4.0 Transitional//EN, -//W3C//DTD HTML 4.0 Frameset//EN, -//W3C//DTD HTML 4.01//EN, -//W3C//DTD HTML 4.01 Transitional//EN, -//W3C//DTD HTML 4.01 Frameset//EN, ISO/IEC 15445:2000//DTD HyperText Markup Language//EN, ISO/IEC 15445:2000//DTD HTML//EN, -//W3C//DTD XHTML 1.0 Strict//EN, -//W3C//DTD XHTML 1.0 Transitional//EN, -//W3C//DTD XHTML 1.0 Frameset//EN, or -//W3C//DTD XHTML 1.1//EN; or the --force-html4-id option is used:
      1. All runs of characters apart from U+002D HYPHEN-MINUS (-), U+002E FULL STOP (.), U+0030 DIGIT ZERO to U+0039 DIGIT NINE (0–9), U+003A COLON (:), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z (A–Z), U+005F LOW LINE (_), and U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z (a–z) are replaced by a single U+002D HYPHEN-MINUS (-) character within generated_id.
      2. Leading and trailing U+002D HYPHEN-MINUS (-) characters are removed from generated_id.
      3. If generated_id is not empty, if the first character is not in the range U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z (A–Z) or U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z (a–z), generated_id is prefixed by a single U+0078 LATIN SMALL LETTER X (x) character; otherwise, generated_id is set to generatedID.
    3. Otherwise, runs of characters that do not match the ifragment production in RFC 3987 are replaced by a single U+002D HYPHEN-MINUS (-) character within generated_id, and then leading and trailing U+002D HYPHEN-MINUS (-) characters are removed from generated_id.
  5. If generated_id matches a ready-existing ID, continue to the next step; otherwise, jump to step 7.
  6. Increment i by one, or set it to one if it doesn't already exist.
  7. Go to step 4.
  8. The generated ID is generated_id.

4 Processes

The elements listed in the below processes, except where otherwise stated, are the local name of the element in null namespace.

4.1 Cross-referencing

Cross-referencing has three essential parts: definitions that define terms, and instances of those terms.

Definitions are marked-up using the dfn element: the definition itself is taken from the title attribute if it is present, otherwise it is taken from the textContent property of the dfn element. By default, anolis will throw a fatal error if a term is defined more than once: this behaviour can be turned off (causing the final definition of the term to be the one that is used) by the --allow-duplicate-dfns option.

Instances are marked-up with various elements, depending on the setting of --w3c-compat-xref-elements: if it is disabled (the default), the abbr, code, i, span, and var elements are used for instances; if it is enabled, the abbr, acronym, b, bdo, big, code, del, em, i, ins, kbd, label, legend, q, samp, small, span, strong, sub, sup, tt, var elements are used for instances. Those that are only there in compatibility mode are there because either they should not semantically be used for an instance, or because they are not present in HTML 5. Similar to definitions, the instance is taken from the title attribute if it is present, otherwise it is taken from the textContent property. An instance is only used if it does not have an interactive content or dfn element as either a parent or a child.

Both definitions and instances are normalized as follows:

If the instance is contained within a span element, the span element is turned into an a element, and a href attribute is added to link it to the definition (e.g., <span>foo</span> becomes <a href=#foo>foo</a>) — all other attributes are preserved. Otherwise (when the instance is not contained within a span element), the location of the a element when linking an instance is dependent on the --w3c-compat-xref-a-placement option: if it is disabled (the default), the a element is placed around the element containing the instance (e.g., <i>foo</i> becomes <a href=#foo><i>foo</i></a>); if it is enabled, the a element goes within the element containing the instance and goes around all of its content (e.g., <i>foo</i> becomes <i><a href=#foo>foo</a></i>).

4.2 Table of contents/section numbering

To create a table of contents, and to number the sections of the document, an outline is created (this is a list of sections, which can each contain more sections, where a section represents a part of the document, and often has a heading associated with it — for more detailed definitions see HTML 5). This means not only are the h1h6 elements supported, but also elements such as section are used to create the outline. After creating the outline, every section with a depth between those provided by --min-depth and --max-depth (defaulting to two and six respectively), and which has a heading, is numbered if it does not have no-num as a class, and is added to the table of contents if it does not have no-toc as a class. Sections without a heading are treated as if they did not exist, unless they have children, in which they will appear to exist while not existing all at once (e.g., they increment the section numbering, though that is not output anywhere; and they get a list item in the table of contents, with only the children within it, and no link to the section itself).

The format of section numbers should comply with ISO 2145:1978, Numbering of divisions and subdivisions in written documents. This means that each section number is given by Arabic numerals, seperated by a single U+002E FULL STOP character, and there is no trailing U+002E FULL STOP character.

The section number is inserted as the first child node of the section heading as a span element with the class attribute set to secno: this is copied into the table of contents.

Pre-existing span elements with a class of secno are removed from all section headings, regardless of whether their depth falls within the range given by --min-depth and --max-depth.

The table of contents is built up as an ordered list (an ol element), with each section marked up as a li element, and child sections are marked up with an ol within that li (and this continues recursively, ad infinitum). By default, the root element of the table of contents (an ol element) is given a class attribute set to toc; however, with the --w3c-compat-class-toc option this is placed on every ol within the table of contents. The entire section heading is copied to be the content of the list item, with all interactive content elements and id attributes removed.

A normal comment substitution is done with sub_identifier equal to toc, and the table of contents as the replacement.

4.3 Substitution

Various strings are replaced in magic ways: a normal string substitution takes the form of [xxx] where xxx is case-sensitively the replacement, which may be followed by any characters apart from U+005D RIGHT SQUARE BRACKET (]) before the final U+005D RIGHT SQUARE BRACKET character — these extra characters are effectively a comment, and carry absolutely no meaning, and vanish into some as-of-yet unknown abyss when the string replacement is done. The entire string must be contained within a single text node.

A normal comment substitution is one where there is a string, sub_identifier, that identifies the comment for the substitution, and the replacement. All nodes between a comment with a value equal to (with leading and trailing whitespace removed) begin- followed by sub_identifier and one with q value equal to (with leading and trailing whitespace removed) end- followed by sub_identifier are removed, and replaced with the replacement. Additionally, any comment (with leading and trailing whitespace removed) with a value equal to sub_identifier is replaced with a comment with a value of begin- followed by sub_identifier, the replacement, and then a comment with a value of end- followed by sub_identifier.

The W3C status is found, when needed by one of the substitutions, by iterating all text nodes in document order (i.e., attribute values and comments have no effect), and for each node, the following is done (in this order):

  1. If the node contains, case-insensitively, "latest", followed by one or more whitespace characters, followed by "version", searching stops, and the default is used (ED).
  2. Otherwise, if the node, case-sensitively, contains "http://www.w3.org/TR/" followed by one of "MO", "WD", "CR", "PR", "REC", "PER", or "NOTE", which in turn is followed by U+002D HYPHEN-MINUS (-), then searching stops, and the status is whatever matched the previous list of options by the first match in the text node.

A side-effect of doing it in this order is the fact that if a node contains both of these possible strings is that the latter is ignored, meaning that the default (ED) is used.

There is also a long W3C status, which correlates to the W3C status under the following mapping:

W3C Status Long W3C Status
MO W3C Member-only Draft
ED Editor's Draft
WD W3C Working Draft
CR W3C Candidate Recommendation
PR W3C Proposed Recommendation
REC W3C Recommendation
PER W3C Proposed Edited Recommendation
NOTE W3C Working Group Note

By default, the normal string substitutions are:

[DATE]
This is replaced with the current date for UTC±0 in the form of, e.g., 31 July 2008. The word used for the month is dependent on the current locale of Python. The number of the day of the month has no leading zeros.
[CDATE]
This is replaced with the current date for UTC±0 in the form YYYYMMDD, e.g., 20080731. This is a conforming ISO 8601:2004 date.
[YEAR]
This is replaced with the current year for UTC±0, in the form YYYY, e.g., 2008. This is a conforming ISO 8601:2004 year.
[TITLE]
This is replaced with the textContent of the first title element which is within the first head of the document, or an empty string if such a title element does not exist.

There is one comment substitution by default. Any nodes between a comment with a value equal to (with leading and trailing whitespace removed) begin-link and one with a value equal to end-link, with interactive content elements removed (though children of those elements preserved), are effectively wrapped in an a element which has a href attribute equal to the textContent of all the nodes between the two comments concatenated in document order. The two comments must have the same parent, otherwise a fatal error occurs.

If --w3c-compat-substitutions is enabled, the following normal string substitutions are done in addition to those above:

[STATUS]
This is replaced with the W3C status.
[LONGSTATUS]
This is replaced with the long W3C status.

Additionally, the following normal comment substitutions are done:

sub_identifier equal to logo
Replacement is equal to: <p><a href="http://www.w3.org/"><img alt="W3C" src="http://www.w3.org/Icons/w3c_home"/></a></p> (parsed as an XML fragment, and serialized into the output document in the needed format).
sub_identifier equal to copyright
Replacement is equal to: <p class="copyright"><a href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a> © [YEAR] <a href="http://www.w3.org/"><acronym title="World Wide Web Consortium">W3C</acronym></a><sup>®</sup> (<a href="http://www.csail.mit.edu/"><acronym title="Massachusetts Institute of Technology">MIT</acronym></a>, <a href="http://www.ercim.org/"><acronym title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>, <a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. W3C <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>, <a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a> and <a href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a> rules apply.</p> (parsed as an XML fragment, and serialized into the output document in the needed format).

There is one further string substitution, and this is only done when --w3c-compat-crazy-substitutions is enabled (note that this is not included in --w3c-compat). A string of http://www.w3.org/StyleSheets/TR/W3C- followed by one or more characters in the range U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z (A–Z) is replaced with whatever http://www.w3.org/StyleSheets/TR/W3C-[STATUS] would evaluate to be. Like the normal string substitutions, this string must be contained in a single text node.

Acknowledgements

Thanks to Andrew Sidwell, Anne van Kesteren, Henri Sivonen, Ian Hickson, James Graham, Lachlan Hunt, Magnus Kristiansen, Michael(tm) Smith, and Philip Taylor for their ever needed help.

Special thanks to Bert Bos for creating the CSS3 Module Postprocessor, on which this is partially based, and (with --w3c-compat) claims to be partially compatible with. Further special thanks to Bert Bos for creating a number of things (especially as the algorithm for finding the W3C status) that took the author of anolis many hours to reverse engineer.