The need for Anolis came from the need for long technical documents to include niceties such as cross-references and a table of contents for the purpose of easy navigation — doing this manually can be a great chore especially when sections are numbered and a section is added, consequently changing the numbering of many others, leading to it being advantageous to do it programmatically.
Anolis does this on HTML documents, as a number of sequential processes. Currently cross-referencing, section numbering, table of contents creation, and a number of substitutions are done (mainly relating to the current date).
The following are the minimum requirements: later versions should also work without issue.
Releases are occasionally made. A link to the latest release can be found from the Anolis website.
Alternatively, a copy can be obtained from our Mercurial repository: this
is where our ongoing development occurs, and allows any revision (and therefore
any release) to be downloaded. Our repository is located at
http://hg.gsnedders.com/anolis/.
Normally, installation is done through setuptools, with the following command:
python setup.py install
Please see setuptools' documentation for information on installation options (such as installing in non-standard locations).
The source distribution and the current development copy (in Mercurial) both contain a test suite. It can be run with the following command:
python runtests.py
Any test failures should be reported at our bug tracker.
Anolis is invoked through the anolis command. The
--help (or -h) option gives
some basic help.
The --enable and --disable
options enable/disable respectively the process given as the option value
(e.g., --disable=toc disables building the table of contents and numbering sections).
The default processes are sub (substitution),
toc (table of contents/section numbering), and
xref (cross-referencing). Any enabled process loaded
via from processes import foo, and if that fails import
foo (where foo is the process), and is then called as
foo.foo(ElementTree, **kwargs).
Some options alter what is used to parse and serialize the document: by
default, html5lib is used to parse the document; passing the
--lxml.html option uses libxml2's HTML parser and
serializer instead (this is quicker, but does not comply to the HTML 5 standard, and sometimes results in a
fatal error).
Anolis offers a compatibility mode, which aims to be compatible
with the CSS3
module postprocessor (within reason). This is mainly provided for the sake
of pre-existing W3C documents. The
--w3c-compat option turns on this compatibility mode,
although specific options that turn on just one compatibility feature at a time
are also available (and are documented below under each process) — these are all implied by the
--w3c-compat options, with one exception:
--w3c-compat-crazy-substitutions, as it can lead to undesirable
results.
The options --newline-char and
--indent-char set the newline and indent strings (they
do not have to be a single character) respecively. They default to U+000A LINE
FEED (LF) and U+0020 SPACE respectively. These are only used when generating
large trees of generated markup, such as the table of contents.
Other process specific options are documented under the process to which they belong.
Upon a fatal error, processing of the document is terminated and the output file is left unchanged.
The textContent property is the same as that defined in DOM Level 3 Core on the Node interface.
Whitespace is as defined in HTML 5: U+0020 SPACE, U+0009 CHARACTER
TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D
CARRIAGE RETURN (CR)
.
Interactive content is as defined in HTML 5: the a, bb,
details, and datagrid elements; the
audio and video elements when they have a
controls attribute; the menu element when the
type attribute is case-insensitively equal to
toolbar.
When an id attribute is needed, it
is created as follows:
id attribute, return its
value, and terminate this algorithm.
title attribute is present and its value is not
empty and does not consist of whitespace only, let
generated_id be equal to its value; otherwise, let
generated_id be equal to textContent.
generatedID.
--force-html4-id option is used, or the DOCTYPE's public identifier is one of:
-//W3C//DTD HTML 4.0//EN
-//W3C//DTD HTML 4.0 Transitional//EN
-//W3C//DTD HTML 4.0 Frameset//EN
-//W3C//DTD HTML 4.01//EN
-//W3C//DTD HTML 4.01 Transitional//EN
-//W3C//DTD HTML 4.01 Frameset//EN
ISO/IEC 15445:2000//DTD HyperText Markup
Language//EN
ISO/IEC 15445:2000//DTD HTML//EN
-//W3C//DTD XHTML 1.0 Strict//EN
-//W3C//DTD XHTML 1.0 Transitional//EN
-//W3C//DTD XHTML 1.0 Frameset//EN
-//W3C//DTD XHTML 1.1//EN
generatedID.
The elements listed in the below processes, except where otherwise stated, are the local name of the element in null namespace.
Cross-referencing has three essential parts: definitions that define terms, and instances of those terms.
Definitions are marked-up using the
dfn element: the definition itself is taken from the
title attribute if it is present, otherwise it is taken from the
textContent property of the dfn element. By default,
Anolis will throw a fatal error if a term is defined
more than once: this behaviour can be turned off (causing the final
definition of the term to be the one that is used) by
the --allow-duplicate-dfns option.
Instances are marked-up with various elements,
depending on the setting of --w3c-compat-xref-elements:
if it is disabled (the default), the abbr, code,
i, span, and var elements are used for
instances; if it is enabled, the
abbr, acronym, b, bdo,
big, code, del, em,
i, ins, kbd, label,
legend, q, samp, small,
span, strong, sub, sup,
tt, var elements are used for instances. Those that are only there in
compatibility mode are there because either they should not
semantically be used for an instance, or because they are not
present in HTML 5. Similar to definitions, the instance is taken from
the title attribute if it is present, otherwise it is taken from
the textContent property. An instance is only used if
it does not have an interactive content or dfn
element as either a parent or a child.
Both definitions and instances are normalized as follows:
--w3c-compat-xref-normalization is enabled,
all characters apart from U+0020 SPACE CHARACTER, U+002D HYPHEN-MINUS (-),
U+0030 DIGIT ZERO to U+0039 DIGIT NINE (0–9), U+0041 LATIN CAPITAL LETTER A
to U+005A LATIN CAPITAL LETTER Z (A–Z), and U+0061 LATIN SMALL LETTER A to
U+007A LATIN SMALL LETTER Z (a–z) are removed.
If the instance is contained within a span
element, the span element is turned into an a
element, and a href attribute is added to link it to the
definition (e.g., <span>foo</span> becomes
<a href=#foo>foo</a>) — all other attributes are
preserved. Otherwise (when the instance is not contained within a
span element), the location of the a element when
linking an instance is dependent on the
--w3c-compat-xref-a-placement option: if it is disabled
(the default), the a element is placed around the element
containing the instance (e.g., <i>foo</i>
becomes <a href=#foo><i>foo</i></a>); if it is
enabled, the a element goes within the element containing the
instance and goes around all of its content (e.g.,
<i>foo</i> becomes <i><a
href=#foo>foo</a></i>).
To create a table of contents, and to number the sections of the document, an outline is
created (this is a list of sections, which can
each contain more sections, where a
section represents a part of the document, and often has a heading associated with it — for more detailed
definitions see HTML 5). This means not
only are the h1–h6 elements supported, but also
elements such as section are used to create the
outline. After creating the outline, every
section with a depth between those provided by
--min-depth and --max-depth
(defaulting to two and six respectively), and which has a heading, is numbered if it does not have no-num as
a class, and is added to the table of contents if it does not have
no-toc as a class. Sections without a
heading are treated as if they did not
exist, unless they have children, in which they will appear to exist while not
existing all at once (e.g., they increment the section numbering,
though that is not output anywhere; and they get a list item in the table of
contents, with only the children within it, and no link to the
section itself).
The format of section numbers should comply with ISO 2145:1978, Numbering of divisions and subdivisions in written documents. This means that each section number is given by Arabic numerals, seperated by a single U+002E FULL STOP character, and there is no trailing U+002E FULL STOP character.
The section number is inserted as the first child node of the
section heading as a span element with the
class attribute set to secno: this is copied into the
table of contents.
Pre-existing span elements with a class of secno
are removed from all section headings,
regardless of whether their depth falls within the range given by
--min-depth and --max-depth.
The table of contents is built up as an ordered list (an ol
element), with each section marked up as a li
element, and child sections are marked up with an
ol within that li (and this continues recursively, ad
infinitum). By default, the root element of the table of contents (an
ol element) is given a class attribute set to
toc; however, with the
--w3c-compat-class-toc option this is placed on every
ol within the table of contents. The entire section
heading is copied to be the content of the list item, with all
interactive content elements and id attributes
removed.
A normal comment substitution is done with
sub_identifier equal to toc, and the table of contents
as the replacement.
Various strings are replaced in magic ways: a normal string
substitution takes the form of [xxx] where xxx is
case-sensitively the replacement, which may be followed by any characters apart
from U+005D RIGHT SQUARE BRACKET (]) before the final U+005D RIGHT SQUARE
BRACKET character — these extra characters are effectively a comment, and
carry absolutely no meaning, and vanish into some as-of-yet unknown abyss when
the string replacement is done. The entire string must be contained within a
single text node.
A normal comment substitution is one where there is a string,
sub_identifier, that identifies the comment for the substitution,
and the replacement. All nodes between a comment with a value equal to (with
leading and trailing whitespace removed) begin-
followed by sub_identifier and one with q value equal to (with
leading and trailing whitespace removed) end-
followed by sub_identifier are removed, and replaced with the
replacement. Additionally, any comment (with leading and trailing
whitespace removed) with a value equal to
sub_identifier is replaced with a comment with a value of
begin- followed by sub_identifier, the replacement, and
then a comment with a value of end- followed by
sub_identifier.
The W3C status is found, when needed by one of the substitutions, by iterating all text nodes in document order (i.e., attribute values and comments have no effect), and for each node, the following is done (in this order):
A side-effect of doing it in this order is the fact that if a node contains both of these possible strings is that the latter is ignored, meaning that the default (ED) is used.
There is also a long W3C status, which correlates to the W3C status under the following mapping:
| W3C Status | Long W3C Status |
|---|---|
| MO | W3C Member-only Draft |
| ED | Editor's Draft |
| WD | W3C Working Draft |
| CR | W3C Candidate Recommendation |
| PR | W3C Proposed Recommendation |
| REC | W3C Recommendation |
| PER | W3C Proposed Edited Recommendation |
| NOTE | W3C Working Group Note |
By default, the normal string substitutions are:
[DATE]
[CDATE]
[YEAR]
[TITLE]
title element which is within the first head of the
document, or an empty string if such a title element does not
exist.
There is one comment substitution by default. Any nodes between a comment
with a value equal to (with leading and trailing whitespace
removed) begin-link and one with a value equal to
end-link, with interactive content elements removed
(though children of those elements preserved), are effectively wrapped in an
a element which has a href attribute equal to the
textContent of all the nodes between the two comments concatenated
in document order. The two comments must have the same parent, otherwise a
fatal error occurs.
If --w3c-compat-substitutions is enabled, the
following normal string
substitutions are done in addition to those above:
[STATUS]
[LONGSTATUS]
Additionally, the following normal comment substitutions are done:
logo
<p><a
href="http://www.w3.org/"><img alt="W3C"
src="http://www.w3.org/Icons/w3c_home"/></a></p> (parsed as an XML
fragment, and serialized into the output document in the needed format).
copyright
<p class="copyright"><a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a>
© [YEAR] <a href="http://www.w3.org/"><acronym title="World
Wide Web Consortium">W3C</acronym></a><sup>®</sup> (<a
href="http://www.csail.mit.edu/"><acronym title="Massachusetts Institute of
Technology">MIT</acronym></a>, <a
href="http://www.ercim.org/"><acronym title="European Research Consortium
for Informatics and Mathematics">ERCIM</acronym></a>, <a
href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. W3C <a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
<a
href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>
and <a
href="http://www.w3.org/Consortium/Legal/copyright-documents">document
use</a> rules apply.</p> (parsed as an XML fragment, and
serialized into the output document in the needed format).
There is one further string substitution, and this is only done when
--w3c-compat-crazy-substitutions is enabled (note that
this is not included in --w3c-compat). A string of
http://www.w3.org/StyleSheets/TR/W3C- followed by one or more
characters in the range U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL
LETTER Z (A–Z) is replaced with whatever
http://www.w3.org/StyleSheets/TR/W3C-[STATUS] would evaluate to
be. Like the normal string
substitutions, this string must be contained in a single text node.
Thanks to Andrew Sidwell, Anne van Kesteren, Henri Sivonen, Ian Hickson, James Graham, Lachlan Hunt, Magnus Kristiansen, Michael(tm) Smith, and Philip Taylor for their ever needed help.
Special thanks to Bert Bos for creating the CSS3 Module Postprocessor, on
which this is partially based, and (with --w3c-compat) claims to
be partially compatible with. Further special thanks to Bert Bos for creating a
number of things (especially the algorithm for finding the W3C
status) that took the author of Anolis many hours to reverse
engineer.