The need for anolis came from the need for long technical documents to include niceties such as cross-references and a table of contents for the purpose of easy navigation — doing this manually can be a great chore especially when sections are numbered and a section is added, consequently changing the numbering of many others, leading to it being advantageous to do it programmatically.
Anolis does this on HTML documents, as a number of sequential processes. Currently cross-referencing, section numbering, table of contents creation, and a number of substitutions are done (mainly relating to the current date).
The following are the minimum requirements: later versions should also work without issue.
Releases are occasionally made. A link to the latest release can be found from the anolis website.
Alternatively, a copy can be obtained from our Mercurial repository: this is
where our ongoing development occurs, and allows any revision (and therefore any
release) to be downloaded. Our repository is located at
http://hg.gsnedders.com/anolis/.
Normally, installation is done through setuptools, with the following command:
python setup.py install
Please see setuptools' documentation for information on installation options (such as installing in non-standard locations).
The source distribution and the current development copy (in Mercurial) both contain a test suite. It can be run with the following command:
python runtests.py
Any test failures should be reported at our bug tracker.
Anolis is invoked through the anolis command. The
--help (or -h) option gives some
basic help.
The --enable and --disable
options enable/disable respectively the process given as the option value (e.g.,
--disable=toc disables building the table of contents and numbering sections).
The default processes are sub (substitution),
toc (table of contents/section numbering), and
xref (cross-referencing). Any enabled process loaded
via from processes import foo, and if that fails import
foo (where foo is the process), and is then called as
foo.foo(ElementTree, **kwargs).
Some options alter what is used to parse and serialize the document: by
default, html5lib is used to parse the document; passing the
--lxml.html option uses libxml2's HTML parser and
serializer instead (this is quicker, but does not comply to the HTML 5 standard, and sometimes results in a
fatal error).
anolis offers a compatibility mode, which aims to be compatible
with the CSS3
module postprocessor (within reason). This is mainly provided for the sake
of pre-existing W3C documents. The
--w3c-compat option turns on this compatibility mode,
although specific options that turn on just one compatibility feature at a time
are also available (and are documented below under each process) — these are all implied by the
--w3c-compat options, with one exception:
--w3c-compat-crazy-substitutions, as it can lead to undesirable
results.
The options --newline-char and
--indent-char set the newline and indent strings (they
do not have to be a single character) respecively. They default to U+000A LINE
FEED (LF) and U+0009 CHARACTER TABULATION (tab) respectively. These are only
used when generating large trees of generated markup, such as the table of
contents.
Other process specific options are documented under the process to which they belong.
Upon a fatal error, processing of the document is terminated and the output file is left unchanged.
The textContent property is the same as that defined in DOM Level 3 Core on the Node interface.
Whitespace is as defined in HTML
5: U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED
(LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR)
.
Interactive content is as defined in
HTML 5: the a,
bb, details, and datagrid elements; the
audio and video elements when they have a
controls attribute; the menu element when the
type attribute is case-insensitively equal to
toolbar.
When an id attribute is needed, it is
created as follows:
id attribute, its value is
used, and this algorithm stops; otherwise:
title attribute is present and its value is not
empty and does not consist of whitespace only, let
generated_id be equal to its value; otherwise, let
generated_id be equal to textContent.
generatedID.
-//W3C//DTD
HTML 4.0//EN, -//W3C//DTD HTML 4.0 Transitional//EN,
-//W3C//DTD HTML 4.0 Frameset//EN, -//W3C//DTD HTML
4.01//EN, -//W3C//DTD HTML 4.01 Transitional//EN,
-//W3C//DTD HTML 4.01 Frameset//EN, ISO/IEC 15445:2000//DTD
HyperText Markup Language//EN, ISO/IEC 15445:2000//DTD
HTML//EN, -//W3C//DTD XHTML 1.0 Strict//EN,
-//W3C//DTD XHTML 1.0 Transitional//EN, -//W3C//DTD XHTML 1.0
Frameset//EN, or -//W3C//DTD XHTML 1.1//EN; or the
--force-html4-id option is used:
generatedID.
The elements listed in the below processes, except where otherwise stated, are the local name of the element in null namespace.
Cross-referencing has three essential parts: definitions that define terms, and instances of those terms.
Definitions are marked-up using the
dfn element: the definition itself is taken from the
title attribute if it is present, otherwise it is taken from the
textContent property of the dfn element. By default,
anolis will throw a fatal error if a term is defined
more than once: this behaviour can be turned off (causing the final
definition of the term to be the one that is used) by
the --allow-duplicate-dfns option.
Instances are marked-up with various elements,
depending on the setting of --w3c-compat-xref-elements:
if it is disabled (the default), the abbr, code,
i, span, and var elements are used for
instances; if it is enabled, the
abbr, acronym, b, bdo,
big, code, del, em,
i, ins, kbd, label,
legend, q, samp, small,
span, strong, sub, sup,
tt, var elements are used for instances. Those that are only there in
compatibility mode are there because either they should not
semantically be used for an instance, or because they are not
present in HTML 5. Similar to definitions, the instance is taken from
the title attribute if it is present, otherwise it is taken from
the textContent property. An instance is only used if
it does not have an interactive content or dfn element
as either a parent or a child.
Both definitions and instances are normalized as follows:
--w3c-compat-xref-normalization is enabled,
all characters apart from U+0020 SPACE CHARACTER, U+002D HYPHEN-MINUS (-),
U+0030 DIGIT ZERO to U+0039 DIGIT NINE (0–9), U+0041 LATIN CAPITAL LETTER A to
U+005A LATIN CAPITAL LETTER Z (A–Z), and U+0061 LATIN SMALL LETTER A to U+007A
LATIN SMALL LETTER Z (a–z) are removed.
If the instance is contained within a span element,
the span element is turned into an a element, and a
href attribute is added to link it to the definition
(e.g., <span>foo</span> becomes <a
href=#foo>foo</a>) — all other attributes are preserved. Otherwise
(when the instance is not contained within a span
element), the location of the a element when linking an
instance is dependent on the
--w3c-compat-xref-a-placement option: if it is disabled
(the default), the a element is placed around the element
containing the instance (e.g., <i>foo</i>
becomes <a href=#foo><i>foo</i></a>); if it is enabled,
the a element goes within the element containing the
instance and goes around all of its content (e.g.,
<i>foo</i> becomes <i><a
href=#foo>foo</a></i>).
To create a table of contents, and to number the sections of the document, an outline is
created (this is a list of sections, which can each
contain more sections, where a section
represents a part of the document, and often has a heading associated with it — for more detailed definitions see
HTML 5). This means not only are the
h1–h6 elements supported, but also elements such as
section are used to create the outline. After
creating the
outline, every section with a depth between those
provided by --min-depth and
--max-depth (defaulting to two and six respectively),
and which has a heading, is numbered if it
does not have no-num as a class, and is added to the table of
contents if it does not have no-toc as a class. Sections without a heading are treated as if they did not exist, unless they have
children, in which they will appear to exist while not existing all at once
(e.g., they increment the section numbering, though that is not
output anywhere; and they get a list item in the table of contents, with only
the children within it, and no link to the section itself).
The format of section numbers should comply with ISO 2145:1978, Numbering of divisions and subdivisions in written documents. This means that each section number is given by Arabic numerals, seperated by a single U+002E FULL STOP character, and there is no trailing U+002E FULL STOP character.
The section number is inserted as the first child node of the
section heading as a span element with the
class attribute set to secno: this is copied into the
table of contents.
Pre-existing span elements with a class of secno
are removed from all section headings,
regardless of whether their depth falls within the range given by
--min-depth and --max-depth.
The table of contents is built up as an ordered list (an ol
element), with each section marked up as a li element,
and child sections are marked up with an
ol within that li (and this continues recursively, ad
infinitum). By default, the root element of the table of contents (an
ol element) is given a class attribute set to
toc; however, with the
--w3c-compat-class-toc option this is placed on every
ol within the table of contents. The entire section
heading is copied to be the content of the list item, with all
interactive content elements and id attributes
removed.
A normal comment substitution is done with
sub_identifier equal to toc, and the table of contents
as the replacement.
Various strings are replaced in magic ways: a normal string
substitution takes the form of [xxx] where xxx is
case-sensitively the replacement, which may be followed by any characters apart
from U+005D RIGHT SQUARE BRACKET (]) before the final U+005D RIGHT SQUARE
BRACKET character — these extra characters are effectively a comment, and
carry absolutely no meaning, and vanish into some as-of-yet unknown abyss when
the string replacement is done. The entire string must be contained within a
single text node.
A normal comment substitution is one where there is a string,
sub_identifier, that identifies the comment for the substitution, and
the replacement. All nodes between a comment with a value equal to (with leading
and trailing whitespace removed) begin- followed by
sub_identifier and one with q value equal to (with leading and
trailing whitespace removed) end- followed by
sub_identifier are removed, and replaced with the replacement.
Additionally, any comment (with leading and trailing whitespace
removed) with a value equal to sub_identifier is replaced with a
comment with a value of begin- followed by
sub_identifier, the replacement, and then a comment with a value of
end- followed by sub_identifier.
The W3C status is found, when needed by one of the substitutions, by iterating all text nodes in document order (i.e., attribute values and comments have no effect), and for each node, the following is done (in this order):
A side-effect of doing it in this order is the fact that if a node contains both of these possible strings is that the latter is ignored, meaning that the default (ED) is used.
There is also a long W3C status, which correlates to the W3C status under the following mapping:
| W3C Status | Long W3C Status |
|---|---|
| MO | W3C Member-only Draft |
| ED | Editor's Draft |
| WD | W3C Working Draft |
| CR | W3C Candidate Recommendation |
| PR | W3C Proposed Recommendation |
| REC | W3C Recommendation |
| PER | W3C Proposed Edited Recommendation |
| NOTE | W3C Working Group Note |
By default, the normal string substitutions are:
[DATE]
[CDATE]
[YEAR]
[TITLE]
title element which is within the first head of the
document, or an empty string if such a title element does not
exist.
There is one comment substitution by default. Any nodes between a comment
with a value equal to (with leading and trailing whitespace
removed) begin-link and one with a value equal to
end-link, with interactive content elements removed
(though children of those elements preserved), are effectively wrapped in an
a element which has a href attribute equal to the
textContent of all the nodes between the two comments concatenated
in document order. The two comments must have the same parent, otherwise a
fatal error occurs.
If --w3c-compat-substitutions is enabled, the
following normal string
substitutions are done in addition to those above:
[STATUS]
[LONGSTATUS]
Additionally, the following normal comment substitutions are done:
logo
<p><a
href="http://www.w3.org/"><img alt="W3C"
src="http://www.w3.org/Icons/w3c_home"/></a></p> (parsed as an XML
fragment, and serialized into the output document in the needed format).
copyright
<p class="copyright"><a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a>
© [YEAR] <a href="http://www.w3.org/"><acronym title="World
Wide Web Consortium">W3C</acronym></a><sup>®</sup> (<a
href="http://www.csail.mit.edu/"><acronym title="Massachusetts Institute of
Technology">MIT</acronym></a>, <a
href="http://www.ercim.org/"><acronym title="European Research Consortium for
Informatics and Mathematics">ERCIM</acronym></a>, <a
href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. W3C <a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
<a
href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>
and <a href="http://www.w3.org/Consortium/Legal/copyright-documents">document
use</a> rules apply.</p> (parsed as an XML fragment, and serialized
into the output document in the needed format).
There is one further string substitution, and this is only done when
--w3c-compat-crazy-substitutions is enabled (note that
this is not included in --w3c-compat). A string of
http://www.w3.org/StyleSheets/TR/W3C- followed by one or more
characters in the range U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL
LETTER Z (A–Z) is replaced with whatever
http://www.w3.org/StyleSheets/TR/W3C-[STATUS] would evaluate to be.
Like the normal string
substitutions, this string must be contained in a single text node.
Thanks to Andrew Sidwell, Anne van Kesteren, Henri Sivonen, Ian Hickson, James Graham, Lachlan Hunt, Magnus Kristiansen, Michael(tm) Smith, and Philip Taylor for their ever needed help.
Special thanks to Bert Bos for creating the CSS3 Module Postprocessor, on
which this is partially based, and (with --w3c-compat) claims to be
partially compatible with. Further special thanks to Bert Bos for creating a
number of things (especially as the algorithm for finding the W3C
status) that took the author of anolis many hours to reverse
engineer.