
Abstract
This article explains how to use Pygments as syntax highlighter for DocBook documents processed with the DocBook XSL stylesheets in order to replace the rather poor standard XSLTHL syntax highlighting.
The standard xslthl highlighter used by the DocBook stylesheets is very limited, in supported languages as well as in its highlighting capabilities. Moreover its implemented in Java and thus only works with Java XSL processors like Saxon or Xalan, which are tedious to use.
An alternative to the limited xslthl is Pygments, which is however not supported by any XSLT processor. Pygments support for DocBook stylesheets must be manually implemented.
In these stylesheets syntax highlighting is performed by the template apply-highlighting, which is defined in highlighting/common.xsl in the DocBook XSL distribution. This template is applied to all elements, which support highlighting (programlisting, screen and synopsis). It determines the language to be used for highlighting (from the language attribute of the element and from the highlight.default.language stylesheet parameter, if set), extracts the content nodes and eventually calls the XPath function highlight with these parameters [1]. This function is looked up in three different namespaces (in this order):
If it fails to determine a language (the language attribute and the highlight.default.language parameter are unset) or to look up highlight in these namespaces, it simply copies the contents.
The highlight function returns a list of XML and text nodes. XML nodes with the xslthl namespace prefix represent tokens from the highlighted source code. For instance, there are keyword and comment nodes. Refer to the Processing xslthl results section in the xslthl documentation for detailed information.
These xslthl nodes are then transformed into the proper output format by the format-specific DocBook highlighting stylesheets. Note, that these must explicitly be included in the customization layer. For instance, if the html/chunk.xsl stylesheet is used to generate chunked html output, the html/highlight.xsl stylesheet must be included in the customization layer for highlighting to work. keyword nodes would then be transformed to HTML b tags with the hl-keyword class, for example.
These observations lead to the obvious conclusion, that this highlight function must be re-implemented and put into any of the mentioned namespaces, if a custom highlighting routine is to be used.
Unfortunately, there is no XSLT processor, that could be extended in Python. Fortunately, the powerful XML package lxml provides a XSLT processing library based on libxslt (the engine behind xsltproc), which can be extended with custom functions and custom elements. Using this package, implementation of a custom XSLT processor is an easy task:
from lxml import etree
transform = etree.XSLT(etree.parse(xslt_file))
document = etree.parse(xml_file)
transform(document)
print(transform.error_log)
To add XPath extension functions, a function namespace must be created before calling transform:
xhl = etree.FunctionNamespace('http://net.sf.xslthl/ConnectorXalan')
xhl.prefix = 'xhl'
xhl['highlight'] = custom_highlight
This adds the namespace http://net.sf.xslthl/ConnectorXalan with the prefix xhl to the global list of function namespaces maintained by lxml. The namespace is therefore automatically known to the XSLT stylesheet bound to transform under the given prefix. Eventually the function custom_highlight is added to this namespace as highlight. Thus the stylesheet can now call the XPath function xhl:highlight.
As you will remember, highlight returns special XML nodes, which are transformed to proper output. This leads to a first approach, which uses Pygments for lexing only and maps Pygments tokens to the corresponding xslthl tags. This mapping roughly looks like this:
XSLTHL_TAGS = {
Token.Keyword: 'keyword',
Token.String: 'string',
Token.Number: 'number',
Token.Comment: 'comment',
Token.Comment.Special: 'doccomment',
Token.Comment.Preproc: 'directive',
Token.Name.Decorator: 'annotation',
Token.Name.Tag: 'tag',
Token.Name.Attribute: 'attribute',
}
Using this mapping a xslthl_highlight function can now be implemented:
def xslthl_highlight(context, language, code, config):
"""
Highlight the given ``code`` in the given ``language``. ``context`` is
the XPath context, in which this function was applied. ``config`` is
ignored.
Return a list of xslthl xml nodes and text nodes containing the
tokenized source code.
"""
namespace = 'http://xslthl.sf.net'
XslthlName = partial(etree.QName, namespace)
# necessary, because lxml somehow fails to correctly pass code (reason
# is unknown to me)
if not code:
code = context.context_node.xpath('.//text()')
lexer = get_lexer_by_name(language[0].lower())
root = etree.Element(XslthlName('xslthl'), nsmap={'xslthl': namespace})
text = []
for token, value in lex(''.join(code), lexer):
# walk up the token hierarchy until the token type maps to a xslthl
# tag
while token and token not in XSLTHL_TAGS:
token = token.parent
# create a simple text node, if the token doesn't map to any tag
if not token:
text.append(value)
else:
if not len(root):
root.text = ''.join(text)
else:
root[-1].tail = ''.join(text)
text = []
el = etree.SubElement(root, XslthlName(XSLTHL_TAGS[token]))
el.text = value
else:
root[-1].tail = ''.join(text)
return root
The inconvenient handling of text nodes using the text list is necessary, because the ElementTree API exposed by lxml doesn’t support text nodes directly. Instead, text nodes are represented by the text and tail attributes of real elements. The lxml.etree Tutorial explains this concept in the section Elements contain text.
This approach largely maintains compatiblity with DocBook by reusing as much of the DocBook stylesheets as possible. While this allows to use Pygments for syntax highlighting and thus to abandon the Java XSLT processors, it doesn’t improve the limited xslthl highlighting capabilities.
The HTML formatter of Pygments produces a much more colorful and feature-rich highlighting than the xslthl highlighter. When using the HTML stylesheets, it is possible to completely abandon xslthl and use the Pygments formatter:
def html_highlight(context, language, code, config):
"""
Highlight the given ``code`` in the given ``language``. ``context`` is
the XPath context, in which this function was applied. ``config`` is
ignored.
Return a list of HTML nodes containined the highlighted code.
"""
# necessary, because lxml somehow fails to correctly pass code (reason
# is unknown to me)
if not code:
code = context.context_node.xpath('.//text()')
lexer = get_lexer_by_name(language[0].lower())
html = highlight(code[0], lexer, HtmlFormatter(nowrap=True))
highlight_div = fragment_fromstring(html, create_parent=True)
highlight_div.set('class', 'pygments_highlight')
return [highlight_div]
This function uses the HTML Formatter to render the source code to HTML. This html code is then parsed using lxml.html. As the stylesheets already wrap highlighted elements in pre tags, nowrap is specified to avoid Pygments wrapping them again. Instead, the returned tokens are wrapped in a simple div element.
Note, that the tags generated by this function do not carry style information. A proper CSS stylesheet must be created and included manually (using pygmentize and the stylesheet parameter html.stylesheet).
docbook_build.py is a build script for DocBook documents, that implements syntax highlighting as described in this article. It takes a XSLT stylesheet as first and a DocBook XML file as second argument:
docbook_build.py [-v] [--html] XSLT XML
It transforms XML using the given XSLT stylesheets after resolving XIncludes and (optionally) validating the document. By default it uses xslthl highlighting, direct html highlighting can be enabled using --html.
To create HTML output with highlighting using docbook_build.py, setup a customization layer to enable highlighting and to include the pygments CSS stylesheet:
<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:import href="http://docbook.sourceforge.net/release/xsl-ns/current/html/chunk.xsl"/>
<!-- highlight templates -->
<xsl:import href="http://docbook.sourceforge.net/release/xsl-ns/current/html/highlight.xsl"/>
<!-- the pygments stylesheet -->
<xsl:param name="html.stylesheet">highlight.css</xsl:param>
<!-- enable highlighting -->
<xsl:param name="highlight.source" select="1"/>
</xsl:stylesheet>
Generate the stylesheet using pygmentize:
pygmentize -S friendly -f html > highlight.css
Invoke docbook_build.py (assuming html.xslt is the customization layer and index.xml the DocBook document):
docbook_build.py --html html.xslt index.xml
Downloads
Footnotes
| [1] | Actually, it also passes the xslthl configuration file in the third parameter. However, explanation and implementation of this parameter exceed the scope of this article. Refer to the xslthl documentation for more information. |