How to extend the HtmlPipeline class

We've already configured a HtmlPipeline by changing the HtmlPipelineContext. We've defined an ImageProvider and a LinkProvider and applied it using the setImageProvider() and setLinkProvider() method, but there's more.

Each time a new XMLWorker/XmlParser is started with the same HtmlPipeline, the context is cloned using some defaults. You can change these defaults with the following methods:

In previous examples, we've also used the setTagFactory() method. We can completely change the way HtmlPipeline interprets tags by creating a custom TagProcessorFactory.

XMLWorker creates Tag objects that contains attributes, styles and a hierarchy (one parent, zero or more children). HtmlPipeline transforms these Tags into com.itextpdf.text.Element objects with the help of TagProcessors. You can find a series of precanned TagProcessor implementations in the com.itextpdf.tool.xml.html package.

The default TagProcessorFactory can be obtained from the Tags class, using the getHtmlTagProcessorFactory() method. Not all tags are enabled by default. Some tags are linked to the DummyTagProcessor (a processor that doesn't do anything), other tags result in a TagProcessor with a very specific implementation. You can extend the HtmlPipeline by adding your own TagProcessor implementations to the TagProcessorFactory with the addProcessor() method. This will either replace the default functionality of already supported tags, or add functionality for new tags.

Suppose that you have HTML code in which you've used a custom tag that should trigger a call to a database, for example a <userdata> tag. XMLWorker will detect this tag and pass it to the HtmlPipeline. As a result, HtmlPipeline looks for the appropriate TagProcessor in its HtmlPipelineContext. You can implement the TagProcessor interface or extend the AbstractTagProcessor class in such a way that it performs a database query, adding its ResultSet to the Document in the form of a (list of) Element object(s). You should prefer extending AbstractTagProcessor, as this class comes with precanned page-break-before, page-break-after, and fontsize handling.

Note that your TagProcessor can use CSS if you introduced a CssResolverPipeline before each pipeline that wants to apply styles. The CssResolverPipeline is responsible for setting the right CSS properties on each tag. This pipeline requires a CSSResolver that contains your css file. Let's take a look at the StyleAttrCssResolver that is shipped with XML Worker.