Parsing HTML into a list of Element objects

When iText parses an XML file, it interprets the different tags and, whenever possible, iText will create a corresponding Element object. Suppose you're not interested in creating PDF, but you just want parse the HTML into a list of iText Element objects.

XMLWorkerHelper.getInstance().parseXHtml(new ElementHandler() {
    public void add(final Writable w) {
        if (w instanceof WritableElement) {
            List<Element> elements = ((WritableElement)w).elements();
            // write class names of elements to file
        }
    }
}, HTMLParsingToList.class.getResourceAsStream("/html/walden.html"), null);

see HTMLParsingToList and the resulting PDF objects.txt

Let's take a look at the first handful of objects.

com.itextpdf.tool.xml.html.head.Title$1
com.itextpdf.text.Chunk
com.itextpdf.text.Paragraph
com.itextpdf.tool.xml.html.Header$1
com.itextpdf.text.Paragraph

The first object is a Title header. It will result in a bookmark. The first real Element is a Chunk. This is the chunk of text between the <pre> tags in the HTML file:

<pre>

The Project Gutenberg EBook of Walden, and On The Duty Of Civil
Disobedience, by Henry David Thoreau

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. ...
<pre>

This snippet is converted to a Chunk, and a Chunk is an Element that doesn't have a leading of its own, hence the gibberish: all the lines between the <pre> tags are written on top of each other, until the first <p> tag is encountered, resulting in a Paragraph object.