Parsing HTML into a list of Element objects
When iText parses an XML file, it interprets the different tags and, whenever possible, iText will create a corresponding Element
object.
Suppose you're not interested in creating PDF, but you just want parse the HTML into a list of iText Element
objects.
XMLWorkerHelper.getInstance().parseXHtml(new ElementHandler() { public void add(final Writable w) { if (w instanceof WritableElement) { List<Element> elements = ((WritableElement)w).elements(); // write class names of elements to file } } }, HTMLParsingToList.class.getResourceAsStream("/html/walden.html"), null);
see HTMLParsingToList and the resulting PDF objects.txt
Let's take a look at the first handful of objects.
com.itextpdf.tool.xml.html.head.Title$1 com.itextpdf.text.Chunk com.itextpdf.text.Paragraph com.itextpdf.tool.xml.html.Header$1 com.itextpdf.text.Paragraph
The first object is a Title
header. It will result in a bookmark. The first real Element
is a Chunk
.
This is the chunk of text between the <pre>
tags in the HTML file:
<pre> The Project Gutenberg EBook of Walden, and On The Duty Of Civil Disobedience, by Henry David Thoreau This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. ... <pre>
This snippet is converted to a Chunk
, and a Chunk
is an Element
that doesn't have a leading of its own, hence the gibberish: all the lines between the <pre>
tags are written on top of each other, until the first <p>
tag is encountered, resulting in a Paragraph
object.