Monthly Archives: January 2009

How to use HTML Parser

Published / by Renan Huanca / Leave a Comment

I was looking for an HTML Parser that parses XHTML. I tested many parser like: TagSoup, Apache Sax, Jericho Parser, etc… But i had the following problems:

  1. My XHTML file was not well formatted.
  2. The majority of parser are a little complicated api, because are more oriented to well formed documents.

So, I found HTML Parser. This one was very easy to use. Here is an example: Let’s have some html called test.html like this:

<html>
    <head></head>
    <body>
        <img>
        <h1>Hello World</h1>
    </body>
</html>

You want to get the tag <h1> that has “Hello Word”, note that the tag <img> is incomplete. Let’s code. 🙂

Parser parser = new Parser("test.html");
NodeList list = parser.parse(new HasChildFilter(new StringFilter("Hello World")));
System.out.println("tag founded = " + list.elementAt(0).toHtml());

The previous code finds the tag <h1> and prints:

tag founded = <h1>Hello World</h1>

Here is the explanation:

Parser class, this class is in charge to load the html file.

parser.parse(), is the method that get all nodes or filter the nodes you want to use.

HasChildFilter and StringFilter, are in charge to filter the nodes. In the example we are looking all tags that has children with the text ‘Hello Word’.

These filters, saved me a lot of work, Html Parser has other filters too.

Well this is it for now :).