Well, let us finish up this scraping saga. As mentioned in the previous post, trying to scrape content using Java. Ended up using two forgiving parsers and a suite of helpers to get an almost there solution. More times than not, almost there and in the way are the chants in Java cult. The parsers are (in the order of my preference)
- TagSoup – worked quite nicely. Implements a SAX Parser and works with invalid markup quite nicely and ultimately one could get a palatable tree to work with.
- jTidy – is an almost there solution, does not do a good job with invalid markup (which is most of the web)
Then one uses a myriad of X* friends in the Javaland to get the work done, be prepared for them not working all the time for all the input, exceptions somewhere deep in their implementation. (Remember the almost there mantra ;-)
- SAX2DOM – Can be used as a ContentHandler for a SAX Parser (TagSoup) and can give us a DOM tree to work with
- XPathAPI – Can be used to jiggle the DOM tree from the previous step
- To Serialize (X)HTML is another matter, try looking up, TransformerFactory.newTransformer(), transformer.transform(..)
Another thing I wonder about these Invalid websites is (particularly of large corporations), that they had spent millions of dollars to get where they are now, costly content management systems, complex document workflows, rocket-science weh page/site construction tools, but if only, they validate and take in content altleast somewhat valid (if there is such a thing)
Finally, I would like to close with a Hats off ! to browser vendors, who consume this Manure and produce Milk!