Well, let us finish up this scraping saga.  As mentioned in the previous post, trying to scrape content using Java.  Ended up using two forgiving parsers and a suite of helpers to get an almost there solution.  More times than not, almost there and in the way are the chants in Java cult.   The parsers are (in the order of my preference)

  • TagSoup – worked quite nicely.  Implements a SAX Parser and works with invalid markup quite nicely and ultimately one could get a palatable tree to work with.
  • jTidy – is an almost there solution, does not do a good job with invalid markup (which is most of the web) 

Then one uses a myriad of X* friends in the Javaland to get the work done, be prepared for them not working all the time for all the input, exceptions somewhere deep in their implementation.  (Remember the almost there mantra ;-)  

Another thing I wonder about these Invalid websites is (particularly of large corporations), that they had spent millions of dollars to get where they are now, costly content management systems, complex document workflows, rocket-science weh page/site construction tools, but if only, they validate and take in content altleast somewhat valid (if there is such a thing)

Finally, I would like to close with a Hats off ! to browser vendors, who consume this Manure and produce Milk!


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s