Bambarbia Kirkudu

Independent Consultant (Toronto, Canada) specializing in Lucene, Hadoop, HBase, Nutch, SOLR, LingPipe, GATE, Data Mining, Search Engines, WebLogic, Oracle, Liferay Portal, Java, J2EE, SOA, and more. Master in MATH, Moscow State University n.a.Lomonosov

Friday, April 11, 2008


Java HTML Parsers Comparison

I found this interesting post and repeated tests:
- Fastest is NekoHTML.
- most correct is NekoHTML.

Only URL to nice internet shop (for beauties!) shows the difference, 144 links found with HtmlCleaner, and 116 with NekoHTML. After quick copy-paste to Excel and sorting links I found that some links are simply repeated by HtmlCleaner probably due to bug... so that all parsers behave the same, correctly parsing ugliest HTML.

NekoHTML is also the best by parformance (2 times fasted than closest competitor).

I compared also with TagSoup which is slowest one...

Here are Java files, enjoy! Performance test commented out, don't have a time to refactor it...

I removed Java Source due to bug in RSS



Post a Comment

Subscribe to Post Comments [Atom]

Links to this post:

Create a Link

<< Home


May 2007   June 2007   July 2007   August 2007   October 2007   April 2008   June 2008   July 2008   August 2008   January 2009  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]