Independent Consultant (Toronto, Canada) specializing in Lucene, Hadoop, HBase, Nutch, SOLR, LingPipe, GATE, Data Mining, Search Engines, WebLogic, Oracle, Liferay Portal, Java, J2EE, SOA, and more.
Master in MATH, Moscow State University n.a.Lomonosov
I found this interesting post
and repeated tests:
- Fastest is NekoHTML.
- most correct is NekoHTML.
Only URL to nice internet shop (for beauties!)
shows the difference, 144 links found with HtmlCleaner, and 116 with NekoHTML. After quick copy-paste to Excel and sorting links I found that some links are simply repeated by HtmlCleaner probably due to bug... so that all parsers behave the same, correctly parsing ugliest HTML.
NekoHTML is also the best by parformance (2 times fasted than closest competitor).
I compared also with TagSoup which is slowest one...
Here are Java files, enjoy! Performance test commented out, don't have a time to refactor it...
I removed Java Source due to bug in RSS
Labels: HTML Parser Java Comparison