Bambarbia Kirkudu

Independent Consultant (Toronto, Canada) specializing in Lucene, Hadoop, HBase, Nutch, SOLR, LingPipe, GATE, Data Mining, Search Engines, WebLogic, Oracle, Liferay Portal, Java, J2EE, SOA, and more. Master in MATH, Moscow State University n.a.Lomonosov

Friday, January 23, 2009

 

SOLR + Lucene + HBase vs. DBSight/Compass + SQL?! No need to normalize!

I published this comment at The Server Side:
============================================

Congratulations, and thank you for sharing this very interesting Lucene implementation! Don't forget: it started as a shopping engine for CNET.

I didn't try DBSight but I noticed some noizy posts in Lucene-related message boards. I heard about DBSight from a colleague who suggested it "to have full text search for a database", who believed it is quick and easy solution.

I tried to evaluate DBSight and first of all browsed available configuration settings directrly in WEB-INF folder and subfolders. Looks weak... I tried Compass before SOLR.

For a "search add-on" for existing database SOLR offers most of possible freedom. You don't even need to have a database for it: indeed, Lucene internals implement "data normalization" automatically for you. Behind the scenes, Apache Hadoop/HBase uses several layers of data compression of different kinds (different algo) which is also "data normalization" but not the way as DBA understands it...

Never ever try to automate full-text searches with databases!!!

For instance, Compass (Hibernate + Lucene) promises "transactional support" but... in some cases "commit" may take few minutes in Lucene (merging few files), what about "optimize"?

Recently I got a call from well-known technology company, they have a client who needs SOLR to implement database full-text search for about 8-10 billions documents, and SOLR was choosen as a "simplest" solution. Are you kidding? Even pure Lucene can't handle that in a single index, even SOLR Shards will need 64 additional GET request parameters for such a distributed search!!!

Lucene uses FieldCache internally for performance optimizations, the primary cause of hundreds-thousands posts related to OutOfMemoryException in SOLR-user and Lucene-user mailing lists (including posts from DBSight technical staff). What is it: it is an array storing "Field" for each non-tokenized non-boolean field for all documents stored in an index. For 10 billions of documents with simplest field such as Social Insurance Number or ISBN, single Lucene index will need an array of average size 1 Terabytes. SOLR can't handle such distribution (only if you have hardware with few terabytes of RAM).

A lot of work is going on in Lucene: for instance, to remove synchronization on isDeleted() method which is called for each query. Would be nice to have non-synchronized versions for read-only indexes.

SOLR is not as huge as Lucene or LingPipe or GATE projects, but it is extremely effective tool. It is very easy to configure XML schema instead of working directly with Lucene API. Main selling point of SOLR (since CNET-based project started and contributed to Apache) is so-called "faceted search" which is simply calculating of intersection sizes of different DocSets (just look at search results of modern price comparison sites - they show subset counts for different categories). However, that was too... architectural mistake. Look at http://issues.apache.org/jira/browse/SOLR-711 - counting frequencies of Terms for a given DocSet is faster than counting intersections.

Lucene + Database: transactional???...

I started with Compass, then moved to Nutch, then - SOLR!!! Now I am using HBase just because power of MySQL + InnoDB is not enough for a highly concurrent application. No need to index database: instead, I am indexing data :)

Thanks,


Robot-based Shopping Engine

Labels: , , , , ,


Comments:
Interesno, interesno...
:)
 
Glad to hear Russian,


Спасибо!

Я пишу здесь... ну не знаю почему... ну в самом деле, Lucene & Hadoop использует несколько уровней компрессии и это вполне сопоставимо с Data Normalization так что не совсем уместно использовать RDBMS в традиционном смысле...
 
About current FieldCache (including SOLR and Lucene):

1. SOLR's FieldCache is powered by Lucene FieldCache for non-tokenized single-value non-boolean fields.
2. Lucene FieldCache is based on WeakHashMap
3. Specifically, for SOLR use cases, it uses (IndexReader.maxDoc()+1) references to String objects during warm-up.
4. After warm-up, it uses (number_of_field_value) arrays of int[], and total size of array is IndexReader.maxDoc()

Due to that... you need at least [Number-of-Non-Tokenized-Fields]*[MaxDoc]*[12 bytes] for this.

For instance, if your index contains 1,000,000 products, and each product has attributes Manufacturer, Category, Color, you need:
3 * 1,000,000 * 12 = 36Mb RAM.
But this is extremely simple case... in most cases you need to index tons of documents with 10-20 nontokenized attributes (such as part number, ISDN, author, etc)

For instance, I had to allocate 16Gb for my simple shopping engine at tokenizer.org.

Without that, you may have:
- Unexpected OOM
- GarbageCollection taking 15% of CPU time
- Bad performance
 

Post a Comment

Subscribe to Post Comments [Atom]



Links to this post:

Create a Link



<< Home

Archives

May 2007   June 2007   July 2007   August 2007   October 2007   April 2008   June 2008   July 2008   August 2008   January 2009  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]