Independent Consultant (Toronto, Canada) specializing in Lucene, Hadoop, HBase, Nutch, SOLR, LingPipe, GATE, Data Mining, Search Engines, WebLogic, Oracle, Liferay Portal, Java, J2EE, SOA, and more.
Master in MATH, Moscow State University n.a.Lomonosov
I published this comment at The Server Side:
Congratulations, and thank you for sharing this very interesting Lucene implementation! Don't forget: it started as a shopping engine for CNET.
I didn't try DBSight but I noticed some noizy posts in Lucene-related message boards. I heard about DBSight from a colleague who suggested it "to have full text search for a database", who believed it is quick and easy solution.
I tried to evaluate DBSight and first of all browsed available configuration settings directrly in WEB-INF folder and subfolders. Looks weak... I tried Compass before SOLR.
For a "search add-on" for existing database SOLR offers most of possible freedom. You don't even need to have a database for it: indeed, Lucene internals implement "data normalization" automatically for you. Behind the scenes, Apache Hadoop/HBase uses several layers of data compression of different kinds (different algo) which is also "data normalization" but not the way as DBA understands it...
Never ever try to automate full-text searches with databases!!!
For instance, Compass (Hibernate + Lucene) promises "transactional support" but... in some cases "commit" may take few minutes in Lucene (merging few files), what about "optimize"?
Recently I got a call from well-known technology company, they have a client who needs SOLR to implement database full-text search for about 8-10 billions documents, and SOLR was choosen as a "simplest" solution. Are you kidding? Even pure Lucene can't handle that in a single index, even SOLR Shards will need 64 additional GET request parameters for such a distributed search!!!
Lucene uses FieldCache internally for performance optimizations, the primary cause of hundreds-thousands posts related to OutOfMemoryException in SOLR-user and Lucene-user mailing lists (including posts from DBSight technical staff). What is it: it is an array storing "Field" for each non-tokenized non-boolean field for all documents stored in an index. For 10 billions of documents with simplest field such as Social Insurance Number or ISBN, single Lucene index will need an array of average size 1 Terabytes. SOLR can't handle such distribution (only if you have hardware with few terabytes of RAM).
A lot of work is going on in Lucene: for instance, to remove synchronization on isDeleted() method which is called for each query. Would be nice to have non-synchronized versions for read-only indexes.
SOLR is not as huge as Lucene or LingPipe or GATE projects, but it is extremely effective tool. It is very easy to configure XML schema instead of working directly with Lucene API. Main selling point of SOLR (since CNET-based project started and contributed to Apache) is so-called "faceted search" which is simply calculating of intersection sizes of different DocSets (just look at search results of modern price comparison sites - they show subset counts for different categories). However, that was too... architectural mistake. Look at http://issues.apache.org/jira/browse/SOLR-711
- counting frequencies of Terms for a given DocSet is faster than counting intersections.
Lucene + Database: transactional???...
I started with Compass, then moved to Nutch, then - SOLR!!! Now I am using HBase just because power of MySQL + InnoDB is not enough for a highly concurrent application. No need to index database: instead, I am indexing data :)
Thanks, Robot-based Shopping Engine
Labels: Compass, DBSight, HBase, Lucene, SOLR, SQL