Bambarbia Kirkudu

Independent Consultant (Toronto, Canada) specializing in Lucene, Hadoop, HBase, Nutch, SOLR, LingPipe, GATE, Data Mining, Search Engines, WebLogic, Oracle, Liferay Portal, Java, J2EE, SOA, and more. Master in MATH, Moscow State University n.a.Lomonosov

Sunday, August 31, 2008

 

SOLR Performance Tuning: the best experiment at www.tokenizer.org (Shopping Price Engine)

I am going to move from SOLR.

Recently someone asked me about 7-9 billions of documents powered by SOLR, memory requirements... 8Gb is not enough for them (OutOfMemoryException). Funny, very funny! Lucene uses FieldCache and even 1000Gb RAM won't handle 10 billion documents with simple nontokenized ISBN field!!!

It must be distributed Lucene index. It could be powered by Hadoop. It must be 64 hardware boxes with 64Gb RAM each.

Even SOLR Distributed Search (Shards) can't deal with that!

(64 additional "shards" in HTTP GET method parameters? Are you kidding?!!)

Explanations? Very easy: FieldCache needs array of String objects, and size of array should be 10,000,000,000. Imagine ISBN field (or similar nontokenized): 100 bytes... and multiply to get size of array for one single field only:
10 billions x 100 bytes = 1000 Gb

If you are interested in Lucene, SOLR, Nutch, Hadoop, HBase, LingPipe, GATE, and Data Mining & Natural Language Processing: call to Fuad Efendi at 416-993-2060 (Toronto, Canada, mail to: fuad AT efendi DOT ca)




...After several years of trusting to genious cNET developers I hacked SOLR, see SOLR-665. Unfortunately some of Apache Committers tried to point me on some books... and kindly asked me to learn Java... But they simply don't understand differences between concurrency strategies, and they can't even see that ConcurrentHashMap uses spin-loop in order to avoid synchronization instead of giving up CPU a.s.a.p...

Books are written for those who can critisize it.

Lucene & SOLR core developers didn't even notice my last post regarding extremely competitive LingPipe (Natural Language Processing) which has excellent and extremely fast Map implementations with optimistic concurrency strategy:
FastCache
HardFastCache

Instead, someone (forgotting Code of Conduct) created SOLR-667.

Ok, what I have finally: SOLR-711

And it is so obvious! I implemented it for Price Engine at Tokenizer. For such a data, out-of-the-box SOLR response time will be around 15-20 seconds; for me - 10 milliseconds!!!

Tokenizer is extremely fast Shopping Engine. Index updates each night at 12:00AM (Central Time; corresponding to Google's date change in California).

Over 10000 shops listed, 7000 in America. You can find here even Gift Cookies!!! Computers & Software, Jewelry & Watches, Arts & Crafts, Health & Beauty, Babies & Toddlers, Home & Garden, Food & Beverages, Office, Automotive, Books, Movies, Music, Clothing, Electronics, Pets, Sports, Recreation, Toys, Games, Weddings, and more.

Feel free to submit your online shopping website to our Shopping Robot.

Current index size: 38,000,000 pages, 700,000 unique tokens used for faceted browsing.

What about Facets? See this: SOLR Faceted Browsing. It was initially designed for CNET, and at that time CNET data contained only 400,000 products. "Faceting" means "Sets Intersections" for SOLR, but not for me. There is a lot of code which could be easily optimized. But... Apache! Not everything open-source is good enough. For instance, I am using specific 'hack' for Lucene where index files are read-only, and I fully avoid synchronization bottlenecks.

Welcome to Shopping Tokenizer!


P.S.
I am going to move to HBase from MySQL. Currently, I have 4Gb for transaction logs with InnoDB. I can't use any statistics and do very simple calcs (such as auto-defining category pages and product pages, auto-categorize products, etc.)

It's extremely simple... but MySQL can't do it! InnoDB is best by performance (concurrent updates).

HBase is clone of Google Big Table, powered by Hadoop (clone of Google File System and Map Reduce).

Labels: , , , , ,


Archives

May 2007   June 2007   July 2007   August 2007   October 2007   April 2008   June 2008   July 2008   August 2008   January 2009  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]