Bambarbia Kirkudu

Independent Consultant (Toronto, Canada) specializing in Lucene, Hadoop, HBase, Nutch, SOLR, LingPipe, GATE, Data Mining, Search Engines, WebLogic, Oracle, Liferay Portal, Java, J2EE, SOA, and more. Master in MATH, Moscow State University n.a.Lomonosov

Sunday, August 31, 2008

 

SOLR Performance Tuning: the best experiment at www.tokenizer.org (Shopping Price Engine)

I am going to move from SOLR.

Recently someone asked me about 7-9 billions of documents powered by SOLR, memory requirements... 8Gb is not enough for them (OutOfMemoryException). Funny, very funny! Lucene uses FieldCache and even 1000Gb RAM won't handle 10 billion documents with simple nontokenized ISBN field!!!

It must be distributed Lucene index. It could be powered by Hadoop. It must be 64 hardware boxes with 64Gb RAM each.

Even SOLR Distributed Search (Shards) can't deal with that!

(64 additional "shards" in HTTP GET method parameters? Are you kidding?!!)

Explanations? Very easy: FieldCache needs array of String objects, and size of array should be 10,000,000,000. Imagine ISBN field (or similar nontokenized): 100 bytes... and multiply to get size of array for one single field only:
10 billions x 100 bytes = 1000 Gb

If you are interested in Lucene, SOLR, Nutch, Hadoop, HBase, LingPipe, GATE, and Data Mining & Natural Language Processing: call to Fuad Efendi at 416-993-2060 (Toronto, Canada, mail to: fuad AT efendi DOT ca)




...After several years of trusting to genious cNET developers I hacked SOLR, see SOLR-665. Unfortunately some of Apache Committers tried to point me on some books... and kindly asked me to learn Java... But they simply don't understand differences between concurrency strategies, and they can't even see that ConcurrentHashMap uses spin-loop in order to avoid synchronization instead of giving up CPU a.s.a.p...

Books are written for those who can critisize it.

Lucene & SOLR core developers didn't even notice my last post regarding extremely competitive LingPipe (Natural Language Processing) which has excellent and extremely fast Map implementations with optimistic concurrency strategy:
FastCache
HardFastCache

Instead, someone (forgotting Code of Conduct) created SOLR-667.

Ok, what I have finally: SOLR-711

And it is so obvious! I implemented it for Price Engine at Tokenizer. For such a data, out-of-the-box SOLR response time will be around 15-20 seconds; for me - 10 milliseconds!!!

Tokenizer is extremely fast Shopping Engine. Index updates each night at 12:00AM (Central Time; corresponding to Google's date change in California).

Over 10000 shops listed, 7000 in America. You can find here even Gift Cookies!!! Computers & Software, Jewelry & Watches, Arts & Crafts, Health & Beauty, Babies & Toddlers, Home & Garden, Food & Beverages, Office, Automotive, Books, Movies, Music, Clothing, Electronics, Pets, Sports, Recreation, Toys, Games, Weddings, and more.

Feel free to submit your online shopping website to our Shopping Robot.

Current index size: 38,000,000 pages, 700,000 unique tokens used for faceted browsing.

What about Facets? See this: SOLR Faceted Browsing. It was initially designed for CNET, and at that time CNET data contained only 400,000 products. "Faceting" means "Sets Intersections" for SOLR, but not for me. There is a lot of code which could be easily optimized. But... Apache! Not everything open-source is good enough. For instance, I am using specific 'hack' for Lucene where index files are read-only, and I fully avoid synchronization bottlenecks.

Welcome to Shopping Tokenizer!


P.S.
I am going to move to HBase from MySQL. Currently, I have 4Gb for transaction logs with InnoDB. I can't use any statistics and do very simple calcs (such as auto-defining category pages and product pages, auto-categorize products, etc.)

It's extremely simple... but MySQL can't do it! InnoDB is best by performance (concurrent updates).

HBase is clone of Google Big Table, powered by Hadoop (clone of Google File System and Map Reduce).

Labels: , , , , ,


Comments:
Just spotted your blog.....we work with Xapian/Flax, which would probably scale better than Lucene for your requirement. We have a test cluster with 100m documents, and Xapian was built for a half-billion page collection. Check out www.mydeco.com for an implementation.

HTH
 
NUTCH, Hadoop, HBASE scale to thousands!!! Yahoo, AOL, and others use NUTCH in a clusters of several thousands computers.
NUTCH is powered by Hadoop which is clone of Google File System, NUTCH uses distributed search powered by Lucene.
I've never heard aout Xapian/Flax... Who uses it?
 
Mobile in india, as the name itself tells, explain/educates/gives every useful and important information about mobile. All models of all companies are explained with their pro & cons, you name it and you have it on http://mobileinindia.in Simple and elobrative language explain every mobile in short. surely help to male decision as we indian are fond of music and photography, mobile with these features are highlighted. and when we say range its more vast as himalayan range ie from basic price to the costlier mobile in India.We also showers our visitors with interesting fact about mobile in india and some some mouth watering information regarding which celebrity uses which set. Durability, battery back up, after sale service, resale are few topics which mobile in India has covered. Mobile in India also bring you news about new technology& fresh launch at swift speed.
 

Post a Comment

Subscribe to Post Comments [Atom]



Links to this post:

Create a Link



<< Home

Archives

May 2007   June 2007   July 2007   August 2007   October 2007   April 2008   June 2008   July 2008   August 2008   January 2009  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]