Independent Consultant (Toronto, Canada) specializing in Lucene, Hadoop, HBase, Nutch, SOLR, LingPipe, GATE, Data Mining, Search Engines, WebLogic, Oracle, Liferay Portal, Java, J2EE, SOA, and more.
Master in MATH, Moscow State University n.a.Lomonosov
I am going to move from SOLR.
Recently someone asked me about 7-9 billions of documents powered by SOLR, memory requirements... 8Gb is not enough for them (OutOfMemoryException). Funny, very funny! Lucene uses FieldCache and even 1000Gb RAM won't handle 10 billion documents with simple nontokenized ISBN field!!!
It must be distributed Lucene index. It could be powered by Hadoop. It must be 64 hardware boxes with 64Gb RAM each.
(64 additional "shards" in HTTP GET method parameters? Are you kidding?!!)
Explanations? Very easy: FieldCache needs array of String objects, and size of array should be 10,000,000,000. Imagine ISBN field (or similar nontokenized): 100 bytes... and multiply to get size of array for one single field only:
10 billions x 100 bytes = 1000 Gb...After several years of trusting to genious cNET developers
I hacked SOLR
, see SOLR-665
. Unfortunately some of Apache Committers
tried to point me on some books... and kindly asked me to learn Java... But they simply don't understand differences between concurrency strategies, and they can't even see that ConcurrentHashMap uses spin-loop in order to avoid synchronization instead of giving up CPU a.s.a.p...
Books are written for those who can critisize it.
Lucene & SOLR core developers didn't even notice my last post regarding extremely competitive LingPipe
(Natural Language Processing) which has excellent and extremely fast Map implementations with optimistic concurrency strategy:FastCacheHardFastCache
Instead, someone (forgotting Code of Conduct) created SOLR-667
Ok, what I have finally: SOLR-711
And it is so obvious! I implemented it for Price Engine
at Tokenizer. For such a data, out-of-the-box SOLR response time will be around 15-20 seconds; for me - 10 milliseconds!!!
Tokenizer is extremely fast Shopping Engine
. Index updates each night at 12:00AM (Central Time; corresponding to Google's date change in California).
Over 10000 shops listed, 7000 in America. You can find here even Gift Cookies!!! Computers & Software, Jewelry & Watches, Arts & Crafts, Health & Beauty, Babies & Toddlers, Home & Garden, Food & Beverages, Office, Automotive, Books, Movies, Music, Clothing, Electronics, Pets, Sports, Recreation, Toys, Games, Weddings, and more.
Feel free to submit your online shopping website
to our Shopping Robot
Current index size: 38,000,000 pages, 700,000 unique tokens used for faceted browsing.
What about Facets? See this: SOLR Faceted Browsing
. It was initially designed for CNET, and at that time CNET data contained only 400,000 products. "Faceting" means "Sets Intersections" for SOLR, but not for me. There is a lot of code which could be easily optimized. But... Apache! Not everything open-source is good enough. For instance, I am using specific 'hack' for Lucene where index files are read-only, and I fully avoid synchronization bottlenecks.
Welcome to Shopping Tokenizer
I am going to move to HBase from MySQL. Currently, I have 4Gb for transaction logs with InnoDB. I can't use any statistics and do very simple calcs (such as auto-defining category pages and product pages, auto-categorize products, etc.)
It's extremely simple... but MySQL can't do it! InnoDB is best by performance (concurrent updates).
HBase is clone of Google Big Table, powered by Hadoop (clone of Google File System and Map Reduce).
Labels: Hadoop, HBase, LingPipe, Lucene, Nutch, SOLR