Bambarbia Kirkudu

Independent Consultant (Toronto, Canada) specializing in Lucene, Hadoop, HBase, Nutch, SOLR, LingPipe, GATE, Data Mining, Search Engines, WebLogic, Oracle, Liferay Portal, Java, J2EE, SOA, and more. Master in MATH, Moscow State University n.a.Lomonosov

Friday, January 23, 2009

 

SOLR + Lucene + HBase vs. DBSight/Compass + SQL?! No need to normalize!

I published this comment at The Server Side:
============================================

Congratulations, and thank you for sharing this very interesting Lucene implementation! Don't forget: it started as a shopping engine for CNET.

I didn't try DBSight but I noticed some noizy posts in Lucene-related message boards. I heard about DBSight from a colleague who suggested it "to have full text search for a database", who believed it is quick and easy solution.

I tried to evaluate DBSight and first of all browsed available configuration settings directrly in WEB-INF folder and subfolders. Looks weak... I tried Compass before SOLR.

For a "search add-on" for existing database SOLR offers most of possible freedom. You don't even need to have a database for it: indeed, Lucene internals implement "data normalization" automatically for you. Behind the scenes, Apache Hadoop/HBase uses several layers of data compression of different kinds (different algo) which is also "data normalization" but not the way as DBA understands it...

Never ever try to automate full-text searches with databases!!!

For instance, Compass (Hibernate + Lucene) promises "transactional support" but... in some cases "commit" may take few minutes in Lucene (merging few files), what about "optimize"?

Recently I got a call from well-known technology company, they have a client who needs SOLR to implement database full-text search for about 8-10 billions documents, and SOLR was choosen as a "simplest" solution. Are you kidding? Even pure Lucene can't handle that in a single index, even SOLR Shards will need 64 additional GET request parameters for such a distributed search!!!

Lucene uses FieldCache internally for performance optimizations, the primary cause of hundreds-thousands posts related to OutOfMemoryException in SOLR-user and Lucene-user mailing lists (including posts from DBSight technical staff). What is it: it is an array storing "Field" for each non-tokenized non-boolean field for all documents stored in an index. For 10 billions of documents with simplest field such as Social Insurance Number or ISBN, single Lucene index will need an array of average size 1 Terabytes. SOLR can't handle such distribution (only if you have hardware with few terabytes of RAM).

A lot of work is going on in Lucene: for instance, to remove synchronization on isDeleted() method which is called for each query. Would be nice to have non-synchronized versions for read-only indexes.

SOLR is not as huge as Lucene or LingPipe or GATE projects, but it is extremely effective tool. It is very easy to configure XML schema instead of working directly with Lucene API. Main selling point of SOLR (since CNET-based project started and contributed to Apache) is so-called "faceted search" which is simply calculating of intersection sizes of different DocSets (just look at search results of modern price comparison sites - they show subset counts for different categories). However, that was too... architectural mistake. Look at http://issues.apache.org/jira/browse/SOLR-711 - counting frequencies of Terms for a given DocSet is faster than counting intersections.

Lucene + Database: transactional???...

I started with Compass, then moved to Nutch, then - SOLR!!! Now I am using HBase just because power of MySQL + InnoDB is not enough for a highly concurrent application. No need to index database: instead, I am indexing data :)

Thanks,


Robot-based Shopping Engine

Labels: , , , , ,


Sunday, August 31, 2008

 

SOLR Performance Tuning: the best experiment at www.tokenizer.org (Shopping Price Engine)

I am going to move from SOLR.

Recently someone asked me about 7-9 billions of documents powered by SOLR, memory requirements... 8Gb is not enough for them (OutOfMemoryException). Funny, very funny! Lucene uses FieldCache and even 1000Gb RAM won't handle 10 billion documents with simple nontokenized ISBN field!!!

It must be distributed Lucene index. It could be powered by Hadoop. It must be 64 hardware boxes with 64Gb RAM each.

Even SOLR Distributed Search (Shards) can't deal with that!

(64 additional "shards" in HTTP GET method parameters? Are you kidding?!!)

Explanations? Very easy: FieldCache needs array of String objects, and size of array should be 10,000,000,000. Imagine ISBN field (or similar nontokenized): 100 bytes... and multiply to get size of array for one single field only:
10 billions x 100 bytes = 1000 Gb

If you are interested in Lucene, SOLR, Nutch, Hadoop, HBase, LingPipe, GATE, and Data Mining & Natural Language Processing: call to Fuad Efendi at 416-993-2060 (Toronto, Canada, mail to: fuad AT efendi DOT ca)




...After several years of trusting to genious cNET developers I hacked SOLR, see SOLR-665. Unfortunately some of Apache Committers tried to point me on some books... and kindly asked me to learn Java... But they simply don't understand differences between concurrency strategies, and they can't even see that ConcurrentHashMap uses spin-loop in order to avoid synchronization instead of giving up CPU a.s.a.p...

Books are written for those who can critisize it.

Lucene & SOLR core developers didn't even notice my last post regarding extremely competitive LingPipe (Natural Language Processing) which has excellent and extremely fast Map implementations with optimistic concurrency strategy:
FastCache
HardFastCache

Instead, someone (forgotting Code of Conduct) created SOLR-667.

Ok, what I have finally: SOLR-711

And it is so obvious! I implemented it for Price Engine at Tokenizer. For such a data, out-of-the-box SOLR response time will be around 15-20 seconds; for me - 10 milliseconds!!!

Tokenizer is extremely fast Shopping Engine. Index updates each night at 12:00AM (Central Time; corresponding to Google's date change in California).

Over 10000 shops listed, 7000 in America. You can find here even Gift Cookies!!! Computers & Software, Jewelry & Watches, Arts & Crafts, Health & Beauty, Babies & Toddlers, Home & Garden, Food & Beverages, Office, Automotive, Books, Movies, Music, Clothing, Electronics, Pets, Sports, Recreation, Toys, Games, Weddings, and more.

Feel free to submit your online shopping website to our Shopping Robot.

Current index size: 38,000,000 pages, 700,000 unique tokens used for faceted browsing.

What about Facets? See this: SOLR Faceted Browsing. It was initially designed for CNET, and at that time CNET data contained only 400,000 products. "Faceting" means "Sets Intersections" for SOLR, but not for me. There is a lot of code which could be easily optimized. But... Apache! Not everything open-source is good enough. For instance, I am using specific 'hack' for Lucene where index files are read-only, and I fully avoid synchronization bottlenecks.

Welcome to Shopping Tokenizer!


P.S.
I am going to move to HBase from MySQL. Currently, I have 4Gb for transaction logs with InnoDB. I can't use any statistics and do very simple calcs (such as auto-defining category pages and product pages, auto-categorize products, etc.)

It's extremely simple... but MySQL can't do it! InnoDB is best by performance (concurrent updates).

HBase is clone of Google Big Table, powered by Hadoop (clone of Google File System and Map Reduce).

Labels: , , , , ,


Friday, July 4, 2008

 

Liferay Portal, Enterprise SOA

The problem:
"Ok, I realize that anonymous blog comments are not currently available (as of v4.3.5). I'm just wondering when EXACTLY blog comments will be available (Are they currently being developed? Are the changes in the trunk and I can check it out from SVN and build from source right now? Are the changes scheduled to be done soon?) Any information would be helpful.
I'm asking because I've added the ability to do guest comments in the blogs in my own customized liferay instance. However, I had to modify the core liferay source to do it (couldn't do it with just the extension environment). So, now I'm stuck using 4.3.1 because I absolutely cannot upgrade my liferay instance without this functionality and It's been so long since I made the modifications that I'm not sure I could duplicate them. Nor do I desire to duplicate the same work over again."

Just a teaser :)

Have you tried to manage /resource-actions/blogs.xml in Liferay version 5.0.1 RC1?


VIEW
ADD_DISCUSSION


DELETE
DELETE_DISCUSSION
PERMISSIONS
UPDATE
UPDATE_DISCUSSION




You may also need to remove themeDisplay.isSignedIn() checks from /html/taglib/ui/discussion

There is also bug in ext-impl/build-parent.xml, something is missed:





Liferay 5.0.1 RC1 EXT:

1. Modify /html/taglib/ui/discussion/page.jsp, remove some code
[c:if test="<%= themeDisplay.isSignedIn() && MBDiscussionPermission.contains(permissionChecker, portletGroupId.longValue(), className, classPK, ActionKeys.ADD_DISCUSSION) %>"]
at least remove themeDisplay.isSignedIn(); it may affect all portlets using this tag - you can create your own tag indeed...

2. Look at Liferay 4.4, Portlet Development, Chapter 5, modify blogs.xml & portal-ext.properties; see also Wiki - Portlet Permissions

3. Verify MBMessageServiceImpl; you need to allow addDiscussionMessage():
use getGuestOrUserId() instead of getUserId()

View Fuad Efendi's LinkedIn profileView Fuad Efendi's profile

Labels: , , ,


Thursday, July 3, 2008

 

Liferay Portal in Canada & USA: Faster Development Time!

Liferay Portal:



Linux Magazine Recognizes Liferay as a Leading Solution For Enterprise CommunicationAshley Wilson writes in the June 2008 issue, "Liferay has been around for years and it keeps getting bigger and better... From an e-commerce platform to a collaborative intranet portal, Liferay has the scalability and functionality built-in to deploy quickly and grow as the needs of the company grow. Liferay is just about the best portal/CMS out there and a superb foundation for unifying and simplifying company communication."

Behind The Scenes


You will have to work directly with source code in Extension Environment. Even if you need a little: you will have to change some html tags in JSPs.

You will have to fix some bugs.

You will have to discover that 99% overloaded CPU happens with Mozilla, and almost never with IE. Of course, never at Server Side.

You will enjoy with faster development time.

Tuesday, June 17, 2008

 

Shopping Price Engines - Do We Need Price Comparison?!

Do we really need "comparison"? Search for MacBook at ShopWiki.com retrieves image with box of apples (Macintosh?!)
Can't we go easier, with direct link to product page of a merchant!?

Tip of the day: How to choose an air purifier
Consumers Union, the nonprofit publisher of Consumer Reports, con...
Black Friday HDTV Deals from CompUSA, Staples
Black Friday ads leaked for computer chains CompUSA and Staples s...
Avoid gift card pitfalls
They’re convenient and popular, but gift cards are often lo...
The fastest way to deal with leaves
The fastest, easiest way to handle the drudgery that is fall clea...
U.S. gas prices—November 12, 2007
Gas prices continue to climb higher. National retail fuel price a...
Survey - Most car buyers satisfied with their dealership experience
Overall, car buyers are largely satisfied with their dealership e...

In the United States, the first two internet comparison shopping services were Jango and RoboShopper. These services were initially implemented as client-side add-ins to the Netscape and Internet explorer browsers, and both required that additional software be downloaded and installed. After these initial efforts, comparison shopping migrated to the server so that the service would be accessible to anyone with a browser.

Currently some of the major U.S. Based comparison shopping services are Pricegrabber, Shopzilla, Dealtime, and NexTag.[citation needed] Major portals like Yahoo!, AOL and MSN also offer comparison shopping services. In the UK some of the major comparison shopping services are DealClick and CompareStorePrices as well as the aforementioned U.S. websites which also provide UK services. The financial comparison sector has seen significant growth in the Uk with a large number of new sites emerging over recent years. Such sites include Money Expert and UK Financial Options[3].

The original Roboshopper.com site still exists and has been re-targeted as a "Meta" tool which gives results from the leading comparison shopping sites, as well as product review and rating sites.
--



What about Dulance? I didn't have a chance to look at it... they sold themselves to Google. What He is doing in Moscow, any research? Xi-xi-xi!!!


Samples of "Description" meta tags found on Google:

- We offer you a complete and fully functional price comparison website based ... You keep 100% of any commission; Search Engine Friendly, fully customizable. ...

- Best Shopping Price Guide, that allows users to make buying decisions through compare price shopping, use our website daily on ...

- Brick Marketing offers shopping comparison & shopping feed management service to help rank well with shopping comparison sites.

- PriceGrabber.com allows you to compare prices on all the most popular products. We also have product reviews by consumers like you. Our comparison shopping ...

BTW PriceGrabber is WORST one and best-SEO-optimized, including stupid ugly robots.txt file (which stupid ugly SEO follows, see WMW)...

;)

Labels:


Friday, April 11, 2008

 

Java HTML Parsers Comparison

I found this interesting post and repeated tests:
- Fastest is NekoHTML.
- most correct is NekoHTML.

Only URL to nice internet shop (for beauties!) shows the difference, 144 links found with HtmlCleaner, and 116 with NekoHTML. After quick copy-paste to Excel and sorting links I found that some links are simply repeated by HtmlCleaner probably due to bug... so that all parsers behave the same, correctly parsing ugliest HTML.

NekoHTML is also the best by parformance (2 times fasted than closest competitor).

I compared also with TagSoup which is slowest one...

Here are Java files, enjoy! Performance test commented out, don't have a time to refactor it...


...
P.S.
I removed Java Source due to bug in RSS

Labels:


Wednesday, October 24, 2007

 

BEA Workshop, Adobe Flex!

Simply go to the shopping cart, and enter in Discount Code:

flex399103107

(extremely limited time offer!)

(printed without explicit permissions... ask BEA! BEA Workshop Studio 10.1 with Adobe Flex 2 & Charting)

Only for $399, even cheaper in Canada

(is US dollar really cheap? Street prices go down even faster than US$, isn't it misunderstood? - subj of another discussion, it is not related to this promotion from BEA)

You'll get the best JEE IDE. It is faster/better/smarter than MyEclipse. And it is bundled with Adobe Flex with Charting.

Retail price for Workshop Studio: about $900-$1000. Adobe Flex: $500, with charting: 750.

Of course, you can say it's tricky... it is 1-year subscription... The truth is that $399 is affordable price (for us, contractors); and it is even cheaper than possible upgrade options for "life-time" products (I never seen lifetime products... smile;) - compare with Rational XDE etc...


That's it!!!

I was thinking about Adobe Flex with Chart, and cheapest prices are around $650-$750 (do not believe if you can find cheaper via PriceGrabber, NexTag, CNET! I found some, for $280, with "academic license" in description... professional licenses are around $700).

I was using MyEclipse IDE for more than 3 years, rich functionality but not so deep... imagine how many times you needed additional plugins and XML Editor or something else crashed after non-compatible upgrade... At least I needed additionally DbVisualizer. BEA will also collaborate with DbVisualizer (currently it is SWING application, hope it will change!)

Thanks for reading this, and do not try to run through deterministic garbage collection!

Almost forgotten... I initially tried to buy from Adobe, because IDE license includes an option for deployment/debugging which is of course JRun + Adobe LiveCycle ES and which costs 10,000 itself! It provides synchronous services between JEE and Flash; asynchronous are free (you can use Spring Facade on WLS without LiveCycle).

Of course licensing agreement includes single-CPU LiveCycle ES - it is a must for Flex development. I read somewhere at Adobe that you can even use it in production (single CPU) when you buy Flex IDE.

Need to confirm.


P.S.

Confirmation just arrived... Unfortunately not from Adobe, they sell it for 10k per-CPU. LiveCycle Data Services ES Express (very long name... check it frequently, they are extremely fast) is FREE.

"In addition to the clustering and load balancing restrictions detailed above, the Express version of LiveCycle Data Services ES does not include LiveCycle Remoting for easy integration with LiveCycle document and process services. In addition, the generated PDF documents using the new RIA-to-PDF feature are watermarked in the Express edition."
- we don't need that; more details at http://www.adobe.com/products/livecycle/dataservices/faq.html. And it is production-level(!) licensing, without any restrictions (single-CPU only).
Most important: Business Facade - Flex communication is synchronous(!) with LiveCycle ES Express, and async with SOAP/REST/ etc.
Few months ago I downloaded JRun-based LiveCycle ES bundle and moved some jar files easily to Tomcat (+ separate JMS implementation and etc.). Would be nice to have WebLogic-based ES.

This article was first published at Bambarbia; LifeRay has some limitations with anonymous posting (I don't know it in depth yet) and I moved this post here... www.bambarbia.com is currently powered by BEA JRockit 6 R27.3 at SuSE 10 ES, AMD64, Tomcat 6 + APR, over ADSL (700kbps upload, async), can you believe that?!!

P.S.

I added some thoughts at http://dev2dev.bea.com/blog/phumphrey/archive/2007/

Labels: ,


Archives

May 2007   June 2007   July 2007   August 2007   October 2007   April 2008   June 2008   July 2008   August 2008   January 2009  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]