Bambarbia Kirkudu

Independent Consultant (Toronto, Canada) specializing in Lucene, Hadoop, HBase, Nutch, SOLR, LingPipe, GATE, Data Mining, Search Engines, WebLogic, Oracle, Liferay Portal, Java, J2EE, SOA, and more. Master in MATH, Moscow State University n.a.Lomonosov

Thursday, May 31, 2007


Google Supplemental Results - in Beta since June 2002

Supplemental Index: it could be new staff for Google and some of satellite-SEOs, although Google tries to implement it properly since June 2002 at least.

Supplemental Index is well known since... June 2000, Inktomi:

Later article (Danny Sullivan):

And finally this page, open-air discussion without any technical understandings:

Just as a sample, initial versions of Google Sitemaps said that it is for webmasters who wants to submit form-based/form-submission-generated dynamic URLs to Google, which Google can't find by regular way from anchor links etc. And it failed. Google has even new rule "to restrict results pages", see another blog page.

It looks like it is smth new even for Matt Cutts ;)

Labels: ,

Friday, May 25, 2007


SEO Tools and Articles
- interesting!


Monday, May 21, 2007


MySQL: InnoDB Outperforms SolidDB in Real World Application

I got it.

I used Oracle 10g before that, and tried to evaluate MySQL.

MySQL Server:
2x AMD Opteron 852 (2.6 GHz, double-core), 14Gb RAM, SuSE Linux Enterprise Server (SLES 10)

InnoDB: 8Gb allocated, etc. MySQL 5.0.41, UTF-8

SolidDB: 8Gb. MySQL 5.0.27, LATIN1

2x AMD Opteron 246 (2.0GHz, single-core), 8Gb RAM, SuSE Linux Enterprise Server (SLES 10)

Client Application:
Web-Crawler, Java Based. 300 Java Threads concurrently fetch HTML pages from Internet, parse it, and store in a database. Each [PARSE] operation generates in average 300 new records in a database within single transaction, including LONGTEXT column (about 128Kb in average). Most frequently used DML with initially empty database: SELECT, INSERT. Only two tables involved, parent-child.
Each Thread crawls specific Internet host and has 2.5 seconds delay between subsequent fetches; plus delay originated by transaction time. Data is organized by Internet hosts (index) so that I don't have any 'concurrency' and competition for data locks.

I had initially very strange performance problem with both InnoDB and SolidDB: very long running transactions, 1-3 minutes in average. Note that I had 300 concurrent transactions and each inserts 300 new records in a database. With Oracle 10g I had only 10-20 seconds per transaction!
Fortunately I found that I need to disable IP-to-Host name resolution via my.cnf.

Here are final results after 10 hours of execution (when system gets stabilized and has enough data):

InnoDB - 450 transactions per minute, 95% CPU on Client, 10% CPU on Server

SolidDB - 250 transactions per minute (I need to retest it; it was 250 during first hour, then I changed to InnoDB), 60% CPU on Client, 60% CPU on Server

Oracle 10g - 150... but server was on the same machine as client.

It looks like InnoDB performs much-much better than 450/minute; Client application was simply overloaded.

SolidDB can't support even ucs2 (at least for JDBC-based client). Even with latin1 it is outperformed by InnoDB (UTF8).


It's really weird... most people still believe that 'pure' TPS means everything and forget about concurrency in a real world.


Thursday, May 10, 2007


There are so many talks around... Supplemental.

There are so many talks around... Matt Cutts has huge blog!

Sorry, guys, but all SE have main and supplemental indexes. It's not "know-how" since 1997, you can find a lot of very old articles!



What is 301 Redirect?

Can I use multiple redirects, and what happens indeed... from The Robot viewpoint:

URL1 (HTTP 301)-> URL2 (HTTP 301)-> URL3 (HTTP 200 Ok)-> The Page

The Page will be indexed.
URL1 or URL3 will be associated with a Page, depends on how Google implements their algorithms.

1. URL1 <-> Page
Pros: do not need to handle Session IDs for dynamic sites (which use redirect if browser does not support session cookies)
Good for: in-site (constrained) crawl (only internal redirects)
Cons: bad for external redirects; easy to steal content and PR from any site

2. URL3 <-> Page
Pros: good to prevent stealing of external content; can penalize URL1
Cons: bad for internal redirects (session IDs, moved pages, etc.)

I believe this is obvious, and Google follows the same logic. Unfortunately some other spiders do not follow 301/302 at all.

And each algorithm has some constraints like as:
- limit of redirects = 10 (some programming frameworks have default setting 100)
- throw away circular redirects

Labels: ,


Google Anchor Text and Authority: free special widget

free special widget
If you see it... you will probably agree with my post at Webmaster World (WMW)

The subject is not even reciprocal/incoming link related, neither PR nor competition nor anchor text.

Google search of "free special widget" (with quotes!) returns the only page, and it is from THIS forum. Anyone with PR0 can be on a second place right tomorrow.

Do you have exactly same sequence of tokens "free special widget" at your homepage, or at least "special widget"?

Google's queries do not remove stop words anymore; try "To be, or not to be" (without quotes), - you won't see "Too hot to be truth" like last year.


free special widget, free special widgets, and more!

Part 2:
WMW uses some kind of keyword cloacking: you will see there a lot of special widgets.

As a sample: is a very real website! It even had PR9 last month and Under Construction homepage; now PR=N/A probably because I published message at Google Groups.

My Website is repeated in so many online guides "non actual"word, ((c) WebMasterWorld).


May 2007   June 2007   July 2007   August 2007   October 2007   April 2008   June 2008   July 2008   August 2008   January 2009  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]