Be wary of functions in SQL in Magento, such as MAX(). There may be a better way.

I just wrapped up a Healthcheck for a Magento ECG customer and I ran into an interesting issue.  IIRC, the problem wasn’t a client one but rather code that a community extension used.  I won’t mention which one.

This extension was generating a query that was looking for the maximum minimum price for all products in a category.  The resulting query was something like this.

1
2
3
4
SELECT MAX(min_price) FROM catalog_product_index_price AS cpip
INNER JOIN catalog_category_product_index AS ccpi
    ON ccpi.product_id = cpip.entity_id
WHERE ccpi.category_id = 2;

So, it was doing a lot of things right.  It was using the index tables, not the attribute tables.  It was also using indexes, so that was right too.  But take out the MAX() call and see what happens.  I’m running this on the Magento sample data set and it returns 1275 results.  And as we all know the sample data is not like most production systems.  The customer I was working with saw about 40k results on some queries and under load this was taking 20 seconds in some cases, though never less than 200ms.  The reason is because it’s essentially doing a table scan on the results to find the record with the highest value.

So how do we make this faster?

It’s super easy, and fun to do.

1
2
3
4
5
6
SELECT min_price FROM catalog_product_index_price AS cpip
INNER JOIN catalog_category_product_index AS ccpi
    ON ccpi.product_id = cpip.entity_id
WHERE ccpi.category_id = 2
ORDER BY min_price DESC
LIMIT 1;

Sorting by min_price and limiting by 1 gives you the same result with a much more efficient query.  On the client’s test system it went from a minimum of 200ms (possibly 10s of seconds) to >1ms.

If you are well versed in SQL this makes a lot of sense, and might have been what you did normally.  But for this community extension it isn’t what it did and I suspect that given that a lot of Magento developers are API-oriented this is what they tend to do.

He put this in a blog and was amazed at what happened.

Sorry.  I saw about 50 of those linkbait posts and it bugged me so I did one of my own.  I didn’t click on them but those just really, really bug me.  Call this link bait, call it a protest piece or call it performance art.  I just really, really hate those posts.

I don’t know why I’m writing this post.  Maybe it’s because the fire alarm went of for a few seconds at 3:00am.  Geez, fire alarm, if you’re going to startle me awake like that have the common decency to have something to check for.  It is easier to sleep when things happen for reasons you can explain.

Maybe that’s a life lesson.  Who knows?

 

troll-face-high-resolution

The server requested authentication method unknown to the client [mysql_old_password]

I updated my server this morning, moving from Zend Server PHP 5.3 to 5.6 but when I did I got a really weird error.

The server requested authentication method unknown to the client [mysql_old_password]

I spent way too much time debugging it but it turns out that this is an incompatibility between MySQLnd and MySQL 5.5.  For the defaults, at least.  MySQLnd does not know about the mysql_old_password authentication method and so this error was being thrown.

The information I found said to update your user’s password and things should be fine.  I did, and they weren’t.  When I was updating the password it wasn’t using the new password hash for some reason.  So what I did was execute the following commands.

set session old_passwords=0;
update mysql.user set Password=PASSWORD('password') where User = 'user';
flush privileges;

For whatever reason I needed to disable old passwords on the MySQL session that I was connected on.

Load Testing the Magento Checkout with JMeter

Load Testing the Magento checkout is, in theory, difficult.  Well, not so much difficult as time consuming.  It is my personal opinion that most Magento load testing is done either poorly or incompletely.  Load testing read-only pages usually means hitting the same page over and over again.  This is counter productive because it allows the system to cache re-used data and skew load testing results to the positive.  I have yet to see a load test that yields performance results that matches production.

Testing the checkout is often even worse.  It is also the point in the load test that is most likely to change system behavior due to changing inventory, API calls, add-to-cart operations, etc.  It’s also hard to build a JMeter test for.  Most of the time orgs don’t have a JMeter expert on staff and so rely on the developers to build the load test.  Developers spend most of their time writing code, not load tests, and so the tests tend to not include some of the harder tests, such as checkout.  I don’t blame anyone for that, a devs core skill is not load testing.

To help with that I’ve written a basic JMeter checkout scenario for JMeter.  It uses the JMeter Plugins Extras Set.  The test uses the Guest Checkout and the product added to the cart is from the Magento sample data, so you will need to change that.  In addition you will need to change the HTTP Default host name, as shown below.

2015-06-08_0823

You can download the JMeter script here.

Magento, HHVM and the Enterprise Full Page Cache

*** Personal note: It has been suggested that this may reflect an official Magento position.  It does not.  I like doing research and I like publishing what I find as long as I find it interesting and it isn’t tied to a customer.  So if you’re wondering if this reflects a change in Magento or some kind of future roadmap, it does not.  ***

I was recently asked by a client about Magento and HHVM support.  While the answer is “don’t touch it in production and there’s no guarantee of support” it did spur in me the insatiable desire to create charts.  I’m curious about it and I’m sure you are too.

One of the things I’ve been curious about is how much juice could we squeeze out of the HHVM/Magento combo?  In other words, how fast could Magento Enterprise 1.14 actually be?

My test setup is my crappy local VM.  8GB of RAM with 4x Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz.  I had intended to test with Redis (boy do I have a fascinating blog post coming up there!) but there seems to be an odd thing that occurs to HHVM when dealing with reading from sockets that have buffered data 16k or greater.  Since HHVM uses FastCGI that meant that for a proper comparison I needed to use Nginx with PHP-FPM.  So I used the Cm_Cache_Backend_File instead.  While my goal was to see if HHVM was faster than standard PHP for Magento I also wanted to see how many more requests per second I could squeeze out of it.

The answer to the latter question is; a whole shitload.

The test was simple.  Hit the front page of a FPC-enabled site with JMeter and see what happened (0 – 25 concurrent connections and down again).  Here’s what happened.

PHP-FPM peaks at just under 200 requests per second

nginx-rps

HHVM Peaks at, wait for it, over 300 requests per second.

HHVM Requests Per Second

That is 300 freaking MAGENTO requests per second on a 4 CPU box!

Now, lest you think I’m suggesting you use HHVM, I’m not.  It was not easy getting it to work and I wouldn’t trust it with a full workload.  PHP 7 is also supposed to have similar improvements.  And this was a test of one page, to see what was possible with one piece of functionality.  But it’s an interesting indication of the improvements that are being made in the PHP ecosystem as a whole thanks to the introduction of HHVM.  Perhaps PHP 7 will render HHVM moot.  If that’s what happens, it’s fine by me.  I’m just glad that we are finally starting to see some forward momentum on addressing some of the performance issues that have been inherent in PHP for quite some time.  A little bit of creative destruction, now and then, is a good thing.

We don’t need better authentication

1423854070034

I saw a tweet today concerning authentication.


When reading that the first thing that came to my mind was “with what?”  When will that one be hacked and then replaced by something else, which will then be hacked and replaced by something else?  For all of its faults a good password is already stored in the most secure storage repository around; your brain.

The problem, however, is that our brain is really good at remembering concepts, abstractions, gists, it is horribly bad at fine precision details.  When it comes to identity, I know who I am and you know who I am (if you know me).  On the other hand, computers are really good with precision information but really, really bad at concepts, abstractions and gists.  And so when we identify ourselves to the computer we have to stoop to its level.  And so we need to continually improve our techniques to identify ourselves to these dumb machines.

Multi factor authentication is a good thing to examine.  However, it has a flaw.  It requires two or more means of authenticating yourself.  You need something you know (a passcode), something you have (a phone) and/or something you are (biometrics).  This is all good stuff.  It really is.  But what if you were out to dinner and you were about to pay and you realized you forgot your phone (because you were so engrossed in the conversation that you even forgot to check email or to Like your accompanying’s checkin at said restaurant).  What biometrics would you use to authenticate yourself?  Eye?  Biochip?  Anyone seen Demolition Man?


“Would you leave me alone, I’m trying to go to the bathroom here!”

Perhaps what we should be doing as well as looking to increased authentication criteria is building a system that is expected to fail.  So if a company is storing credit card information and it leaks into the public have a second methodology that invalidates fraudulent activity and resets the account.  Have the proper security, but also know that security alone is insufficient for dealing with the modern world.

In truth I don’t know what that would look like or how it would operate.  But I don’t think that the problem of identity management is going to be solved by providing more complex authentication methods.  I honestly think that we need to presume failure in security as one of the layers of defense in depth.

Note, again, that my title is link-bait.  Yes, we do need better authentication, but we also need better corruption recovery methods.

Net Neutrality is about money and power, not openess

For all the talk of getting money out of politics we don’t really seem to understand what draws it there in the first place.  Why does money funnel into politics?  Because in politics there is influence and power.  Influence and power gives you control.  It helps you get what you want.  And those in power don’t mind utilizing their power in exchange for some money.  If we really want to address the question of money in politics, we need to remove what draws it there in the first place; power.  But yet for some reason I do not understand we keep on handing more power to those in government and then complain that the money follows there.

Consider some of the most heavily regulated industries in the US.  There is a strong correlation between regulation and lobby spending.  The top 6 spenders are Pharmaceuticals/Health Products, Business Associations, Insurance, Oil and Gas, Computers and Internet, followed by Electrical Utilities.  The total money spent by the top 6 was $936 million.  All except Computers and Internet are heavily regulated entities and Computers and Internet wants to be.

Why would that be?  Recount the number of new businesses in the Pharma, Insurance, Oil and Gas, and Utilities sectors that have been started over the past 10 years or so.  I can’t.  Regulation allows stasis.  Stasis is great if you are the incumbent.  That way you can focus on your front lines without having to watch your supply lines.  If someone goes after your supply lines all you have to do is tell your lawyer to file suit and tie them up in court until they run out of money.  You, as the incumbent, have more money and better lawyers than they do.  Rather than defending your position you can simply get the government to do your dirty work.

Think this is fantasy?  Consider this quote from an article in the WSJ featuring the CEO of Goldman Sachs.  I quote it at length.

“More intense regulatory and technology requirements have raised the barriers to entry higher than at any other time in modern history,” said Mr. Blankfein. “This is an expensive business to be in, if you don’t have the market share in scale. Consider the numerous business exits that have been announced by our peers as they reassessed their competitive positioning and relative returns.”

Longer term, Mr. Blankfein sees more opportunities for global giants like Goldman to grab even more market share, as “only a handful of players” will likely be able “to effectively compete on a global basis.”

While the Goldman boss wasn’t endorsing all of the added directives from Washington, he said his bank is “prepared to have this relationship with our regulators”—and the regulators are prepared to have a deep relationship with Goldman—“for a long time.”

Is that really what we want for the Internet?

You might be wondering how this applies to Net Neutrality, and how it affects you.  Why should you care if the rules around big business make big business’ lives more difficult?  Beyond the assertion that regulation stymies innovation there is a very real, practical side to this.  In a WSJ opinion piece on Feb 8, 2015 the author quoted a 2005 statement by the Supreme Court that if Internet Service Providers are regulated under Title II by the FCC any service provider on the Internet is subject to the regulators.  In other words, if you provide a service on the Internet it is possible (likely?) that you will be subject to the regulators whims.  If the FCC deems that it has the power to regulate how service providers provide bandwidth it has the power to regulate anyone who can be called a service provider.  Who will be called a service provider?  Anyone whom a large company with sufficient lawyers and a judge agree to be called one.

Given how regulation tends to favor the incumbent why is it that Net Neutrality detractors are smeared with cozy-ing up to big business?  Comcast and Time Warner suck as ISPs (I have Time Warner (I don’t any more.  I switched over the weekend)).  Do we really want them cementing their hold on the Internet?  Do we want that to stay the same?  The FCC promises, cross their heart and hope to die, that they will only use 1% of the power that they have under title II to regulate ISPs.  And we can be assured of this because other government organizations, such as the NSA, the IRS, the CIA, etc. are all such warm and fuzzy operations.  Don’t get me wrong.  They’re all needed in our modern world.  But they all wield way too much power and operate with way too much impunity.  But the FCC will be different, right?

Proponents of Net Neutrality state that they do not want ISPs to have the power to manage this content.  I quote an article from The Nation at length.

Without net neutrality, your Internet service provider could block or slow online content, depending on which websites or apps they wish to preference. For example, an ISP might speed up your access to NBC.com, but slow or degrade your access to AlJazeera.com. They could also charge different prices for different content. An ISP might charge NBC.com more to host last week’s episode of Parks and Recreation than to feature an article about it. Internet service providers could also charge fees to Internet companies for providing that content to you. They might, for example, begin charging Netflix a fee for carrying online video over its network, which it likely will pass on along to its customers.

Is that a possibility?  Yes, it is possible (though more likely they will charge Netflix.  More on that later).  And the article goes on to highlight several times when this has been the case.  Interestingly enough, in most of the cases highlighted the ISPs lost.  But how is this possible given that we have very limited Net Neutrality protections in place?  In its attempt to defend Net Neutrality the article ends up making the case that Title II regulations are NOT required to force access to services because, in large part, the services were restored.

But if existing conflicts were largely resolved without Net Neutrality why are these issues used to argue for it’s necessity?

I wonder if the proponents of Net Neutrality have noticed that the companies in favor of Title II regulation are companies that use gobs of bandwidth, but don’t actually pay for the infrastructure itself?  In many cases they build out large, concentrated data centers whereas the ISPs build out vast networks.  Netflix consumes a reported 35% of ISP bandwidth, but wants to only pay for their datacenters and media licensing, as if that were the only costs.  Might that explain why they are so vocal about it?  Did it not occur to anyone that Netflix became a vocal proponent of Net Neutrality AFTER it saw an 88% drop in profitability and public outrage over their change in packages?  Do you really think that Netflix, Amazon, Microsoft, Google, etc. are in favor of Net Neutrality out of a spirit of goodness and Comcast and Time Warner are against it out of greed?  Do you really think that Netflix is concerned about bunnies and lollipops but Time Warner is only concerned about shareholders and 3 martini lunches?  (mmmm, Martini)

If it is bandwidth that we are truly concerned about why not, instead of tying the hands of those providing the bandwidth (a massively expensive endeavor), make it easier and cheaper to increase bandwidth?  Instead of regulating the size of the pie and who controls the pie why don’t we make the pie bigger so all can enjoy its sugary goodness?  Does that mean that Comcast is innocent in all of this?  By no means.  But be careful about who you blame.  What Net Neutrality really is is proxy fight between large corporations with one side simply having a stronger rhetorical high ground.  “Freedom!  Openness!” vs. “Margins!  Cost!”

If you are prepared to hand over regulatory power to the FCC you also need to understand that the regulated companies will be “prepared to have this relationship with the regulators”—and the regulators will be  prepared to have a deep relationship with the regulated—“for a long time”.  What this means is that Comcast and Time Warner, because they are smart companies with smart lawyers, will find ways to use this regulation to their advantage.  Why not?  It’s worked for Goldman Sachs.

[UPDATE]

If you question this premise (which you are free to) I have a chart that is worth looking at.  I did some searching on the WSJ website to see what the earliest mention of Netflix and Net Neutrality is.  I found a video from April 17, 2012.  The following chart shows Netflix’ margins.  The arrow is the rough approximation of the date of the video.  Correlation does not prove causality, but that’s a whole lot of correlation there.

2015-02-23_1210

Reasons why Redis is a great option for a worker queue in Magento

redis-300dpi

Alan Kent asked on Twitter


You may know that I am partial to Redis as a worker queue for Magento.  Here are some reasons why.

  1. It is stupid simple to set up (meaning nubes and philes can both use it)
  2. It is blazing fast (it’s slow log is measured in microseconds)
  3. It does pubsub for one to many
  4. It does BLPOP for many to one
  5. It’s already supported with the Cm_Redis stuff

Things you do not get

  1. Durability
  2. Complex routing rules

When I worked on my last talk for Imagine I spent a week trying to get a fast, working implementation of ActiveMQ and Rabbit.  I looked at it and thought “this is way more complicated than it needs to be” and tried Redis.  What I spent a week working on was up and running in 4 hours, including a Java-based worker and PHP refactoring.

The HA portion is missing out of the box but I believe with Sentinal you can achieve it.  In some ways it’s like MySQL.  It’s beastly easy to get set up and running but if you need to do more you can.

If you develop for Magento, know your indexes

2015-01-30_1348

When I first got into Magento development, in my mind, there were two ways of getting data from the database.  You would either call Mage::getModel(‘catalog/product’)->load($id) or you would work with the collection.  If you wanted to get a filtered list of something you would use the ORM to get it.

But as I’ve gained more experience (fairly quickly, I might add) I realized that there was more to the puzzle.  A good portion of this is because I work with Magento ECG and some of the best Magento devs and architects can be found there and I’m a quick learner.

One of the things I’ve consistently seen in production deployments is that a lot of devs don’t get beyond the 1:1 model/collection view.  Consider a scenario where you wanted to list all of the visible products in the same category as a product provided on the URL.  Your code might look something like the following.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public function testAction()
{	
	$product = Mage::getModel('catalog/product')->load(
		$this->getRequest()->getQuery('id')
	);
	/* @var $product Mage_Catalog_Model_Product */
 
	$category = $product->getCategoryCollection()->getFirstItem();
	$productCollection = $category->getProductCollection();
 
	foreach ($productCollection as $product) {
		if ($product->isVisibleInCatalog()) {
			echo $product->getSku() . '<br>';
		}
	}
}

I will admit that this is somewhat contrived, but you will see the pattern an awful lot in Magento deployments.  Though, truth be told, if this example were to be accurate with a lot of the code I see it should look more like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public function testAction()
{	
	$product = Mage::getModel('catalog/product')->load(
		$this->getRequest()->getQuery('id')
	);
	/* @var $product Mage_Catalog_Model_Product */
 
	$category = $product->getCategoryCollection()->getFirstItem();
	$productCollection = $category->getProductCollection();
 
	foreach ($productCollection as $product) {
                $product = Mage::getModel('catalog/product')->load($product->getEntityId()); //WHAT!!  yep.
		if ($product->isVisibleInCatalog()) {
			echo $product->getSku() . '<br>';
		}
	}
}

What is one way that we could do this better?  Use an index!  In particular catalog_category_product_index.  Following is its CREATE DDL (sans foreign keys and indexes).

CREATE TABLE `catalog_category_product_index` (
  `category_id` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Category ID',
  `product_id` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Product ID',
  `position` int(11) DEFAULT NULL COMMENT 'Position',
  `is_parent` smallint(5) unsigned NOT NULL DEFAULT '0' COMMENT 'Is Parent',
  `store_id` smallint(5) unsigned NOT NULL DEFAULT '0' COMMENT 'Store ID',
  `visibility` smallint(5) unsigned NOT NULL COMMENT 'Visibility',
);

It contains the calculated relations between products and categories.  But not just categories, sort order, visibility and a few other things.  We can use that knowledge and build out a mechanism that, while more complicated, will execute much less logic.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public function test2Action()
{
    $productCollection = Mage::getModel('catalog/product')->getCollection();
    /* @var $productCollection Mage_Catalog_Model_Resource_Product_Collection */
    $select = $productCollection->getSelect();
    $select->join(
        ['ccpi' => Mage::getSingleton('core/resource')->getTableName('catalog/category_product_index')],
        'ccpi.product_id = e.entity_id',
        null
    );
 
    $cond = $productCollection->getConnection()->quoteInto(
        'ccpi2.category_id = ccpi.category_id
        AND ccpi2.is_parent = 1
        AND ccpi2.product_id = ?
        AND ccpi2.visibility IN (2,4)',
        $this->getRequest()->getQuery('id') 
    );
    $select->join(
        ['ccpi2' => Mage::getSingleton('core/resource')->getTableName('catalog/category_product_index')],
        $cond,
        null
    );
 
    foreach ($productCollection as $product) {
        echo $product->getSku() . '<br>';
    }
}

Now you might be thinking “how in the world did he come up with that!?”  It’s actually quite simple.  I knew I needed the data from catalog_category_product_index.  So what I needed to figure out was the query that would get me there.  So I eventually came up with the query:

SELECT cpe.sku, ccpi.category_id FROM catalog_product_entity AS cpe
INNER JOIN catalog_category_product_index AS ccpi
 ON ccpi.product_id = cpe.entity_id
INNER JOIN catalog_category_product_index AS ccpi2 
 ON ccpi2.category_id = ccpi.category_id
 AND ccpi2.is_parent = 1
 AND ccpi2.product_id = 1001
 AND ccpi2.visibility IN (2,4);

From there I constructed the Zend_Db_Select statement incrementally, adding clauses until it looked kind of similar.

For this type of scenario in this size of dataset the benefit is not that much.  But you add complexity, dataset size and server load and the drawbacks of the previous method becomes noticeably worse.

Does this mean that you can build out your own indexes and provide the same level of benefit in your custom code with custom logic?  Yes!  But not from this blog post.  :-)

(New(ish)) Indexing in Magento or “The wonderful world of materialized views”

The Wonderful World of Materialized Views

(or, Why Indexing in Magento 1.13 Is So Much Faster)

When Magento Enterprise 1.13 came out it boasted massive improvement in index operation performance.  On the Magento blog introducing 1.13 (http://www.magentocommerce.com/blog/comments/magento-enterprise-edition-113/) a simple statement was made: “The introduction of incremental indexing reduces the need to perform a full re-index and most indexing operations are now automated”.  While this statement is true it provides a very simplified explanation of what has changed and why the improvements in performance are so great.

Materialized Views

Part of the problem with index performance was simply due to what the indexes in Magento are.  They are not indexes like a BTree or hash but act more like a pre-calculation table.  In SQL parlance this often known as a Materialized View.  This is similar to a regular view, where the table contains the results of a query but unlike a regular view which updates its contents when the row changes a Materialized View does not.

MySQL does not support Materialized Views and so Magento implemented a sort of pseudo-Materialized View.  Prior to 1.13 Magento implemented functionality that was like a Materialized View but did not track changes in products and so it was not aware of which items needed to be updated.  So a re-index was always a full re-index.  So it was a kind of pseudo-pseudo-Materialized View.

The effect of this can be seen in even very cursory, top-level views.  In the following example one product was loaded, the inventory changed, the product saved, the indexes loaded and a re-index was done all via the administrative UI.

One of the changes has been the inclusion of a new module called Enterprise_Index.  If one were to start looking here for the answer they would only be partially right and really only if they looked in the setup script.  The setup script defines a number of different events.  But these events are different from the typical event/observer paradigm that most developers are familiar with.

The events in the Enterprise_Index setup are actually triggers in MySQL on individual tables that respond to various modifications to individual rows.  These triggers are used to completely invalidate an entire index.  As such there are only a few operations that actually trigger this, most of which revolve around store, attribute or certain configuration modifications.  And automated index invalidation does not explain why Enterprise 1.13 made such large strides when it comes to re-indexing.

But it gives us a starting place to look.

Partial Indexing

A big part of the performance gains is through the use of the partial re-indexing.  The partial re-indexing of small changes is never much of a problem.  Just modify the items that were changed and apply the same logic that you use for larger changes.

But how does one determine which items have changed?  That can be a huge problem.  One option would be to set an event observer for “model_save_after”.  But that would result in a lot of overhead added to the execution flow.  Plus it wouldn’t take into account if a model class overrides the _afterSave() method and forgets to call the parent.  So the problem of reliably determining when a model has changed is a problem that cannot be reliably done in the application code.  There are too many moving parts that need to be taken into account.

To get around these and other issues triggers are used to note changes.  But unlike the triggers in Enterprise_Event which simply invalidate a full index these triggers run on individual tables and simply watch for, and make note of, changes.

One can see how this is set up in the new Enterprise_Catalog module.  The installation script makes use of a model from a new module called Mview.  The Mview module is Magento’s implementation of a MySQL-based pseudo-Materialized View.

To configure an individual Mview requires three distinct components.  1) the metadata, 2) a client, and 3) an action.  The metadata and client is managed by the Mview module.  The action is implemented by whichever module is utilizing the Mview.

This is done by telling Mview to subscribe to actions in a given table, which are referred to as events.  The events can be INSERT, UPDATE or DELETE, correlating to the related SQL actions.  Mview will set up a trigger on a provided table.  This trigger will take the provided column, most likely the primary key, and inject it into a changelog table that is unique for the table provided.  The changelog table is a two column table, one with a primary key that is referred to as the version_id and another that holds the value of the key provided.  Whenever a subscribed-to change occurs the provided value (again, usually the primary key) will be inserted into the changelog table and the version_id will be incremented via AUTO_INCREMENT.

An action can be run manually, but most of the time they are automatically managed by the new Enterprise_Index module.  It adds a cron job that is run each time cron is run.  The Enterprise_Index_Model_Observer::refreshIndex() method is called every time the cron job is run.  This method retrieves an Mview client over which the observer iterates by retrieving index configuration nodes from under global/index/indexer.

Under this configuration node will be several unique children.  Under these children should be several nodes, with the ones important to the Enterprise_Index observer being under the node “action_model”.  Under this node are the Magento URNs to the class that is going to be handling the actual re-index process.  There are currently two nodes under this parent node; “changelog” and “all”.

<catalog_product_price>
   <model>enterprise_catalog/index_dummy_price</model>
   <action_model>
       <changelog>enterprise_catalog/index_action_product_price_refresh_changelog</changelog>
       <all>enterprise_catalog/index_action_product_price_refresh</all>
   </action_model>
   <index_table>catalog_product_index_price</index_table>
   <sort_order>40</sort_order>
</catalog_product_price>

Figure 2 – Example of Full and Partial Reindex Action Configurations

All re-indexing processes that are using the Materialized View should be initiated by the Mview client and should implement Enterprise_Mview_Model_Action_Interface.  The Mview client expects that any view reindexing will be done by a class that implements that interface, often with several sub-classes in between to handle some of the mundane details of managing the partial re-indexing functionality.

Prior to executing a full re-index the Mview client will check the enterprise_mview_metadata table to see if the currently requested index is actually dirty.  If it is not marked as dirty then the index will not be run.  As noted earlier, there are only certain conditions which will cause the table to be flagged and so just because a table row has changed does not mean that the index will be marked dirty, inducing a full re-index.

But while this partial indexing is often managed via cron it can also happen inline with a regular request.  There are several examples of a partial re-index being implemented on saving an item; product inventory being an example.  The product inventory indexer listens for changes in a product and will automatically do a partial re-index for that product when the product saves.

This, like most things in Magento, is a configurable option.  Most of the partial indexers will check a configuration prior to re-indexing.  In Enterprise 1.13 there is a new System Configuration section called “Index Management” which allows the administrator to control this behavior.

The effect of this is clear.  The actual number of queries for a save operation is significantly lower.

Using Materialized Views For Your Own Profit

But while the materialized view implementation has helped make the indexing functionality much more efficient one of the real benefits of the new Mview module is that you can use it in a standardized way to manage your own indexing requirements. It will help you implement module functionality when you have a count(*), GROUP BY or DISTINCT queries or when you need to pre-calculate values.  That’s not to say that you can’t manage it on your own but by using the Mview module you can offload some of the responsibilities of managing these types of calculations to Magento.

In order to do that you need 3 elements

  1. A setup script that creates your table subscriber
  2. An index table
  3. Indexer action models that deal with either the changelog or a complete reindex

Before we move on we should note that #3 has some caveats.  Generally if you are going to be working with the changelog you will not be configuring an index to run with the shell indexer as you might be expected.  Instead, it will be run automatically through cron and managed by the Enterprise_Index_Model_Observer class, which will have its refreshIndex() method called each time cron is executed, just like it does in the Enterprise_Catalog module.  In other words, you will generally have no direct control via the administration UI or the CLI if you use this method.

For this example we are going to write an extension that will print a summary of the browsers that have viewed a given page in the footer of that page.  Granted, this is an extension that would not go on a production front-facing Magento implementation but it will demonstrate how you can integrate with the Mview extension to do calculations that you would not want to do during a normal request.  Additionally, for the purposes of clarity, we are going to omit things like Layout and Blocks and standard configuration options and focus on the Mview functionality.  This should not be a problem for most Magento developers, though the full code is available on GitHub.

For implementing this kind of functionality you might have considered creating a new table that might contain a column for the URL, the User Agent and a count.  However if you were to calculate the results on the fly you are adding a minimum of an extra write during regular load times and you increase the possibility of locks or write contention in the database.  Those kinds of writes are generally something that should be avoided.  This is not to mention, as well, that in our scenario here the information we need is already being stored in the log_ tables and so we would be duplicating a write for no reason.

There is an additional reason that we would want to do this separately.  User Agent strings are long strings of mostly useless information (for our purposes, at least).  These strings have many different permutations which may or may not follow a specific pattern.  So we have one of two options.  The first is that we build the mechanism to normalize the user agents into something useful ourselves or we use what someone else has done so we don’t have to maintain it.  The get_browser() functionality is a good suit for this, but it is a little on the slow side if you want accurate results.

So we are going to process those results offline as part of a regularly scheduled cron job that is already set up via Enterprise_Index.

To get started we need to set up a subscriber on the log_url table.  This is the table where each and every HTTP request is registered.   The Mview module allows your module to watch a given table for changes.  When those changes occur the change is noted by the triggers that have been defined for that table and when a change occurs the individual view will have an internal version ID counter incremented.

When creating a subscription you will do it by telling the Mview client to watch a given column on a given table.  There is something interesting to note here; because of how the system tracks changes it is beneficial to make sure that you are watching a column that has a unique value.  Otherwise it is possible that your indexer may accidentally select data that has already been processed.  A good example of this is if you are watching a DATETIME column.  It is quite possible that two events occurred in the same second that the indexer was run which would end up getting re-processed the next time the indexer is run.  When retrieving the changed values from Mview the indexer looks for a version ID in the enterprise_mview_metadata table which will be matched against the actual value in the change log table.  If there is any possibility of duplication here then it is possible that your data will be skewed.  Simply watching the primary key of the table you’re interested in is by far the easiest way to accomplish this.

To go about setting these relationships up you will generally do this from within a setup script for the module that you are creating.  In the setup script we are creating here there are two tables that need to be created.  One is for the summary report data and the second is for mapping individual user agents onto their parent browser values.  For example, mapping “Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20100101 Firefox/10.0.12 Iceweasel/10.0.12 Snapper/1.0” onto “Firefox 10“.  These tables, however, we will not document here for the sake of clarity, but you can see them in the GitHub repository for this module.

To start creating our materialized view and its subscription we’re going start with our setup script.   We are going to get the name of the log_url table, which is the table we’re going to watch.

$logTable = Mage::getModel('log/log')->getResource()->getTable('url_table');

Why not just type in “log_url”?  This goes to an important principle in software development; hardcode as little as you can.  What if the table changes?  Then you need to go and try and figure out all of the places where you hardcoded the value.  This is why, in our next few lines of code, we are a little more verbose when determining the primary key.

1
2
3
4
5
$pkData = $this->_conn->getIndexList($logTable);
if (!isset($pkData['PRIMARY']['COLUMNS_LIST'][0])) {
     Mage::throwException('Unable to find log table primary key');
}
$logPrimaryKey = $pkData['PRIMARY']['COLUMNS_LIST'][0];

With these two variables defined we can now configure the Mview client and subscriber.

1
2
3
4
5
6
7
Mage::getModel('enterprise_mview/metadata')
    ->setTableName($logTable)
    ->setViewName($logTable)
    ->setKeyColumn($logPrimaryKey)
    ->setGroupCode('eschrade_useragent_report')
    ->setStatus(Enterprise_Mview_Model_Metadata::STATUS_INVALID)
    ->save();

This first step in the setup process involves retrieving an instance of the Mview metadata class which is responsible for managing the basic view information.  The class itself basically functions as an interface to the database, as can be seen from the skeleton-like definition of its resource and collection methods.  The metadata class, however, is used by the Mview client as a means of differentiating between various materialized views.  The client loads the metadata information prior to executing an individual indexer.

setTableName() is where we set the actual name of the table that is the source of our changed action log.  It is the table whose changes we are interested in when they occur.  It is the table that the triggers will get set up against when the subscription is created later on.  Only one materialized view metadata entry can watch the table, but multiple indexers can process it based off of configuration settings (which we will look at later on).

The setViewName() method is used to determine the source of changes that you are interested in noting.  While this value can be a different table from the one specified in setTableName() it will generally be the same and it must be unique.

setKeyColumn() is used to tell the Mview client which is the unique value that we want to store in our changelog.  Generally, this will be the primary key.

setGroupCode() is largely unused but must be unique due to table index requirements.  However, it does allow you to load the metadata by the code or also allows you to define the group code in your index class.

Setting the status to “invalid” via setStatus() is how you force an index to be completely rebuilt.  An index that is out of date is not invalid.  What happens during normal cron runs is that each materialized view that is configured will be run, but whether or not that index is invalid will determine which index class is called.  We will get into building those classes later on.  By setting the index status to “invalid” you will be forcing a full refresh of the index during the next cron run.

After the metadata has been created you will need to create an instance of the client associated with the view name to create the actual subscription.

1
2
3
4
$client = Mage::getModel('enterprise_mview/client');
 
/* @var $client Enterprise_Mview_Model_Client */
$client->init($logTable);

The value that is passed into the init() method is the name of the table that you are watching.  This initializes the client and loads the metadata that was previously saved.

The next thing you need to do in your setup is create the changelog table.  This is done by asking the client to do it for you by passing in a Magento URN to a class that will create the changelog.  Generally speaking you can use the base class to do this, which is what we do here.  The base class requires the table_name as a parameter.  Generally the changelog will be named {$tableName}_cl.

1
2
3
$client->execute('enterprise_mview/action_changelog_create', array(
    'table_name' => $logTable
));

Now that the changelog has been created we need to create the subscription.  The subscription class will create a trigger that watches for CUD operations and inserts the primary key value for the changed row into the changelog table.  For each change that is made to the watched table a row will be inserted into the changelog that contains the previously defined watch column along with a version ID.  This version ID corresponds to the version ID stored for this view in the enterprise_mview_metadata table.  So when you are creating an indexer later on the way the indexer knows which rows to include is by selecting all of the rows in the changelog that are greater or equal to the version ID found in the metadata table.

1
2
3
4
$client->execute('enterprise_mview/action_changelog_subscription_create', array(
    'target_table' => $logTable,
    'target_column' => $logPrimaryKey
));

At this point we have Magento configured to watch for changes on a specified table and note row changes in the database.  Now we need to do something with them.

In order to make this example useful we need to define at least two classes (Eschrade_UserAgent_Model_Index_Ua_Changed and Eschrade_UserAgent_Model_Index_Ua_All) to interact with the Mview components.  Those two classes are there to either build the index partially or to build the index fully.  But while two classes generally should be defined, because the nature of the indexing operation should be similar between the two.  The primary difference will be the size of the data set.

Most of the time the base indexer class can be built by extending Enterprise_Mview_Model_Action_Mview_Refresh_Changelog.  While there are classes that can be used to define each class individually this might end up making more work than is necessary.  In this case we will build a base class for refreshing the view, but provide some simple hooks that will allow the full re-indexer to easily do its job at the same time.

In the example here there are several methods that we are going to describe in words instead of code since they are either helper functions or are used to modify internal SQL queries to get what we’re looking for.  If you would like to see the full code you can find it on GitHub at https://github.com/kschroeder/Magento-Enterprise-Materialized-Views-Example/blob/master/app/code/local/Eschrade/UserAgent/Model/Index/Ua/Changed.php.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class Eschrade_UserAgent_Model_Index_Ua_Changed extends  Enterprise_Mview_Model_Action_Mview_Refresh_Changelog
{
    public function execute()
    {
        try {
            $this->_connection->beginTransaction();
            $this->preIndexHook();
            $this->populateUserAgentTable();
            $versionId = $this->_selectLastVersionId();
            $this->processIndex();
            $this->_metadata->setValidStatus()
                ->setVersionId($versionId)
            ->save();
            $this->_connection->commit();
        } catch (Exception $e) {
            $this->_connection->rollBack();
            $this->_metadata->setInvalidStatus()->save();
            throw $e;
        }
        return $this;
    }
}

This class will, by default, only process changed items.  But when we look at the class we’ve defined for processing all of user agent results you will see that there is very little that we need to do.

The first thing we need to do is start a transaction.  We don’t need to do this, but it is often a good idea to wrap this functionality in a transaction to handle cases where there is an error that might corrupt the index or cause data to be processed multiple times.

The preIndexHook() method allows any subclasses to execute any logic prior to actually calculating the index.  This is primarily used to allow the full re-index class to truncate the data.

Next up we call a method called populateUserAgentTable().  This is a helper method that does a LEFT JOIN on the log_visitor_info table and a table that is used to store the friendly names of the user agents.  If there are a large number of user agents that have not been processed, this method can actually take several seconds to execute (full version on GitHub). This is done so that MySQL can do some processing for us later on rather than doing it in our code.

At this point we get the most current version ID from the change log to use later when setting the last processed version ID.  This could be done at the end of the process, but there is more of a possibility that a version ID may be skipped if we wait until the end to do this.  Perhaps a better option would be to do checks from the changelog version IDs and state that highest one will be set.  But, for clarity, and the fact that the elapsed time between when this query is made when the unprocessed ID’s are retrieved will be very low, we do it here.

Next up is the processIndex() method.  This is where the magic is implemented.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public function processIndex()
{
    $indexTable= Mage::getModel('eschrade_useragent/agentreport')->getResource()->getMainTable();
    $select = $this->_selectChangedRows();
    $this->joinSelectWithUrlAndVisitor($select);
    $stmt = $select->query();
 
        while (($row = $stmt->fetch()) !== false) {
            $this->_connection->insertOnDuplicate(
                $indexTable,
                array(
                    'page' => $row['url'],
                    'agent' => $row['normalized_user_agent'],
                    'request_count' => $row['page_count']
                ),
                array(
                    'request_count'  => new Zend_Db_Expr('request_count + VALUES(request_count)')
                )
            );
        }
}

The first thing we need is the table that we are going to be inserting the values on.  That would be the main agentreport table.  Next up we retrieve the SELECT object that is pre-configured to retrieve only the primary keys of rows that have changed.  This is defined in the Enterprise_Mview_Model_Action_Mview_Refresh_Changelog class.  We make a small modification to the method to deal with some of the complex JOINs we do later, but it essentially the same.

After we have retrieved the pre-configured SELECT object, which is directly pertinent to the Mvew discussion we modify it to suit our needs, which is not pertinent to our example.  That method is custom to this example and does a number of JOINs on various tables to pull the data we need into one result set.  This data is then inserted with an ON DUPLICATE clause that will add any new page/user agent counts to existing ones, or insert them if no counts exist.

At this point it should be worth noting that using Mview forces us to do things on a bit of a less efficient manner.  Generally, when you are building a Materialized View the goal is to do as much from INSERT FROM SELECT statements as possible.  This can greatly cut down on the possibility of hitting memory limits in PHP and also allows MySQL to do the heavy lifting for you.  A good rule of thumb is that you want to work as close to the data as possible.

But because we are doing partial changes and because those partial changes may already have some previously-calculated data in the table we are not able to do that.  This is actually a limitation in the database adapter where fields to be defined with a Zend_Db_Expr class (which is how you would make the addition calculation for duplicate entries) are not calculated in a way that can be used.  The problem is that when you specify a column the expression that you use for the field is the expression that is used for the ON DUPLICATE clause.  Because we are possibly updating existing columns with an addition expression we are not able to use the same field value in the column clause as we do in the ON DUPLICATE clause.  Therefore, in this case, we need to process them manually.

This is why we first generate a result set and iterate over them calling insertOnDuplicate().  insertOnDuplicate() allows us to distinctly define columns and ON DUPLICATE clauses, which we can’t do via insertFromSelect().

At this point when we run the index nothing will happen.

That is because we have not defined the class that will (re)process the entire view.  Remember back to our setup script where we set the view as invalid?  That will result in a full re-index.

So let’s take a look at that class in its entirety.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Eschrade_UserAgent_Model_Index_Ua_All extends Eschrade_UserAgent_Model_Index_Ua_Changed
{
    public function preIndexHook()
    {
        $this->_connection->truncateTable(
            $this->getReportResource()->getTable('agentreport')
        );
    }
 
    protected function joinSelectWithUrlAndVisitor(Varien_Db_Select $select)
    {
        $select->reset(Varien_Db_Select::WHERE);
        parent::joinSelectWithUrlAndVisitor($select);
    }
 
}

Because of the way we built out Eschrade_UserAgent_Model_Index_Ua_Changed we only need to do two things. 1) truncate the report table, and 2) remove the version_id condition from the WHERE clause.

And with that our classes are done.

But our indexers will still not run.  That is because we have not defined our indexes in the configuration.  Our indexers will be added under the /config/global node in our config.xml file.

<index>
    <indexer>
    <eschrade_useragent_report>
    <model>eschrade_useragent/index_ua_dummy</model>
    <action_model>
    <changelog>eschrade_useragent/index_ua_changed</changelog>
    <all>eschrade_useragent/index_ua_all</all>
    </action_model>
    <index_table>log_url</index_table>
    </eschrade_useragent_report>
    </indexer>
</index>

The first thing you might notice is that there is a dummy class defined under the <model> tag.  This node is typically used by the index CLI script to process the individual indexes.  However, because Mview based indexers should be run via cron we define a simple class that renders the name and the description for the admin UI but does nothing else.  The main indexer, which is managed by the new Enterprise_Index module, uses the same configuration nodes as Mage_Index, but uses the <action_model> and <index_table> nodes instead.

When the Mview index manager is run it will check status column in the mview metadata table for indexes that are invalid.  For each row whose status is “invalid” the table_name will be used to match up which indexer configuration should be used.  The main indexer will check for a URN under <action_model><all> and run that indexer.  This is the first part of the indexing process.

The second part of the indexing process is when Enterprise_Index iterates over all of the indexers defined in the merged configuration.  An index is ignored if either the <action_model> or the <index_table> is missing.  If a full re-index has already been run for a given indexer entry the changelog-based indexer will be omitted.  Beyond that caveat every index configuration that has an <action_model><changelog> and <index_table> node will be run.

Conclusion

While there are several places in the new Magento Enterprise Edition 1.13 that make indexing more efficient there is perhaps no mechanism more important than the implementation of Materialized Views.  It allows for a standardized method of building summary/calculation tables in a predictable manner.  But rather than building an implementation intended strictly for the new version of Magento Enterprise, Mview module allows developers to utilize the same functionality that is largely responsible for some of the performance improvements noted for EE1.13.

Web Analytics