Magento, HHVM and the Enterprise Full Page Cache

*** Personal note: It has been suggested that this may reflect an official Magento position.  It does not.  I like doing research and I like publishing what I find as long as I find it interesting and it isn’t tied to a customer.  So if you’re wondering if this reflects a change in Magento or some kind of future roadmap, it does not.  ***

I was recently asked by a client about Magento and HHVM support.  While the answer is “don’t touch it in production and there’s no guarantee of support” it did spur in me the insatiable desire to create charts.  I’m curious about it and I’m sure you are too.

One of the things I’ve been curious about is how much juice could we squeeze out of the HHVM/Magento combo?  In other words, how fast could Magento Enterprise 1.14 actually be?

My test setup is my crappy local VM.  8GB of RAM with 4x Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz.  I had intended to test with Redis (boy do I have a fascinating blog post coming up there!) but there seems to be an odd thing that occurs to HHVM when dealing with reading from sockets that have buffered data 16k or greater.  Since HHVM uses FastCGI that meant that for a proper comparison I needed to use Nginx with PHP-FPM.  So I used the Cm_Cache_Backend_File instead.  While my goal was to see if HHVM was faster than standard PHP for Magento I also wanted to see how many more requests per second I could squeeze out of it.

The answer to the latter question is; a whole shitload.

The test was simple.  Hit the front page of a FPC-enabled site with JMeter and see what happened (0 – 25 concurrent connections and down again).  Here’s what happened.

PHP-FPM peaks at just under 200 requests per second

nginx-rps

HHVM Peaks at, wait for it, over 300 requests per second.

HHVM Requests Per Second

That is 300 freaking MAGENTO requests per second on a 4 CPU box!

Now, lest you think I’m suggesting you use HHVM, I’m not.  It was not easy getting it to work and I wouldn’t trust it with a full workload.  PHP 7 is also supposed to have similar improvements.  And this was a test of one page, to see what was possible with one piece of functionality.  But it’s an interesting indication of the improvements that are being made in the PHP ecosystem as a whole thanks to the introduction of HHVM.  Perhaps PHP 7 will render HHVM moot.  If that’s what happens, it’s fine by me.  I’m just glad that we are finally starting to see some forward momentum on addressing some of the performance issues that have been inherent in PHP for quite some time.  A little bit of creative destruction, now and then, is a good thing.

We don’t need better authentication

1423854070034

I saw a tweet today concerning authentication.


When reading that the first thing that came to my mind was “with what?”  When will that one be hacked and then replaced by something else, which will then be hacked and replaced by something else?  For all of its faults a good password is already stored in the most secure storage repository around; your brain.

The problem, however, is that our brain is really good at remembering concepts, abstractions, gists, it is horribly bad at fine precision details.  When it comes to identity, I know who I am and you know who I am (if you know me).  On the other hand, computers are really good with precision information but really, really bad at concepts, abstractions and gists.  And so when we identify ourselves to the computer we have to stoop to its level.  And so we need to continually improve our techniques to identify ourselves to these dumb machines.

Multi factor authentication is a good thing to examine.  However, it has a flaw.  It requires two or more means of authenticating yourself.  You need something you know (a passcode), something you have (a phone) and/or something you are (biometrics).  This is all good stuff.  It really is.  But what if you were out to dinner and you were about to pay and you realized you forgot your phone (because you were so engrossed in the conversation that you even forgot to check email or to Like your accompanying’s checkin at said restaurant).  What biometrics would you use to authenticate yourself?  Eye?  Biochip?  Anyone seen Demolition Man?


“Would you leave me alone, I’m trying to go to the bathroom here!”

Perhaps what we should be doing as well as looking to increased authentication criteria is building a system that is expected to fail.  So if a company is storing credit card information and it leaks into the public have a second methodology that invalidates fraudulent activity and resets the account.  Have the proper security, but also know that security alone is insufficient for dealing with the modern world.

In truth I don’t know what that would look like or how it would operate.  But I don’t think that the problem of identity management is going to be solved by providing more complex authentication methods.  I honestly think that we need to presume failure in security as one of the layers of defense in depth.

Note, again, that my title is link-bait.  Yes, we do need better authentication, but we also need better corruption recovery methods.

Net Neutrality is about money and power, not openess

For all the talk of getting money out of politics we don’t really seem to understand what draws it there in the first place.  Why does money funnel into politics?  Because in politics there is influence and power.  Influence and power gives you control.  It helps you get what you want.  And those in power don’t mind utilizing their power in exchange for some money.  If we really want to address the question of money in politics, we need to remove what draws it there in the first place; power.  But yet for some reason I do not understand we keep on handing more power to those in government and then complain that the money follows there.

Consider some of the most heavily regulated industries in the US.  There is a strong correlation between regulation and lobby spending.  The top 6 spenders are Pharmaceuticals/Health Products, Business Associations, Insurance, Oil and Gas, Computers and Internet, followed by Electrical Utilities.  The total money spent by the top 6 was $936 million.  All except Computers and Internet are heavily regulated entities and Computers and Internet wants to be.

Why would that be?  Recount the number of new businesses in the Pharma, Insurance, Oil and Gas, and Utilities sectors that have been started over the past 10 years or so.  I can’t.  Regulation allows stasis.  Stasis is great if you are the incumbent.  That way you can focus on your front lines without having to watch your supply lines.  If someone goes after your supply lines all you have to do is tell your lawyer to file suit and tie them up in court until they run out of money.  You, as the incumbent, have more money and better lawyers than they do.  Rather than defending your position you can simply get the government to do your dirty work.

Think this is fantasy?  Consider this quote from an article in the WSJ featuring the CEO of Goldman Sachs.  I quote it at length.

“More intense regulatory and technology requirements have raised the barriers to entry higher than at any other time in modern history,” said Mr. Blankfein. “This is an expensive business to be in, if you don’t have the market share in scale. Consider the numerous business exits that have been announced by our peers as they reassessed their competitive positioning and relative returns.”

Longer term, Mr. Blankfein sees more opportunities for global giants like Goldman to grab even more market share, as “only a handful of players” will likely be able “to effectively compete on a global basis.”

While the Goldman boss wasn’t endorsing all of the added directives from Washington, he said his bank is “prepared to have this relationship with our regulators”—and the regulators are prepared to have a deep relationship with Goldman—“for a long time.”

Is that really what we want for the Internet?

You might be wondering how this applies to Net Neutrality, and how it affects you.  Why should you care if the rules around big business make big business’ lives more difficult?  Beyond the assertion that regulation stymies innovation there is a very real, practical side to this.  In a WSJ opinion piece on Feb 8, 2015 the author quoted a 2005 statement by the Supreme Court that if Internet Service Providers are regulated under Title II by the FCC any service provider on the Internet is subject to the regulators.  In other words, if you provide a service on the Internet it is possible (likely?) that you will be subject to the regulators whims.  If the FCC deems that it has the power to regulate how service providers provide bandwidth it has the power to regulate anyone who can be called a service provider.  Who will be called a service provider?  Anyone whom a large company with sufficient lawyers and a judge agree to be called one.

Given how regulation tends to favor the incumbent why is it that Net Neutrality detractors are smeared with cozy-ing up to big business?  Comcast and Time Warner suck as ISPs (I have Time Warner (I don’t any more.  I switched over the weekend)).  Do we really want them cementing their hold on the Internet?  Do we want that to stay the same?  The FCC promises, cross their heart and hope to die, that they will only use 1% of the power that they have under title II to regulate ISPs.  And we can be assured of this because other government organizations, such as the NSA, the IRS, the CIA, etc. are all such warm and fuzzy operations.  Don’t get me wrong.  They’re all needed in our modern world.  But they all wield way too much power and operate with way too much impunity.  But the FCC will be different, right?

Proponents of Net Neutrality state that they do not want ISPs to have the power to manage this content.  I quote an article from The Nation at length.

Without net neutrality, your Internet service provider could block or slow online content, depending on which websites or apps they wish to preference. For example, an ISP might speed up your access to NBC.com, but slow or degrade your access to AlJazeera.com. They could also charge different prices for different content. An ISP might charge NBC.com more to host last week’s episode of Parks and Recreation than to feature an article about it. Internet service providers could also charge fees to Internet companies for providing that content to you. They might, for example, begin charging Netflix a fee for carrying online video over its network, which it likely will pass on along to its customers.

Is that a possibility?  Yes, it is possible (though more likely they will charge Netflix.  More on that later).  And the article goes on to highlight several times when this has been the case.  Interestingly enough, in most of the cases highlighted the ISPs lost.  But how is this possible given that we have very limited Net Neutrality protections in place?  In its attempt to defend Net Neutrality the article ends up making the case that Title II regulations are NOT required to force access to services because, in large part, the services were restored.

But if existing conflicts were largely resolved without Net Neutrality why are these issues used to argue for it’s necessity?

I wonder if the proponents of Net Neutrality have noticed that the companies in favor of Title II regulation are companies that use gobs of bandwidth, but don’t actually pay for the infrastructure itself?  In many cases they build out large, concentrated data centers whereas the ISPs build out vast networks.  Netflix consumes a reported 35% of ISP bandwidth, but wants to only pay for their datacenters and media licensing, as if that were the only costs.  Might that explain why they are so vocal about it?  Did it not occur to anyone that Netflix became a vocal proponent of Net Neutrality AFTER it saw an 88% drop in profitability and public outrage over their change in packages?  Do you really think that Netflix, Amazon, Microsoft, Google, etc. are in favor of Net Neutrality out of a spirit of goodness and Comcast and Time Warner are against it out of greed?  Do you really think that Netflix is concerned about bunnies and lollipops but Time Warner is only concerned about shareholders and 3 martini lunches?  (mmmm, Martini)

If it is bandwidth that we are truly concerned about why not, instead of tying the hands of those providing the bandwidth (a massively expensive endeavor), make it easier and cheaper to increase bandwidth?  Instead of regulating the size of the pie and who controls the pie why don’t we make the pie bigger so all can enjoy its sugary goodness?  Does that mean that Comcast is innocent in all of this?  By no means.  But be careful about who you blame.  What Net Neutrality really is is proxy fight between large corporations with one side simply having a stronger rhetorical high ground.  “Freedom!  Openness!” vs. “Margins!  Cost!”

If you are prepared to hand over regulatory power to the FCC you also need to understand that the regulated companies will be “prepared to have this relationship with the regulators”—and the regulators will be  prepared to have a deep relationship with the regulated—“for a long time”.  What this means is that Comcast and Time Warner, because they are smart companies with smart lawyers, will find ways to use this regulation to their advantage.  Why not?  It’s worked for Goldman Sachs.

[UPDATE]

If you question this premise (which you are free to) I have a chart that is worth looking at.  I did some searching on the WSJ website to see what the earliest mention of Netflix and Net Neutrality is.  I found a video from April 17, 2012.  The following chart shows Netflix’ margins.  The arrow is the rough approximation of the date of the video.  Correlation does not prove causality, but that’s a whole lot of correlation there.

2015-02-23_1210

Reasons why Redis is a great option for a worker queue in Magento

redis-300dpi

Alan Kent asked on Twitter


You may know that I am partial to Redis as a worker queue for Magento.  Here are some reasons why.

  1. It is stupid simple to set up (meaning nubes and philes can both use it)
  2. It is blazing fast (it’s slow log is measured in microseconds)
  3. It does pubsub for one to many
  4. It does BLPOP for many to one
  5. It’s already supported with the Cm_Redis stuff

Things you do not get

  1. Durability
  2. Complex routing rules

When I worked on my last talk for Imagine I spent a week trying to get a fast, working implementation of ActiveMQ and Rabbit.  I looked at it and thought “this is way more complicated than it needs to be” and tried Redis.  What I spent a week working on was up and running in 4 hours, including a Java-based worker and PHP refactoring.

The HA portion is missing out of the box but I believe with Sentinal you can achieve it.  In some ways it’s like MySQL.  It’s beastly easy to get set up and running but if you need to do more you can.

If you develop for Magento, know your indexes

2015-01-30_1348

When I first got into Magento development, in my mind, there were two ways of getting data from the database.  You would either call Mage::getModel(‘catalog/product’)->load($id) or you would work with the collection.  If you wanted to get a filtered list of something you would use the ORM to get it.

But as I’ve gained more experience (fairly quickly, I might add) I realized that there was more to the puzzle.  A good portion of this is because I work with Magento ECG and some of the best Magento devs and architects can be found there and I’m a quick learner.

One of the things I’ve consistently seen in production deployments is that a lot of devs don’t get beyond the 1:1 model/collection view.  Consider a scenario where you wanted to list all of the visible products in the same category as a product provided on the URL.  Your code might look something like the following.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public function testAction()
{	
	$product = Mage::getModel('catalog/product')->load(
		$this->getRequest()->getQuery('id')
	);
	/* @var $product Mage_Catalog_Model_Product */
 
	$category = $product->getCategoryCollection()->getFirstItem();
	$productCollection = $category->getProductCollection();
 
	foreach ($productCollection as $product) {
		if ($product->isVisibleInCatalog()) {
			echo $product->getSku() . '<br>';
		}
	}
}

I will admit that this is somewhat contrived, but you will see the pattern an awful lot in Magento deployments.  Though, truth be told, if this example were to be accurate with a lot of the code I see it should look more like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public function testAction()
{	
	$product = Mage::getModel('catalog/product')->load(
		$this->getRequest()->getQuery('id')
	);
	/* @var $product Mage_Catalog_Model_Product */
 
	$category = $product->getCategoryCollection()->getFirstItem();
	$productCollection = $category->getProductCollection();
 
	foreach ($productCollection as $product) {
                $product = Mage::getModel('catalog/product')->load($product->getEntityId()); //WHAT!!  yep.
		if ($product->isVisibleInCatalog()) {
			echo $product->getSku() . '<br>';
		}
	}
}

What is one way that we could do this better?  Use an index!  In particular catalog_category_product_index.  Following is its CREATE DDL (sans foreign keys and indexes).

CREATE TABLE `catalog_category_product_index` (
  `category_id` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Category ID',
  `product_id` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Product ID',
  `position` int(11) DEFAULT NULL COMMENT 'Position',
  `is_parent` smallint(5) unsigned NOT NULL DEFAULT '0' COMMENT 'Is Parent',
  `store_id` smallint(5) unsigned NOT NULL DEFAULT '0' COMMENT 'Store ID',
  `visibility` smallint(5) unsigned NOT NULL COMMENT 'Visibility',
);

It contains the calculated relations between products and categories.  But not just categories, sort order, visibility and a few other things.  We can use that knowledge and build out a mechanism that, while more complicated, will execute much less logic.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public function test2Action()
{
    $productCollection = Mage::getModel('catalog/product')->getCollection();
    /* @var $productCollection Mage_Catalog_Model_Resource_Product_Collection */
    $select = $productCollection->getSelect();
    $select->join(
        ['ccpi' => Mage::getSingleton('core/resource')->getTableName('catalog/category_product_index')],
        'ccpi.product_id = e.entity_id',
        null
    );
 
    $cond = $productCollection->getConnection()->quoteInto(
        'ccpi2.category_id = ccpi.category_id
        AND ccpi2.is_parent = 1
        AND ccpi2.product_id = ?
        AND ccpi2.visibility IN (2,4)',
        $this->getRequest()->getQuery('id') 
    );
    $select->join(
        ['ccpi2' => Mage::getSingleton('core/resource')->getTableName('catalog/category_product_index')],
        $cond,
        null
    );
 
    foreach ($productCollection as $product) {
        echo $product->getSku() . '<br>';
    }
}

Now you might be thinking “how in the world did he come up with that!?”  It’s actually quite simple.  I knew I needed the data from catalog_category_product_index.  So what I needed to figure out was the query that would get me there.  So I eventually came up with the query:

SELECT cpe.sku, ccpi.category_id FROM catalog_product_entity AS cpe
INNER JOIN catalog_category_product_index AS ccpi
 ON ccpi.product_id = cpe.entity_id
INNER JOIN catalog_category_product_index AS ccpi2 
 ON ccpi2.category_id = ccpi.category_id
 AND ccpi2.is_parent = 1
 AND ccpi2.product_id = 1001
 AND ccpi2.visibility IN (2,4);

From there I constructed the Zend_Db_Select statement incrementally, adding clauses until it looked kind of similar.

For this type of scenario in this size of dataset the benefit is not that much.  But you add complexity, dataset size and server load and the drawbacks of the previous method becomes noticeably worse.

Does this mean that you can build out your own indexes and provide the same level of benefit in your custom code with custom logic?  Yes!  But not from this blog post.  :-)

(New(ish)) Indexing in Magento or “The wonderful world of materialized views”

The Wonderful World of Materialized Views

(or, Why Indexing in Magento 1.13 Is So Much Faster)

When Magento Enterprise 1.13 came out it boasted massive improvement in index operation performance.  On the Magento blog introducing 1.13 (http://www.magentocommerce.com/blog/comments/magento-enterprise-edition-113/) a simple statement was made: “The introduction of incremental indexing reduces the need to perform a full re-index and most indexing operations are now automated”.  While this statement is true it provides a very simplified explanation of what has changed and why the improvements in performance are so great.

Materialized Views

Part of the problem with index performance was simply due to what the indexes in Magento are.  They are not indexes like a BTree or hash but act more like a pre-calculation table.  In SQL parlance this often known as a Materialized View.  This is similar to a regular view, where the table contains the results of a query but unlike a regular view which updates its contents when the row changes a Materialized View does not.

MySQL does not support Materialized Views and so Magento implemented a sort of pseudo-Materialized View.  Prior to 1.13 Magento implemented functionality that was like a Materialized View but did not track changes in products and so it was not aware of which items needed to be updated.  So a re-index was always a full re-index.  So it was a kind of pseudo-pseudo-Materialized View.

The effect of this can be seen in even very cursory, top-level views.  In the following example one product was loaded, the inventory changed, the product saved, the indexes loaded and a re-index was done all via the administrative UI.

One of the changes has been the inclusion of a new module called Enterprise_Index.  If one were to start looking here for the answer they would only be partially right and really only if they looked in the setup script.  The setup script defines a number of different events.  But these events are different from the typical event/observer paradigm that most developers are familiar with.

The events in the Enterprise_Index setup are actually triggers in MySQL on individual tables that respond to various modifications to individual rows.  These triggers are used to completely invalidate an entire index.  As such there are only a few operations that actually trigger this, most of which revolve around store, attribute or certain configuration modifications.  And automated index invalidation does not explain why Enterprise 1.13 made such large strides when it comes to re-indexing.

But it gives us a starting place to look.

Partial Indexing

A big part of the performance gains is through the use of the partial re-indexing.  The partial re-indexing of small changes is never much of a problem.  Just modify the items that were changed and apply the same logic that you use for larger changes.

But how does one determine which items have changed?  That can be a huge problem.  One option would be to set an event observer for “model_save_after”.  But that would result in a lot of overhead added to the execution flow.  Plus it wouldn’t take into account if a model class overrides the _afterSave() method and forgets to call the parent.  So the problem of reliably determining when a model has changed is a problem that cannot be reliably done in the application code.  There are too many moving parts that need to be taken into account.

To get around these and other issues triggers are used to note changes.  But unlike the triggers in Enterprise_Event which simply invalidate a full index these triggers run on individual tables and simply watch for, and make note of, changes.

One can see how this is set up in the new Enterprise_Catalog module.  The installation script makes use of a model from a new module called Mview.  The Mview module is Magento’s implementation of a MySQL-based pseudo-Materialized View.

To configure an individual Mview requires three distinct components.  1) the metadata, 2) a client, and 3) an action.  The metadata and client is managed by the Mview module.  The action is implemented by whichever module is utilizing the Mview.

This is done by telling Mview to subscribe to actions in a given table, which are referred to as events.  The events can be INSERT, UPDATE or DELETE, correlating to the related SQL actions.  Mview will set up a trigger on a provided table.  This trigger will take the provided column, most likely the primary key, and inject it into a changelog table that is unique for the table provided.  The changelog table is a two column table, one with a primary key that is referred to as the version_id and another that holds the value of the key provided.  Whenever a subscribed-to change occurs the provided value (again, usually the primary key) will be inserted into the changelog table and the version_id will be incremented via AUTO_INCREMENT.

An action can be run manually, but most of the time they are automatically managed by the new Enterprise_Index module.  It adds a cron job that is run each time cron is run.  The Enterprise_Index_Model_Observer::refreshIndex() method is called every time the cron job is run.  This method retrieves an Mview client over which the observer iterates by retrieving index configuration nodes from under global/index/indexer.

Under this configuration node will be several unique children.  Under these children should be several nodes, with the ones important to the Enterprise_Index observer being under the node “action_model”.  Under this node are the Magento URNs to the class that is going to be handling the actual re-index process.  There are currently two nodes under this parent node; “changelog” and “all”.

<catalog_product_price>
   <model>enterprise_catalog/index_dummy_price</model>
   <action_model>
       <changelog>enterprise_catalog/index_action_product_price_refresh_changelog</changelog>
       <all>enterprise_catalog/index_action_product_price_refresh</all>
   </action_model>
   <index_table>catalog_product_index_price</index_table>
   <sort_order>40</sort_order>
</catalog_product_price>

Figure 2 – Example of Full and Partial Reindex Action Configurations

All re-indexing processes that are using the Materialized View should be initiated by the Mview client and should implement Enterprise_Mview_Model_Action_Interface.  The Mview client expects that any view reindexing will be done by a class that implements that interface, often with several sub-classes in between to handle some of the mundane details of managing the partial re-indexing functionality.

Prior to executing a full re-index the Mview client will check the enterprise_mview_metadata table to see if the currently requested index is actually dirty.  If it is not marked as dirty then the index will not be run.  As noted earlier, there are only certain conditions which will cause the table to be flagged and so just because a table row has changed does not mean that the index will be marked dirty, inducing a full re-index.

But while this partial indexing is often managed via cron it can also happen inline with a regular request.  There are several examples of a partial re-index being implemented on saving an item; product inventory being an example.  The product inventory indexer listens for changes in a product and will automatically do a partial re-index for that product when the product saves.

This, like most things in Magento, is a configurable option.  Most of the partial indexers will check a configuration prior to re-indexing.  In Enterprise 1.13 there is a new System Configuration section called “Index Management” which allows the administrator to control this behavior.

The effect of this is clear.  The actual number of queries for a save operation is significantly lower.

Using Materialized Views For Your Own Profit

But while the materialized view implementation has helped make the indexing functionality much more efficient one of the real benefits of the new Mview module is that you can use it in a standardized way to manage your own indexing requirements. It will help you implement module functionality when you have a count(*), GROUP BY or DISTINCT queries or when you need to pre-calculate values.  That’s not to say that you can’t manage it on your own but by using the Mview module you can offload some of the responsibilities of managing these types of calculations to Magento.

In order to do that you need 3 elements

  1. A setup script that creates your table subscriber
  2. An index table
  3. Indexer action models that deal with either the changelog or a complete reindex

Before we move on we should note that #3 has some caveats.  Generally if you are going to be working with the changelog you will not be configuring an index to run with the shell indexer as you might be expected.  Instead, it will be run automatically through cron and managed by the Enterprise_Index_Model_Observer class, which will have its refreshIndex() method called each time cron is executed, just like it does in the Enterprise_Catalog module.  In other words, you will generally have no direct control via the administration UI or the CLI if you use this method.

For this example we are going to write an extension that will print a summary of the browsers that have viewed a given page in the footer of that page.  Granted, this is an extension that would not go on a production front-facing Magento implementation but it will demonstrate how you can integrate with the Mview extension to do calculations that you would not want to do during a normal request.  Additionally, for the purposes of clarity, we are going to omit things like Layout and Blocks and standard configuration options and focus on the Mview functionality.  This should not be a problem for most Magento developers, though the full code is available on GitHub.

For implementing this kind of functionality you might have considered creating a new table that might contain a column for the URL, the User Agent and a count.  However if you were to calculate the results on the fly you are adding a minimum of an extra write during regular load times and you increase the possibility of locks or write contention in the database.  Those kinds of writes are generally something that should be avoided.  This is not to mention, as well, that in our scenario here the information we need is already being stored in the log_ tables and so we would be duplicating a write for no reason.

There is an additional reason that we would want to do this separately.  User Agent strings are long strings of mostly useless information (for our purposes, at least).  These strings have many different permutations which may or may not follow a specific pattern.  So we have one of two options.  The first is that we build the mechanism to normalize the user agents into something useful ourselves or we use what someone else has done so we don’t have to maintain it.  The get_browser() functionality is a good suit for this, but it is a little on the slow side if you want accurate results.

So we are going to process those results offline as part of a regularly scheduled cron job that is already set up via Enterprise_Index.

To get started we need to set up a subscriber on the log_url table.  This is the table where each and every HTTP request is registered.   The Mview module allows your module to watch a given table for changes.  When those changes occur the change is noted by the triggers that have been defined for that table and when a change occurs the individual view will have an internal version ID counter incremented.

When creating a subscription you will do it by telling the Mview client to watch a given column on a given table.  There is something interesting to note here; because of how the system tracks changes it is beneficial to make sure that you are watching a column that has a unique value.  Otherwise it is possible that your indexer may accidentally select data that has already been processed.  A good example of this is if you are watching a DATETIME column.  It is quite possible that two events occurred in the same second that the indexer was run which would end up getting re-processed the next time the indexer is run.  When retrieving the changed values from Mview the indexer looks for a version ID in the enterprise_mview_metadata table which will be matched against the actual value in the change log table.  If there is any possibility of duplication here then it is possible that your data will be skewed.  Simply watching the primary key of the table you’re interested in is by far the easiest way to accomplish this.

To go about setting these relationships up you will generally do this from within a setup script for the module that you are creating.  In the setup script we are creating here there are two tables that need to be created.  One is for the summary report data and the second is for mapping individual user agents onto their parent browser values.  For example, mapping “Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20100101 Firefox/10.0.12 Iceweasel/10.0.12 Snapper/1.0” onto “Firefox 10“.  These tables, however, we will not document here for the sake of clarity, but you can see them in the GitHub repository for this module.

To start creating our materialized view and its subscription we’re going start with our setup script.   We are going to get the name of the log_url table, which is the table we’re going to watch.

$logTable = Mage::getModel('log/log')->getResource()->getTable('url_table');

Why not just type in “log_url”?  This goes to an important principle in software development; hardcode as little as you can.  What if the table changes?  Then you need to go and try and figure out all of the places where you hardcoded the value.  This is why, in our next few lines of code, we are a little more verbose when determining the primary key.

1
2
3
4
5
$pkData = $this->_conn->getIndexList($logTable);
if (!isset($pkData['PRIMARY']['COLUMNS_LIST'][0])) {
     Mage::throwException('Unable to find log table primary key');
}
$logPrimaryKey = $pkData['PRIMARY']['COLUMNS_LIST'][0];

With these two variables defined we can now configure the Mview client and subscriber.

1
2
3
4
5
6
7
Mage::getModel('enterprise_mview/metadata')
    ->setTableName($logTable)
    ->setViewName($logTable)
    ->setKeyColumn($logPrimaryKey)
    ->setGroupCode('eschrade_useragent_report')
    ->setStatus(Enterprise_Mview_Model_Metadata::STATUS_INVALID)
    ->save();

This first step in the setup process involves retrieving an instance of the Mview metadata class which is responsible for managing the basic view information.  The class itself basically functions as an interface to the database, as can be seen from the skeleton-like definition of its resource and collection methods.  The metadata class, however, is used by the Mview client as a means of differentiating between various materialized views.  The client loads the metadata information prior to executing an individual indexer.

setTableName() is where we set the actual name of the table that is the source of our changed action log.  It is the table whose changes we are interested in when they occur.  It is the table that the triggers will get set up against when the subscription is created later on.  Only one materialized view metadata entry can watch the table, but multiple indexers can process it based off of configuration settings (which we will look at later on).

The setViewName() method is used to determine the source of changes that you are interested in noting.  While this value can be a different table from the one specified in setTableName() it will generally be the same and it must be unique.

setKeyColumn() is used to tell the Mview client which is the unique value that we want to store in our changelog.  Generally, this will be the primary key.

setGroupCode() is largely unused but must be unique due to table index requirements.  However, it does allow you to load the metadata by the code or also allows you to define the group code in your index class.

Setting the status to “invalid” via setStatus() is how you force an index to be completely rebuilt.  An index that is out of date is not invalid.  What happens during normal cron runs is that each materialized view that is configured will be run, but whether or not that index is invalid will determine which index class is called.  We will get into building those classes later on.  By setting the index status to “invalid” you will be forcing a full refresh of the index during the next cron run.

After the metadata has been created you will need to create an instance of the client associated with the view name to create the actual subscription.

1
2
3
4
$client = Mage::getModel('enterprise_mview/client');
 
/* @var $client Enterprise_Mview_Model_Client */
$client->init($logTable);

The value that is passed into the init() method is the name of the table that you are watching.  This initializes the client and loads the metadata that was previously saved.

The next thing you need to do in your setup is create the changelog table.  This is done by asking the client to do it for you by passing in a Magento URN to a class that will create the changelog.  Generally speaking you can use the base class to do this, which is what we do here.  The base class requires the table_name as a parameter.  Generally the changelog will be named {$tableName}_cl.

1
2
3
$client->execute('enterprise_mview/action_changelog_create', array(
    'table_name' => $logTable
));

Now that the changelog has been created we need to create the subscription.  The subscription class will create a trigger that watches for CUD operations and inserts the primary key value for the changed row into the changelog table.  For each change that is made to the watched table a row will be inserted into the changelog that contains the previously defined watch column along with a version ID.  This version ID corresponds to the version ID stored for this view in the enterprise_mview_metadata table.  So when you are creating an indexer later on the way the indexer knows which rows to include is by selecting all of the rows in the changelog that are greater or equal to the version ID found in the metadata table.

1
2
3
4
$client->execute('enterprise_mview/action_changelog_subscription_create', array(
    'target_table' => $logTable,
    'target_column' => $logPrimaryKey
));

At this point we have Magento configured to watch for changes on a specified table and note row changes in the database.  Now we need to do something with them.

In order to make this example useful we need to define at least two classes (Eschrade_UserAgent_Model_Index_Ua_Changed and Eschrade_UserAgent_Model_Index_Ua_All) to interact with the Mview components.  Those two classes are there to either build the index partially or to build the index fully.  But while two classes generally should be defined, because the nature of the indexing operation should be similar between the two.  The primary difference will be the size of the data set.

Most of the time the base indexer class can be built by extending Enterprise_Mview_Model_Action_Mview_Refresh_Changelog.  While there are classes that can be used to define each class individually this might end up making more work than is necessary.  In this case we will build a base class for refreshing the view, but provide some simple hooks that will allow the full re-indexer to easily do its job at the same time.

In the example here there are several methods that we are going to describe in words instead of code since they are either helper functions or are used to modify internal SQL queries to get what we’re looking for.  If you would like to see the full code you can find it on GitHub at https://github.com/kschroeder/Magento-Enterprise-Materialized-Views-Example/blob/master/app/code/local/Eschrade/UserAgent/Model/Index/Ua/Changed.php.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class Eschrade_UserAgent_Model_Index_Ua_Changed extends  Enterprise_Mview_Model_Action_Mview_Refresh_Changelog
{
    public function execute()
    {
        try {
            $this->_connection->beginTransaction();
            $this->preIndexHook();
            $this->populateUserAgentTable();
            $versionId = $this->_selectLastVersionId();
            $this->processIndex();
            $this->_metadata->setValidStatus()
                ->setVersionId($versionId)
            ->save();
            $this->_connection->commit();
        } catch (Exception $e) {
            $this->_connection->rollBack();
            $this->_metadata->setInvalidStatus()->save();
            throw $e;
        }
        return $this;
    }
}

This class will, by default, only process changed items.  But when we look at the class we’ve defined for processing all of user agent results you will see that there is very little that we need to do.

The first thing we need to do is start a transaction.  We don’t need to do this, but it is often a good idea to wrap this functionality in a transaction to handle cases where there is an error that might corrupt the index or cause data to be processed multiple times.

The preIndexHook() method allows any subclasses to execute any logic prior to actually calculating the index.  This is primarily used to allow the full re-index class to truncate the data.

Next up we call a method called populateUserAgentTable().  This is a helper method that does a LEFT JOIN on the log_visitor_info table and a table that is used to store the friendly names of the user agents.  If there are a large number of user agents that have not been processed, this method can actually take several seconds to execute (full version on GitHub). This is done so that MySQL can do some processing for us later on rather than doing it in our code.

At this point we get the most current version ID from the change log to use later when setting the last processed version ID.  This could be done at the end of the process, but there is more of a possibility that a version ID may be skipped if we wait until the end to do this.  Perhaps a better option would be to do checks from the changelog version IDs and state that highest one will be set.  But, for clarity, and the fact that the elapsed time between when this query is made when the unprocessed ID’s are retrieved will be very low, we do it here.

Next up is the processIndex() method.  This is where the magic is implemented.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public function processIndex()
{
    $indexTable= Mage::getModel('eschrade_useragent/agentreport')->getResource()->getMainTable();
    $select = $this->_selectChangedRows();
    $this->joinSelectWithUrlAndVisitor($select);
    $stmt = $select->query();
 
        while (($row = $stmt->fetch()) !== false) {
            $this->_connection->insertOnDuplicate(
                $indexTable,
                array(
                    'page' => $row['url'],
                    'agent' => $row['normalized_user_agent'],
                    'request_count' => $row['page_count']
                ),
                array(
                    'request_count'  => new Zend_Db_Expr('request_count + VALUES(request_count)')
                )
            );
        }
}

The first thing we need is the table that we are going to be inserting the values on.  That would be the main agentreport table.  Next up we retrieve the SELECT object that is pre-configured to retrieve only the primary keys of rows that have changed.  This is defined in the Enterprise_Mview_Model_Action_Mview_Refresh_Changelog class.  We make a small modification to the method to deal with some of the complex JOINs we do later, but it essentially the same.

After we have retrieved the pre-configured SELECT object, which is directly pertinent to the Mvew discussion we modify it to suit our needs, which is not pertinent to our example.  That method is custom to this example and does a number of JOINs on various tables to pull the data we need into one result set.  This data is then inserted with an ON DUPLICATE clause that will add any new page/user agent counts to existing ones, or insert them if no counts exist.

At this point it should be worth noting that using Mview forces us to do things on a bit of a less efficient manner.  Generally, when you are building a Materialized View the goal is to do as much from INSERT FROM SELECT statements as possible.  This can greatly cut down on the possibility of hitting memory limits in PHP and also allows MySQL to do the heavy lifting for you.  A good rule of thumb is that you want to work as close to the data as possible.

But because we are doing partial changes and because those partial changes may already have some previously-calculated data in the table we are not able to do that.  This is actually a limitation in the database adapter where fields to be defined with a Zend_Db_Expr class (which is how you would make the addition calculation for duplicate entries) are not calculated in a way that can be used.  The problem is that when you specify a column the expression that you use for the field is the expression that is used for the ON DUPLICATE clause.  Because we are possibly updating existing columns with an addition expression we are not able to use the same field value in the column clause as we do in the ON DUPLICATE clause.  Therefore, in this case, we need to process them manually.

This is why we first generate a result set and iterate over them calling insertOnDuplicate().  insertOnDuplicate() allows us to distinctly define columns and ON DUPLICATE clauses, which we can’t do via insertFromSelect().

At this point when we run the index nothing will happen.

That is because we have not defined the class that will (re)process the entire view.  Remember back to our setup script where we set the view as invalid?  That will result in a full re-index.

So let’s take a look at that class in its entirety.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Eschrade_UserAgent_Model_Index_Ua_All extends Eschrade_UserAgent_Model_Index_Ua_Changed
{
    public function preIndexHook()
    {
        $this->_connection->truncateTable(
            $this->getReportResource()->getTable('agentreport')
        );
    }
 
    protected function joinSelectWithUrlAndVisitor(Varien_Db_Select $select)
    {
        $select->reset(Varien_Db_Select::WHERE);
        parent::joinSelectWithUrlAndVisitor($select);
    }
 
}

Because of the way we built out Eschrade_UserAgent_Model_Index_Ua_Changed we only need to do two things. 1) truncate the report table, and 2) remove the version_id condition from the WHERE clause.

And with that our classes are done.

But our indexers will still not run.  That is because we have not defined our indexes in the configuration.  Our indexers will be added under the /config/global node in our config.xml file.

<index>
    <indexer>
    <eschrade_useragent_report>
    <model>eschrade_useragent/index_ua_dummy</model>
    <action_model>
    <changelog>eschrade_useragent/index_ua_changed</changelog>
    <all>eschrade_useragent/index_ua_all</all>
    </action_model>
    <index_table>log_url</index_table>
    </eschrade_useragent_report>
    </indexer>
</index>

The first thing you might notice is that there is a dummy class defined under the <model> tag.  This node is typically used by the index CLI script to process the individual indexes.  However, because Mview based indexers should be run via cron we define a simple class that renders the name and the description for the admin UI but does nothing else.  The main indexer, which is managed by the new Enterprise_Index module, uses the same configuration nodes as Mage_Index, but uses the <action_model> and <index_table> nodes instead.

When the Mview index manager is run it will check status column in the mview metadata table for indexes that are invalid.  For each row whose status is “invalid” the table_name will be used to match up which indexer configuration should be used.  The main indexer will check for a URN under <action_model><all> and run that indexer.  This is the first part of the indexing process.

The second part of the indexing process is when Enterprise_Index iterates over all of the indexers defined in the merged configuration.  An index is ignored if either the <action_model> or the <index_table> is missing.  If a full re-index has already been run for a given indexer entry the changelog-based indexer will be omitted.  Beyond that caveat every index configuration that has an <action_model><changelog> and <index_table> node will be run.

Conclusion

While there are several places in the new Magento Enterprise Edition 1.13 that make indexing more efficient there is perhaps no mechanism more important than the implementation of Materialized Views.  It allows for a standardized method of building summary/calculation tables in a predictable manner.  But rather than building an implementation intended strictly for the new version of Magento Enterprise, Mview module allows developers to utilize the same functionality that is largely responsible for some of the performance improvements noted for EE1.13.

Realtime logging for Magento

There are a couple of things that animate me in PHP/Magento world.  The first is queuing.  I’ve written and spoken about that several times and do as much with it as possible.  The second is something that I’ve not written about nor done a lot of work in.

One of the other things that I have had in the back of my mind for the past several years is a cloud-based, completely integrated development environment.  Personally, I think that the development tooling in the PHP realm are still very young and basically come down to an IDE.  Debuggers are still not used as much as they should be.  var_dump() is our friend, right?

Ever since the Zend Developer Cloud started up it stirred in me some really interesting possibilities of what could be done.  Sadly they never happened, but the ideas remained.  If that doesn’t inspire you it’s because I’m not describing what I have in my head.  I don’t have the time to do that.

But this blog post is one part of it.  One of the things that is part of what I envision is a realtime logger that shows what is happening, as it is happening.  There is good software out there that shows what happened during a request, but it is always after the fact.  The Zend Server ZRay feature is a good example.  It’s a cool feature but doesn’t quite go as far as I have in mind.

With that in mind I spent some time last week working on a very small piece of this vision for Magento.  I wrote it really, really quickly and so don’t laugh when you look at the code and see obvious errors.  One error is the mechanism that I intended to use to shutdown logging when the watcher is not running, but sadly the pcntl code does not seem to do what I expect it to, which is odd because I copied and pasted the code from the documentation.

It is a combination of Magento extension and command line program that utilizes Redis PubSub for messaging passing.  It

  • Hooks in to every Magento event and provides instrumentation for that
  • Contains a log writer so that if an exception is logged or some message is logged it will pass it through to the watcher (no running through var/report again)
  • Contains an SQL profiler that will provide any SQL queries in real time or a summary at the end of the request.

The logger in Magento is controlled by the watcher, which sends logging commands when it is started up.  Following is the help menu.

[kschroeder@devvm1 bin]$ php watcher.php --help
This script is used to connect to the front end of the Eschrade_PubSubLogger utility.
Note that if you are using the File cache backend you may need to clear it from /tmp if the user you are running this script under does not have permission in {magento}/var/cache.
Usage: watcher.php [ options ]
--all-events Listen for all events
--enable-blocks Enable block events (overrides all-events)
--enable-models Enable model events (overrides all-events)
--events <string> Comma-separated list of events to listen for (overides any settings)
--resolve-classes Look up the class based off of group class names
--hide-event-data Hides the event data when events are turned on
--hide-event-observers Hides the event observers when events are turned on
--delay-messages <integer> Delay messages by X milliseconds (default 75ms) - Makes it easier to follow the log
--no-color Turns off color coding of elapsed times
--show-request-id Show the unique ID for each request
--show-sql [ <string> ] Show the SQL in real time. Optionally add a preg_match() filter
--show-sql-summary Show the SQL summary at the end of the request
--help This help page

By default it just captures the URL, the request start time, the full action name and the elapsed time of the request.  Following are explanations for each option

  • all-events – Turns on event monitoring.  It will provide the event name all the observers for that event and a summary of the data provided to the observers.  By default it turns off block and model events since they end up being noise most of the time.  This will only attach to configured events, not all events.  In other words, if an event has a configuration value in a config.xml it will be reported on.  If there is some event that is triggered somewhere it will not show up.
  • enable-blocks – Turns on block events
  • enable-models – Turns on model events
  • events – provide a comma separated list of the specific events you want to watch for.
  • resolve-classes – when examining event observers this flag will have it look up the class definition from the Magento config
  • hide-event-data – Turns off the event data.  For example, if you are listening to controller_action_predispatch you will not see the controller action class that is passed to the observer
  • hide-event-observers – Turns off printing the observers.  When done in conjunction with hide-event-data this will simply print out a list of all the observers as they are called with their elapsed execution time from the last event
  • delay-messages – One of the things I noticed was that a lot of information comes in very quickly.  So I added a 75ms delay to each message.  Set to zero if you want no delay.
  • no-color – If you are running it with event introspection it will color code elapsed time in events.  However those commands look like gibberish if you are piping to a file.  This turns that off
  • show-request-id – The Magento extension automatically adds a unique ID to each request so you can track the source of logs or events.  However, the watcher does not turn this on by default.  This will turn it on.
  • show-sql – Will show the SQL and its elapsed execution time in realtime as it happens.
  • show-sql-summary – Shows a count of SQL operations, SELECT, INSERT, etc. at the end of a request.

Download it at github.

3 things to do that can make you completely successfull

We all have important things to do but some things are more important than others.  Using these 3 strategies you will find success to come easier.

  1. Find something you love to do – it’s easier to do something you love
  2. Do things for other people – blah blah blah
  3. blah blah blah

Things that are worth pursuing cannot be accomplished by reading stupid lists, nor writing them.  What they really are are pointless marketing.  In other words

Worthwhile things are hard and require internal fortitude, not external well-wishing.  In other words,

  1. Work on your internal fortitude
  2. Do something you hate
  3. Work with difficult people

Force yourself to do that and you will come out stronger.  Stop reading those stupid lists.

Wait a sec…

Why I’m not for Net Neutrality

I’m not necessarily against it either, which makes that one of the more passive-aggressive titles I’ve written in a while (that, or wildly successful link-bait).  But while I’m not really for or against; I tend to be wary of calls for more government control of something.  But the truth be told, I almost did come out in favor of Net Neutrality when I saw this graphic over at Re/Code.

netflix-comcast-complaint-calls

Being an avid user of Netflix and Amazon Prime Instant Video this graphic kinda pisses me off because I have been experiencing poor streaming speeds for a while.  I pay for more than enough bandwidth to cover my streaming needs.  Since I pay for it why shouldn’t I get it?  That’s essentially the crux of the complaint.

But here is the chart that shows the other side of the issue.

sandvine-130514

And so here is why I’m not quite on board with Net Neutrality.

Nothing.
In.
Life.
Is.
Free.

There is always a cost.

The only thing in your life that does not have a monetary cost directly tied to it is the air that you breath.  Unless you are living naked at the mouth of a natural stream in the mountains EVERYTHING ELSE has a price.  The clothes you wear, the food you eat, the car you drive, etc.  It’s just a question of who will pay for it.  Up until I became a grown-up my parents paid for a lot of that.  But as a grown up I have had to learn that I need to pay for the things I use.  The same thing goes for Internet service.  That 50% of traffic that is going to streaming services?  Someone is going to pay for it.  Hint: it won’t be the broadband providers.  Why not?  It’s quite simple.  They don’t have the legal authority to print money.  They have revenue and they have costs.  If things are working as they should the revenue will exceed the costs.

But what happens if costs start exceeding revenue, particularly if increased costs are expected to be long term?  There are two general options.  1) Increase income to cover costs, or 2) decrease costs to cover lost income.  Granted there will be a lot of variables in there but at the end of the day it will be one, the other or both.

While it may sound like I’m on the side of the cable companies, I’m not.  I’m not particularly satisfied with my Internet service but I am aware of the economic realities of it.  The US in particular has some unique challenges.  Geography being one of them.  Things that work a certain ways in other countries may not work here.  Things that work here may not work in other countries.  Things that work on the East Coast may not work in the South, where I live.  And things that work for someone who watching Netflix for 8 hours a day will not be the same as someone who uses their Internet connection for getting pictures of the grand-kids.

We will be incredibly short-sighted if we just say “don’t throttle my Internet connection because it annoys me.”  SOMEONE is going to be paying the costs of an un-throttled Internet connection, and with something like Net Neutrality my concern is that the costs will disproportionately go to the people who are not responsible for the bandwidth usage.  It so often seems that as soon as someone says “someone needs to do something about X” everyone gets thrown in the boat whether they want to be in the boat or not.

Is it greed on the part of the cable companies?  Perhaps.  But no more so than Netflix, me or you, the person reading this.  But we need to be cautious before we start waving the magic wand of government to make our lives easier.  Like magic, everything comes with a cost, deary.  Before we wave the magic wand of legislation and regulation we need to make sure we know what the costs are going to be when we get what we ask for.

So before picking up a magic wand of your own, it is worthwhile to think through what the effect of it will be.  Buffering-free episodes of Once Upon a Time will be only *one* of the effects.

Testing Magento with random URLs in JMeter

One of the things that I think is important to do in your testing is to introduce some level of randomness into your testing for a web server.   In the wild you have almost no control over what customers are doing on your web site and adding some level of randomness might help you gather additional data, particularly if you are doing it under load.

With JMeter this is actually fairly easy.  You have the ability to modify HTTP requests based off of variable values.  In other words, you can have your URL endpoint be ${url}.  But how do you get that random data into JMeter in the first place?  I think one of the easiest ways to do it is by having a really small script that gets a random URL from core_url_rewrite.  It would look something like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$pdo = new PDO('mysql:dbname=magento_ee', 'root', '');
$sql = 'SELECT COUNT(url_rewrite_id) FROM core_url_rewrite';
$sth = $pdo->prepare($sql);
$sth->execute();
$count = $sth->fetchColumn(0);
$row = rand(0, $count);
 
$sql = 'SELECT request_path FROM core_url_rewrite LIMIT 1 OFFSET ' . $row;
 
$sth = $pdo->prepare($sql);
$sth->execute();
$url = $sth->fetchColumn(0);
 
echo $url;

Then you would need a JMeter script that would look something like this (Eschrade Random URL JMX file).

 

Web Analytics