Asserting page load times in Magento with Magium

Leave a comment

Validating SLAs (Service Level Agreements) is often part of executing browser tests. With Magium you can now include page timings as part of your test. Say, for example, that part of your SLA is that a page must be rendered in less than 5 seconds. This is done by using the TTFBLoadedLessThan, DOMPageLoadedLessThan or DOMContentLoadedLessThan … Continue reading “Asserting page load times in Magento with Magium”

Powered by WPeMatico


Excluding fields in the mongodb/mongodb library

Leave a comment

I am using the mongodb/mongodb library for a project of mine.  The API seems fairly different from the old PECL library and it also came with some other, albeit unexpected, baggage.

My understanding of the library is that it is intended to be a replacement for the existing PECL library and that it intends to strike a balance between the core library functionality while also giving it a more agile, for lack of a better term, release cycle.  What that means is that there is some basic MongoDB functionality written in C with the forward-facing API being written in PHP.

In and of itself this is not an issue… except for the one thing I was trying to do.  One of the practices I’ve heard about Mongo is to get Mongo to do as much as it can, but not to worry too much about complicated joins and such as you would in SQL.  In other words, don’t shy away from bringing data into the application to do some processing.

That was the practice I followed, which worked fine up until my data size started to increase.  Then I ran into an issue where I was seeing 50% of my response time being eating up by BSONDocument and BSONArray unserialization calls.  This was due to a sub-document I had in the document that could actually be quite large.

“Hmm,” I thought to myself.  “I’m sure glad I didn’t see this a month from now.”  I did some research and waffled a little between using MapReduce or using the newish aggregation features.  I opted for the aggregation feature, which is so incredibly powerful.  I see myself misusing this quite often.

But after changing the code to make it work with aggregation I still had the problem of the unserialization.  The documents are returned as a whole.  In the core functionality you can exclude individual fields by doing something like this:

1
2
3
4
db.inventory.find(
   { type: 'food', _id: 3 },
   { "classification.category": 0}
)

The second parameter (classification.category) tells Mongo to omit that field in the result set.  But I couldn’t find a way to mimic this behavior in the new library.  So I opened an issue on GitHub (when you run into a problem, contact the library maintainers or post on SO.  Don’t just leave it) asking how I might run this.

Jeremy Mikola responded with the solution.  There is a second option that you can put in that is documented, but I don’t think it is directly clear from the documentation exactly what it does.  Or, at least, the description isn’t very SEO friendly (Google-Based-Development FTW!).

How does that look?

Like this:

$find = [
    'customer_id' => $this->getId(),
    'invoked_test' => $testName
];
$options = [
   'sort' => ['created_at' => -1],
   'limit' => 10,
   'projection' => ['events' => 0]];

$currentResults = $tests->find($find, $options);

BOOM! 7.5 seconds off the response wall clock time.


How to (properly) harness the Magento EE Full Page Cache (part 2 – application states)

Leave a comment

In part 1 we took a look at some of the basics of the Magento EE FPC.  We talked about the general structure and flow of the FPC.

In this article we are going to take a look at the application states in the FPC.  I don’t know if “states” is the official term, but it’s what I use to describe what they are.

There are, essentially, 4 states that the Full Page Cache can be in.

Completely cached

All containers can be rendered by the processor.  This is the best state to be in.  No DB connection is required.  The only infrastructure required is a connection to the cache.  Each container will pull content from the cache and inject its content in the page.

Completely cached, with processing

Most containers will pull their content from the cache, but some might render content inline.  This is done by overriding the applyWithoutApp() method.  We’ll take a look at this in a future article

Partially bootstrapped without blocks

The state is when the applyWithoutApp() method returns false.  The FPC will first call that method.  If ANY of the containers returns false the FPC will render as many containers as it can but keep track of the ones that did not.  Then it will bootstrap the request and forward it to the FPC’s request controller called Enterprise_PageCache_RequestController.  The processAction() method call each un-processed container’s applyWithinApp() method.

Partially bootstrapped without layout

The difference between the two partially bootstrapped states depends on if the individual container overrides the applyInApp() method.  The abstract container will try to render a block when its applyInApp() method is called.  So if you do not want to render the blocks (they will likely have recursion and hierarchy to them) you will need to override the applyInApp() method for that container so it does not automatically try to render its block.

All of the different states can also be categorized by how they are processed.  The following diagram splits them up.

Magento-EE-FPC-Types

So which should you use?  Well, that’s the wrong question to ask.  The right question is “which should I build for?”  And for that question the answer is obvious.  Build to make your application use the request processor as much as possible.  Why?  Take a look at the code. Here is Mage_Core_Model_App:run().

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
public function run($params)
{
  $options = isset($params['options']) ? $params['options'] : array();
  $this->baseInit($options);
  Mage::register('application_params', $params);
 
  if ($this->_cache->processRequest()) {
    $this->getResponse()->sendResponse();
  } else {
    $this->_initModules();
    $this->loadAreaPart(Mage_Core_Model_App_Area::AREA_GLOBAL, Mage_Core_Model_App_Area::PART_EVENTS);
 
    if ($this->_config->isLocalConfigLoaded()) {
      $scopeCode = isset($params['scope_code']) ? $params['scope_code'] : '';
      $scopeType = isset($params['scope_type']) ? $params['scope_type'] : 'store';
      $this->_initCurrentStore($scopeCode, $scopeType);
      $this->_initRequest();
      Mage_Core_Model_Resource_Setup::applyAllDataUpdates();
    }
 
    $this->getFrontController()->dispatch();
  }
  return $this;
}

The full page cache is executed in the line that says processRequest().  If that method returns true, meaning that all of the containers were able to process their content, then the response object is called and the request is essentially done.  No routing, no controllers, and, importantly, no layout.  All of those are integral to executing a Magento request, but they do come at a cost.  If you can bypass that cost as often as possible your application will be much, much faster.

How much faster?  Consider the response times of this quick load test on my very, very under-powered virtual machine.

cached-vs-uncached


How to (properly) harness Magento EE’s Full Page Cache (part 1)

Leave a comment
2015-05-15_1646

TL;DR

  1. The Full Page Cache is one of the most important performance features in Magento EE and very few people know how to use it
  2. Containers control your content
  3. Processors manage containers
  4. Knowing containers and processors gets you 90% of the way to solving #1

With my work with ECG one of the more common things I see are performance problems with web sites.  Most often it is due to problems that developers have caused with how they interact with database.  However, that represents the problems I see.  How about the things that people could do better. A big issue that a lot of Enterprise Magento merchants have is that their full page cache is underutilized.  You may be saying to yourself “Why don’t I just use Varnish?  It’s loads faster anyway.”  You can… as part of an overall caching strategy.  Varnish is really good for static content.  But what happens when someone adds an item to the shopping cart.  You are now running fully bootstrapped and laid out Magento requests.  That, or you’re now using Ajax and cookies to do things to your layout to make it appear like the shopping cart is being rendered. But it doesn’t just have to be the shopping cart.  Anything that needs some level of unique display is a potential cache invalidation feature.  What is unique about the Magento Enterprise Full Page Cache is that you get the best of both worlds.  You get cached pages AND you get non-static content.  Varnish is still going to be faster, but now the difference is 0.5ms vs 20ms instead of 0.5ms vs 2000ms.  In many cases, a properly interfaced EE FPC negates the need for Varnish. Following is a diagram that roughly corresponds to how the FPC works and where it hooks in to the process. FPC_Order

There are two places where the FPC hooks into. The first is the request processors.  This is at the front of the request.  It is done immediately after the app/etc/*.xml files have been loaded.  This is because that is where the cache configuration exists.  Can use a cache without cache configuration.  The FPC is a processor.  The processor (Enterprise_PageCache_Model_Processor) is responsible for loading the cache metadata and initial cache content.  It then delegates the processing of the cached content to a sub-processor.  A sub-processor is in charge of retrieving cached content and merging content into the correct place.  There are three types of sub-processors

  1. Default (typically used by the CMS)
  2. Category
  3. Product

For the most part you do not need to worry about the sub-processors. The real magic occurs in the containers.  The containers are responsible for caching and rendering block output.  The caching component injects itself into the block output and the rendering injects itself into the sub-processor. Containers are the glue between the cached and uncached content. 2015-05-15_1646

A container “injects” itself at certain places in the code and it does this by hooking in to the core_block_abstract_to_html_after event.  This happens when the layout is being rendered.  If there is a container defined for the block being executed then the FPC will do two things.

  • Extract the block’s HTML and save it to the cache
  • Surround the HTML with a placeholder tag.

The placeholder tag is used on subsequent requests to find cached output for certain containers when rendering the page from the cache. We’ll take a look at the structure of a full page cached page in a future article. But next up will be the states that the full page cache can be in.


Be wary of functions in SQL in Magento, such as MAX(). There may be a better way.

Leave a comment
2015-07-23_0943

I just wrapped up a Healthcheck for a Magento ECG customer and I ran into an interesting issue.  IIRC, the problem wasn’t a client one but rather code that a community extension used.  I won’t mention which one.

This extension was generating a query that was looking for the maximum minimum price for all products in a category.  The resulting query was something like this.

1
2
3
4
SELECT MAX(min_price) FROM catalog_product_index_price AS cpip
INNER JOIN catalog_category_product_index AS ccpi
    ON ccpi.product_id = cpip.entity_id
WHERE ccpi.category_id = 2;

So, it was doing a lot of things right.  It was using the index tables, not the attribute tables.  It was also using indexes, so that was right too.  But take out the MAX() call and see what happens.  I’m running this on the Magento sample data set and it returns 1275 results.  And as we all know the sample data is not like most production systems.  The customer I was working with saw about 40k results on some queries and under load this was taking 20 seconds in some cases, though never less than 200ms.  The reason is because it’s essentially doing a table scan on the results to find the record with the highest value.

So how do we make this faster?

It’s super easy, and fun to do.

1
2
3
4
5
6
SELECT min_price FROM catalog_product_index_price AS cpip
INNER JOIN catalog_category_product_index AS ccpi
    ON ccpi.product_id = cpip.entity_id
WHERE ccpi.category_id = 2
ORDER BY min_price DESC
LIMIT 1;

Sorting by min_price and limiting by 1 gives you the same result with a much more efficient query.  On the client’s test system it went from a minimum of 200ms (possibly 10s of seconds) to >1ms.

If you are well versed in SQL this makes a lot of sense, and might have been what you did normally.  But for this community extension it isn’t what it did and I suspect that given that a lot of Magento developers are API-oriented this is what they tend to do.


Load Testing the Magento Checkout with JMeter (and enjoying it)

2 Comments
2015-06-08_0808

Load Testing the Magento checkout is, in theory, difficult.  Well, not so much difficult as time consuming.  It is my personal opinion that most Magento load testing is done either poorly or incompletely.  Load testing read-only pages usually means hitting the same page over and over again.  This is counter productive because it allows the system to cache re-used data and skew load testing results to the positive.  I have yet to see a load test that yields performance results that matches production.

Testing the checkout is often even worse.  It is also the point in the load test that is most likely to change system behavior due to changing inventory, API calls, add-to-cart operations, etc.  It’s also hard to build a JMeter test for.  Most of the time orgs don’t have a JMeter expert on staff and so rely on the developers to build the load test.  Developers spend most of their time writing code, not load tests, and so the tests tend to not include some of the harder tests, such as checkout.  I don’t blame anyone for that, a devs core skill is not load testing.

To help with that I’ve written a basic JMeter checkout scenario for JMeter.  It uses the JMeter Plugins Extras Set.  The test uses the Guest Checkout and the product added to the cart is from the Magento sample data, so you will need to change that.  In addition you will need to change the HTTP Default host name, as shown below.

2015-06-08_0823

You can download the JMeter script here.


Magento, HHVM and the Enterprise Full Page Cache

7 Comments
hhvm-rtot

*** Personal note: It has been suggested that this may reflect an official Magento position.  It does not.  I like doing research and I like publishing what I find as long as I find it interesting and it isn’t tied to a customer.  So if you’re wondering if this reflects a change in Magento or some kind of future roadmap, it does not.  ***

I was recently asked by a client about Magento and HHVM support.  While the answer is “don’t touch it in production and there’s no guarantee of support” it did spur in me the insatiable desire to create charts.  I’m curious about it and I’m sure you are too.

One of the things I’ve been curious about is how much juice could we squeeze out of the HHVM/Magento combo?  In other words, how fast could Magento Enterprise 1.14 actually be?

My test setup is my crappy local VM.  8GB of RAM with 4x Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz.  I had intended to test with Redis (boy do I have a fascinating blog post coming up there!) but there seems to be an odd thing that occurs to HHVM when dealing with reading from sockets that have buffered data 16k or greater.  Since HHVM uses FastCGI that meant that for a proper comparison I needed to use Nginx with PHP-FPM.  So I used the Cm_Cache_Backend_File instead.  While my goal was to see if HHVM was faster than standard PHP for Magento I also wanted to see how many more requests per second I could squeeze out of it.

The answer to the latter question is; a whole shitload.

The test was simple.  Hit the front page of a FPC-enabled site with JMeter and see what happened (0 – 25 concurrent connections and down again).  Here’s what happened.

PHP-FPM peaks at just under 200 requests per second

nginx-rps

HHVM Peaks at, wait for it, over 300 requests per second.

HHVM Requests Per Second

That is 300 freaking MAGENTO requests per second on a 4 CPU box!

Now, lest you think I’m suggesting you use HHVM, I’m not.  It was not easy getting it to work and I wouldn’t trust it with a full workload.  PHP 7 is also supposed to have similar improvements.  And this was a test of one page, to see what was possible with one piece of functionality.  But it’s an interesting indication of the improvements that are being made in the PHP ecosystem as a whole thanks to the introduction of HHVM.  Perhaps PHP 7 will render HHVM moot.  If that’s what happens, it’s fine by me.  I’m just glad that we are finally starting to see some forward momentum on addressing some of the performance issues that have been inherent in PHP for quite some time.  A little bit of creative destruction, now and then, is a good thing.


Reasons why Redis is a great option for a worker queue in Magento

2 Comments

redis-300dpi

Alan Kent asked on Twitter


You may know that I am partial to Redis as a worker queue for Magento.  Here are some reasons why.

  1. It is stupid simple to set up (meaning nubes and philes can both use it)
  2. It is blazing fast (it’s slow log is measured in microseconds)
  3. It does pubsub for one to many
  4. It does BLPOP for many to one
  5. It’s already supported with the Cm_Redis stuff

Things you do not get

  1. Durability
  2. Complex routing rules

When I worked on my last talk for Imagine I spent a week trying to get a fast, working implementation of ActiveMQ and Rabbit.  I looked at it and thought “this is way more complicated than it needs to be” and tried Redis.  What I spent a week working on was up and running in 4 hours, including a Java-based worker and PHP refactoring.

The HA portion is missing out of the box but I believe with Sentinal you can achieve it.  In some ways it’s like MySQL.  It’s beastly easy to get set up and running but if you need to do more you can.


(New(ish)) Indexing in Magento or “The wonderful world of materialized views”

4 Comments

The Wonderful World of Materialized Views

(or, Why Indexing in Magento 1.13 Is So Much Faster)

When Magento Enterprise 1.13 came out it boasted massive improvement in index operation performance.  On the Magento blog introducing 1.13 (http://www.magentocommerce.com/blog/comments/magento-enterprise-edition-113/) a simple statement was made: “The introduction of incremental indexing reduces the need to perform a full re-index and most indexing operations are now automated”.  While this statement is true it provides a very simplified explanation of what has changed and why the improvements in performance are so great.

Materialized Views

Part of the problem with index performance was simply due to what the indexes in Magento are.  They are not indexes like a BTree or hash but act more like a pre-calculation table.  In SQL parlance this often known as a Materialized View.  This is similar to a regular view, where the table contains the results of a query but unlike a regular view which updates its contents when the row changes a Materialized View does not.

MySQL does not support Materialized Views and so Magento implemented a sort of pseudo-Materialized View.  Prior to 1.13 Magento implemented functionality that was like a Materialized View but did not track changes in products and so it was not aware of which items needed to be updated.  So a re-index was always a full re-index.  So it was a kind of pseudo-pseudo-Materialized View.

The effect of this can be seen in even very cursory, top-level views.  In the following example one product was loaded, the inventory changed, the product saved, the indexes loaded and a re-index was done all via the administrative UI.

One of the changes has been the inclusion of a new module called Enterprise_Index.  If one were to start looking here for the answer they would only be partially right and really only if they looked in the setup script.  The setup script defines a number of different events.  But these events are different from the typical event/observer paradigm that most developers are familiar with.

The events in the Enterprise_Index setup are actually triggers in MySQL on individual tables that respond to various modifications to individual rows.  These triggers are used to completely invalidate an entire index.  As such there are only a few operations that actually trigger this, most of which revolve around store, attribute or certain configuration modifications.  And automated index invalidation does not explain why Enterprise 1.13 made such large strides when it comes to re-indexing.

But it gives us a starting place to look.

Partial Indexing

A big part of the performance gains is through the use of the partial re-indexing.  The partial re-indexing of small changes is never much of a problem.  Just modify the items that were changed and apply the same logic that you use for larger changes.

But how does one determine which items have changed?  That can be a huge problem.  One option would be to set an event observer for “model_save_after”.  But that would result in a lot of overhead added to the execution flow.  Plus it wouldn’t take into account if a model class overrides the _afterSave() method and forgets to call the parent.  So the problem of reliably determining when a model has changed is a problem that cannot be reliably done in the application code.  There are too many moving parts that need to be taken into account.

To get around these and other issues triggers are used to note changes.  But unlike the triggers in Enterprise_Event which simply invalidate a full index these triggers run on individual tables and simply watch for, and make note of, changes.

One can see how this is set up in the new Enterprise_Catalog module.  The installation script makes use of a model from a new module called Mview.  The Mview module is Magento’s implementation of a MySQL-based pseudo-Materialized View.

To configure an individual Mview requires three distinct components.  1) the metadata, 2) a client, and 3) an action.  The metadata and client is managed by the Mview module.  The action is implemented by whichever module is utilizing the Mview.

This is done by telling Mview to subscribe to actions in a given table, which are referred to as events.  The events can be INSERT, UPDATE or DELETE, correlating to the related SQL actions.  Mview will set up a trigger on a provided table.  This trigger will take the provided column, most likely the primary key, and inject it into a changelog table that is unique for the table provided.  The changelog table is a two column table, one with a primary key that is referred to as the version_id and another that holds the value of the key provided.  Whenever a subscribed-to change occurs the provided value (again, usually the primary key) will be inserted into the changelog table and the version_id will be incremented via AUTO_INCREMENT.

An action can be run manually, but most of the time they are automatically managed by the new Enterprise_Index module.  It adds a cron job that is run each time cron is run.  The Enterprise_Index_Model_Observer::refreshIndex() method is called every time the cron job is run.  This method retrieves an Mview client over which the observer iterates by retrieving index configuration nodes from under global/index/indexer.

Under this configuration node will be several unique children.  Under these children should be several nodes, with the ones important to the Enterprise_Index observer being under the node “action_model”.  Under this node are the Magento URNs to the class that is going to be handling the actual re-index process.  There are currently two nodes under this parent node; “changelog” and “all”.

<catalog_product_price>
   <model>enterprise_catalog/index_dummy_price</model>
   <action_model>
       <changelog>enterprise_catalog/index_action_product_price_refresh_changelog</changelog>
       <all>enterprise_catalog/index_action_product_price_refresh</all>
   </action_model>
   <index_table>catalog_product_index_price</index_table>
   <sort_order>40</sort_order>
</catalog_product_price>

Figure 2 – Example of Full and Partial Reindex Action Configurations

All re-indexing processes that are using the Materialized View should be initiated by the Mview client and should implement Enterprise_Mview_Model_Action_Interface.  The Mview client expects that any view reindexing will be done by a class that implements that interface, often with several sub-classes in between to handle some of the mundane details of managing the partial re-indexing functionality.

Prior to executing a full re-index the Mview client will check the enterprise_mview_metadata table to see if the currently requested index is actually dirty.  If it is not marked as dirty then the index will not be run.  As noted earlier, there are only certain conditions which will cause the table to be flagged and so just because a table row has changed does not mean that the index will be marked dirty, inducing a full re-index.

But while this partial indexing is often managed via cron it can also happen inline with a regular request.  There are several examples of a partial re-index being implemented on saving an item; product inventory being an example.  The product inventory indexer listens for changes in a product and will automatically do a partial re-index for that product when the product saves.

This, like most things in Magento, is a configurable option.  Most of the partial indexers will check a configuration prior to re-indexing.  In Enterprise 1.13 there is a new System Configuration section called “Index Management” which allows the administrator to control this behavior.

The effect of this is clear.  The actual number of queries for a save operation is significantly lower.

Using Materialized Views For Your Own Profit

But while the materialized view implementation has helped make the indexing functionality much more efficient one of the real benefits of the new Mview module is that you can use it in a standardized way to manage your own indexing requirements. It will help you implement module functionality when you have a count(*), GROUP BY or DISTINCT queries or when you need to pre-calculate values.  That’s not to say that you can’t manage it on your own but by using the Mview module you can offload some of the responsibilities of managing these types of calculations to Magento.

In order to do that you need 3 elements

  1. A setup script that creates your table subscriber
  2. An index table
  3. Indexer action models that deal with either the changelog or a complete reindex

Before we move on we should note that #3 has some caveats.  Generally if you are going to be working with the changelog you will not be configuring an index to run with the shell indexer as you might be expected.  Instead, it will be run automatically through cron and managed by the Enterprise_Index_Model_Observer class, which will have its refreshIndex() method called each time cron is executed, just like it does in the Enterprise_Catalog module.  In other words, you will generally have no direct control via the administration UI or the CLI if you use this method.

For this example we are going to write an extension that will print a summary of the browsers that have viewed a given page in the footer of that page.  Granted, this is an extension that would not go on a production front-facing Magento implementation but it will demonstrate how you can integrate with the Mview extension to do calculations that you would not want to do during a normal request.  Additionally, for the purposes of clarity, we are going to omit things like Layout and Blocks and standard configuration options and focus on the Mview functionality.  This should not be a problem for most Magento developers, though the full code is available on GitHub.

For implementing this kind of functionality you might have considered creating a new table that might contain a column for the URL, the User Agent and a count.  However if you were to calculate the results on the fly you are adding a minimum of an extra write during regular load times and you increase the possibility of locks or write contention in the database.  Those kinds of writes are generally something that should be avoided.  This is not to mention, as well, that in our scenario here the information we need is already being stored in the log_ tables and so we would be duplicating a write for no reason.

There is an additional reason that we would want to do this separately.  User Agent strings are long strings of mostly useless information (for our purposes, at least).  These strings have many different permutations which may or may not follow a specific pattern.  So we have one of two options.  The first is that we build the mechanism to normalize the user agents into something useful ourselves or we use what someone else has done so we don’t have to maintain it.  The get_browser() functionality is a good suit for this, but it is a little on the slow side if you want accurate results.

So we are going to process those results offline as part of a regularly scheduled cron job that is already set up via Enterprise_Index.

To get started we need to set up a subscriber on the log_url table.  This is the table where each and every HTTP request is registered.   The Mview module allows your module to watch a given table for changes.  When those changes occur the change is noted by the triggers that have been defined for that table and when a change occurs the individual view will have an internal version ID counter incremented.

When creating a subscription you will do it by telling the Mview client to watch a given column on a given table.  There is something interesting to note here; because of how the system tracks changes it is beneficial to make sure that you are watching a column that has a unique value.  Otherwise it is possible that your indexer may accidentally select data that has already been processed.  A good example of this is if you are watching a DATETIME column.  It is quite possible that two events occurred in the same second that the indexer was run which would end up getting re-processed the next time the indexer is run.  When retrieving the changed values from Mview the indexer looks for a version ID in the enterprise_mview_metadata table which will be matched against the actual value in the change log table.  If there is any possibility of duplication here then it is possible that your data will be skewed.  Simply watching the primary key of the table you’re interested in is by far the easiest way to accomplish this.

To go about setting these relationships up you will generally do this from within a setup script for the module that you are creating.  In the setup script we are creating here there are two tables that need to be created.  One is for the summary report data and the second is for mapping individual user agents onto their parent browser values.  For example, mapping “Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20100101 Firefox/10.0.12 Iceweasel/10.0.12 Snapper/1.0” onto “Firefox 10“.  These tables, however, we will not document here for the sake of clarity, but you can see them in the GitHub repository for this module.

To start creating our materialized view and its subscription we’re going start with our setup script.   We are going to get the name of the log_url table, which is the table we’re going to watch.

$logTable = Mage::getModel('log/log')->getResource()->getTable('url_table');

Why not just type in “log_url”?  This goes to an important principle in software development; hardcode as little as you can.  What if the table changes?  Then you need to go and try and figure out all of the places where you hardcoded the value.  This is why, in our next few lines of code, we are a little more verbose when determining the primary key.

1
2
3
4
5
$pkData = $this->_conn->getIndexList($logTable);
if (!isset($pkData['PRIMARY']['COLUMNS_LIST'][0])) {
     Mage::throwException('Unable to find log table primary key');
}
$logPrimaryKey = $pkData['PRIMARY']['COLUMNS_LIST'][0];

With these two variables defined we can now configure the Mview client and subscriber.

1
2
3
4
5
6
7
Mage::getModel('enterprise_mview/metadata')
    ->setTableName($logTable)
    ->setViewName($logTable)
    ->setKeyColumn($logPrimaryKey)
    ->setGroupCode('eschrade_useragent_report')
    ->setStatus(Enterprise_Mview_Model_Metadata::STATUS_INVALID)
    ->save();

This first step in the setup process involves retrieving an instance of the Mview metadata class which is responsible for managing the basic view information.  The class itself basically functions as an interface to the database, as can be seen from the skeleton-like definition of its resource and collection methods.  The metadata class, however, is used by the Mview client as a means of differentiating between various materialized views.  The client loads the metadata information prior to executing an individual indexer.

setTableName() is where we set the actual name of the table that is the source of our changed action log.  It is the table whose changes we are interested in when they occur.  It is the table that the triggers will get set up against when the subscription is created later on.  Only one materialized view metadata entry can watch the table, but multiple indexers can process it based off of configuration settings (which we will look at later on).

The setViewName() method is used to determine the source of changes that you are interested in noting.  While this value can be a different table from the one specified in setTableName() it will generally be the same and it must be unique.

setKeyColumn() is used to tell the Mview client which is the unique value that we want to store in our changelog.  Generally, this will be the primary key.

setGroupCode() is largely unused but must be unique due to table index requirements.  However, it does allow you to load the metadata by the code or also allows you to define the group code in your index class.

Setting the status to “invalid” via setStatus() is how you force an index to be completely rebuilt.  An index that is out of date is not invalid.  What happens during normal cron runs is that each materialized view that is configured will be run, but whether or not that index is invalid will determine which index class is called.  We will get into building those classes later on.  By setting the index status to “invalid” you will be forcing a full refresh of the index during the next cron run.

After the metadata has been created you will need to create an instance of the client associated with the view name to create the actual subscription.

1
2
3
4
$client = Mage::getModel('enterprise_mview/client');
 
/* @var $client Enterprise_Mview_Model_Client */
$client->init($logTable);

The value that is passed into the init() method is the name of the table that you are watching.  This initializes the client and loads the metadata that was previously saved.

The next thing you need to do in your setup is create the changelog table.  This is done by asking the client to do it for you by passing in a Magento URN to a class that will create the changelog.  Generally speaking you can use the base class to do this, which is what we do here.  The base class requires the table_name as a parameter.  Generally the changelog will be named {$tableName}_cl.

1
2
3
$client->execute('enterprise_mview/action_changelog_create', array(
    'table_name' => $logTable
));

Now that the changelog has been created we need to create the subscription.  The subscription class will create a trigger that watches for CUD operations and inserts the primary key value for the changed row into the changelog table.  For each change that is made to the watched table a row will be inserted into the changelog that contains the previously defined watch column along with a version ID.  This version ID corresponds to the version ID stored for this view in the enterprise_mview_metadata table.  So when you are creating an indexer later on the way the indexer knows which rows to include is by selecting all of the rows in the changelog that are greater or equal to the version ID found in the metadata table.

1
2
3
4
$client->execute('enterprise_mview/action_changelog_subscription_create', array(
    'target_table' => $logTable,
    'target_column' => $logPrimaryKey
));

At this point we have Magento configured to watch for changes on a specified table and note row changes in the database.  Now we need to do something with them.

In order to make this example useful we need to define at least two classes (Eschrade_UserAgent_Model_Index_Ua_Changed and Eschrade_UserAgent_Model_Index_Ua_All) to interact with the Mview components.  Those two classes are there to either build the index partially or to build the index fully.  But while two classes generally should be defined, because the nature of the indexing operation should be similar between the two.  The primary difference will be the size of the data set.

Most of the time the base indexer class can be built by extending Enterprise_Mview_Model_Action_Mview_Refresh_Changelog.  While there are classes that can be used to define each class individually this might end up making more work than is necessary.  In this case we will build a base class for refreshing the view, but provide some simple hooks that will allow the full re-indexer to easily do its job at the same time.

In the example here there are several methods that we are going to describe in words instead of code since they are either helper functions or are used to modify internal SQL queries to get what we’re looking for.  If you would like to see the full code you can find it on GitHub at https://github.com/kschroeder/Magento-Enterprise-Materialized-Views-Example/blob/master/app/code/local/Eschrade/UserAgent/Model/Index/Ua/Changed.php.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class Eschrade_UserAgent_Model_Index_Ua_Changed extends  Enterprise_Mview_Model_Action_Mview_Refresh_Changelog
{
    public function execute()
    {
        try {
            $this->_connection->beginTransaction();
            $this->preIndexHook();
            $this->populateUserAgentTable();
            $versionId = $this->_selectLastVersionId();
            $this->processIndex();
            $this->_metadata->setValidStatus()
                ->setVersionId($versionId)
            ->save();
            $this->_connection->commit();
        } catch (Exception $e) {
            $this->_connection->rollBack();
            $this->_metadata->setInvalidStatus()->save();
            throw $e;
        }
        return $this;
    }
}

This class will, by default, only process changed items.  But when we look at the class we’ve defined for processing all of user agent results you will see that there is very little that we need to do.

The first thing we need to do is start a transaction.  We don’t need to do this, but it is often a good idea to wrap this functionality in a transaction to handle cases where there is an error that might corrupt the index or cause data to be processed multiple times.

The preIndexHook() method allows any subclasses to execute any logic prior to actually calculating the index.  This is primarily used to allow the full re-index class to truncate the data.

Next up we call a method called populateUserAgentTable().  This is a helper method that does a LEFT JOIN on the log_visitor_info table and a table that is used to store the friendly names of the user agents.  If there are a large number of user agents that have not been processed, this method can actually take several seconds to execute (full version on GitHub). This is done so that MySQL can do some processing for us later on rather than doing it in our code.

At this point we get the most current version ID from the change log to use later when setting the last processed version ID.  This could be done at the end of the process, but there is more of a possibility that a version ID may be skipped if we wait until the end to do this.  Perhaps a better option would be to do checks from the changelog version IDs and state that highest one will be set.  But, for clarity, and the fact that the elapsed time between when this query is made when the unprocessed ID’s are retrieved will be very low, we do it here.

Next up is the processIndex() method.  This is where the magic is implemented.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public function processIndex()
{
    $indexTable= Mage::getModel('eschrade_useragent/agentreport')->getResource()->getMainTable();
    $select = $this->_selectChangedRows();
    $this->joinSelectWithUrlAndVisitor($select);
    $stmt = $select->query();
 
        while (($row = $stmt->fetch()) !== false) {
            $this->_connection->insertOnDuplicate(
                $indexTable,
                array(
                    'page' => $row['url'],
                    'agent' => $row['normalized_user_agent'],
                    'request_count' => $row['page_count']
                ),
                array(
                    'request_count'  => new Zend_Db_Expr('request_count + VALUES(request_count)')
                )
            );
        }
}

The first thing we need is the table that we are going to be inserting the values on.  That would be the main agentreport table.  Next up we retrieve the SELECT object that is pre-configured to retrieve only the primary keys of rows that have changed.  This is defined in the Enterprise_Mview_Model_Action_Mview_Refresh_Changelog class.  We make a small modification to the method to deal with some of the complex JOINs we do later, but it essentially the same.

After we have retrieved the pre-configured SELECT object, which is directly pertinent to the Mvew discussion we modify it to suit our needs, which is not pertinent to our example.  That method is custom to this example and does a number of JOINs on various tables to pull the data we need into one result set.  This data is then inserted with an ON DUPLICATE clause that will add any new page/user agent counts to existing ones, or insert them if no counts exist.

At this point it should be worth noting that using Mview forces us to do things on a bit of a less efficient manner.  Generally, when you are building a Materialized View the goal is to do as much from INSERT FROM SELECT statements as possible.  This can greatly cut down on the possibility of hitting memory limits in PHP and also allows MySQL to do the heavy lifting for you.  A good rule of thumb is that you want to work as close to the data as possible.

But because we are doing partial changes and because those partial changes may already have some previously-calculated data in the table we are not able to do that.  This is actually a limitation in the database adapter where fields to be defined with a Zend_Db_Expr class (which is how you would make the addition calculation for duplicate entries) are not calculated in a way that can be used.  The problem is that when you specify a column the expression that you use for the field is the expression that is used for the ON DUPLICATE clause.  Because we are possibly updating existing columns with an addition expression we are not able to use the same field value in the column clause as we do in the ON DUPLICATE clause.  Therefore, in this case, we need to process them manually.

This is why we first generate a result set and iterate over them calling insertOnDuplicate().  insertOnDuplicate() allows us to distinctly define columns and ON DUPLICATE clauses, which we can’t do via insertFromSelect().

At this point when we run the index nothing will happen.

That is because we have not defined the class that will (re)process the entire view.  Remember back to our setup script where we set the view as invalid?  That will result in a full re-index.

So let’s take a look at that class in its entirety.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Eschrade_UserAgent_Model_Index_Ua_All extends Eschrade_UserAgent_Model_Index_Ua_Changed
{
    public function preIndexHook()
    {
        $this->_connection->truncateTable(
            $this->getReportResource()->getTable('agentreport')
        );
    }
 
    protected function joinSelectWithUrlAndVisitor(Varien_Db_Select $select)
    {
        $select->reset(Varien_Db_Select::WHERE);
        parent::joinSelectWithUrlAndVisitor($select);
    }
 
}

Because of the way we built out Eschrade_UserAgent_Model_Index_Ua_Changed we only need to do two things. 1) truncate the report table, and 2) remove the version_id condition from the WHERE clause.

And with that our classes are done.

But our indexers will still not run.  That is because we have not defined our indexes in the configuration.  Our indexers will be added under the /config/global node in our config.xml file.

<index>
    <indexer>
    <eschrade_useragent_report>
    <model>eschrade_useragent/index_ua_dummy</model>
    <action_model>
    <changelog>eschrade_useragent/index_ua_changed</changelog>
    <all>eschrade_useragent/index_ua_all</all>
    </action_model>
    <index_table>log_url</index_table>
    </eschrade_useragent_report>
    </indexer>
</index>

The first thing you might notice is that there is a dummy class defined under the <model> tag.  This node is typically used by the index CLI script to process the individual indexes.  However, because Mview based indexers should be run via cron we define a simple class that renders the name and the description for the admin UI but does nothing else.  The main indexer, which is managed by the new Enterprise_Index module, uses the same configuration nodes as Mage_Index, but uses the <action_model> and <index_table> nodes instead.

When the Mview index manager is run it will check status column in the mview metadata table for indexes that are invalid.  For each row whose status is “invalid” the table_name will be used to match up which indexer configuration should be used.  The main indexer will check for a URN under <action_model><all> and run that indexer.  This is the first part of the indexing process.

The second part of the indexing process is when Enterprise_Index iterates over all of the indexers defined in the merged configuration.  An index is ignored if either the <action_model> or the <index_table> is missing.  If a full re-index has already been run for a given indexer entry the changelog-based indexer will be omitted.  Beyond that caveat every index configuration that has an <action_model><changelog> and <index_table> node will be run.

Conclusion

While there are several places in the new Magento Enterprise Edition 1.13 that make indexing more efficient there is perhaps no mechanism more important than the implementation of Materialized Views.  It allows for a standardized method of building summary/calculation tables in a predictable manner.  But rather than building an implementation intended strictly for the new version of Magento Enterprise, Mview module allows developers to utilize the same functionality that is largely responsible for some of the performance improvements noted for EE1.13.


Testing Magento with random URLs in JMeter

Leave a comment

One of the things that I think is important to do in your testing is to introduce some level of randomness into your testing for a web server.   In the wild you have almost no control over what customers are doing on your web site and adding some level of randomness might help you gather additional data, particularly if you are doing it under load.

With JMeter this is actually fairly easy.  You have the ability to modify HTTP requests based off of variable values.  In other words, you can have your URL endpoint be ${url}.  But how do you get that random data into JMeter in the first place?  I think one of the easiest ways to do it is by having a really small script that gets a random URL from core_url_rewrite.  It would look something like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$pdo = new PDO('mysql:dbname=magento_ee', 'root', '');
$sql = 'SELECT COUNT(url_rewrite_id) FROM core_url_rewrite';
$sth = $pdo->prepare($sql);
$sth->execute();
$count = $sth->fetchColumn(0);
$row = rand(0, $count);
 
$sql = 'SELECT request_path FROM core_url_rewrite LIMIT 1 OFFSET ' . $row;
 
$sth = $pdo->prepare($sql);
$sth->execute();
$url = $sth->fetchColumn(0);
 
echo $url;

Then you would need a JMeter script that would look something like this (Eschrade Random URL JMX file).