How to (properly) evaluate Zend Server – Event Monitoring

So, you just received instructions to download and try out Zend Server.  Or, you heard that Zend Server is a “PHP Application Server”, but you have no idea what that means and you want to find out.  What do you do?

What I have often seen is that people will download and install Zend Server, try a PHP application on it and see if it works.  They see that it does work.  Then they ask “ok, it works but it’s not worth the price so I’ll just go back to what I was using before.”  The problem here is that more often than not, and I’m sure it’s a MUCH more often than not, “worth” is not defined.

The first thing thing that people do is look at cost.  Price, but also performance.  In other words, the first thing people often do to test Zend Server’s Event Monitoring is check to see what the performance overhead is.  What they’ll do is take JMeter, run a test session against Zend Server, and then become concerned because Event Monitoring has overhead.  Yes, Event Monitoring has overhead.  Yes, it will probably be a little more than what the marketing material states but we’re still only talking a few percentage points.  That may make a difference for Facebook or Google but it will probably not make that much of a difference to you.

But what if that few percent of overhead showed you problems that you either a) didn’t know existed, or b) were unable to replicate?

For example, one of the hardest problems to diagnose is a slow page request.  There could be literally hundreds of thousands of expressions that could be the culprit.  Or it could be that you HAVE hundreds of thousands of expressions.  But you don’t know.  In fact, it’s just as likely that you do not even know that you’re having a performance problem, or to what level you’re having it.  And if you do, your only production measurement is external.  In other words, all you know is what the elapsed time is.  There is also a relatively good chance that a lot of the information you need to reproduce the issue is not going to be readily available.  Things like, $_POST, $_SERVER, environment variables and such are often not seen in a PHP log.

I’ve worked in support before there is something I know; when the alarms start going off it’s too late.  When the alarms go off your customers are already being affected.  Most monitoring systems do not do real time monitoring.  At least none of the ones I’ve worked with.  And most systems monitor the OS or HTTP responses to determine if there are issues, and they do it at periodic intervals and they don’t record the cause of the error condition.  That’s what Event Monitoring does.

Why am I saying all of this stuff?  Note my statement at the beginning of this posting. I made the claim that people make the error of testing the performance of event monitoring when what you should be doing is measuring the cost of not having monitoring.  In other words, what does it cost you to have a 5 minute lag time before you know something is wrong.  What does it cost to have your customers start calling you asking you why your site is running slowly.  What does it cost for your developers to spend a week finding a performance issue when it could take them 4 hours.  I’m not joking on that last one.  It happened to me once.  It doesn’t happen often but it does happen.

But let me make a controversial statement.  At least it will be controversial if our sales people read it.  While Zend Server is about benefit and not cost, I would venture to say that when you first install it, it will actually end up costing you more.  I remember talking with someone once who looked at Event Monitoring and specifically stated they were not going to use Zend Server because it threw too many errors.  Let me translate that for you.  They were not going to use Zend Server because it was going to show them how much their application sucked.  I was absolutely flabbergasted at that.  I had no response.  Nothing.  Not a word.  How does one respond to that? … nicely?

I tell that story to emphasize a point.  Zend Server will not immediately begin to save you money, unless you are an absolute top-notch PHP programmer and you know exactly what’s going on in your app, all the time.  What it will immediately do is bring to the forefront what level of effort is going to be required to make your application compliant to the PHP language.  This cost will help you to write PHP code of a much higher quality than you did before.  Too many developers write their PHP code so-as to be utilizing the type juggling and forgiveness of the engine as much as possible.  This is not good PHP code.  Applications like that will definitely cause a lot of errors to be thrown with Event Monitoring.  However, applications like that are also much more likely to exhibit unexpected behaviors that you do not want in a production application.

Evaluating Event Monitoring (for the admins)

As an admin you will be primarily responsible for managing the system in your production environment and you would also be the one signing off on it saying that it will run.

When testing, and you want to see what the overhead is, do not use the defaults unless you have an extremely well-written and high performance application. The way to test for this is to do some kind of a load test (I use JMeter myself) and record the response times. The reason why you want to do this is because there is a series of events that you can monitor for. Having your values set to low will cause an inordinate amount number of events to be thrown. Remember, the purpose here is not to set the thresholds at what you would like the application to run at, or what your SLA says. The purpose here is to find out what your typical response time is, not to find out what requests are problematic. If your SLA for individual requests is lower than your average request time then you need to modify your application, not your thresholds. Here are a list of events that can be used to discover performance problems.

  • Severe Slow Request Execution (Absolute)
  • Slow Request Execution (Absolute)
  • Severe Slow Request Execution (Relative)
  • Slow Request Execution (Relative)
  • Severe High Memory Usage (Absolute)
  • High Memory Usage (Absolute)
  • Severe High Memory Usage (Relative)
  • High Memory Usage (Relative)
  • Inconsistent Output Size

Start by forcing events to be thrown by dropping your thresholds to 5-10ms, for performance, and 10-20k for memory issues. Run your test. Wait. Run again. Look at your second bunch of numbers. As I’m sure you’re aware, a cold cache is no cache, so give it time to warm up.

Assuming a relatively constant response time (+-50% or so) take the mean time and double it for your thresholds for Slow Request Execution (Absolute). Then double it again for the Severe level. These are not hard and fast numbers, but they can help you start to tune it. Remember, you’re tuning for realistic application performance, not for your SLA. If you can’t deliver the app within your SLA you either get better servers or better programmers. 🙂

Once you’ve reset the monitoring numbers run the test again and see if the number of events have dropped off significantly.  If they have not, up your times.  If they have, then you can start to examine the events that you collected from the first run and start asking yourself questions like

  1. What is the actual performance impact?
    1. Will it significantly affect throughput?
  2. Will this reduce the time it takes to open a ticket and start working on a resolution?
    1. If so, by how much? If you say that it won’t reduce the time to start working on a resolution
  3. Does this give the developers better information than they get right now?

After asking these questions run the tests again with the tuned Event Monitoring settings and see if you get some events. Examine these and ask yourself, right off the top, “am I learning something about how my application behaves?” If you are, you are getting to see the benefit of working with Event Monitoring.

So, when testing Event Monitoring

  1. Set your thresholds low to force events to occur
  2. Use the timings from those events to tune the thresholds
  3. Run the test again
  4. If the event counts are decently lower, go to the next step, otherwise go to step 2
  5. Examine the data collected from the events to see if you can use it to reduce your Mean Time to Resolution

But there’s even more. However, we’ll look at that at a later article.

Evaluating Event Monitoring (for the developers)

Developers are of a different breed than administrators. I believe that is because developers don’t have pagers that go off in the middle of the night. Administrators like things that keep the pager from going off. Developers like cool things, that inevitably cause the pager to go off. When it comes to event monitoring there’s not much you need to do.  Believe it or not, that’s actually kind of a good thing.  Event Monitoring is intended to run external to your code.  In other words, you don’t need to hook into it.  It’s mostly an operational thing.

That said, there are some hooks that you can use and you do have an important part to play in the evaluation process.  We claim that Event Monitoring will help you reduce your MTtR (Mean Time to Resolution).  Test that.  Install Zend Server on your local machine.  Take an old version of your code, one that has an error to reproduce, and test it.  The best type of error to test is one that is difficult to reproduce. Slow page requests with no specific issue are a good place to start.

One of the things you want to do is write an application that does not fatally error out. Ideally you want to catch every error. Practically, that’s not possible, but that’s the goal. The problem with catching any error, like if you’re building a ZF MVC application, is that you can’t automatically catch errors. You can catch slow execution requests and such, but application errors are more difficult because, technically, the application hasn’t errored out. It caught the error.

However, all is not lost! There is an API call you can make. Let me show you in the error controller for my blog.

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class ErrorController extends Zend_Controller_Action
{
    public function errorAction()
    {
        $errors = $this->_getParam('error_handler');
        switch ($errors->type) {
            case Zend_Controller_Plugin_ErrorHandler::EXCEPTION_NO_CONTROLLER:
            case Zend_Controller_Plugin_ErrorHandler::EXCEPTION_NO_ACTION:
            // 404 error -- controller or action not found
                $this->getResponse()->setHttpResponseCode(404);
                $this->view->message = 'Page not found';
                break;
            default:
                // application error
                $this->getResponse()->setHttpResponseCode(500);
                $this->view->message = 'Application error';
                $exception = $errors->exception;
                /* @var $exception Exception */
                if (function_exists('zend_monitor_set_aggregation_hint')) {
                    zend_monitor_set_aggregation_hint($exception->getMessage());
                    zend_monitor_custom_event(
                        'Internal Server Error Caught',
                        $exception->getMessage(),
                        array(
                            'trace' => $exception->getTraceAsString()
                        )
                    );
            }
            break;
        }
    $this->view->exception = $errors->exception;
    $this->view->request = $errors->request;
    }
}

 

There are two API calls here. The first one sets an aggregation hint. The way Event Monitoring works is that rather than generate multiple events it will aggregate similar events. Given that my title is “Internal Server Error Caught” any error that is caught will be aggregated according to that. So we aggregate according to title of the exception by calling zend_monitor_set_aggregation_hint() prior to calling zend_monitor_custom_event().

zend_monitor_custom_event() takes the following parameters.

  • Class – The nature of the event (aggregated)
  • Text – The description of the event (not aggregated)
  • User data – Any data that you want to store along with the event

When your admin is doing their testing make sure you check out the events that are being generated and check this out.

When an event is caught, it will also catch a lot of the information that you need to do a debug.  So with this button you can start a debug session in your browser that is kicked off using the same context as what caused the issue.  Things like GET and POST variables.  Additionally, if you click on the Settings button you can take that context and initiate it on a different server from where the event was collected.  So you could have an event kick off in production, but debug it in development with the same context.

Like with the admins, there is more, but we’ll look at that in a later article.

Conclusion

At this point we’ve really only scraped the surface of what we’re going to be looking at.  Event Monitoring is actually a two-part process.  This is the nifty part.  Next we’re going to look at the freaking cool part.

Related posts

One thought on “How to (properly) evaluate Zend Server – Event Monitoring

  1. Jesse LaVere

    Thanks for the article Kevin. I attended ZendCon last week and was happily impressed by the entire experience. Thanks for all of your effort. I look forward to future articles regarding Zend Server. In particular I’m interested in debugging the High memory usage errors.

Leave a Comment