archive 2007 August

Welcome ITSMF

Posted on Friday 24 August 2007

I’d like to welcome the ITSMF folks to EffectiveMonitoring.com!

I’ve had requests for the slides from the presentation that I gave on Thursday. I’m going to post it, and all others that I do, on the presentations page on this website.




The Best of Breed Debate

Posted on Monday 20 August 2007

If you’ve gotten through determining what you’d like to cover at a general level on all of your systems, it’s time to pick out the tools that you’re going to use if you haven’t done so already. The techniques that I’ll be covering at EffectiveMonitoring.com will work no matter what vendor you use, but the choice of tool will possibly make your job easier.

Most of the documentation on how to monitor is vendor specific, and written in such a way that the only solution to the “problems” that they bring up are their own products. There’s rarely a good debate about what tools should be like in general, and definitely not about making the right mix of tools that will get your job done. I don’t know about you, but I personally find that a lot of the articles written about systems monitoring read more like press releases from vendors rather than good discussions about comparisons between tools.

The most heated debate that has been argued literally throughout the entire decade that I’ve been involved in systems monitoring has to do with getting “Best of Breed” tools versus “Jack of All Trade” tools that try to manage the entire infrastructure.

The Best of Breed camp says that it’s necessary to drill deeply down into each application in order to do a good job monitoring it. This sometimes leads to tools that work for just a few platforms, which may cause you to have to purchase and maintain many solutions to cover your entire enterprise. It also can put a burden on your operations team (or whoever watches the consoles), because they may have to contend with multiple places to get alerts.

The Jack of All Trade camp says that you must have tools that span everything in your enterprise. The simplified version of their argument is that you should have alerts that correlate across your all platforms. Unix, Windows, Network, SQL, Linux, applications, and everything else should have alerts. They say that alerts should go to just one place so that the tools can do correlation between alerts. Root cause analysis is easier at this point, because problems are all on one console. Also, it’s simpler operationally because of a single console. Unfortunately, these solutions tend to do a few things fairly well, and then provide mediocre coverage for the rest. Two sayings come to mind for these tools: “Jack of all trades, master of none.” And “A mile wide and an inch deep.”

And because this is IT, there’s a third camp that reared up. Some believe in consolidation tools that will roll up alerts from any solution into their console. Once these alerts are in a single tool, it can perform correlation or other analysis.

Now, because correlation is a crux issue for this debate, I need to cover it briefly now. Correlation is the concept that you should filter the “symptom” alerts from the “cause” alerts. For example, if your entire database server is down, then the fact that your application server is writing a logfile that it can’t contact the database is a symptom, not a cause. There’s only one alert that matters here, and your good correlation tools will filter for this. But correlation as it relates to this debate has a very strong tendency to favor Best of Breed. The reason is simple: correlation assumes that you’ve done a complete job of putting alerts on all of your critical points of failure first. Otherwise, you have nothing to correlate. The Jack of All Trades tools can miss “deep” alerts. My other observation about correlation is that it rarely works in practice. There is too much manual work involved, and these tools can generate too many false alerts due to incorrect correlations. I’ll cover correlation in detail in a future article, because it’s quite a large topic.

After a decade of using various tools, I would suggest that IT shops use tools that are best of breed within the systems monitoring space that can cover as much of your environment as possible, and then use specific tools to solve the rest of the issues. I haven’t seen any tools that do a good job of mixing network alerting (routers, switches, and cable plant monitoring) with the systems alerting, especially if your company is large enough to have a networking team. Their needs tend to be so different they need their own console and control over their own tools. And, besides, they tend to ignore systems monitoring alerts. That’s only fair, because systems monitoring folks often have to ignore network alarms because sometimes traffic can be routed through other infrastructure, and the alerts aren’t meaningful.

I do believe that having fewer consoles is a goal that you should always strive towards, and this is why your application alerting and your operating system alerting needs to go in the same tool as much as possible. There’s a simple rule of thumb for this decision if you need to evaluate possible solutions: You must be able to do deep monitoring on each aspect of your system. Leave none out. Your set of monitoring solutions must cover database servers, your web servers, your custom applications, and all other critical aspects of your operating systems. If you can’t find an overall solution that covers all of these, you need to bring in a best-of-breed solution that will be able to handle the alerts on all of the parts that you haven’t covered yet.

I prefer category solutions in the systems space that allow me to write custom scripts. I often find that I want to alert on areas that the out-of-box monitoring doesn’t cover, and I need the freedom to add in a new alert type. But I want to emphasize again that the upcoming techniques and articles are vendor-neutral, and that whatever you choose, you will be able to use these solutions. As long as you make sure that your set of solutions cover all of the areas that we talked about in the Defining “Down” article, you will find the next articles usable almost immediately.

The next article in this series covers a surefire way to make certain that every single one of your alerts is meaningful.




Defining “Down”

Posted on Tuesday 14 August 2007

One of the points that’s often missed when it comes to monitoring is what we mean by a system, server, or application to be “down.” Indeed, it’s not as clear a concept as it may seem. This is where we need to start if we’re going to design a comprehensive set of alerts for a system.

What we do know for sure is if your network, server hardware, operating system, or application components are having trouble, your users will call and simply say: “The system is down.” The goal of monitoring is to know about these faults before they happen, or at worst when they happen. Note that even if you find out at the same time as your users, you will be cutting out all of the sleuthing that you’d be doing if you just got that user call. Not only that, you’re saving the time in between when they first try to contact you, and when the actual message arrives. If you have a large organization, sometimes tickets can spend hours in various help desk queues.

I’ve seen many shops complain about trouble with their monitoring, but fail to take the simple first step of identifying the actual components of their systems that can cause a failure. This process is part of a methodology I call identifying the critical path, which I am going to cover in a future articles in detail. For now, I just want to focus on the overall areas of failure for systems in general, and talk about different monitoring solutions for each. You must have monitoring that covers all of these in your set of monitoring tools.

Network: Network monitoring is entirely different than systems monitoring. Your best ones will check all of the network links, as well as the status of your network hardware. These articles won’t cover network monitoring in depth, so it will assume that you have this covered. The good news about monitoring networks is that it has a more regular set of faults that can occur, and so monitors can be implemented without as much alerting design required. This is often true no matter what kind of hardware or cable plants that you have in your data center.

Hardware: Whether your systems have a bad hard drive, correctable memory errors, a failed motherboard, or a dead fan, your system can crash as a result. You need to know the status of your hardware at all times. Fortunately, most of the hardware monitoring systems are predictive. That is, they will often tell you what needs to be replaced before they completely fail. Hardware monitoring is also out of scope for this series of articles, although the alarms for these faults can certainly be sent to the systems monitoring console. The best hardware monitoring solutions usually come directly from your hardware vendors because they have the tightest integration with the actual hardware.

Operating System: If you run out of disk space, memory, or have any other operating system failures, your users are still going to call you and say that your system has failed. This alerting should be part of the same software as your application monitoring solution. This is a complex topic that will be handled in a series of future articles, but has the advantage of using the same set of monitors no matter what applications are running on the system.

Application Monitoring: Applications in this case have a very broad definition. It includes infrastructure software such as databases and web servers, but also includes application processes, services, or daemons. This monitoring for applications are irregular and difficult, because there is no catch-all monitoring that works for every application. In fact, most applications are unmanaged because administrators just don’t have the method of understanding their applications from a monitoring perspective, rather than just an administrative perspective. We’ll be spending considerable time on this monitoring because although many vendors claim that they can do this automatically, it requires administrators to design these alerts. I will cover how to do this in detail, and provide a simple-to-follow guide on how to break apart an application into the alerts that matter.

Now that we’ve covered the overall areas that can cause a system to be down, the next article will talk about the longest-standing argument about the tools that can catch these failures: The Best of Breed debate.




How Do You Know When You’re Down?

Posted on Monday 13 August 2007

Imagine that your entire job is to watch a single server.

Sounds easy, doesn’t it? You would just need to stay connected to it, watch disk space and other basic statistics, and connect to the applications running on it occasionally.

Now imagine that your job is to watch five servers.

It still sounds possible, right? Although you need to go through them, and switch between them to see if any of them are having problems.

Now imagine that it’s your job to watch 200 servers. You’re in the middle of a data center. The fans are roaring in your ears, and the hard drives are whirring. What servers are having problems at this point? Out of curiosity, I’ve asked many application groups how they know when their systems are down. Inevitably, they give me the same answer:

“Our users call us when there’s a problem.”

How can this be an acceptable solution? An IT emergency in this case might be in the network, in the server hardware, an operating system fault, a problem with the application itself, or any number of places. A lot of applications actually run across many servers, and so when users call, it’s impossible to tell what component is having the problem without some detective work.

My data center is 10 times the size of the above example, with over 2200 servers, and at least 15,000 applications running. But when I get asked the question “Are there any problems?” I can glance at a console, and give them an answer. In fact, most of the time, my consoles have NO alerts flashing on them. That does not mean that there are no problems, but it does mean that there are no unhandled problems. All of the alerts have had an open ticket, and have people on the job, working to get it fixed.

You can get this same level of comfort while monitoring your own environment. And the answer isn’t in the latest tool. In fact, it doesn’t matter what monitoring tool you use. It’s in how you implement monitoring, and the procedures that you base around the alerts that you get.

Although many companies blame the monitoring tools, many systems monitoring implementations fail simply because the alerts haven’t been implemented properly. This blog will share how to do this effectively. I will cover troubleshooting, alerting, and determining what the critical points of a server are. Also, I will cover how to collect troubleshooting and performance statistics, so you’ll be able to answer the dreaded question: “What happened?” With the right data, you can often tell them. In fact, in my own environment, we often are able to tell people what caused a failure on a server within the last month.

This blog covers effective monitoring for IT systems based on over a decade of experience implementing systems monitoring in enterprise environments. I’m sharing this information simply because I needed to bring it together for some presentations that I’ve been asked to give for conferences and ITIL organizations, and there is more material that I can cover in a short amount of time that I have for the presentations. I wanted to share my full experience and methodology, which isn’t vendor specific. I will release new articles regularly in between my job as a systems monitoring implementor for a Fortune 100 company, and the presentations that I will be giving.

The next topic will lay out the scope of the problem, and identify everything that can break, so that we can understand the real problem that we’re trying to solve.




Effective Monitoring designed by SEO-Themes and powered by Wordpress