archive 2007 November

Rule #3: Only Alert The Ones Responsible

Posted on Wednesday 7 November 2007

Can you imagine a fire alarm going off in your building for a fire in a building five hundred miles away?

No? Why?

I have a guess: because it’s not relevant to you. But based on the three rules here, it meets #1: It’s certainly an emergency. It also meets #2: It’s actionable. But of course, that alert doesn’t have anything to do with you, so for you, that fire alarm is just noise.

Now, let’s stretch everyone’s imagination and say that an IT department is not much different than a fire department. Would you want your fire department to get all fire alarms for buildings in a 500 mile radius? Of course not. So why would you treat your IT department the same way? Often, shops do so, by sending every event to every member of an administration team.

What I don’t understand is why some IT shops do a good job designing alerts, but then just hope that someone will do something about them when they appear. There’s a simple way to solve this. Rule #3 states: “Alerts must only go to people who are responsible to act on it.” This is almost a part of the actionable nature of good alarms. If you are going to define what action you should take if you get a certain alert–say, an application process being down–you should also specify who is supposed to take the action.

I’ll never forget the first time that I implemented Systems Monitoring in 1997/1998. It was an instance of hardware monitoring for our servers. In this case, it was one of the very first versions of Compaq Insight Manager (CIM.) After installing CIM, it immediately set off an alarm that a server drive was in a degraded state. This meant that the drive had not yet failed, but it would shortly and should be replaced immediately. Of course, the implementation team was excited because the tool was working out of the box, and they could prove the value of the tool right away. They flagged an admin, and showed them the alert.

The subsequent conversation went something like this:

Team: “Hey, you have a degraded hard drive!”

Admin: “Wow. You’re right.”

Team: “Um. Aren’t you going to do something about it?”

Admin: “No. That’s not my job.”

Team: “Oookay.”

They looked for other admins, and had no luck finding anyone to take action. Naturally the drive failed the next day. The follow-up conversation went something like this:

Admin: “Hey, the drive failed! The system is supposed to tell us.”

Team: “Um, it did. We told you. You didn’t do anything about it.”

Admin: “This Systems Monitoring stuff isn’t working.”

This kind of attitude is why I like to say that the technology is no more than 40% of the Systems Monitoring solution.  The technology could work fine, but if you don’t act on it properly, the technology can’t help you.

Here’s what I’d suggest to successfully implement rule #3:

  • When you determine the action for each alert (rule #2)  you should also determine who is to take that action, and in what time frame. Include off-hours coverage if you are a 24/7 shop.
  • Even if there are groups of alerts on the same server, determine who the appropriate recipient is for each one, and only send it to those recipients. We have many servers that have alerts that go to DBA’s, and other alerts that go to the web team, for example. They don’t get each other’s alerts because they aren’t responsible for them.
  • Unless you have an operations team, never depend on your mobile admins to “watch the console” for problems. Instead, page them or send them email. If you do a good job with the three rules, every alert will be an actionable emergency that they will be responsible for, and they will never get any alerts that are just noise.
  • When you must page a group of people for a problem rather than just individuals, make sure that there is always one person on call that is responsible for taking action for the alerts that go to the team. Otherwise, everyone assumes that someone else will handle the problems. We always have a primary, and a backup.

You need to be ruthless about enforcing this rule as much as the first two. If anyone receives alerts that are just informational for them, they will delay looking at their alerts because they might not be directly responsible. You are depending on them to do filtering. Again, think of the fire department. Do they get “informational” fire alarms for fires that are hundreds of miles away? Only if they’re really serious, and in those cases, they are contacted directly because then, they have an actionable emergency that they are responsible for. You should set up your own system the same way.

The next major topic is designing custom monitoring. In particular, I’m going to cover a technique called Critical Path Monitoring that will let you monitor any application. This will be a series of articles, because, as you can imagine, this topic isn’t simple. But it has worked for years in our departments, helping us make thousands of alerts, and monitor over a hundred custom applications.




NetConnect Lessons: No More CYA Alerts

Posted on Monday 5 November 2007

It’s taken me some time to integrate what I heard at NetConnect this year. I don’t mean that I learned a ton of new things. It’s that a lot of environments are in more trouble than I thought. One of the things that it’s caused me to do is change the third rule. I’ve been stunned at both the large number of deployments of tools out there in the world, as well as the number of them that just aren’t used.

I never thought that another name for “monitoring” would be CYA. That is to say: “Cover Your Donkey” but, of course, using one of the synonyms for donkey.

It seems that many environments would rather have alerts that no one uses rather than have their system miss something, and get blamed for a production issue. This leads to a proliferation of alerts that, very quickly, no one watches because they break the rules that I’ve been talking about here.

One of the rules that I felt was just a component of the first one needs to be a separate rule. So I’m replacing rule #3. Instead, that’s going to be one of the rules of creating custom monitoring, which I’m going to explain soon when I talk about critical path monitoring. Instead, I’m going to talk about rule #3 in the next article. And this article, I want to talk about the CYA problem.

But, just to get it out of the way, here are the new three rules:

  1. Every alert must mean “Run to the console now!”
  2. All alerts must be actionable.
  3. Alerts must only go to people who are responsible to act on it.

Like I said, #3 is next article. I just want to expand on the point that I mentioned in #1. For now, let’s talk about the CYA issue.

If you let administrators completely depend only on your alerts from your systems tools, you will always be surprised by some events that happen on the servers. No matter how good of a job you do developing your alerts, there are always oddball issues that can come up that you will miss. I inform every administrator of this issue, and tell them that they should still check their servers on an occasional basis, just like they should be doing before putting monitoring in place.

Because an alert is proof of a problem, though, some management and administrative groups have used it to assign blame rather than to fix things. I ran into more than one group at NetConnect that get told that they may not turn off any alerts because they might “miss something” even if they are deluged with many false alerts. But unfortunately, because many of these shops are not being strict about reducing the alerts down to the ones that follow the rules, a lot of alerts are meaningless and they ignore all of them, good and bad. This makes adding new alerts when they are necessary next to useless, because even the current set of alerts are already ignored.

It does take time and effort to reduce the number of alerts down to the ones that matter, but if you want to get any value out of Systems Monitoring tools at all, it’s a necessity.

In fact, the conversation that is inspiring this article is with an administrator of one company that was monitoring 2400 servers in their environment. I asked them how many alerts that they get a day. He told me “thousands.” I asked what they did with them. He answered: “Nothing.” And so I asked why, and he said: “CYA.”

Now, why is this so easy to believe?




Effective Monitoring designed by SEO-Themes and powered by Wordpress