Can you imagine a fire alarm going off in your building for a fire in a building five hundred miles away?
No? Why?
I have a guess: because it’s not relevant to you. But based on the three rules here, it meets #1: It’s certainly an emergency. It also meets #2: It’s actionable. But of course, that alert doesn’t have anything to do with you, so for you, that fire alarm is just noise.
Now, let’s stretch everyone’s imagination and say that an IT department is not much different than a fire department. Would you want your fire department to get all fire alarms for buildings in a 500 mile radius? Of course not. So why would you treat your IT department the same way? Often, shops do so, by sending every event to every member of an administration team.
What I don’t understand is why some IT shops do a good job designing alerts, but then just hope that someone will do something about them when they appear. There’s a simple way to solve this. Rule #3 states: “Alerts must only go to people who are responsible to act on it.” This is almost a part of the actionable nature of good alarms. If you are going to define what action you should take if you get a certain alert–say, an application process being down–you should also specify who is supposed to take the action.
I’ll never forget the first time that I implemented Systems Monitoring in 1997/1998. It was an instance of hardware monitoring for our servers. In this case, it was one of the very first versions of Compaq Insight Manager (CIM.) After installing CIM, it immediately set off an alarm that a server drive was in a degraded state. This meant that the drive had not yet failed, but it would shortly and should be replaced immediately. Of course, the implementation team was excited because the tool was working out of the box, and they could prove the value of the tool right away. They flagged an admin, and showed them the alert.
The subsequent conversation went something like this:
Team: “Hey, you have a degraded hard drive!”
Admin: “Wow. You’re right.”
Team: “Um. Aren’t you going to do something about it?”
Admin: “No. That’s not my job.”
Team: “Oookay.”
They looked for other admins, and had no luck finding anyone to take action. Naturally the drive failed the next day. The follow-up conversation went something like this:
Admin: “Hey, the drive failed! The system is supposed to tell us.”
Team: “Um, it did. We told you. You didn’t do anything about it.”
Admin: “This Systems Monitoring stuff isn’t working.”
This kind of attitude is why I like to say that the technology is no more than 40% of the Systems Monitoring solution. The technology could work fine, but if you don’t act on it properly, the technology can’t help you.
Here’s what I’d suggest to successfully implement rule #3:
- When you determine the action for each alert (rule #2)Â you should also determine who is to take that action, and in what time frame. Include off-hours coverage if you are a 24/7 shop.
- Even if there are groups of alerts on the same server, determine who the appropriate recipient is for each one, and only send it to those recipients. We have many servers that have alerts that go to DBA’s, and other alerts that go to the web team, for example. They don’t get each other’s alerts because they aren’t responsible for them.
- Unless you have an operations team, never depend on your mobile admins to “watch the console” for problems. Instead, page them or send them email. If you do a good job with the three rules, every alert will be an actionable emergency that they will be responsible for, and they will never get any alerts that are just noise.
- When you must page a group of people for a problem rather than just individuals, make sure that there is always one person on call that is responsible for taking action for the alerts that go to the team. Otherwise, everyone assumes that someone else will handle the problems. We always have a primary, and a backup.
You need to be ruthless about enforcing this rule as much as the first two. If anyone receives alerts that are just informational for them, they will delay looking at their alerts because they might not be directly responsible. You are depending on them to do filtering. Again, think of the fire department. Do they get “informational” fire alarms for fires that are hundreds of miles away? Only if they’re really serious, and in those cases, they are contacted directly because then, they have an actionable emergency that they are responsible for. You should set up your own system the same way.
The next major topic is designing custom monitoring. In particular, I’m going to cover a technique called Critical Path Monitoring that will let you monitor any application. This will be a series of articles, because, as you can imagine, this topic isn’t simple. But it has worked for years in our departments, helping us make thousands of alerts, and monitor over a hundred custom applications.
