Are you responsible for monitoring IT within your organization? Do problems with your IT services keep organizing that your monitoring systems are silent about? Are you constantly having to swap monitoring tools or write custom scripts because "new" monitoring requirements keep cropping up that your current monitoring systems can not handle?
I have been in those situations working for the enterprise monitoring department of a large bank. Having been responsible for working with dozens of support teams to monitor 100s of services running on 1000s of servers, I can attest to how daunting trying to monitor an enterprise can be. But what drve me and my team to successfully align systems was seeking the answers to the Five Essential Questions I ask below.
The Five Essential Questions are both strategic and tactical. The strategic questions expose potential weaknesses in your portfolio of monitoring systems that may require long-term planning to rectify. The tactical questions expose weaknesses in keeping your monitoring systems aligned with day-to-day operations.
1. Are we monitoring all services and technologies in our environment? (Strategic)
This is a big picture question, and as such, we are not as concerned about how comprehensively we are monitoring each technology (depth) but rather whenever we have any coverage at all (breadth). The tactical questions that follow will deal with the depth aspect.
Conceptually, the way to determine the answer is to create a list of all the technologies and technology-based services in your organization and put a check mark next to each that is monitored. Any that do not have checks are the monitoring gaps.
You should include manual procedures, such as data center walkthroughs and daily error reports, into the survey if you are confident that they are rigorously followed and result in remediation when problems are spotted.
2. Are we monitoring all instances of a technology in our environment? (Tactical)
You may have configured the most in-depth alert conditions for a server, but if your monitoring system is not aware of those servers, it does not matter. That's why this is the first tactical question I present because addressing the gaps uncovered by this answer need to be done as soon as possible.
In all but the smallest, static environments, this question has to be answered in an automated fashion. When I worked for the bank, we received a daily report of servers entering and leaving production status which we basically acted on. If you are in a more dynamic environment or make use of ephemeral servers, you will need this discovery and instrumentation process to be fully automated.
3. Are we monitoring for all incidents support staff commonly encounter? (Tactical)
The intent of this question is to discover all the types of accidents that a support team encounters and understand how they were detected and reported to the support team. The responsibility for detecting and reporting should be with your monitoring systems, so any accidents not coming through that channel are the gaps.
Conceptually, you are creating a list of such incidents and cross checking them against what your monitoring systems are configured to alert on today; Are capable of monitoring for (a fillable gap); And will not be able to monitor with the tools in hand (a persistent gap).
4. Are we monitoring for failure and performance degradation scenarios that subject matter experts (SMEs) anticipate? (Strategic and Tactical)
Conceptually, you build a list of failure and performance degradation scenarios and cross check this list with what you are monitoring for today. Anything not monitored for the gap.
There are several methods you can use to generate the scenarios. I am partial to a lean six sigma method called Failure Modes and Effects Analysis (FMEA) which not only generates a list of scenarios but helps prioritize them. Another method would be to take documented system functional requirements and ask the subject matter expert what could cause that function to not be behaved correctly. And yet another way would be to sit with the SME while looking at a diagram of the system, point to different components and ask questions like, "what could make this component not perform correctly?" And "what would happen to the system if it did?"
Be sure to choose your subject matter expert wisely. They not only have to be an expert in the technology but have to be an expert in how it is actually deployed within your organization. You might consider getting together your lead engineer, an admin and a consultant from the vendor to help you answer this question for a given technology.
5. Do we have the capability of monitoring technology in the pipeline / on the roadmap? (Strategic)
To be proactive and prepare your monitoring system portfolio for the future, you need to know what technology changes are coming down the pipe. These changes can be the introduction of new technologies, major updates to existing ones, or their decommissioning. For your monitoring systems, these changes can trigger the need for more / different licenses, increased capacity, system upgrades, module purchases, custom scripting, or complete replacements of monitoring tools. Each change brings its own monitoring challenges and it is up to you to be prepared before these changes go live.
If you have answered the previous four essential questions you have reasonably uncovered monitoring requirements your current systems can not handle. My advice to you is to leakage changes in your environment to address these shortcomings. If you are proactive by routinely answering this final essential question, you will be better positioned to ask projects for money at the beginning of their effort and not after they go live.
Good luck with your monitoring!