<<Project Name>>

Monitoring Plan

Customer Name

Directions for using template:

Read the Guidance (Arial blue font in brackets) to understand the information that should be placed in each section of this template. Then delete the Guidance and replace the placeholder within <<Begin text here>> with your response. There may be additional Guidance in the Appendix of some documents, which should also be deleted once it has been used.

Some templates have four levels of headings.  They are not indented, but can be differentiated by font type and size:

You may elect to indent sections for readability.

Author

Author Position

Date

Version: 1.0


Ó 2002 Microsoft Corporation. All rights reserved.

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.

Microsoft and Visual Basic are either registered trademarks or trademarks of Microsoft in theUnited States and/or other countries.


 

Revision & Sign-off Sheet

Change Record

Date

Author

Version

Change Reference

Reviewers

Name

Version Approved

Position

Date

Distribution

Name

Position

Document Properties

Item

Details

Document Title

Monitoring Plan

Author

Creation Date

Last Updated


 


Table of Contents

Summary

Objectives

Anticipating Failures

Resource Threshold Monitoring

Performance Monitoring

Trend Analysis

Application Health and Performance Monitoring

Detecting Failures (Incidents)

Error Detection

SNMP

Event Logs

Exception Trapping

Notifications

Diagnosing Failures (Problems)

Resolving Failures (Known Errors)

Recovering from Failures

Tools


 


 

[Introduction to the Template

Description: The Monitoring Plan defines the process by which the operational environment will monitor the solution. It describes what will be monitored, what monitoring is looking for, how monitoring will be done, and how the results of monitoring will be reported and used. Customers use automated procedures to monitor many aspects of their solutions. Automated monitoring is a key best practice that enables identification of failure conditions and potential problems. Monitoring helps to reduce the time needed to recover from failures.

Justification: The plan will provide the details of the monitoring process, which will be incorporated into the functional specification. Once incorporated into the functional specification, the monitoring process (manual and automated) will be included in the solution design. Monitoring ensures that operators are made aware that a failure has occurred so they can initiate procedures to restore service. Additionally, some organizations monitor their servers’ performance characteristics to spot usage trends. This proactive best practice allows organizations to identify the conditions that contribute to system failure and take action to prevent those conditions from occurring.

{Team Role Primary: Program Management is responsible for ensuring that the plan is completed and has acceptable quality, as well as incorporating it into the Master Project Plan and Operations Plan. Release Management will contribute heavily to the content of the plan in its responsibility for designing an effective solution monitoring process.

Team Role Secondary: Development will review the plan to ensure that the functional specification and project deliverables are in synch with the monitoring plan. Product Management will review the plan to ensure that external customer needs are met by the monitoring plan. Test and User Experience will review the plan to ensure that what is monitored supports their functional areas of interest.}]


 

Summary

[Description: Provide an overall summary of the contents of this document.

Justification: Some project participants may need to know only the highlights of the plan, and summarizing creates that user view. It also enables the full reader to know the essence of the document before they examine the details.]

<<Begin text here>>

Objectives

[Description: The Objectives section describes the business and technical drivers of the monitoring process and what key objectives are targeted for the monitoring process.

Justification: Identifying the drivers and monitoring objectives signals to the customer that Microsoft has carefully considered the situation and solution and created an appropriate monitoring approach.]

<<Begin text here>>

Anticipating Failures

[Description: The Anticipating Failures section should

This information could be documented in matrix such as:

Component

Single Point of Failure

(yes or no)

Mean Time between Failures

Conditions and Circumstances leading to Failure

Probability of Component Failure

Impacts of Failure

Justification: Anticipating failures will enable operations either to avoid them or be prepared to deal with them when they occur.]

<<Begin text here>>

Resource Threshold Monitoring

[Description: The Resource Threshold Monitoring section identifies the solution resources that will be monitored, it defines the conditions and circumstances to be monitored for each type of resource, and it defines the thresholds to be used to judge that resources are working properly and are/are not sufficient to support the solution. Resources include hard drives, CPU, memory, and threads.]

<<Begin text here>>

Performance Monitoring

[Description:  The Performance Monitoring section defines the monitoring process that gathers and records information about the performance of the total solution and the individual components in the solution. For each type of solution event it includes

<<Begin text here>>

Trend Analysis

[Description: The Trend Analysis section defines the analysis that will take place on the data collected during performance monitoring. Trend analysis uses the information gathered and recorded by performance monitoring to predict solution and component performance and health under different conditions and circumstances, such as a larger user set and a changing solution environment.]

<<Begin text here>>

Application Health and Performance Monitoring

[Description: The Application Health and Performance Monitoring section should list and describe each software application in the solution and describe the plan for monitoring each application:

<<Begin text here>>

Detecting Failures (Incidents)

[Description: The Detecting Failures section should describe how the development team, operations, and maintenance will utilize the functional specifications and user acceptance criteria to detect failure incidents. The functional specifications clearly define the success criteria for a solution and for each of its components. User Acceptance Criteria, based on the functional specifications, precisely define user expectations for the correct and effective operation of the solution.]

<<Begin text here>>

Error Detection

[Description: The Error Detection section describes the processes, methods, and tools teams will use to detect and diagnose solution errors. The goal of an error detection strategy should be that the error is detected, resolved and recovered without the knowledge of the user community.

Justification: Error detection in a Windows environment will enhance a solution’s reliability and availability. Early detection and handling of application and system errors can help avoid a shutdown, or at least allow for an orderly shutdown. It can also increase availability by allowing the solution to continue operating in a degraded state.]

<<Begin text here>>

SNMP

[Description: The SNMP protocol captures or traps configuration and status information from a Windows NT server.]

<<Begin text here>>

Event Logs

[Description: The Event Logs section describes the logs that will provide a system for capturing and reviewing significant application and system events. Describe the logs operations will maintain and the procedures they will use to record events and time in the logs.]

<<Begin text here>>

Monitoring for Failure

[Description: The Monitoring for Failure section should describe the processes, methods, and tools teams will use to detect and report solution failures.]

<<Begin text here>>

Monitoring for Success

[Description: The Monitoring for Success section describes the processes, methods, and tools teams will use to determine the solution is working correctly and is meeting user expectations. Monitoring for success includes the use of monitoring tools and interaction with solution users to gather information about solution successes.]

<<Begin text here>>

Monitoring for Alarms

[Description: The Monitoring for Alarms section describes how solution alarms will signal that a problem is about to occur or has occurred in a solution. It should identify all solution alarms, indicate how they will signal users and operations, and define what each alarm means.]

<<Begin text here>>

Exception Trapping

[Description: The Exception Trapping section describes a type of monitoring built into a solution that recognizes incidents, indicating a solution has produced a result that is an exception to acceptable results (i.e., the result lies outside the range of acceptability). This section should identify where the development team will build exception traps into the solution that continually monitor solutions or that operations will turn on when they suspect problems within a solution. Exception trapping capabilities allow for reliable programmer and program control over responses to exceptions that occur during the execution of a solution.]

<<Begin text here>>

Notifications

[Description: The Notifications section describes how people will be notified when monitoring and exception trapping has detected solution failures. This should include notification for errors and cases in which user performance expectations have not been met.]

<<Begin text here>>

Diagnosing Failures (Problems)

[Description: The Diagnosing Failures section describes the processes, methods, and tools teams will employ to diagnose the problems detected in solutions by monitoring and exception trapping.]

<<Begin text here>>

Resolving Failures (Known Errors)

[Description: The Resolving Failures section describes the procedures teams will use to correct the errors detected and diagnosed in solutions and to improve solutions that do not meet user expectations.]

<<Begin text here>>

Recovering from Failures

[Description: The Recovering from Failures section defines how the solution will be recovered from failure or referenced the Backup and Recovery Plan.]

<<Begin text here>>

Tools

[Description: The Tools section lists and describes the tools teams can employ to detect, diagnose, and correct errors and to improve a solution’s performance. The table below is an example of this.]

<<Begin text here>>

Tool

Description

Microsoft Systems Management Server

Integrated inventory, distribution, installation, and remote troubleshooting tools for centralized management of hardware and software. Microsoft Systems Management Server can be used in medium to large multi-site Windows–based environments to reduce the cost of change and configuration management of Windows based desktop and server computers. Details available at http://www.microsoft.com/backoffice

Microsoft Performance Monitor (Perfmon)

Windows NT administrative tool that enables viewing behavior of processors, memory, cache, threads, and process objects. Each object has an associated set of counters that provide information about device usage, queue length, delays, and other data that measures throughput and internal congestion. Details available at http://www.microsoft.com/ntserver

Microsoft Windows NT Resource Kit, version 3.51
Microsoft Windows NT Server 4.0 Resource Kit
Microsoft Windows NT Workstation 4.0 Resource Kit

Microsoft Press® kits contain both technical documentation and a CD-ROM with useful utilities and accessory programs to help install, configure, and troubleshoot Microsoft Windows NT. See Details available at http://www.mspress.microsoft.com

Tivoli Management Software

Family of products with a single management framework integrating disparate IBM systems management applications. Details available at http://www.tivoli.com

Microsoft HTTPMon

Multithreaded Windows NT service that monitors web server performance by measuring how quickly the web server responds to requests from client browsers. Details available at http://www.microsoft.com/ntserver

HP OpenView

Hewlett Packard family of products designed to manage distributed computer systems and networks from computers running Windows or UNIX operating systems. Details available at http://www.hp.com

NetManage

Single-source PC-to-host connectivity solutions from NetManage. The company develops integrated applications, servers, and development tools for Microsoft Windows, Windows® 95 and Windows NT operating systems. Details available at http://www.netmanage.com

PerlEx

Utility for Web servers running under Windows NT that improves the performance of Perl scripts. Details available at http://www.activestate.com

SeNTry

An SNMP-based monitoring tool. Details available at http:// www.missioncritical.com