Design for Operations
Building Health, Task, and State Models

 

Microsoft® Corporation

Published: October 2003

 

Abstract

As part of its Dynamic Systems Initiative, Microsoft is making a number of investments to improve the manageability of individual applications and distributed systems. These investments include enhancements to the management infrastructure in Microsoft Windows® operating systems, development of management protocols based on the Web Services Architecture, and capabilities in our development tools that make it easier for developers to construct and publish management capabilities.

This white paper provides guidance on the development and usage of Models for Management, which Microsoft developed. These models for system health, administrative tasks, and application state provide a structured basis for building management into your applications and services. Using these models will help you create instrumentation, monitoring, and automation that support the administrative needs of your application or service.

This white paper is intended for service or application developers who want to increase the manageability of their products.


 

Legal Page

 

This is a preliminary document and may be changed substantially prior to final commercial release of the software described herein. 

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.  Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only.  MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user.  Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document.  Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

© 2003 Microsoft Corporation.  All rights reserved.

Microsoft, Active Directory, ActiveX, Outlook, SharePoint, Windows, Windows NT, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

 


 

Contents

Introduction

The Models

Document Flow

Health Model - Introduction, Customer Impact, Understanding the Health Model, Building the Health Model, Implementation

Task Model - Introduction, Customer Impact, Understanding the Task Model, Building the Task Model, Implementation

State Model, Introduction, Customer Impact, Understanding the State Model, Building the State Model, Implementation

Appendix A – Health Model Documents - Terminal Server System Diagram, Terminal Server Licensing Spreadsheet excerpt, Terminal Server Licensing Health Stat Diagram

Appendix B – Customer Issues Template

Appendix C – Task Model Example, Sample Task Breakout, Sample Operation Breakout

Appendix D – Common Tasks

 


 

Introduction

Decisions made during the application design phase have a profound effect on the deployment and ongoing operations of that application. An analogy can be drawn here to the “Design for Manufacturability” movement that occurred in the manufacturing industry many years ago. In those product development cycles, studies have shown that only 8% of the total product budget has been spent by the time a product is designed, but those design decisions have locked in 80% of the cost of the product.[1] The manufacturing industry learned from this and began to have production workers sit with the product design teams as they made design choices to ensure that the choices they made factored in manufacturing concerns.

The same principles can be applied to software. Manageability must be engineered into an application from the beginning; it can no more be added on after the fact than scalability or security. Applications must have manageability interfaces built in.  These interfaces and their schemas must be published, together with the models and behaviors behind them, prescriptive guidance for how to operate them, and any policies and service levels agreements that apply.

As part of its Dynamic Systems Initiative, Microsoft is making a number of investments to improve the manageability of individual applications and distributed systems. These investments include:

Microsoft recommends that application developers follow the practices in this white paper and leverage the forthcoming Longhorn infrastructures to ensure the most manageable enterprises possible. Longer term, our development tools will provide additional capability to make following these best practices even simpler.

The processes and infrastructures described here and being developed for Longhorn will be deeply integrated into Microsoft components.

Before manageability is coded into services or applications, it is important to define the basis of manageability. You should consider what the administrator will see and what it means to the system. Going deeper, you should identify the issues that will help them manage systems proactively and to perform the necessary tasks. Microsoft has developed three models as the basis for implementing management within a service or application: the Health Model, the Task Model, and the State Model. This white paper is meant to guide you through the process of developing these models.

The Models

The models are meant to provide a prescriptive and iterative mechanism for ensuring that management is built into every service and application and that the management is aligned with the needs of the administrator who will be running your service. There are three management models:

The Health Model defines what it means for a system to be healthy or unhealthy, and it defines how a system transitions in and out of such states. Good information on a system’s health is necessary for the maintenance and diagnosis of running systems. The contents of the health model become the basis for system events and instrumentation on which monitoring and automated recovery is built. All too often, system information is supplied in a developer-centric way, which does not help the administrator know what is going on. The Health Model seeks to guide both what kinds of information should be provided and how the system or the administrator should respond to the information.

The Task Model enumerates the activities that are performed in managing the system. These may be maintenance tasks performed on a routine basis, such as backup; event-driven tasks, such as adding a user; or diagnostic tasks performed to correct system failures. Defining these tasks guides the development of administration tools and interfaces, and it becomes the basis for automation. Used in conjunction with the Health Model and ensuing instrumentation, the Task Model also drives self-correcting systems.

The State Model catalogs the state and settings associated with an application and defines the scope and type for each. State may be associated with the computer or the user, it may be temporary or permanent, and it might be user data or operational parameters. Having a strict association of every state entity with its scope and category allows the administrator flexibility in deployment and provides a powerful tool for control. It means an administrator can separately store user data, migrate a user easily from one computer to another, and replicate computer configuration across a data center.

The following diagram shows how management fundamentals build on these models: 


Note that the models are developed prior to the code, although the process below shows you how to leverage existing code to start the process. Then the models are incorporated into the design and the application, leveraging the infrastructures Microsoft provides. Using the defined processes and infrastructures allows administrative tools to operate across servers and services. The process is meant to be iterative. Working with administrators, you will learn more over time about system failure states for the health model and tasks needed to run the system. The processes below describe how to use the feedback from your customers, internal testers, and support organizations to continually improve the models, which will drive changes in the application itself.

Document Flow

This document is divided into three key sections: Health Model, Task Model and State Model. These are three fundamental areas for development teams to participate in management. Each section describes how to think about the area, what work is appropriate for what applications, how to prioritize that work, and how to incorporate it.

After describing the model, each section introduces you to the infrastructures that support it. Often there are multiple mechanisms; some are available today, and some will be delivered in Longhorn. The section helps you figure out which ones apply to your application or service.

 

Health Model

Introduction

Currently, most Windows-based applications and services expose events, performance counters, tracing, and Windows Management Interface (WMI) objects. However, they lack a clearly defined relationship between this instrumentation, the overall health of the services an application provides, and its possible failure modes. Often insufficient information is provided to help detect, diagnose, or recover from specific problems, which limits the benefit administrators and users receive from this instrumentation.

The Health Model has the following goals:

·         Document all management instrumentation exposed by an application or service.

·         Document all service health states and transitions that the application can experience when running.

·         Determine the instrumentation (events, traces, performance counters, and WMI objects/probes) necessary to detect, verify, diagnose, and recover from bad or degraded health states.

·         Document all dependencies, diagnostics steps, and possible recovery actions.

·         Identify what conditions will require intervention from an administrator.

·         Improve the model over time by incorporating feedback from customers, product support, and testing resources.

The Health model is initially built from the management instrumentation you expose. By analyzing this instrumentation and the system failure-modes, you can also identify where your application lacks the proper instrumentation. Events, WMI providers, performance counters, probes, and Event Tracing for Windows (ETW) are the infrastructure pieces that allow you to expose the information indicated by your model. Monitoring builds on top of this instrumentation using Microsoft Operations Management MOM packs today. In Longhorn, Windows Monitoring Service will build on the instrumentation and MOM on top of that.

Customer Impact

Building from a strong understanding of service health for your applications allows instrumentation to be aligned with customer needs. Coupled with the monitoring and diagnostic infrastructures, this will allow administrators to quickly obtain the information appropriate to their circumstances. The guidelines below on management instrumentation and documentation will ensure that the structured information delivered to the administrator is meaningful and that the appropriate actions are clear. These improvements will support prescriptive guidance, automated monitoring, and troubleshooting, which, in turn, will simplify data center operations, reduce help desk support time, and lower operational costs.

The more complete and accurate an application’s model is, the fewer support escalations can be expected. This is simply because the known possible failures and corrective actions have already been described. With more automation, you enable customers to manage a larger number of computers per operator with higher uptime.

In addition, the modeling documents created can be directly used in producing the deployment, operations, and prescriptive guidance documents for your customers when the product ships.

Understanding the Health Model

Currently, most of the applications and services expose events, performance counters, and WMI objects, with no clearly defined relationship between this instrumentation and application health and its failure modes. This contributes to the following problems:

·         Administrators do not know when things are going wrong until something breaks.

·         When something breaks, it is hard to determine what is broken and what to do about it.

·         Automatic monitoring tools do not have sufficient knowledge about the system to repair the problem.

·         Product support does not have information required to troubleshoot the application.

The Health Model addresses the above problems by:

·         Prioritizing an application’s top known support and customer issues.

·         Documenting all management instrumentation that an application contains that can be used to determine health.

·         Documenting all known health states and transitions that the application can potentially go through during its lifecycle.

·         Documenting the detection, verification, diagnosis, and recovery steps for all “bad” health states.

·         Identifying instrumentation (events, traces, and performance counters) necessary to detect, verify, diagnose, and recover from bad health states.

·         Refining the model as new states, transitions, and diagnostic steps are identified through customer, support, test, and community inputs.

The primary deliverables of the Health Model are:

·         Spreadsheet(s) that enumerate all management instrumentation, health states, health state transitions, verification and diagnosis steps, and recovery actions.

·         A Health State diagram of the application with all health states and transitions between them.

·         Monitoring and troubleshooting rules to automate failure diagnosis and recovery.

Appendix A – Health Model Documents” provides an example of this using the Terminal Server Licensing area. It includes a system-overview diagram, the spreadsheet, and the Health State diagram. These can be used as examples to help understand the process.

The following diagram shows the generic lifecycle of an application and the high-level health states a service can be in:

 

 

The states in this diagram indicate conditions that the system can be in. The sample in Appendix A – Health Model Documents” includes states such as Full Run, Stopped, and Not Activated.

Red, Green, and Yellow colors indicate the basic health buckets an application or service can be in:

·         Green – Fully operational, executing normally.

·         Yellow – Partially operational. Can perform some functions, but some problem has been detected. Non-critical functionality is missing, performance is degraded, or security is compromised.

·         Red – Cannot provide its services at all.

 

All applications will have an ‘Executing’ state in which they are functioning normally. In some cases, there may be more than one such state where there are alternate, viable modes of execution. All applications will have a Red state, indicating the application is not running. There may be multiple Red states, such as in the example above, indicating stopped versus failure or alternate instances of total failure. Yellow states vary significantly and are frequently unique to the architecture of the service being modeled because applications fail differently and can have numerous states of partial functionality.

 

The transitions between states show the activities that cause or indicate the change from one state to another. Understanding these transitions is crucial to properly instrumenting the system. Building instrumentation from the Health Model exposes the transitions into red or yellow unhealthy states, ensuring that these changes are surfaced to the administrator. Transitions back to a green healthy state are also expressed in the model and also must be instrumented. These ‘anti-alerts’ show administrators that the system is no longer in trouble, either because of self-correction or actions they have taken.

 

Building the Health Model

The following is the process for developing a Health Model for an application or service. This process results in a spreadsheet and a diagram that comprehensively describes the health states and all other related information about the application.

Requirements

Building the Health Model requires the following steps:

1.       For an existing application, identify and prioritize your customer’s top failure cases. These indicate red or yellow states that should be expressed in your model.

2.       Enumerate all management instrumentation your application exposes. This will help you identify additional health states and transitions, align your instrumentation with the model, and identify where additional instrumentation is necessary.

3.       Analyze this instrumentation to create a spreadsheet that documents your health states, detection signatures, verification steps, diagnosis steps, and recovery actions.

4.       Analyze your service architecture for potential failure modes not currently exposed by your instrumentation.

5.       Add all states that can only be detected by inspecting instrumentation or by exercising instrumentation methods.

6.       Create a diagram that shows health states and transitions between them.

7.       As your code evolves, update your model to accurately reflect the code. Add new health states and events to the model, and make sure that required instrumentation is in place.

8.       Use feedback from customers to discover unknown problem states, and update your model accordingly. Add instrumentation where required to support these new states.

 

The excerpt in Appendix A – Health Model Documents” of the sample spreadsheet illustrates the results of this process. A spreadsheet titled Template - Instrumentation Inventory and Health State Workbook can be found at:
http://www.microsoft.com/windowsserver2003/technologies/management/dsi/designops.mspx
to use for developing your specific model.

Process

Step 1: Identify and prioritize customer issues.

1.       Generate a prioritized list of customer issues with your application.
A template and example can be found in Appendix B – Customer Issues Template.” Good sources for issues for existing applications are direct customer feedback, customer support teams, and community forums. For all applications, include expected failure cases derived from the product planning team, test cases, and test experience.

2.       Prioritize your issues list based on relative customer pain.
The template in Appendix B – Customer Issues Template” allows you to enter the frequency of failure and difficulty to diagnose and repair.

·         For existing applications, call volumes and average response time are good measures to use, and these can be multiplied to create priority.

·         For new applications and internally identified scenarios, assess the expected frequency and difficulty of each scenario, use results from your test organization, or both.

 

3.       Decide on your solution scenario for addressing your top customer issues in priority order. Possible actions are:

a)      Treat the issue simply as a bug and fix it.

b)      Re-architect so that the issue no longer surfaces.

c)       Develop diagnostic and monitoring instrumentation to ensure that this case is detectable, diagnosable, and recoverable in your model. This is the case we focus on here.

 

4.       Use this list as you progress through this process to ensure that your model is evolving so that administrators can diagnose and recover from these top issues. Ensure that failure states identified in 3c are reflected in your Health State diagram, and determine what indicates transitions into and out of those states.

Step 2: Enumerate all management instrumentation your application exposes.

1.       Gather all .mc files for your application or service.

2.       Obtain the spreadsheet Template - Instrumentation Inventory and Health State Workbook from:
http://www.microsoft.com/windowsserver2003/technologies/management/dsi/designops.mspx

3.       Open each .mc file with a plain-text editor, and transfer the events into the spreadsheet. Include the event ID, symbolic name, facility, category, type, and message description.

Step 3: Analyze your instrumentation to create a spreadsheet that has the following information:

Go through the spreadsheet event by event, and fill in the following information for each event.

Item

Description

Event ID

Event ID as reported to Eventlog. (Obtained from .mc file.)

Symbolic Name

Symbolic name for the event. (Obtained from .mc file.)

Facility

[Optional] Facility for the event. (Obtained from .mc file.)

Category

[Optional] Category for the event. (Obtained from .mc file.)

Type

Event type as reported to Eventlog. (Obtained from .mc file.)
Revise per the guidelines on Event Types if necessary.

Level

Severity of event. Revise if necessary. These might include:

Critical:

The application has encountered a critical degradation in its health or capabilities, which prevents it from servicing any subsequent operations.

Error:

The application has encountered a partial degradation in its capabilities, but it may be able to continue to service further requests.

Warning:

The application has encountered problems that are not immediately significant but that may indicate conditions that could cause future problems. Also, the application has detected problems in a different application. (However, these problems do not affect the application’s health or capabilities.)

Informational:

The application has encountered a positive change in its capabilities (that is, recovered from a previous degradation). These often negate previous degradations.

Verbose:

Diagnostic trace signifying detailed traces from intermediate steps taken by the application while executing.

Message Description

Event Message Description as written to Eventlog. (Obtained from .mc file.)

Review and update as needed. Admin Event messages must have:

Explanation:

The explanation should provide a text description of what occurred and the change in the capabilities of the service that resulted from it. If the change is negative (that is, a degradation in capabilities), this description should specify the degradation that occurred. If the change is positive, this description should state what the new or restored capabilities are.

User Action/Remedy (not applicable for informational events):

The user action/remedy presents steps the user can take to fix the problem, to diagnose it further, or both. It could include running a utility or performing a different task to fix the problem, retrying an operation, or looking into another log for further information about the problem.

Legacy API?

If the event is reported through the legacy API (ReportEvent), the spreadsheet should show Yes in this column. If the event will be reported through the new API (EvtReport), the spreadsheet should show No in this column.

Tag

This column should show into which classifications the event falls. You can also add tags for event types that are specific to your service.

Install: The event indicates the installation or un-installation of an application or service within the service raising the event.

Settings: The event indicates a settings (configuration) change in the service.

Lifecycle: The event indicates a runtime lifecycle change (that is, start, stop, pause, or maintenance) in the service.

Security: The event indicates a change that is security related.

Backup: The event indicates a change that is related to backup operations.

Restore: The event indicates a change that is related to restore operations.

Connectivity: The event indicates a change that is related to network connectivity issues.

LowResource: This event is related or caused by low resource (for example, disk or memory) issues.

Archive: This event should be archived for an extra long period of time for the purpose of availability analysis. (These events must be infrequent, for example, restarting the computer.)

Insert parameters

Enter real property names for each of the Insert parameters for this event. Use commas to separate Insert parameters.

Blame Component

If the blame for this failure falls on one of your dependencies, state the dependency to blame for the failure.

State Before

Operational state of the application or service before the event.

State After

Operational state of the application or service after the event.

Desired State

Operational state in which the application or service would have been, had the event not occurred.

Event Group

Name of a group of related events all signifying a transition from one health state to another. Use a separate name for each transition line, but give the same name to all events that indicate that particular transition. See the example in Appendix A – Health Model Documents.”

Availability

Current level of service availability in this state. Availability can be:

Red: No service/functionality is available.

Yellow: Partial service/functionality is available.

Green: All service/functionality is available.

Verification

How do you verify whether you are still in this state?

How do you verify whether you have recovered from this state (anti-events)?

Test, probe, or presence/lack of an informational event that can be used to verify whether the service is in the detected state.

Diagnosis

What should be inspected to determine the root cause of why the application is in this state?

Diagnosis typically starts by enumerating the list of "Detection" events and identifying where diagnosis should start for each one.

Events, traces, configuration settings, WMI providers, and performance counters can all be sources for diagnostic information.

Recovery

How can the application recover from this state? What actions should be taken?

Configuration settings, WMI providers, troubleshooters, and monitoring rules can all be used as potential recovery steps.

Auto-Retry

Does the application automatically attempt to recover from this state? If so, how often?

Anti-Event

Event that indicates a possible transition back to a healthy state for this event. If verified, invalidates the original transition to a bad health state.

Comments

General comments around this event, this state, or both.

Source File

Convenience column for listing the source file from which this event is logged.
This is optional but has proven useful for some teams doing their analysis.

Probability

Probability of occurrence of this event based on knowledge of the code path and experience from previous support issues. This is fairly subjective and is meant to help you prioritize which events are most important to work on. This field can have a value of:

·          Rare

·          Low

·          Medium

·          High

 

Notes:

·         Because several events with different causes will result in the same state, verification, diagnosis, and recovery will depend on the event that signaled the problem.

·         Verification, diagnosis, and recovery instructions should be precise. Ideally, they should have references to the tools or scripts that can do the work and parameters for them.

·         Verification and diagnosis should log informational events to signal successful recovery wherever possible.

Event Types

An application can raise different types of events, and it is very important to define and distinguish them from each other. These types of events are:

Event Type

Description

Examples

Administrative Events

Indicate a change in the health or capabilities of an application or the system itself, signaling a health-state transition.

Started
Service stopped
Database backup failure
Severely degraded performance

Audit Events

Indicate a security-related operation, including the result of an access check on a secured object.

User logon

Operational Events

Indicate state changes, such as deployment, configuration, or internal application changes. These might be interesting to an administrator for debugging, auditing, or measuring compliance with a service-level agreement (SLA).

Counters installed for application x.
Thread pool increased to 50 threads.

Debug Tracing

Code-level debugging statements that are comprehensible only to someone with knowledge of the source code.

Function x returned y status code.

Request Tracing

Track application activity, response time and resource usage within and between parts of an application. Activated for problem diagnosis.

HTTP web request.
Search command on database servers.

Step 4: Analyze your service architecture for potential failure modes.

1.       Map both your internal and external dependencies and how they can fail.

2.       Examine your code for locations where failures are encountered, recovery logic has been written, or both.

3.       Ensure that each of these locations in your code exposes the proper type of instrumentation based on the instrumentation selection guidelines later in this white paper. The instrumentation must provide the administrator or user with clear information about actions to take, the cause of the problem, the loss in functionality, and further diagnostic direction.

4.       Make sure you have instrumentation to signal transitions back from bad states to good (anti-alerts).

5.       Update the Instrumentation Inventory and Health State Workbook and state diagram with this information.

Step 5: Add states that can be detected only by exercising instrumentation.

Not all health state transitions can be detected, diagnosed, and verified from inside of the service itself. For this reason, it is also important to document what client applications or services rely on your services, how they might be exercised to test the health of your service, and how the management instrumentation that they expose that could indicate your failure to supply proper service to them.

An application might, for example, publish the average transaction time over a certain interval as a performance counter. An external service can detect a performance degradation by comparing this to historical data and generate an appropriate event. An application might also be blocked by waiting for an external application that has stopped responding.

Step 6: Create the Health State Diagram.

A visual representation helps illustrate how your application or service looks as a whole. A visual health state transition diagram also can pinpoint where you are missing instrumentation.

1.       Create a diagram that shows the states and the signals of transitions between those states (event groups) from your Template - Instrumentation Inventory and Health State Workbook.

2.       Look for locations where you have clear have transition/recovery paths that no instrumentation will detect.

3.       Add the proper instrumentation to your code to be able to detect these conditions, and update the spreadsheet and diagram accordingly.

4.       Add events or other instrumentation to signal transitions back from bad states to good.

A sample health state diagram, developed for Terminal Services Licensing, can be found in Appendix A – Health Model Documents.”

Step 7: Incorporate code changes.

Your code base is always evolving. New code is introduced, and old code is re-factored. As your code evolves, you need to keep your model up to date with your code. These modeling documents need to be treated as living specifications that must be kept in synchronization with your current architecture at all times.

Step 8: Incorporate customer feedback.

Customers, community, product support, and test resources will report problems and solutions over the lifecycle of your application.

New health states will be identified, alternate verification and diagnostic steps will be found, and quicker recovery paths will be discovered as your services are deployed and used. The Health Model is a living set of documents. It must be improved over time as your customers teach you how they manage your services in their environments and identify where you need to add management instrumentation to future releases.

Artifacts

The following artifacts will be produced by the health modeling exercise.

·         Customer issues document, which lists highest cost issues and suggested mitigation solutions for your application or service.

·         Instrumentation Inventory and Health State Workbook which enumerates all management instrumentation, health states, health-state transitions, conditions of the transitions, verification steps, diagnostic steps, and recovery actions.

·         Health State diagram, which shows all health states and transitions between them, as well as conditions of these transitions.

Implementation

Numerous services support health monitoring, diagnosis, and troubleshooting. The tools are summarized in the table below:

 

Infrastructure

When to Use

Eventlog

·          Use to notify when a failure or health state transition has been detected that can be determined as internal to the application or service.

Trace events and ETW

·          Use when debug-style tracing is required in your code to understand the flow of execution for diagnostic and troubleshooting purposes.

WMI and WMI probes

·          Use to expose live state control.

·          If you have an existing WMI that provides all or most of the required instrumentation, extend it.

·          If not, it is much simpler to use the WMI V2 Managed Code support, System.Management.Instrumentation.

Performance counters

·          Use to expose metrics that measure and analyze system performance.

·          For native code on Longhorn client, use the existing Perflib V1.

·          For managed code, use the Systems.Diagnostics.Performance counter classes.

Microsoft Operations Manager

·          Used for problem detection and alerting, roll up health of application or service to administrator, and automated problem diagnosis and corrective actions.

·          Requires agent installation on Microsoft Windows 2000 and Windows Server™ 2003 operating systems.

Windows Monitoring Service

·          Used for problem detection and alerting, roll up health of application or service to administrator, and automated problem diagnosis and corrective actions.

·          New infrastructure for Longhorn.

 

 

These infrastructure tools are available as follows

Infrastructure

Windows 2000

Windows  XP

Windows Server  2003

Longhorn client

Longhorn server

Eventlog

X
(ReportEvent)

X
(ReportEvent)

X
(ReportEvent)

X
(Crimson)

X
(Crimson)

Trace Events / ETW

X

X

X

X

X

WMI (CIMOM)

X

X

X

X

X

WMI probes

X

X

X

X

X

Performance counters

X

X

X

X

X

Windows Monitoring System

 

 

 

 

X

MOM

X

 

X

 

X

 

 

 

Task Model

Introduction

A variety of tasks are required to operate business systems. Some are periodic, and some are event driven. Some are performed locally, and some are not. Applications and services support these tasks through the graphical user interface (GUI) tools, such as the Microsoft Management Console; the scripting interfaces; and the command-line tools they provide. However, these tools are not always well correlated with the administrative tasks. Common tasks are sometimes difficult to perform or to automate. Remote management might not be completely supported, preventing “lights out” operation or requiring help desk to visit customers. Some tasks have no tools support, only APIs, so special tools must be written or purchased to provide complete management.

 

Task Modeling provides a process to ensure that an application or service can be completely administered. It describes administration of the application or feature in terms of tasks – complete meaningful operations that have direct value to the administrator or user.

 

Tasks apply broadly to all aspects of using and managing services and applications. Although the focus is on administrative tasks, creating a complete task model for your application is recommended. This will ensure that all tasks are properly exposed, assisting you in facilitating task delegation at the right level.

 

The goals of the Task Model are:

·         Identify the complete set of tasks associated with operating an application or service.

·         Define the scope for tasks: who will perform them, when, and how.

·         Ensure that all tasks can be automated.

·         Grow the task model through customer and community input.

 

A complete task model becomes the basis for developing administration tools and interfaces. It illustrates where GUI tools, command-line tools, and scripting interfaces are required. To ensure it is complete, the model should be developed in conjunction with users and administrators who can provide outside perspective on how an application or service is used.

 

Basic concepts of task-based administration include:

·         Task Model A complete list of tasks that users or administrators will want to perform on the application or service. Also, it identifies command-line and then GUI tools required to implement these tasks.

·         Task An action or procedure that a system administrator will want to perform on the application or service. A task can be thought of grammatically as:

<verb> <noun> <context>

Where:
      <verb>              specifies the action that one wants to perform.
      <noun>             specifies the object of the action.
      <context>         provides the parameters of the action.

Examples:

-          <Create> <Child Domain> in <Domain>

-          <Add> <Class Definition> in <Schema>

-          <Delete> <User Account> from <Group>

-          <Create> <Web Site> on <Web Server>

Usually a task breaks down into a series of Operations. Because administrators and users consider tasks as single events, each task should operate transactionally. It should either succeed or fail as a whole. Make sure that, when one of the operations fails in the middle of executing the task, the tool that implements a task can reverse appropriate operations. For example, ‘create user account’ could result in security and support problems if it terminated with the user account created in parts of the system but not others.

 

·         Operation An atomic action performed on a resource managed by the application or service. Examples include setting and getting individual configuration parameters and reading specific system metrics. An operation is usually, though not necessarily, a simple task on its own. Exposing operations enables tools or scripts to aggregate them into complex tasks efficiently, just as writing subroutines supports developing a major application. All operations should be exposed as tasks to support administrators in developing custom tasks and in ad-hoc resource and setting manipulation.

·         Permission A set of access rights that a security principal has with regard to the object. Usually, a permission is one or several access control entries (ACEs) relevant to the security principal in the object’s access control list (ACL).

Customer Impact

By developing a deep understanding of the tasks associated with an application or service and how those tasks are generally incorporated into user or administrator work flow, you can ensure that complete task support is provided and the operation of the application is as painless as possible. The most important benefits of creating a task model and building tools and interfaces from this include:

·         Right level of abstraction. Administration is described in terms of tasks applicable on the service level and understandable by administrators and users. It avoids defining actions only at the level of resources and operations, which might be more meaningful to developers.

·         Consistency between GUI and command-line tools. Implementing all your tools on top of the same model (whether MMC, a WMI provider, or a script) drives a consistent customer experience.

·         Leverage of the knowledge. Consistency between GUI and command-line tools allows an administrator or user to start working with the system using easy-to-understand GUI tools and then directly leverage this knowledge to build automated management tools.

Understanding the Task Model

The Task Model catalogs the tasks and operations necessary to operate the system. An example is shown in Appendix C – Task Model Example.” The task model has three sections: Tasks, Operations, and Verbs. Typically, you will only use the first two. The verb section is provided to allow you to expand the list of allowed verbs. Keeping consistent verb usage makes the system easier to understand and use for the administrator; the verb list should not be expanded unless absolutely necessary to add a completely new function.

The Task section enumerates the actual tasks that a user or administrator needs to perform. It consists of the following fields:

Field

Values

Description

Category

Deployment, State/Data Mgmt, Config Mgmt, Health/Diagnostics, Administrative

 

Task

Any

The name of the task.

Description

Any

A description of the task as the user would understand it.

Operation

List of operations

The list of operations performed as steps to accomplish this task. See steps 2 and 3 below for how this evolves.

Permissions

Any

The permissions required to perform this task (for example, read, write, create, and so forth). Including this data helps ensure that tasks are designed to provide the appropriate security.

User Role(s)

Any

The roles or group memberships required to perform this task (for example, User, Domain Administrator, Backup User, and so forth). This information helps ensure that the task can be run by the applicable role with the least privilege.

Server Role?

Yes or No

Does this task typically apply to all servers of a particular role, such as Web servers, or only to a specific server?

Distributed Service?

Yes or No

Does this task apply to only one type of server or to multiple types that support a system?

Task Frequency

Any

How often is this task performed (for example, daily or quarterly)? If the task is performed based on a specific event (for example, Create User is performed when a person joins the company), consider how frequently that event typically occurs.

Skill Level

Novice, Layman, Expert

What administration skill is expected of the person executing this task? Is this person a novice (that is, a user), a laymen (inexperienced administrator), or an expert administrator? Tasks that people at the novice and laymen levels perform should probably have simple GUI tools, especially if the tasks should be performed frequently. Expert users will look for scripting interfaces to automate most tasks, especially ones that are frequently performed. However, even expert users may seek GUI tools for tasks rarely performed so that they do not have to learn all the nuances of such commands.

Comments

Any

Any additional issues, questions, or notes that you want to document for this task.

 

The Operation section enumerates the granular operations that are required to perform the tasks. As discussed in Step 3, each operation will probably also be listed as a task, but other tasks will aggregate operations. The Operation section consists of the following fields:

Field

Values

Description

Name

Any

A name for the operation. This should be expressed as you would in your code (without spaces).

Description

Any

A description of the operation as a user would understand it. Typically this is the same description used initially in the operation list for the tasks. See steps 2 and 3 below.

Verb

Defined verb list

The action for this operation. See the discussion of <verb> <noun> <context> above.

Noun

Any

The noun for this operation. See the discussion of <verb> <noun> <context> above.

Context

Any

The context for this operation. See the discussion of <verb> <noun> <context> above.

Comments

Any

Any additional issues, questions, or notes that you want to document for this operation.

 

Building the Task Model

The excerpt in Appendix C – Task Model Example,” for the sample spreadsheet illustrates the results of this process. A spreadsheet Template-TaskModelWorksheet can be found at:
www.microsoft.com/windowsserver2003/technologies/management/dsi/designops.mspx
 to use for developing your specific model.

Step 1 – Identify tasks.

Create a complete list of management tasks for your application or service.

Sources of information include:

·         The Common Tasks list. Many activities are shared by applications. Select the activities that apply to your area. See Appendix D – Common Tasks.”

·         Existing management tools and scripts for your application. You can extract the tasks that they provide.

·         Deployment and operation guides for your application. You can extract task descriptions from the processes that you recommend to your customers.

·         Customers who can help you determine tasks that you might have missed. You can ask them about special tools or scripts they have written or obtained, process documents they have compiled, or any problem areas they have uncovered.

Collect results into the “Task” tab of the template spreadsheet. Fill out the Category, Task, and Description columns for each task. This gives you the basic catalog of tasks associated with your application.

Step 2 – Develop tasks.

Each of these tasks needs to be expanded to provide the information you need to build tools and scripting interfaces for them. Fill out the rest of the columns according to the definitions above.


Fill in the Operations column with a description of each step required to perform the task. See Appendix C – Task Model Example” for a sample of this. When you find multiple tasks that have common steps, try to express those using the same phrase.

Step 3 – Identify operations.

Now define the Operations required to support the tasks. Start by pasting each step in the operations section into the description field of a row on the Operations tab of the template. As you do this, watch for operations that may be the same. If you find any, combine them to a single operation row.

 

Name each operation you identified. It will be easiest to manage the cross reference between this model and your code if that name carries across, so use a name that is legal in the syntax of your language. For example, Stop Web Operations is not legal in many programming languages because of the spaces; StopWeb might be a better choice.

 

For each operation, determine what resources are accessed during the execution of the task and what action is performed on each resource. Use this to populate the Verb, Noun, and Context columns. Use a common set of verbs wherever possible. Using multiple verbs for the same action can confuse the administrator and prevent tools from leveraging operations.

Step 4 – Identify operations and tasks.

Review the operations to determine any that should be identified as tasks. Tasks must represent a meaningful and complete action that some user or administrator will need to accomplish. For example:

·         Reset Password for User is a task, because it is a meaningful operation.

·         Set Property of User Object is an operation but not a task, because it does not represent any specific valuable action. It is merely a step in a process.

Step 5 – Develop tools and automation.

The information in this model provides guidance on how to build the tools and scripts to help manage your application. Use it to prioritize tools, to define script interfaces, and to determine where GUI or command-line tools should be provided.

 

Step 6 – Refine task model.

As your application evolves and is used more, tasks will be discovered that were not originally considered. Problems with the scope or flexibility of defined tasks will also be found. Solicit this information from the same sources cited in Step 1. Use this information to enhance the task model and to update your tools and scripting capabilities.

Implementation

Infrastructure

Availability

When to Use

WMI

Windows 2000 and beyond

In Windows 2000 and Windows Server 2003, WMI provides scripting and programmatic access to data and tasks that are specific to an application. In Longhorn, Monad provides the primary task interface. Use WMI in Longhorn when:

·You need it to support monitoring.

·You are exposing something that is already being exposed Longhorn Managed Code APIs

·Simple get/set is simpler to code under the WMI managed code interface.

 

MMC

Windows 2000 and beyond

MMC provides a framework for GUI tools.

Monad

Longhorn

Monad is the key automation and scripting interface in Longhorn, and it will automatically expose WMI and Configuration entries. Write specific Monad Cmdlets to create complex tasks, when protocols are explicitly involved, or to orchestrate tasks over multiple applications or services.

 

State Model

Introduction

Typically, applications have concerned themselves with settings, which are any typed data structure with persisted values that control the configuration or behavior of services or applications. These might be the registry settings that affect the operation of the application through GUI settings, Group Policy, or installation parameters. It is important to differentiate a setting from input data for an application. For example, the location of a dictionary is a setting, but the dictionary itself is input data. As an illustration, all of the registry content would be a setting except registry content with volatile keys or any COM binding data. Applications generally rely the most on configuration settings. A configuration setting is a setting that an administrator or user can manipulate. This section discusses new initiatives to expose and categorize configuration settings (which will allow them to be visible, accessible, and portable for administrators) and to create management tools that support roaming, migration, and backup.

Managing settings is not enough. A goal in Longhorn is to allow services to be highly portable, allowing them to be moved from computer to computer to achieve optimal workloads. Such portability also supports distributed storage and ‘Run from Network’ applications. To achieve this, not just settings but all state must be segregated. State includes any persisted information, including settings, files on disk, entries in WinFS, and so forth.

The State Model described below provides common guidelines for categorizing and separating state. Following this process helps you understand the State taxonomy and guides you through creating a viable state description for your application or service. The settings infrastructure supports the state description, but it is focused on the application settings. It facilitates visibility and migration of settings, allows developers to describe them, and provides a common API to management tools.

Some of the goals of this next-generation system for managing settings include:

Customer Impact

Providing visible and portable state is a huge asset to customers. In today’s environment, an administrator or user has limited information on the system state. They set configurations through GUI tools or Group Policy, but they cannot see how the system changes underneath. They have no comprehensive mechanism to compare the state of one system with another (for example, a working system with a broken one), to determine how this system has been changed from the defaults or from yesterday’s settings, or to track who changed what when. State is also hard to move. Administrators cannot easily fix a broken system by copying state from a working system, and they cannot effectively create ‘Run from Network’ implementations that would allow them to centralize management and control.

The State and Settings Model is designed to improve the situation. When state is fully and accessibly documented, administrators and users can access and manipulate the configuration of their systems in a meaningful way, and they can achieve key business scenarios such as ‘Run from Network,’ server consolidation, and service mobility.

Users and administrators both benefit from the more predictable application interaction that comes from well documented settings. This reduces application fragility, identifies and manages application compatibility problems, and generally reduces uncertainty and problems associated with rolling out software or updates. The system can identify areas of change, alert administrators to conflicts between applications, and prevent installations that would break existing applications. Users will also realize a more robust roaming experience because the system has much better knowledge of what settings need to roam for what purposes and how to abstract them.

Understanding the State Model

Components and applications are described in a manifest. The manifest is an XML and XSD-based description of all persisted settings in the component. Each setting has an associated type. The type may be a standard XSD type, a well-known type defined by Microsoft (through Configuration) or a user-defined restricted or complex type. Each setting definition also includes:

The basis for state management is understanding and expressing the persisted state of your component or application. This separation allows state to be associated, migrated, and accessed properly. State must be categorized along three primary dimensions:

State categories are mutually exclusive, and they are typically determined when the state is first created on a system. This might include when the application is installed, when the application is started for the first time, or when a user creates a file with the application for the first time.

The following table gives a more complete definition of the state categories, with some examples for each. Only Per User examples are given for User Explicit Data, because by definition this only exists in the Per User scope.

 

State Category

Description

Scope

Per User

Per-machine

Static / Program Files

Application’s static files that are delivered during installation and updated through servicing. Examples include Dynamic Link Libraries, executable files, and Help files.

·         ActiveX® controls

·         Applets or games downloaded from the Internet

·         Applications installed per-user

·         Operating-system components

·         Applications installed per-computer

·         Drivers

Registration

Data that components publish to extend the functionality of the system. This data is typically delivered during installation, much like Static state.

·         Internet Explorer protocol handlers (mailto, streaming media, and so forth)

·         File type handlers (.gif, .jpg, .txt)

·         Shell extensions

·         COM registration

·         Network stack registration

 

Configuration

Settings that control the functioning of the application and that are changed through explicit administrator or user action.

·         Internet Explorer home page settings

·         Shell start menu settings

·         Microsoft Word settings

·         Display settings

·         Internet Information Services (IIS) configuration

·         Microsoft Exchange Server configuration

·         Network configuration

·         Computer name

Operational

Information

State which is used by administrators for things like troubleshooting, debugging, or auditing. It can be discarded with no loss of functionality.

·         Mail send and receive logs

·         Application crash dump files

·         Exchange Server logs

·         IIS logs

·         Installer logs

·         Windows Update logs

Temp

State that can be discarded with no loss of functionality or data or that can be recreated from a master source.

·         Temp files

·         Internet Explorer cache file

·         System Temp files

·         CSC cache

·         Content indexing files

Data

User Explicit.
Data files that the user creates and explicitly manages.

·         Contents of My Documents (included Shared folder in Home / FUS scenario)

·         *.doc, *.jpg, *.mp3

·         Files created through the Save dialog box (for example, MyPrez.ppt)

Application Managed:
Data that is integral to the application’s functioning and whose location the application manages. (The user may be able to configure this.)

·         Microsoft Outlook® messaging and collaboration client .pst and .ost files

·         Microsoft Money data file

·         Application Most Recently Used lists

·         Outlook nickname file

·         Office PIP files

·         Microsoft SQL Server database file

·         Exchange mailbox store

·         SQL transaction history

·         System RestoreR restore point file

Hardware-dependent

State that is tied to the hardware configuration of a specific computer

·         Digital Rights Management playlist licenses

·         Microsoft Office product activation information

·         Windows product activation information

 

Building the State Model

Building your state model requires the following steps:

  1. Understand what the buckets mean and implications on application or component behavior.
  2. Analyze and bucket all of your state according to this taxonomy.

 

Step 1: Understand the buckets.

Review the table above and consider how this taxonomy applies to your settings. When thinking about this, consider the ‘Run From Network’ scenario, where the settings may be physically separate from your application and centrally managed. Do not assume that applications can write to Static/Program Files or Configuration. Assume that Temp configuration can be deleted at any time, and ensure your application can still run.

All settings must conform to this taxonomy. Do not create custom data stores that contain state to span multiple buckets, for example, a single file that contains temporary, operational, and application data. Likewise, avoid “mixing” state buckets in system data stores, such as overloading WinFS items or registry keys with multiple settings that might cross categories. Either of these approaches undermines the configuration system and prevents Microsoft from delivering a transparent, accessible system.

Step 2: Analyze and bucket all of your state.

A spreadsheet titled Template - Configuration Modeling Workbook is available as the basis for this step and may be found at:
www.microsoft.com/windowsserver2003/technologies/management/dsi/designops.mspx
 Currently this tool supports only Settings; a future release will also support State. The settings model is a complete list of all settings in the application with specific information about each one. The specific information is things like type, description, default value, security, configuration setting designation, applicable manageability scenarios, and so forth.

By constructing the spreadsheet and looking at the represented settings model, you can garner key facts about the application settings. For example, what grouping of settings might be created to make the settings more understandable? Are several settings related? These are candidates to be re-factored to simplify the user experience of the application. Re-factoring a setting creates a new setting that drives the values of the other settings. The user manipulates the new setting rather than all the subordinate settings, thus creating a significantly simpler user experience.

Implementation

Settings Management in Longhorn provides the infrastructure to support settings. This allows state information to be included in the configuration portion of the component or application manifest. Settings Management manages the local cache and communicates with the Configuration service to load or save settings and deliver notifications. The service provides store access and assertion enforcement. The application can also continue direct access to an external store.

 

Appendix A – Health Model Documents

The following documents are the results of the health modeling for Terminal Server Licensing. They are meant as an example for the health modeling process, and they should not be presumed to be technically accurate for Terminal Server diagnosis.

 

Terminal Server System Diagram

This block diagram gives the all-up view of a typical Terminal Server deployment.

Terminal Server Licensing Spreadsheet excerpt

This shows a portion of the spreadsheet generated:

 

Terminal Server Licensing Health State Diagram

The transition labels correspond to the event groupings in the spreadsheet above.

Appendix B – Customer Issues Template

The first line provides an example. Fill this out with key customer problem scenarios.

 

Problem/Issue

# of Calls

Avg Resp Time [2]

Pain Score[3]

 

Scenario

Solution

Remote Assistance does not work across some firewalls.

7

10

70

If user is behind a firewall that does not support Universal Plug and Play or if personal firewall is enabled, Remote Assistance does not work.

Changes are planned for Remote Assistance in a future release to address this. No ETA.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Appendix C – Task Model Example

Sample Task Breakout

Sample Operation Breakout

Appendix D – Common Tasks

Configuration management

Description

Configuration management is responsible for identifying, recording, tracking, and reporting changes to the configuration of system and application software. The information captured and tracked often includes description, version, relationships to other configurations, location/assignment, and current status.

Categories

The basic task categories for configuration management are:

·         Discovering configuration. Finding configurations of a software application or service.

·         Controlling access to configuration. Ensuring that all additions, modifications, or removals of configurations are authorized.

·         Updating configuration. Updating configuration of software locally or in a distributed environment.

·         Auditing configuration. Recording all the changes made to configuration and ensuring that those changes can be traced.

·         Verifying configuration. Verify consistency and data integrity of the configuration.

Scope of configuration tasks can be just one user, a single computer, or a cluster of servers in a distributed environment or data center.

Tasks/Operations

Get/Set/Add/Remove. Retrieve, update, create, or remove configuration item(s).

Test. Verify configuration consistency.

Lock. Prevent configuration from being updated.

Unlock. Allow configuration to be updated.

Compare. Compare two versions of a configuration.

Merge. Merge two versions of a configuration.

Export. Export configuration as a single entity, perhaps in a different format than native one.

Import. Import configuration, perhaps from a different format.

Security management

Description

Security management is responsible for maintaining a safe computing environment. Security is an important part of the infrastructure of the enterprise. An information system with a weak security foundation will eventually experience a security breach.

Security can be divided into six basic requirements, or tenets, that help ensure data confidentiality, integrity, and availability. The six security tenets are:

·         Identification. This deals with user names and how users identify themselves to the system.

·         Authentication. This deals with passwords, smart cards, biometrics, and so on. Authentication is how users prove to the system that they are who they claim to be.

·         Access control (also called authorization). This deals with access and the permissions granted to users so that they can perform certain functions on the system.

·         Confidentiality. This deals with encryption. Confidentiality mechanisms ensure that only authorized people can see data stored on or traveling across the network.

·         Integrity. This deals with checksums and digital signatures. Integrity mechanisms ensure that data is not garbled, lost, or changed when traveling across the network.

·         Non-repudiation. This also deals with digital signatures. Non-repudiation provides proof of data transmission or receipt, such that an occurrence of a transaction cannot later be denied.

 

The primary security management goals are to ensure:

·         Data confidentiality. No one should be able to view an organization’s data without authorization. It includes granting and revoking permissions to and from users, as well as delegating administrative rights to other system administrators.

·         Data integrity. All authorized users should feel confident that the data presented to them is accurate and not improperly modified.

·         Security auditing. Audit logs may give the only indication that a security breach has occurred, and they may pinpoint the location and the perpetrator of the breach.

·         System safety. Functions of other roles must not compromise the overall security of the enterprise. Security of the system must be maintained despite the system administration model chosen.

Categories

·         Managing identities. Adding and deleting identities to the system and modifying properties of security identities.

·         Managing security policies. Managing security policies for the application or service. Examples include password expiration policy, encryption on the network, authentication methods, and so forth. This category includes managing everything that affects the security of the entire application, not individual resources managed by the application.

·         Managing roles. Managing security roles for the application or service. Roles are a higher level, simplified approach to access control management. This category also includes temporarily delegating responsibilities (roles) to users.

·         Managing access control. Managing low-level access masks on resources (in other words, ACL management).

·         Monitoring security. Verifying consistency of the security policies and settings, including roles and access masks; verifying data integrity and authenticity; and preventing real-time intrusion and attack detection.

·         Auditing security. Analyzing security and audit logs of the application or service.

Tasks/Operations

Identities

Create/Delete. Creates or deletes a new identity (user, group, and so forth).

Add/Remove. Adds or removes an identity to or from a container (for example, a user from a group).

Get/Set. Gets or sets an identity’s properties.

Policies

Add/Remove. Add or remove policy.

Get/Set. Get or set policy’s properties.

Roles

Add/Remove. Add or remove application-specific administration role.

Get/Set. Get or set role’s properties.

Include/Exclude/Delegate. Add or remove user(s) to or from the role, as well as temporarily delegate a role to a user.

Access Control

Allow/Deny. Allow or deny access to resources managed by the application to the specified users.

Get/Set. Get or set an entire ACL on the resource.

Monitoring

Measure. Obtain security related state values and verify them against thresholds (for example, logons per second, failed authentications per second, and so forth).

Test. Test security aspects of the application or service, diagnose the problem, propose a recovery action, and, optionally, implement it. Recovery action usually involves updating application state and configuration and sometimes restarting the application or service.

Auditing

Trace. Turn auditing on (if it was off) and view audit log.

Service Monitoring and Control

Description

Service monitoring allows the operations staff to observe the health of an IT service in real time. Accurate monitoring of a system is a complicated puzzle within a distributed process environment, complicated even more by the integration of systems with partners and suppliers in automating a given company’s value chain. With this in mind, the following list is an example of system services that are typically monitored to ensure that the IT service remains available:

·         Process heartbeat

·         Job status

·         Queue status

·         Server resource loads

·         Response times

·         Transaction status and availability

 

However, knowing the current health of a service or determining that a service outage might occur is of little benefit unless the operations staff has the ability to do something about it (at the very least, notifying the appropriate group that a specific type of reactive or proactive action should occur). This is what the term “control” means. When combined and implemented properly, this service management function provides the critical capability to ensure that service levels are always in a state of compliance.

Categories

The basic service monitoring and control tasks are:

·         Managing lifecycle. This category involves starting and stopping instances of the service, as well as pausing and resuming them.

·         Monitoring state. Accurate monitoring of a system is a complicated puzzle within a distributed process environment, complicated even more by the integration of systems with partners and suppliers in automating a given company’s value chain. With this in mind, the following list is an example of systems that are typically monitored to ensure that the IT service remains available:

o        Process heartbeat

o        Job status

o        Queue status

o        Server resource loads

o        Response times

o        Transaction status and availability

·         Controlling state. Knowing the current health of a service or determining that a service outage may occur is of little benefit unless the operations staff has the ability to do something about it (at the very least, notifying the appropriate group that a specific type of reactive or proactive action should occur). This is what the term “control” means. When combined and implemented properly, this service management function provides the critical capability to ensure that service levels are always in a state of compliance.

·         Managing resources. This category involves manipulating resources that are managed by the application. For example, the Active Directory® directory service manages organizational units, users, groups, printers and so on. IIS manages sites, applications, filters, and extensions, as well as individual files and folders that compose those resources.

·         Maintaining service. This category involves updating services and moving instances across computers, as well as splitting and joining instances to and from multiple computers.

Tasks/Operations

Instances

Enable. Allow the application or service to start and run.

Disable. Do not allow the application or service to start again.

Start. Start the instance of the application or service.

Stop. Stop the instance of the application or service.

Restart. Terminate an existing instance and start over.

Pause. Stop accepting requests.

Resume. Resume accepting requests.

Monitoring

Ping. Simple and quick test whether application or service is running or not.

Measure. Obtain health and performance related state values and verify them against thresholds.

Test. Test application or service health aspects, diagnose the problem, propose a recovery action, and, optionally, implement it. Recovery action usually involves updating application state and configuration and sometimes restarting the application or service.

Trace. Trace application or service activity.

Controlling

Get. Retrieve current state of the application or service or its subset.

Set. Update current state of the application or service or its subset.

Resources

Get. View resource or content of the collection.

Set. Update resource or content of the collection.

Add. Add resource/sub-collection to the collection.

Clear. Clear content of the collection by removing all resources.

Remove. Remove resource or entire collection.

Rename. Rename resource or collection.

Copy. Copy resource/collection into another collection.

Move. Move resource/collection into another collection.

Lock. Lock the resource(s) and make it read-only.

Unlock. Unlock resource(s) and allow updates.

Verify. Verify consistency and integrity of the resource(s).

Export. Export resource(s), perhaps in a different format than the native one.

Import. Import resource(s), perhaps in a different format than the native one.

Backup. Archive resource(s) to another location.

Restore. Restore resource(s) from another location.

Maintenance

Move. Move existing instance from one computer to another, preserving all configuration settings and, optionally, data. For highly available applications, moving should not interrupt the service.

Update. Update application or service in place, replacing old version with the new one. For highly available applications, updating should not interrupt the service.

Split. Split the application or service into two or more instances (for example, move 200 out of 500 Microsoft SharePoint™ Team Services sites to another computer). For highly available applications, updating should not interrupt the service.

Join. Consolidate multiple instances of the application or service (for example, move all SharePoint Team Services sites from three outdated servers onto one new server). For highly available applications, updating should not interrupt the service.

Storage Management

Description

Storage management deals with on-site and off-site data storage for the purposes of data restoration and historical archiving. The storage management team must ensure the physical security of backups and archives. The goal of storage management is to define, track, and maintain data and data resources in the production IT environment.

The storage management operational process consists of two major focus areas, each of which comprises various activities and associated tasks: data backup, restore and recovery operations, and storage resource management.

Categories

These are storage management tasks:

·         Planning capacity. Calculating storage capacity required to deploy a service or application and to keep it running for a long time. This category includes planning for adding storage capacity to the existing services before they exceed current capacity.

·         Backing up data. Copying the current state of the storage to archive.

·         Restoring data. Restoring a storage state snapshot from archive.

·         Maintaining storage. Adding storage, moving to a different location, and so forth. Also includes verifying and fixing integrity of the data and storage subsystem.

Tasks/Operations

Backup. Back up data to archive.

Restore. Restore data from archive.

Measure. Obtain storage related state values and verify them against thresholds (for example, the amount of free storage available, fragmentation, current performance of the storage, and so forth).

Test. Verify data and storage subsystem integrity and consistency, and repair if needed. Recovery action usually involves updating application state and configuration and sometimes restarting the application or service.


 

[1] David M. Anderson, Design for Manufacturability, Optimizing Cost, Quality and Time-to-Market, Second Edition (2001), CIM Press 805-924-0200; 314 pages.

[2] Average Response Time is the amount of time generally taken to troubleshoot a problem like this and get the customer back online.

[3] Pain Score is the Average Response Time multiplied by the number of calls. This helps prioritize key scenarios.