Problem Management Best Practices

One of the more overlooked ITIL processes is Problem Management. Many companies implementing ITIL, will focus on Incident and Change Management first. What companies realize after implementing Incident and Change Management is, they need a Problem Management program to improve the overall availability of IT services. A mature program managing problems will lead to preventing reoccurring incidents or at least reducing the impact. Focusing on problems in your environment will increase the uptime of your IT services. Therefore, Problem Management is a critical component of your overall IT Service Management program. For these reasons, the Help Desk must be an active player to be successful.

The Help Desk plays a major role in managing incidents and problems. Accurate and thorough incident ticket documentation by the Help Desk will significantly help the root cause analysis of incident generating problems. Assigning correct ticket categories to incident tickets by the Help Desk will improve problem identification. Correct ticket category assignment will significantly improve incident matching, ticket type trending, and identifying problem candidates. Visit how to create Help Desk ticket categories for more ticket category information.

What is a Problem?

A problem is the main source of a fault in the IT infrastructure. A problem can cause one or more incidents impacting IT services. These incidents are a result of the problem and the end users experience unstable, degraded, or unavailable IT services. You will hear the term root cause to describe the underlying cause of an incident. A problem is identified by a root cause analysis process causing the incidents. A problem ticket is raised based on the incident or incidents which caused the fault and possibly operational outage. When a problem has been defined it can be called a known error. A known error is a problem where the root cause has been identified and a workaround is in place. Once a solution is found to permanently fix the problem, a change request is created and implemented to resolve the problem,

What is Problem Management?

Problem Management is the life cycle process of identifying, investigating, documenting, and permanently resolving incident causing problems from the production environment. Incident management is focused on restoring an IT service quickly through any means. Problem Management is focused on identifying the incident root cause and prevent recurrence of additional service impacting incidents. Problems are resolved by defining and implementing a solution to the problem. The solution to the problem is implemented by a change request. In a ticketing application, a problem management investigation ticket is created from an incident ticket or operational outage. This process creates an association between the incident and problem tickets. In most instances, there is more than one incident related to a problem. In those cases, all the incidents should be linked to the problem ticket.

Problem Management: Reactive and Proactive.

Problem Management focuses on problems which caused incidents or may cause incidents in the future. Therefore, a problem investigation can be initiated both reactively and proactively. Reactive Problem Management is initiated after one or more incidents occurred. The reactive problem investigation will focus on finding the incident’s root cause, define a workaround, and ultimately implementing a solution by the change management process. Proactive Problem Management focused on preventing future incidents before they occur. This is accomplished by initiating preventative problem investigations. These problem investigations will focus on analyzing operational data, configurations, and system performance data looking for potential problems. For example, if an application running on a server has above average CPU utilization for an extended period of time, there may be a potential problem. Investigating the reason why before a customer impacting incident occurs is being proactive.

Problem Management Work Around

When a problem has been investigated and the root cause found but not fixed, a workaround may be developed until a permanent fix can be applied. A workaround is when a full resolution is not yet available for an incident or problem, but something can be done to allow the user to complete their task. At times a solution cannot be defined for permanently resolving incident causing problems. In those cases, Problem Management attempts to minimize the impacts of the incident causing problems with a workaround. In this situation, the problem is identified as a known error. These known errors are published by Problem Management in a known error database until a time where a permanent solution becomes available.

Known Error Database

Companies implementing Problem Management may realize a significant reduction of call handle time and first contact resolution. This is achieved by implementing a known error database, which is a key component of Problem Management. A known error database is an invaluable tool for the Help Desk to help end users remain productive using IT services. When the Help Desk receives a contact about something broken, one place they check is if the issue is identified in the known error database. If the reported issue that has been identified as a known error, the Help Desk agent can implement the published workaround. Implementing a workaround will allow users to continue to work at some level while a permanent solution to the problem is being developed.

Help Desk

There are many benefits to the Help Desk and customer for implementing an effective Problem Management program. Not only will there be an improved quality of IT Services but also repeat incident occurrences should be eliminated. The Help Desk staff play a significant role in Problem Management activities. Day in and day out the Help Desk deals with hundreds or thousands of incident tickets. Usually, Incident Management will be engaged for more significantly impacting incidents and a large spike of similar issues. In those cases, the incident manager will create a problem ticket once the incident is resolved. If the incident results in a workaround, the known error database will be updated with details of the workaround. The Help Desk staff become in tune with trends of customer break-fix issues. A Help Desk agent may be able to identify when customers are reporting similar issues which could be related to an underlying problem needed to be addressed. When the Help Desk sees similar reported incidents over a longer period, they may create a proactive problem ticket for review. Finally, your organization will become proactive in identifying and eliminating infrastructure problems.

What are the KPIs of Problem Management?

Being able to report on key performance indicators for Problem Management is important. Like other processes, there are a lot of metrics to choose from. The most important thing to remember is that your Problem Management program should use a robust ticketing application to manage problems. The problem template should have many fields to capture data. By capturing a lot of information about each problem as they move through the problem lifecycle, current and future reporting will be robust. Since there are so many ways to pull data and report on problems, we have grouped some of the most common below.

The number of problems reported

These types of reports focus on the number of problems managed by the Problem Management program. These reports are based on different quantity totals of problems. The benefit of these type of reports will give you in site on the trend of problem occurrence. Knowing the number of problems at various states of the problem lifecycle will allow management to manage the backlog of problems, become more efficient and ultimately reduce the overall occurrence. Each of the below reports can be further filtered by date, severity and other attributes.

  • Number of all problems created
  • Current number of active problems
  • Resolved problems total
  • Total number of closed problems

Duration of problems reporting

These types of reports focus on the duration of problems managed by the Problem Management program. These duration type reports do not just focus on the time from open to close but categorize problems by lifecycle segments. The benefit of these type of reports will give you insight on the effectiveness of the problem management team processing the problems to closure. Below are the most common problem time segments in order of typical occurrence.

  • Problem detection
  • Categorization and prioritization
  • Workaround
  • Known error
  • Resolution
  • Closed

Attributes of problems

These types of reports focus on specific attributes of problems managed by the Problem Management program. These type reports look at specific attributes or characteristics of problems. The benefit of these type of reports will give insight on specific and related areas of your Problem Management program. Below are the most common problem attributes tracked and reported on.

  • Percentage of problems where a root cause has been completed.
  • Problems that reoccurred
  • Problem SLA compliance

Problem Management Steps

Step 1 – Prioritizing problem management investigation candidates:

A problem investigation is initiated by identifying IT service issue candidates. Problem management candidates can be identified by any or all of the following methods;

  • Mandatory for Priority 1 incidents and highly recommended for Priority 2 incidents that caused a recent service degradation or outage.
  • Technical staff including Help Desk Agents, level 2, level 3 resolver teams, developers, application owners and management nominating problem candidates.
  • Customer and business partners reporting critical service impacting issues.
  • A proactive Incident trend analysis

Step 2 – Gather the data:

Once the problem management investigation candidate has been identified it is important to gather the foundational data about the problem.

  • If a service outage occurred, develop a timeline of events for prior, during and after the outage. The Help Desk can assist with a large amount of this data.
  • Gather and assess error, diagnostic and monitoring information.
  • Gather the number of Incidents related to the Problem
  • Review the frequency of this problem by searching the Help Desk ticket data.
  • Was a change recently implemented that could have caused this problem?

Step 3 – Form a problem review team

Identify the needed technical staff, Help Desk Agents, and customer representatives to quickly meet and review the following:

  • Review all available foundational problem data
  • Identify likely causes
  • Discuss any contributing factors
  • Eliminate likely causes until you have the most probable cause remaining.
  • Identify the root cause.

Step 4 – Propose and implement a solution

  • Document a Request for Change for any action you intend on taking to resolve the issue.
  • Submit Request for Change or implement solutions that do not require Change Management.

Step 5 – Validate the fix

  • Confirm success/failure of Approved Change.
  • Have the customer validate the problem does not occur anymore.
  • Monitor Help Desk tickets for future re-occurrences.

Be the first to comment

Leave a Reply

Your email address will not be published.