The Role of 'Root Cause Analysis' in Problem Solving

Metal Finishing's quality control guru, Leslie Flott, discusses how root cause analysis can be utilized as an effective procedural tool.

If a person keeps doing the same thing and they expects different results, we call this “insanity.” It seems quite reasonable that if some undesired result occurs repeatedly, then it is incumbent upon management to determine what is really causing this situation to occur and correct it so as to prevent it from occurring again. This procedure, “root cause analysis, seeks the true cause of the problem, and corrects it rather than continuing to deal with the symptoms.

The root cause and analysis procedure investigates the failure using facts left behind from the initial event. By evaluating the remaining evidence after the fault has occurred—and obtaining information from people associated with the incident—the analyst can identify both the contributing and non-contributing reasons that resulted in the undesirable situation.

The process begins by collecting the data, analyzing it, then developing appropriate corrective action or generating practical recommendations. Root cause analysis is a tool to explain what happened, how it happened, and why it happened.

By comprehending the facts of the failure, the process allows safety, quality, risk and reliability people an opportunity to employ more dependable and cost-effective strategies in order to effect significant, long-term improvement. The result is increased capability to recover from and prevent disasters with both financial and safety results.

In addition to verifying the failure, it is important to determine the factors that explain the how and why the failure occurred. Identifying the root cause of the failure event makes it possible to explain the how and why of failure.

Finding the Real Problem. Most problems have multiple approaches to solution. Generally, these various approaches will have different price tags. Because of a sense of urgency in breakdown situations, the tendency is to choose the most expedient way of dealing with the situation. In general, this results in treating the symptoms rather than the root cause. Yet, in taking the most convenient approach and dealing with the symptoms rather than the cause, what usually ensues is that the situation will, in time, return and will need to be dealt with again.

Consider the specific example of expediting customer orders in an order-filling process. The organization has a well-defined process for accepting, processing, and shipping customer orders. When a customer calls and complains about not getting their order, the most normal response is to expedite. This means that someone personally tracks down this customer’s order, assigns it a #1 priority, and ensures it gets shipped ahead of everything else. But what isn’t realized, until sometime later on, if at all, is that in expediting this order one or more other orders were delayed because the process was disrupted to get this customer’s order out the door. What it all comes down to is that expediting orders simply ensures that more orders will have to be expedited later. The appropriate response to this situation is to figure out why the order was in need of expediting in the first place. Yet this is seldom done because the task assigned to the expediter was, “Get the order shipped!” and that is as far as the thought process and investigation are apt to go.

To find root causes there is one question that’s relevant: "What can we learn from this situation?” Experts like W.E Deming and others suggest that unwanted situations are about 95% related to process problems and only 5% related to personnel problems. Yet, most organizations spend far more time looking for “someone” to blame rather than “what” is to blame. This practice results in misdirected effort that seldom concludes in finding the basis for the unwanted situation.

Consider the following: The sales department of a company reports that certain customers are repeatedly complaining about late shipments. Moreover, these customers state that they employ a just-in-time (JIT) scheduling in order to minimize their cost of inventory. They cannot tolerate late shipments.

The shipping department reports that they have been ordered by the chief financial officer (CFO) to avoid “less-than-truckload” shipment, as this practice drives up the cost of doing business. The CFO explains that corporate headquarters had issued a memo to reduce expenses at all costs. The plant manager had also indicated that the plant had to be as cost conscious as possible, or heads would roll. The outcome was that customers were unhappy when were deliveries were late, which resulted from an attempt to save money.

To Resolve or Not To Resolve. Once the root cause is identified, then it has to be determined whether it costs more to remove the root cause or simply continue to treat the symptoms. This is often not an easy determination. Even though it may be relatively easy to estimate the cost to remove the root cause, it is often difficult to assess the cost of treating the symptom. This is because the cost of the symptom is generally involved in some number of customer and employee satisfaction factors as well as the resource costs associated with just treating the symptom.

Consider this situation in which it is determined that it could cost the company an extra $30,000 a year to ship every order on time but only 5 minutes for someone to resolve the situation when the customer calls with the problem. Initially, one might perceive that the cost of removing the root cause is far larger than the cost of treating the symptom. Yet, suppose that this symptom is such that when it arises it so infuriates the customer that they swear they will never buy another product, or service, from you, and will go out of their way for the next year to tell everyone they meet what a terrible company you are to do business with. How do you estimate the cost of the revenue lost associated with this situation?

Failure of a component, or a process, indicates it has become completely or partially unusable or the situation has deteriorated to the point that it is undependable or detrimental for normal sustained service.

Failure analysis is an engineering approach to determining how and why equipment, a process, or a component has failed. Some general causes for failure are structural loading, wear, corrosion, and latent defects. The goal of a failure analysis is to understand the root cause of the failure to prevent similar failures in the future. In addition to verifying the failure event, it is important to determine the factors that explain how and why the failure event occurred. Identifying the root cause of the failure event allows us to explain the how and why of failure.

The process of investigating industrial product or procedural failure may include expert witness testimony, industrial accident investigation, materials and metallurgical failure analysis, welding, manufacturing, forensic engineering, product liability, and explosion investigation.

Preventing Reoccurrence of the Failure. It is not always necessary to prevent the primary, or root cause, from happening. It is merely necessary to break the chain of events at any point and the final failure will not occur. If the investigation team identifies the root cause analysis as an initial design problem, they may recommend a redesign. Where the root cause analysis leads back to a failure of procedures, it is necessary to either address the procedural weakness or develop an approach to prevent the damage caused by the procedural failure.

Corrective Action Reports (CARs). Too often a company will fix a problem in their business process, say a customer complaint or product return, after the problem has occurred. This is typically product or event focused. Then the company will look at what they have done and say, “Well if we revise our shop router or procedure, this will not happen again.” They will label this second phase as Preventive Action.

While it is a future thinking type of activity, it is still corrective action because it focuses on solving a problem that has already happened. A corrective action needs to focus on the quality system, so in this example the systemic action taken is the revision of the shop router or procedure. This is corrective action.

Preventive Action Reports (PARs). Preventive Action activities should stand alone, and not focus on past events. For example, the “company” has a preventive maintenance management program that requires manual data entry. It is working fine and there have not been any problems. However, it takes a lot of time to manage and does have a high potential for error or lost records. Management decides to purchase preventive maintenance software to manage this activity. Since the purchase is not based on problems that have happened, and is focused on process control or making an improvement to company quality, it is a preventive action.

When first establishing a quality system, it is normal to record and complete many more CARs than PARs. On an annual basis, a company may record only a few preventive action activities, while in the corrective action log many dozens events are recorded. Later, as the quality system matures, there should be many more preventive actions than corrective actions.

Another important issue with regard to problem solutions and preventions is that there are no universal, or generic, solutions to problems, even those with the same root cause. What works well in inner city Chicago might not work at all in rural Nebraska. Each individual problem-solving team must take into account not just the basic cause but items such as the location, the workforce and available resources.

BIO Leslie W. Flott, Ph.B., CQE, ASQ Fellow, is certified as an IDEM Wastewater Treatment Operator and Indiana Wastewater Treatment Operator. He received his BS in Chemistry from Northwestern University and his Masters Degree in materials engineering from Notre Dame University. Most recently, Flott served as the environmental program director and instructor at Ivy Tech Community College. Prior to that, he was the health, environment, and safety manager at Wayne Metal Protection Company.