Failure Analysis 101



It seems intuitive to most with years of troubleshooting or failure analysis experience, but it's forgotten or ignored so much it bears repeating : a methodical process is the best way to find and solve problems. Methodical in this case does not mean slow, it means ordered and logical, which in turn leads to improved efficiency. The following four-step process is effective regardless of the scope or field of technology.
- The process starts with an initial problem description and a gathering of all available facts. The problem description or definition needs to be thoroughly reviewed to make sure you're trying to solve the real problem and not just a symptom. It's important not to skip this step as attention to details at this point can save countless hours later. Facts and data need to be reviewed to determine how they relate to the problem at hand and to make sure the facts make sense relative to each other and the problem as defined. This comparison of all available information forms the basis for the next logical step. Before proceeding, it's a good idea to give the problem definition one more quick review in light of all the data to confirm you're trying to solve the right problem.
- The next step is to resolve any discrepancies in the data/facts and fill in any missing information in order to get a complete picture. This is the heart of troubleshooting - gaining enough knowledge about the mechanisms, components, and factors involved in a situation to fully understand how the combination can create the problem or failure. This may involve running additional tests, making measurements or observations, or using other means to provide a detailed picture of what may be happening to cause the problem. The gathering of information needs to be done in a logical and orderly manner rather than attempting a shotgun of tests hoping something useful turns up. The desired information should be identified (what is it you want to know?) and tests should be designed to provide the desired information. Using a Plan, Test, Review approach gives a much better chance of getting useful data. Desired information should be prioritized so the most important and useful tests are started first, even if they're not the fastest ones to conduct. This step is complete when you have enough information to know what is causing the problem, malfunction, or failure.
- Once the root cause is known, the next step is designing a solution to the problem. In some cases this is obvious, such as replacing a worn out part to make a machine function properly again. Other times, however, the solution is much more complex, especially if the root cause is a combination of factors that are not easy to control. One thing that experienced engineers do is dig beyond failed components or systems to search for an underlying root cause. For example, replacing a torn belt may immediately fix an operational issue, but it's important to know why the belt failed. If it failed due to normal wear and tear, then the failure is expected and could be prevented by routine maintenance. If the failure was premature, however, it points to a deeper rooted problem which needs to be found - a poor quality component (the belt), improper installation, a more serious wear issue in the machine (such as ball bearings), or partial failure of another component (cracked or chipped pulley). Knowing the root cause enables the engineer to provide a lasting solution that prevents recurrence of the problem or failure and heads off other possible impending failures related to the same cause.
- The final step is validation of the solution. Again the amount of work and time required depends on the complexity of the situation, the cost of the operation, the cost of failures (including down time), any regulatory requirements, and the novelty of the solution. Validation testing goes beyond functional testing to make sure the system works again. Proper testing must also determine that the solution has not adversely affected other parts of the system, confirm that the solution works in all possible operating conditions, and provide the operating limits of the solution.
The four-step process outlined above provides a framework for solving problems. The key to effective application of this process is the elimination of assumptions in each step. All assumptions need to be removed by validation through research or testing. This ensures decisions are driven by facts not feelings. Intuition helps a good engineer know where to look, but it's not a substitute for the hard facts gathered by going through the steps above.
To see the resume of the expert associated with this case study, see the link below.