I once wrote a maintenance recommendation for a DOE project meant for short term (less than 1 year) equipment operational requirements. The equipment types are a moot point for this article but if you must know; there were pumps and fans and filters and air handlers and hydraulic lifts and lift tables and electrical power transformers and electrical distribution panels and compressors and vacuum pumps and switches and relays and chillers and VFD’s and PLC’s and pressure switches and diesels and pipes and valves and a partridge in a pear tree…………. Yep everything you would expect to see in any typical production environment, temporary or not.
The scope of the project was to remove sludge at the bottom of expended nuclear fuel storage pools; it was a nasty job to say the least. Nuclear fuel had been in the pools for over 30 years, the material used to hold the fuel rods together had corroded and some of the rods had broken apart. The remaining sludge was a mixture of basalt dust, corrosion products and nuclear material from the busted fuel rods. To say the least, it was a very toxic and erosive material.
You should know that the project was SAFELY completed. However, schedule slip was measured in years and cost overruns were in the millions. Now it wasn’t that the project was poorly planned or that the engineering concepts and controls used to remove the material safely were compromised, the biggest problem, the one that the taxpayers ended up paying for was program management’s assumption that they could complete all operations with zero equipment redundancy and zero maintenance. It was a temporary installation after all and all work could be completed before the first oil change was required; WOW!!! What an assumption. And as a Monday morning quarterback; it turned out to be a really bad assumption…………and we had told them so.
Back to that maintenance recommendation I had written so long ago. So prior to getting this project underway, the contractor that DOE hired to do the job hosted a small business workshop that advertised they were looking for new and innovative contributions from the small business community that could support the goals of the project; Safe removal and interim storage of the sludge. I attended that workshop and since I am a maintenance ‘guy’ that ran a maintenance and reliability management company, I wanted to contribute my recommendations, and of course I wanted to include a plug my company’s expertise in these matters; it was a great sales opportunity. Not only that, maintenance engineers, planners and procedure developers that were employed by me (my company) had been instrumental in completing the nuclear fuel removal; my company had hands on experience managing similar and some instances, the same equipment. We had a developed list of maintenance lessons learned that could be leveraged into a very comprehensive maintenance strategy at minimal cost with maximum gain. WOW!!! A perfect storm; maybe.
I wish I had a copy of that recommendation now, it was actually quite good. What I do remember introducing to that client was a highlight of Preventive Maintenance (PM) and Condition Based Maintenance (CBM, PdM) using simplified Reliability Centered Maintenance (RCM) tools, a FMEA plus our lessons learned, to establish the most economical methods in managing the short term operational schedule without equipment redundancy. Just a refresher here; a FMEA is a Failure Mode and Effects Analysis and based on the basic equipment types they had, it would be pretty easy complete, really just a cursory analysis and it would pay for itself. I based my case on the Random Nature of equipment failures and the Predicable Nature of consumable ‘end of life’ determination that are generically portrayed in the various Weibull distributions. So before the ink dried on my super-duper award winning expose on the magic of Managing Reliability, I set up a meeting with the client and charged off to impress upon them the importance of some form of routine maintenance, even on a project with short term operation goals.
The actual meeting went pretty smooth; the presentation was met with the usual chin stroking and polite nods of understanding as I hit each of the presentation points. In fact there were several comments from the project engineer, his deputy and the procurement representative that sounded quite positive; they were in, they understood. Then just as I wrapped up and had asked for any final questions, the project manager looks me square in the eye and said “the defined operational period (project scheduled) of the equipment is less than one year, based on our own internal risk assessment we have decided that no routine maintenance or inspection surveillances will be performed as a cost saving benefit to the project”. I was in awe, the project engineer buried his face on the table and as I left the meeting room, the procurement representative said thank you, great presentation. No Sale on the front end of this project. I was disappointed to say the least.
So let’s take a look at my recommendation(s) from a strictly probability of failure (Pf) point of view and let you decide if some form of routine maintenance (PM, CBM, PdM) would be required.
As part of the presentation, on a white board I threw up the 6 basic Weibull representations of equipment reliability. These were originally presented by Nolan in Heap in their study of commercial aircraft reliability in 1978.
These basic reliability profiles were again verified as accurate by the US Navy in 1984(?). The only difference between the Navy and the commercial aircraft study was the percent occurrences. So what are the 6 curves and what do they represent?

We are generally most familiar with the ‘Bathtub Curve’, Pattern1. This is a generalized Weibull representation of a piece of equipment from commissioning through its operational phase to its end of life. The ‘Ski Slope’ on the front of the curve represents a higher Pf based on some run in period and then flattens out to a random Pf over the majority of the equipment’s life. And at or near the end of the equipment’s useful life the Pf creeps up again. What you need to understand that this curve is only representative of approximately 3 to 4% of failure profiles.
Pattern 2, 4, 5 and 6 are similar variations of Pattern 1. What should be noted in these patterns, like Pattern 1, is that the Pf ‘flat lines’ over a majority of equipment life, e.g. Random Failure. In fact, Random Failures account for 80 to 85% of all failures. WOW!!! One other thing that should be noted about Pattern 6; this is the predominant failure pattern for equipment. It is representative of 29 to 60% of all failure profiles. Also note that this profile has a ‘Ski Slop’ on the front end just like Pattern 1; the Pf is higher early in equipment life, the ‘run in’ period. We will discuss this later because ‘run in’ isn’t the real issue here. It’s actually Introduction of Error. Say What????
Now Pattern 3 is a bit different, it is representative of consumable components. As time in service increases, Pf slopes up. Rolling element bearings, oil, drive belts, IGBT’s (power transistors) in VFD’s and SCR’s, filters, tires, gasket material in certain environments, UV exposed paints and coatings, etc… fall under this Weibull profile. If we look at an actual Weibull plot for bearings (to the right) it has a very predictable linier profile. I want to repeat myself here, Pattern 3’s focus is at the consumable component level and related to equipment only in the fact that if the component fails, it drags down the equipment with it.
Side Bar Comment: Remember the discussion in a previous article, Read the Fricking Instructions, the term Reductionism. And if you actually go back and review the definition I presented you will see I included the following statement; We Analyze for Failure and Evaluate Service Life at this level; the component level.
Since Pattern 6 is the predominant equipment failure pattern and Pattern 3 is representative of consumable components; it only makes sense that we manage our equipment based on these representations.
First and foremost; just based on the 6 variations of failure probability, failures are going to happen. If it rolls, slides, conducts electricity, is painted, uses oil, or goes bump in the night, if you are not thinking about failure prevention (mitigation) and failure detection, you will have setbacks. In this gamble, the odds are stacked in favor of the house. And if you have a process where there is no redundancy built into your design, it pretty much becomes mandatory that you develop a planned preventive (PM) and condition based (PdM) maintenance strategy. This is the only way you can ensure a reasonable level of reliability for your process.
Number 2; As I mentioned earlier, Pattern 1 and 6 have a ‘Ski Slope’ on the front end of the curve; an increased Pf. You generally hear that this increased Pf is caused by ‘run in’, or maybe ‘burn in’ failures or maybe you’ve heard of it as ‘infant mortality’. The real reason; Introduction of Error. Whenever I manufacture, install and commission a piece of equipment, I increase the potential for the Introduction of Error. Whenever I disassemble a piece of equipment to perform a maintenance action, even something as simple as removing a guard or cover to gain access, I increase the potential for the Introduction of Error. During manufacture; I missed the occlusion in the casting, I missed the crack in the weld, I missed the flaw in the circuit board, I missed the material stratification in the transistor, getting the O-ring seated correctly…………… During installation; I mounted the components to an untested structure, grouting the base I left a hollow spot, the soil compaction was wrong for the installation, the space was not air conditioned, we did not re-align the coupling or the belts, the resilient mounts were installed incorrectly, power was wired to the common ground, we wired all of our single phase components to a single phase of the 3 phase distribution causing a voltage unbalance…………During commissioning; I ran the compressor for several minutes without oil, I ran the unit backwards for a while until I swapped leads, we operated the diesel for several minutes without cooling water…………..During Maintenance; I forgot to refill it with oil, I couldn’t get all of the bolts back in, the shaft was a little scared and the bearing fit was a bit loose, I couldn’t reach some of the fasteners to tighten them……….. Each and every one of these scenarios plays out as an Introduction of Error that increases the probability of failure during initial operation. And, initial operation could be measured in months and years, not just hours, adding to item 1 above, random failure.
What are your thoughts………….
So what do you think? Based solely on this discussion, should there have been a maintenance plan in place for the project, a maintenance strategy, some spare parts, maybe just a few simple PdM tasks? If you are or ever have been part of an operations or maintenance crew, I’m sure you would agree, the project manager’s assumption was flat wrong. Yes indeed, there should have been some defined routine maintenance process to support this project.
So did I profit from the several failures that plagued the project? Yes I did. After several pump failures my company had a full time engineer out there with one of my vibration meters providing more or less manual continuous monitoring. And based on what I remember, a good portion of those pump failures were due to equipment design and commissioning (or lack thereof) issues, e.g. Introduction of Error. To the U.S. taxpayer, I apologize that I could not make a stronger case.
My recommendation to you; include a maintenance plan, a strategy and budget for it. Failures are gonna’ happen.
Maintenance, what a Concept!!!!!
MMJennings
No comments:
Post a Comment