Add a "Black Box" Fault Logger to Your "Big (or Small) Box" System
Abstract
This article describes how to add a "black box" functionality—nonvolatile fault logging—to networking, communications, industrial, and medical equipment. It outlines the benefits of recording fault data, including faster, more definitive failure analysis.
A similar version of this article appeared June 24, 2011 on www.how2power.com.
Background
Everyone is familiar with the term "black box," referring to the device that provides clues as to why a plane crash occurred. The airplane's black box collects numerous data points about the operating condition of the aircraft, including altitude, speed, flap, and rudder positions; it records what the pilots were doing and saying right before the accident. This running log of what transpired just before a crash can be critical in determining the root cause of the incident.
As an aside, the term "black box" is a misnomer. The equipment used in aircraft is never black—it is orange so that it can be easily located. The proper avionic terminology for the device is an "event data recorder."
Of course, the engineering community will also understand a black box as a device where the inputs and outputs are known, but the internal operation of the box is unknown. That type of black box is not the subject of this article.
Adding a data recording function—a black box—to electronic equipment other than aircraft can prove extremely valuable. Called a "complex system manager" in electronic equipment, black-box functionality provides fault logging in networking, industrial control, medical, and communications equipment. The principal benefit of fault logging is quite straightforward: a faster, more definitive failure analysis. This article explains how to implement such a function, and outlines the benefits that can be realized from nonvolatile fault logging.
Power-Management Schemes
From a power-management perspective, the inner workings of most "big box" and "small box" systems look very similar. Whether the box is a router, a server, a base station, an optical multiplexer, a programmable logic controller (PLC), or a magnetic resonance imager (MRI), they all contain an array of switched-mode power and linear power supplies, which require monitoring for voltage, current, temperature, and possibly fan speed. See Figure 1.
Figure 1. A typical power-supply arrangement.
Nonvolatile Fault Logging
In both large big-box systems and smaller "pizza-box" systems, a complex system manager’s primary function is to control and monitor a number of power supplies and fans. Monitoring includes looking for system fault events, such as voltages that are either too high or too low, currents that are too high, temperatures that are out of range, and fans that are not spinning at the proper speed. Checking for faults could be as simple as examining the parameter for excursion beyond a threshold. If real-time data is collected while the system operates and is stored to nonvolatile memory when a fault occurs, an event-data-recorder function can be created. Figure 2 shows just such a system.
Figure 2. Functional diagram of a nonvolatile fault-logging system for a number of power supplies and fans.
In Figure 2, the complex system manager continuously collects data on the numerous system voltages, currents, temperatures, and fan speeds. Similar to the black box in an aircraft, the most recent parametric data (for example, the last 500ms to 1s of data) is continuously collected on a rolling basis. Then when a fault occurs, a snapshot of the system at that time is permanently recorded. Being able to examine the previous 500ms to 1s of system operation before a fault occurs is critical information for understanding what caused the fault and how the system was affected. From examining the data, a timeline can be reconstructed and the interdependencies determined. Ideally, the complex system manager should record multiple fault occurrences. Due to tightly coupled system interdependencies, a fault will likely cause multiple system faults to occur in succession. To find the root cause of the failure, it is thus important that all of the data be captured. Moreover, a large amount of nonvolatile storage allows the system to store events that may not be deemed catastrophic, but merely indicate when the system is operating outside the specified range. The storage of this data can be important for enforcing warranty compliance.
An Example
Consider the scenario shown in Figure 3. A power supply fails (Step 1) and the fault is detected by one of the complex system managers that is constantly monitoring voltages, currents, and temperatures. The manager immediately notifies the other managers in the system so they can take action as needed (Step 2). The complex system managers then sequence off the power supplies and fans in concert as the system requires (Step 3). All of the recent data on system voltages, currents, temperatures, and fan speeds is then logged into the onboard black box in each complex system manager (Step 4). Since the data is stored in nonvolatile memory, a host can pull the data anytime in the future (even after it is returned from the field) to determine what caused the failure (Step 5).
Figure 3. Black-box fault logging scenario.
Benefits of Nonvolatile Fault Logging
Nonvolatile fault logging has a number of benefits. If the equipment can track what transpired during the field failure, the failure-analysis team can quickly analyze and accurately determine the root cause of the failure. This troubleshooting improves customer relationships, since users inevitably want to know quickly why the equipment failed. Also, the quicker a manufacturer can realize a potential liability, the quicker they can rectify the issue and save the costs of potential future failures. Once again, this keeps customers satisfied and improves the overall reliability of their equipment. Nonvolatile fault logging can also determine if the customer was using the equipment outside the specified operating range, an action that can violate the product warranty. Over time, the collection of field failure data can improve future product reliability by identifying poor suppliers and weak design practices.
Complex System Managers
Maxim Integrated offers a number of complex system managers that include extensive nonvolatile fault logging for both big-box systems like servers and pizza-box designs like network switches. See Figures 4 and 5.
The MAX34440 controls and monitors up to six power supplies (Figure 4). It provides power-supply sequencing and margining, and monitors for voltage, current and temperature faults. Multiple MAX34440 devices can be paralleled to handle all of the power supplies that exist in a system. The MAX31785 controls and monitors up to six fans. Like the MAX34440, multiple MAX31785 devices can be used to support as many fans as required.
Figure 4. A big-box system design uses the MAX34440 and MAX31785.
Maxim also offers complex system managers that support smaller pizza-box designs like network switches. The MAX34441 supports up to five power supplies plus a fan (Figure 5). To maximize design flexibility, multiple MAX34441 devices can be paralleled or used in conjunction with multiple MAX34440 and MAX31785 devices.
Figure 5. A pizza-box system design using the MAX34441.
A Value Proposition
Black-box fault-logging in networking, industrial control, medical, and communications equipment results in faster, more definitive failure analysis. This, in turn, yields higher customer satisfaction with faster reaction times and in the long term, better product reliability.