Designing High Integrity Systems

Mark Harris
|  Created: April 2, 2021  |  Updated: April 2, 2021
Designing High Integrity Systems

It was not long after the first electrical devices were conceived that they were being used in situations where their failure could have disastrous consequences. This could be delivering a fatal electric shock to the user or failing to apply the brakes on a speeding vehicle; electrical devices are used everywhere and for almost every conceivable purpose. Medical devices are now routinely embedded into a patient’s body to regulate their heartbeat. Motor vehicles now come with electrically controlled steering, braking and acceleration functions. Some modern aircraft cannot fly without computer control. All these examples have one thing in common if the electrical devices that control them fail, then there’s the potential for an accident to result in the loss of life. There are other examples of high integrity systems where the consequences of failure are more subtle. Home heating systems could leave you cold in the depths of winter. Telecommunications equipment could leave your internet-based business disconnected from the whole world. Financial plans that manage the flow of money worldwide could potentially misplace a customer’s life savings or even bankrupt an entire nation if things went catastrophically wrong. 

The design process

The design process for high integrity systems follows a sequence of steps. The first step is to look for all the credible hazards caused by the device and eliminate them. For example, suppose the plan is for the device to be connected to the mains electricity supply and to include a power supply module to generate the required low voltages that the components will require. However, there is a credible hazard that the device may get wet and electrocute the user. One solution is to enclose the device in a waterproof housing to keep the high voltage and water separated. Another option is to replace the main electricity supply with a low voltage supply generated using a power adapter located at the socket of the mains supply. The hazard associated with electrocution caused by the device getting wet has now been eliminated. The device has other implications if it gets wet, but they are out of this hypothetical example’s scope.

The second step is to take the hazards that can’t be eliminated and reduce their likelihood of occurring down to an acceptable level. For example, your device may include an interlock that prevents an operator from opening an access door when it is switched on. This could be to stop the operator from being exposed to a hazardous voltage, or maybe there’s a laser operating within the device that’s sufficiently powerful to cause eye damage. The relay will have a failure mode where the device is switched on, but the relay has failed, and the access door is unlocked. This failure mode will have a probability of occurrence, for example, once every ten thousand years. This may sound like it will never happen, but chances are not that straightforward. Typically for safety-critical applications, hazards that cause loss of life or severe injury should be in the region of once every few million years, depending on the rules and regulations that apply. Here we need to improve the reliability of the safety interlock to make the hazard acceptable. Design changes, including redundancy into the interlock mechanism or adding a second independent interlock, will be required. As a last resort, manual procedures can be added to ensure the access door is never opened when the device is on. Humans are by design unreliable creatures, so technical solutions should always be the first, second, and even third lines of defense.

The final step is to add control mechanisms to limit any failures’ impact, making the device fault-tolerant. Isolation techniques can prevent a failure in one block of components, causing knock-on failures in connected blocks of components. For example, a failure in a power supply module shouldn’t be allowed to cause the failure of all the components connected to that power supply if the fault causes the power supply output voltage to spike upwards. Such a cascade of failures downstream of the power supply is not only undesirable from a high integrity point of view, but it can also make any repair economically unviable.

Dealing with failure

A common approach is to work on the assumption that your device will, at some point, fail and so implement controls to manage that failure in the safest possible way.

Fail-Safe or Fail Secure?

One option is fail-safe systems. If any error occurs, the device immediately enters a safe state. It issues an operator’s alert to take corrective action—for example, a medical device delivering a dose of drugs to a patient. The safe option is to stop providing the medicine and alert a nurse to either replace the device or manually administer the medications. The last thing you want is the device incorrectly delivering too much or too little drugs and no one noticing until it’s too late—another variation on the same vein, fail-secure systems. Take the example of an ATM; the last thing the banks want is a fault to result in the machine issuing free money due to a fault. A better option is to prevent any cash from leaving the device and if any users fail to get their withdrawn money, rely on them entering the bank where the transaction can be checked, and the cash issued manually.

Sometimes the distinction between fail-safe and fail-secure can be blurred and even mutually exclusive states. Take the example of an electrically controlled access door to a facility. In the event of a fault, the door could either be left locked or unlocked. If there’s anything of value behind the door, then locked is the fail-secure option, but if the door provides the means of escape in the event of a fire, then unlocked is the fail-safe option. It now becomes a trade-off between the more critical state, though we would hope that the fail-safe state would always out trump the fail-secure one. More realistically, such a situation should prompt a design rethink that aligns the fail-safe and fail-secure conditions so they are the same.

Fail Soft

An alternative to fail-safe and fail-secure is the principle of fail-soft. Here in the event of a fault, the device is designed to continue operating in a limited capacity to provide a minimum level of functionality unaffected by any defect. A good example is a limp-home feature in modern cars. In the event of a fault in the engine controller or any of its many sensors, rather than just stopping the engine, it enters a state where it runs at a much-reduced power setting, meaning that the car can get you home or to the nearest garage. 

The most complex option is to design a fail-operational system where a device’s failure does not stop or reduce the overall system’s operation. Take the example of an elevator. In the event of a fault, you don’t want the elevator just to stop, as any occupants would need to be extracted. It is a tricky proposition if it stops between floors, and the occupant is a hospital patient on a trolley, just on their way back from the operating theatre. Designing a system that allows the elevator to get to a safe position where the doors can be opened, and the occupants can leave the preferred option typically.

Failure Causes

Faults can arise from a myriad of causes that can either be internal or external to the device. 

Component Failure

Component failures are very well understood in terms of their likelihood of occurring and the modes in which components fail. For a simple example, a discrete resistor will most probably fail open-circuit and, to a lesser extent, may fail short circuit or out of tolerance values. Manufacturers can provide mean time to failure figures though the designer needs to consider the environment in which the device is operating. Extremes of temperature, exposure to moisture, vibrations, and shock impacts will affect the component’s reliability and the likely failure modes. For example, in a high vibration environment, open circuit faults due to fractures in PCB traces, solder joints, or component legs are common. Conversely, in a high humidity environment, short circuit faults due to water or conducting compounds being deposited on the PCB’s surface and components may be more likely.

However, the more complex the component, the harder this task becomes. The failure modes for a voltage regulator can be more subtle and harder to plan for. They don’t just fail to deliver the correct voltage. More subtle effects such as noise or ripples on the regulated output can be harder to trace, and their effects will be seen on other components further down the line in your circuit. The result may well be the premature failure of a connected component. Replacing this failed component without realizing this was a consequence rather than the cause of the device fault would just mean that the device will fail again when the replacement component falls over.

The greatest challenge is with the most complex components of all. Processing devices such as an MCU can fail in an almost infinite number of different ways. And a fault may be caused during manufacture or assembly that won’t materialize until a precise set of conditions occur, which may not happen until the device has been successfully operating for months, if not years. It is not uncommon for unused pins to inadvertently be left unconnected, which will not be spotted if the pin happens to float at a voltage that matches a benign state. However, all it needs is for the potential difference at that pin to hover to the opposite state over time due to external factors, and suddenly, the processor may perform some unwanted action. Debugging such a fault may prove to be a real arduous task if the pin floats back to the benign state when it’s on the bench and being investigated.

Electro-Static Discharge

Another common problem is the exposure of a device to Electro-Static Discharge (ESD) when the device is handled or in an environment where high static voltages may be generated. In pure analog devices, the effect of ESD will tend to be transitory and, unless the voltages present are sufficient to damage a component, this will have no lasting impact. However, if the device includes digital components such as an MCU, the effect can be much more significant. Permanent damage to digital circuits can be expected though their effect can vary greatly. Worst case, the device fails. A small part of the device will be damaged more likely, and the impact only is seen when that part is used. Suppose this occurs in a high integrity system. In that case, you could argue that the limited failure of part of the digital component is potentially more troublesome than a complete component failure as the effects can be more subtle and harder to counter. This is where the designer’s experience and careful analysis of the circuit will be needed to identify the possibilities and identify the means to control the consequences.

Electro-Magnetic Interference

Similar to the ESD issue are the effects of external EMI. The difference here is that the external environment is outside of the designer’s control. All they can do is to plan for the worst-case EM levels and include protection circuitry to build immunity to external EM  sources. Conventional techniques for protecting against EMI include line filtering, shielded enclosures and cables, and careful layout design. A common ingress point for EMI is via an external power connection, particularly mains power. Attention paid to including EMI protection as part of the device power supply circuit can pay dividends in reducing the device’s overall susceptibility.

The critical points for the designer to take on board include the following:

EMI protection needs to be considered part of the circuit design process and not added as an afterthought. It is improbable that any bolt-on protection will be as effective as fully integrated protection.

All protection circuitry should be added as close as possible to the points where the EMI can enter the circuit. Ideally, protection should be at every connection point of the device’s enclosure, keeping the EMI outside the box. The EMI should be redirected straight to the enclosure’s ground connection, away from any ground paths inside the device.

Any components within the device that are sensitive to EMI should be physically and electrically isolated as much as possible from those components that can be exposed to the EMI. Defense in depth can be provided by shielding sensitive components within the shielded device. Opto-isolators can also be a great way to prevent EMI ingressing through external connections into the inner protected sanctum. Less expensive options are to use a diode/suppressor network inline with the input. A decoupling diode inline with a low-value resistor working with suppressor diodes will protect against large voltages and provide a degree of protection against noise.

When designing the circuit, consider both common-mode and differential-mode EMI effects. A low-pass filter on a signal line will attenuate differential-mode noise between the signal line and the ground; it won’t help with common-mode noise present on both the signal and the ground lines. Particularly in digital circuits, capacitors between the signal line and the ground can increase common-mode noise levels. Providing a clean ground will solve the problem, alternatively using a common-mode choke can help.

Conclusions

The design process for a high integrity device is simply an extension of the typical design process. You start with the essential functions that your device needs to perform and then brainstorm the credible hazards based on how and where they will be used. Then it’s just a case of refining the design to eliminate risks, or at least reduce them to an acceptable level. Experience with high integrity systems will shorten the time needed for the hazard identification and analysis process and bring proven solutions to the table when required. There are also established techniques and methodologies available to aid the process. Techniques such as Failure Modes and Effects Criticality Analysis are invaluable for electrical circuit analysis. Other methods are available. The bottom line is recognizing how your device can fail and what the consequences would be. Forearmed with this knowledge allows the design of resilient systems that you can depend upon in a high integrity application.

Would you like to find out more about how Altium Designer can help you with your next PCB design? Talk to an expert at Altium.

About Author

About Author

Mark Harris is an engineer's engineer, with over 12 years of diverse experience within the electronics industry, varying from aerospace and defense contracts to small product startups, hobbies and everything in between. Before moving to the United Kingdom, Mark was employed by one of the largest research organizations in Canada; every day brought a different project or challenge involving electronics, mechanics, and software. He also publishes the most extensive open source database library of components for Altium Designer called the Celestial Database Library. Mark has an affinity for open-source hardware and software and the innovative problem-solving required for the day-to-day challenges such projects offer. Electronics are passion; watching a product go from an idea to reality and start interacting with the world is a never-ending source of enjoyment. 

You can contact Mark directly at: mark@originalcircuit.com

most recent articles

Back to Home