Reducing Metastability in FPGA Designs
Here, we take a look at the concept of metastability in regards to digital circuits – and therefore FPGA designs – and how its 'appearance' can be greatly reduced simply by adhering to proven design principles that mitigate its effect.
Metastability! You could be forgiven for thinking this might be related to the integrity of some futuristic containment vessel, or force-field: "The metastability of the warp drive's flux triangulator and cryonic envelope is reaching critical levels Captain!".
To those of you who live and breathe digital electronics on a daily basis however, the term will likely be greeted with a mixture of disdain and respect.
Here, we take a look at the concept of metastability in regards to digital circuits – and therefore FPGA designs – and how its 'appearance' can be greatly reduced, simply by adhering to proven design principles that mitigate its effect.
Metastability concerns the outputs of registers (or clocked flip-flops in old money) within digital devices and the potential for such outputs to enter a ‘metastable state’. FPGA devices typically utilize D-type flip-flops. Before looking at how such a state can be entered, it is a good idea to refresh ourselves with some basic key timing elements related to the operation of a register:
’Set-up time’ – this is the minimum time that the input to the register must be stable, prior to the arrival of the next clock edge. Typically appears as Tsu in data sheets.
‘Hold time’ – this is the minimum time after the arrival of the clock edge, that the input to the register must continue to be in the same stable state. Typically appears as Th in data sheets.
‘Clock-to-Output Delay time’ – this is the amount of time, after the clock edge arrives, at which point the output of the register changes. This is also referred to as the register's 'settling time' or 'propagation delay'. Can appear in data sheets as, for example, Tco, or Tphl and Tplh.
Whenever a signal travels between two asynchronous clock domains – digital sub-circuits within the overall design that are running on different, or unrelated clocks – there is the possibility of encountering metastability. This is also true of data transferal from an unclocked region of a design into a synchronous system – for example external (outside) signals fed into an FPGA.
The following image illustrates two examples of asynchronous signals entering a synchronous system. In the upper example, a signal travels between different clock domains. In the lower example, a signal from an unclocked system is fed into a clocked (synchronous) system.
The problem arises when a data signal from one clock domain arrives at the register logic in another clock domain. The incoming data signal from the source domain may transition at any time in comparison to the clock in the target domain – there is no synchronicity between the two domains, no knowledge of transitory speeds in the two logic sub-circuits. If the data signal transitions at a point that violates the required Set-up or Hold times for the destination register, the output of that register can enter a 'metastable state' – a state where the output signal is neither logical Low, nor logical High, but rather in the unstable area between the two.
The length of time that the output continues to remain metastable may exceed the register's specified ‘Clock-to-Output Delay time’ (settling time). In the majority of cases, registers will quickly resolve this output instability and return to one of the two defined (and stable) states. The problem for a design however, is in the minority of cases, when the time to settle to a stable state is not quick enough, or the output signal resolves to the incorrect logic level.
The following image illustrates the output of a register depending on transition of the input data signal.
Considering the three inputs:
Input A: The input observes the register's Set-up and Hold times and the output is available after the device's Clock-to-Output Delay time.
Input B: The input transitions during the register's Set-up time, with the output going metastable until settling to the correct stable level beyond the Clock-to-Output Delay time.
Input C: The input transitions during the register's Hold time, with the output going metastable. Not only does the output settle to a stable state beyond the Clock-to-Output Delay time, it also settles to the wrong logic level!
If the output from the register feeds into more than one subsequent register in the circuit – in parallel – there is the possibility that these destination registers capture the data at differing logic levels, depending on whether the metastable output from the source register has settled to a stable state before each destination register is clocked over to capture the next data. Path delays between the source and destination registers, added to the time for the metastable output to become stable, only compounds the problem.
In summary, metastability is a statistical- or probability-based foe to a designer. Depending on the devices used and the circuitry lay out in the design, metastable output states may occur, or they may not. If they do occur, they may be detrimental – causing failure of the design – or luck may be on your side and settling times of devices, clock speeds and routed paths might just make their appearance benign. The problem as a designer however, is that can you really afford to take that 'chance'? What if the product you are designing is part of a medical-based installation or a commercial jet liner – failure of the design could be catastrophic.
Although metastability cannot be eradicated entirely – no device in the world can lay claim to operate absolutely free of potential metastability effects – it can be reduced to the point of becoming barely a 'blip on the radar'.
As a measure of the reliability of a design, in regards to metastability-induced failure, we talk about something called the Mean Time Between Failures – or MTBF. With metastability left unchecked – that is, no provisions are made in a design to mitigate its effect – the MTBF could be as little as seconds. By applying tried-and-tested digital design methodologies to combat metastability, and by making careful choices of the digital devices used in a design, the MTBF can be considerably increased. A thousand years between failures. A million years. Even a billion years if mathematically calculated and extrapolated. At these values for MTBF, such a design can be rubber-stamped as 'Highly Reliable' or virtually 'Fail-Safe' (or should that be 'Fail Free') – but you get the picture.
The following sections take a look at just how you, as a designer, can extend the MTBF, and how device technology plays its part.
Synchronizing Asynchronous Signals
Perhaps the most prevalent and widely accepted solution to the metastability problem, is the addition of front-end circuitry to synchronize an incoming asynchronous signal with the clock of the target synchronous circuit. In its simplest form, this circuitry consists of one or more D-type flip-flops, chained together, and clocked using the target system clock. This is referred to as a ‘synchronization register chain’, or just plain ‘synchronizer’.
The additional delay imposed by each register allows the incoming signal to recover from any metastable state it may have entered. The more registers in the chain, the more delay and therefore the more time for a metastable output to resolve. The total delay is often known as the ‘Metastability Settling Time’. Typically the synchronization circuit will consist of two registers, but for critical applications – such as medical and military – three is not uncommon.
The following image illustrates an example of adding a 2-stage synchronizer to the front-end of a synchronous system, to synchronize an incoming asynchronous signal.
Handshaking logic between circuitry in different clock domains and/or FIFO logic is also used – in addition to front-end synchronization – to ensure reception of correct data values. This is of particular importance when dealing with a bussed grouping of multiple asynchronous signals, each of which could transition at any time and independently of each other.
The Weakest Link...
In a digital design, there may be multiple different clock domains and a plethora of signals passing between them. In addition, there may be a variety of external, asynchronous signals – sourced from outside (especially for a design implemented in an FPGA and utilizing external peripheral components and communications interfaces). In such cases, it is not uncommon to find many synchronizing register chains, handling the different asynchronous signal transfers within the overall system.
In terms of MTBF, each synchronizing chain will have its own 'value'. As the overall failure rate for a design is the sum of the individual failure rates for the synchronizing chains within, and the failure rate is 1/MTBF, you can readily see that a synchronizing chain with a decreased MTBF in comparison to the others, would have an overall detrimental effect on the overall MTBF for the design. In fact, the MTBF for the design will essentially follow the MTBF of the worst synchronizer chain – which can be disastrous if five chains had MTBF of a million years and a sixth chain had MTBF of 50 years!!
To handle this, the solution is to add another register stage to the worst-performing synchronizer chain in the design, thus increasing the metastability settling time and enhancing the MTBF for that chain - and therefore overall design – considerably (if not exponentially!).
Device Technology - Faster vs Smaller
To recap, metastability (although there's nothing stable about this state!) occurs when an incoming asynchronous signal transitions in violation of a register's Set-Up and/or Hold Time. The overall length of time, Set-Up + Hold, essentially defines the 'window' for metastability occurring – the ‘metastability window’ if you will.
It stands to reason that the faster a register's Set-Up and Hold times, the smaller the metastability window. Indeed, faster logic families exhibit these faster times and hence decrease the probability of a metastable event. Should a metastable event occur (remember metastability cannot be completely eradicated) the registers are fast enough to recover quickly. For example, a register in the 74F family would lead to better MTBF than a device used from the 74LS family – two ends of the device-speed spectrum.
With FPGAs, the decrease in process geometries (from 180nm, through 90nm and onward to 65nm, 40nm and beyond) lends itself to faster transistor switching speeds – typically improving the MTBF due to metastability. However, the benefits of reduced size is not without potential penalty. Shrinking geometries naturally bring reduced supply voltages. During a metastable state, the output from a register is typically half of the supply voltage. As the supply voltage gets smaller and smaller, the voltage difference between full and halfway becomes narrowed, leading to a reduction in the gain of the circuit and longer times for registers to recover from a metastable state.
FPGA vendors typically perform vigorous metastability analysis to ensure robustness against metastability in physical devices that utilize these ever-decreasing process geometries.
Use the following links to access external documents that take a closer, and more detailed look, at the phenomenon of metastability and how its effect is essentially rendered negligible in digital electronic designs. Many of these documents take a look at equations used to calculate the MTBF for a flip-flop, and subsequent MTBF for an entire design, and themselves provide reference to further information on the subject.