

# The Design of an Equalizer—Part Two

As explained in "The Design of an Equalizer—Part One" [1], we wish to develop an equalizer meeting the following performance:

- data format: nonreturn-to-zero
- data rate = 56 Gb/s
- channel loss at 28 GHz = 20 dB
- input differential swing  $= 800 \text{ mV}_{pp}$
- bit error rate (BER) <10<sup>-12</sup>
- power consumption = 10 mW
- $\bullet V_{\rm DD} = 1 \, \rm V.$

The simulations are carried out with  $V_{DD} = 1V - 5\%$  in the slow–slow corner of the process and at  $T = 75^{\circ}$  C.

We surmised that a continuoustime linear equalizer (CTLE) and a decision-feedback equalizer (DFE) would prove necessary for this purpose. The two-stage CTLE designed in the first part provides a boost factor of about 13 dB at the Nyquist frequency  $f_{Nyq} = 28$  GHz. The circuit delivers an eye height of 220 mV and an eye width of 13.6 ps while drawing 5 mW. In this part, we develop a DFE that achieves a greater eye opening and, hence, allows robust sampling of the data with a low BER. The reader is referred to various articles written on the subject [2]–[8].

# **General Considerations**

Suppose that we apply an impulse to a lossy channel (Figure 1). Due to the medium's limited bandwidth, the output exhibits a long tail. The peak value is called the "main cursor," that at  $t = T_b$  is referred to as the "first postcursor," and so on.

Digital Object Identifier 10.1109/MSSC.2021.3126997 Date of current version: 24 January 2022 This time-domain characterization reveals that each bit generates a tail that interferes with the next bit(s). We, thus, seek a method of canceling this effect, recognizing that

a fixed channel exhibits a certain ratio between the amplitudes of, for example, the first postcursor and the main cursor. Denoting this ratio by  $h_1$ , we articulate the necessary operation as follows: the present bit should be "memorized," scaled by a factor of  $h_1$ , and

subtracted from the next bit so as to remove the interfering tail. This function is afforded by a DFE.

In this simplest form, a DFE senses the data, makes a binary decision, delays the result by a 1-b period  $T_{\rm b}$ , and feeds a scaled copy of the result back to the input. Shown in Figure 2(a) is a possible realization, where the flip-flop (FF) performs both the decision action (also referred to as "slicing") and the delay function. The return path has

a scaling factor of  $h_1$  and is called the "first tap" or "first coefficient." We expect that the tail cancellation accords the "summing junction"

> signal  $D_{sum}$ , a greater eye opening than that at the DEF input.

It is instructive to examine the DFE waveforms in a favorable condition, i.e., with little channel loss. As illustrated in Figure 2(b), the first bit  $B_1$  is delayed and scaled by a factor of  $h_1$  as it appears in  $D_r$ . Upon sub-

traction from the next bit  $B_2$ , this scaled replica increases the voltage swing to  $-1 - h_1$ . Similarly,  $B_2$  boosts the amplitude of  $B_3$  to  $+1 + h_1$ . We conclude that a 1010 data sequence experiences greater swings as a result of the DFE action, a highly desirable effect, as such a sequence displays the most severe eye closure in a lossy medium [1].

Interestingly, the DFE introduces its own eye closure in the presence of consecutive ones (or zeros), also

7



In this simplest

the data, makes

a binary decision,

delays the result

by a 1-b period  $T_{\mu}$ 

and feeds a scaled

copy of the result

back to the input.

form, a DFE senses

FIGURE 1: The impulse response of a channel.

called a "run." Consider the input waveform shown in Figure 2(c), where a zero and two ones occur between  $t_1$  and  $t_4$ . From the first bit, the DFE generates a feedback value of  $-h_1$ and subtracts it from the second bit, yielding  $D_{sum} = 1 + h_1$ . Next, the circuit generates  $D_F = +h_1$  from  $B_2$ and subtracts it from  $B_3$ , degrading the swing to  $1 - h_1$ . This is the price paid for dealing with the eye closure due to a 1010 sequence.

The channel impulse response depicted in Figure 1 exhibits higher postcursors as well. We can then employ additional FFs and feedback taps to cancel those components. The total number of taps depends on the type of the channel and is typically in the range of five to 10 [2].

The principal challenge in DFE design relates to the finite delay around the loop. If excessive, this delay simply causes the circuit to fail. As explained in [7], the following timing constraint must be met:

$$T_{\rm CK-Q} + T_{\rm FB} + T_{\rm setup} \le T_{\rm b}, \tag{1}$$

where  $T_{CK-Q}$ ,  $T_{FB}$ , and  $T_{setup}$  denote the FF clock-to-Q delay, the feedback delay, and the FF setup time, respectively. These quantities can be reduced by means of inductive peaking at the summing junction



FIGURE 2: (a) The basic DFE topology as well as its effect with a (b) 1010 sequence and (c) 11 run.



FIGURE 3: The typical waveform at the summing junction.

The overall equalizer BER is determined by the eye opening at the summing junction, node *A* in Figure 2(a). Illustrated in Figure 3 is the waveform at this node, exhibiting a worstcase peak of  $\Delta V$ . Sensing this voltage, the FF must make correct decisions in the presence of two nonidealities: the total root-mean-square (rms) noise  $V_{n,rms}$  and the total dc offset  $V_{OS}$ , both referred to this interface. The noise and offset contain contributions by the CTLE, summer, and FF. For BER  $\approx 10^{-12}$ , we write

$$\frac{1}{2}Q\left(\frac{\Delta V - 4V_{\rm OS}}{V_{\rm n,rms}}\right) < 10^{-12},$$
 (2)

where  $Q(\cdot)$  is the "error function" (the integral of a Gaussian distribution). Note that  $4V_{OS}$  represents the  $4\sigma$  variance of the offset. This condition is satisfied if the argument of  $Q(\cdot)$  exceeds seven, i.e., if

$$\Delta V \ge 7 V_{\rm n,rms} + 4 V_{\rm OS}. \tag{3}$$

The noise and offset voltages must be calculated for a particular CTLE/ DFE cascade, but we typically target a  $\Delta V$  of at least 150–200 mV to accommodate the FF's finite sensitivity as well [6]. We return to this point later.

Numerous DFE architectures have been reported [2]–[8], and the pros and cons of some have been described in [7]. Most do not relax the timing constraint expressed by (1). We begin with the full-rate, "direct" topology of Figure 2(a) for its simplicity and consider others if this approach does not provide satisfactory performance.

### Channel–CTLE Impulse Response

To compute the relative strengths of the DFE taps, we must examine the impulse response of the channel–CTLE cascade. We apply to the channel a differential pulse having a width of 2 ps ( $\ll T_b = 17.9$  ps) and rise and fall times of 0.1 ps. Plotted in Figure 4 are the output waveforms of the channel and CTLE. The former exhibits its first, second, and

third postcursors at relative levels of 60%, 41%, and 30%, respectively. The latter indicates that the CTLE reduces these to 22%, -3%, and -6%, respectively. The slightly underdamped response at the CTLE output originates from a few decibels of peaking that appear in the cascade frequency response (shown in the first article in this series [1]).

The voltage swings in Figure 4 characterize the CTLE under small-signal conditions. This is necessary because we can predict the step or pulse response from the impulse response only if the system remains linear. Also, the worstcase scenario of a 1010 sequence does lead to relatively small swings at the channel output. Nevertheless, as the channel and CTLE "recover" from a long run, e.g., between  $t_1$  and  $t_2$  in Figure 3, dynamic nonlinearities in the CTLE can cause departure from our foregoing results. In other words, some iteration in the DFE tap values may be necessary if the greatest eye opening is desired.

## **FF Design**

The performance of the DFE shown in Figure 2 hinges primarily upon the FF design. While the simplicity and efficiency of the StrongARM latch make it a desirable candidate here, we recognize from (1) that the FF delay must remain below  $T_{\rm b} = 17.9 \, {\rm ps}$ , a condition that such a latch cannot fulfill in 28-nm technology. We, therefore, resort to current-mode logic (CML) and construct the latch illustrated in Figure 5 [9]. Here, the circuit senses the input when  $M_5$  is on and regenerates when  $M_6$  is on. To save voltage headroom, the bias currents of  $M_5$  and  $M_6$  are defined by a mirror arrangement rather than by a tail current source.

The latch design begins with a power budget, e.g., 1 mW. We bias  $M_5$  and  $M_6$  at a drain current  $I_D$  of 0.5 mA, assuming that  $I_{D5}$  rises to about 1 mA when *CK* is high and  $M_6$  is off, and vice versa. To carry a peak tail current of 1 mA,  $M_1-M_4$  must be wide enough so as to leave sufficient headroom for  $M_5$  and  $M_6$ . We then select  $(W/L)_{1-4} = 5 \,\mu m/30$  nm. The voltage swings at *X* and *Y* are

typically chosen in the range of 400 to 500 mV. With complete switching of  $M_1-M_4$ , these swings are eq-

ual to  $1 \text{ mA} \times R_D$ , suggesting  $R_D \approx 500 \Omega$ . We must also ensure that the small-signal loop gain around  $M_3$  and  $M_4$  is greater than unity so that the circuit properly regenerates when *CK* falls and  $\overline{CK}$  rises.

The clock path in Figure 5 merits some remarks. First, at 56 GHz,

*CK* and *CK* are close to sinusoids, especially if they are provided by resonant buffers (stages with *LC* tank loads). Compared to square waves, sinusoidal clocks exhibit longer transition times, thus elongating the FF response. Second, if nearly rail-to-rail clock swings are available, then  $C_1$  and  $C_2$  need not be much greater than the input capacitance of  $M_5$  and  $M_6$ . This is because we prefer to maintain a mod-

erate gate-source voltage, e.g., around 700 mV, so that these transistors operate in or near saturation. Third,

the current mirror as well as  $C_1$  and  $C_2$  can be shared among all of the DFE's latches; the capacitor values are then chosen according to the total gate capacitance that they must drive.

When characterizing the speed of FFs for a DFE environment, we should

examine their input sensitivity  $V_{\text{sen}}$ , defined as the minimum difference that guarantees correct decisions at the desired clock frequency. Specifically,  $V_{\text{sen}}$  must be studied in the context of the waveforms delivered by the channel–CTLE cascade, e.g., a 1111 run followed by a 0101 pattern (Figure 3). The swing received by the FF after the long run is small and of opposite value, requiring that the



To compute the

of the DFE taps,

we must examine

the impulse

channel-CTLE

cascade.

response of the

relative strengths

FIGURE 4: The impulse response observed at the channel and CTLE outputs.



FIGURE 5: The CML latch design.

9

first latch "recover" from the previous, large overdrive and still respond correctly. Illustrated in Figure 6 is the overdrive recovery, where  $V_{sum}^+$  and  $V_{sum}^-$  denote the summer outputs. If the first latch begins to sense  $V_{sum}^+$  and  $V_{sum}^-$  at  $t = t_1$ , then the latch output voltages in Figure 5 must cross before the circuit enters the regeneration mode. That is, we must have  $t_2 - t_1 < T_{CK}/2$ .

We simulate the FF in such a scenario with  $V_0 = 70$  mV, arriving at the results plotted in Figure 7(a). The latch output voltages barely cross before regeneration begins. Thus, in the presence of offsets due to  $M_1-M_2$ or  $M_3-M_4$ , the circuit produces an incorrect output. Figure 7(b) depicts the outputs for  $V_0 = 100$  mV, revealing more robust sampling. Thus, in addition to overcoming various offsets and noise contributions at the summing junction, the swing must include at least another 100 mV



FIGURE 6: The latch overdrive recovery.

for proper FF behavior. Interestingly, this value is greater than that predicted by (2), implying that the

summing junction swing is dictated by the FF overdrive recovery rather than noise and offset.

The FF offset is also of interest. We write the threshold mismatch between  $M_1$  and  $M_2$  in Figure 5 as  $\Delta V_{\text{TH1,2}} =$ 

 $A_{\rm VTH}/\sqrt{WL}$  [11]. With  $A_{\rm VTH} \approx 4 \, {\rm mV} \cdot \mu {\rm m}$ , we have  $\Delta V_{\rm TH1,2} \approx 10 \, {\rm mV}$ . The threshold mismatch between  $M_3$  and  $M_4$  has the same standard deviation, but it is divided by  $g_{m1,2}R_D - g_{m1,2}/g_{m3,4}$  [6], which is about 2 in our design. The  $4\sigma$  offset of  $L_1$  is, therefore, around 45 mV.

The input-referred noise of the FF can be computed using the method described in [10] and is about  $1.5 \text{ mV}_{rms}$ . To this, we must add the CTLE and summer noise. The simulated noise spectrum shown in Figure 8 yields an rms value of 2.6 mV integrated up to 100 GHz. It follows that  $V_{n,rms} \approx 3 \text{ mV}$  in (3), suggesting that the differential eye height,  $2\Delta V$ , must exceed 132 mV.

# **DFE Design**

We implement the DFE architecture of Figure 2 as shown in Figure 9. The summer senses the CTLE output voltage, converts it to current by means of  $M_1$  and  $M_2$ , adds the result to the output of  $M_3$  and  $M_4$ , and allows the sum to flow through the load resistors. For  $M_1$  and  $M_2$ , we select  $W = 10 \ \mu$ m so as to give their tail current source a reasonable voltage headroom. The first tap is scaled down by a factor of four according to our

Some iteration in

the DFE tap values

may be necessary

if the greatest

eye opening

is desired.

impulse response study. We also add 5-fF capacitors at *A* and *B* as an estimate of layout parasitics.

A number of points should be borne in mind regarding DFE simulations. First, a proper phase relationship must be estab-

lished between the summer output and FF clocks. As our initial try, we place the crossing points of  $V_{A}$  and  $V_{R}$  near those of *CK* and  $\overline{CK}$  such that, when *CK* goes high,  $L_1$  senses  $V_A - V_B$ . Second, the feedback provided by  $M_3$  and  $M_4$  must be negative. Third, we monitor the eye at the summing junction, but we must also ensure that the FF output indeed matches the original data applied to the channel. That is, an open eye does not necessarily imply correct operation. Fourth, the simulation must run long enough (about 20 ns in our example) so that the channel "settles," and the summer output represents the steady-state behavior. We then construct the eye diagram from the last few nanoseconds.

Figure 10 plots the differential eye diagrams at the CTLE and DFE summer outputs. The DFE increases the vertical opening by about 250 mV.

Next, we apply resistive and capacitive degeneration to the summer input stage (Figure 11). Providing some linear equalization, this method leads to the eye shown in



FIGURE 7: The (a) input and output waveforms of a CML latch in an overdrive recovery test with  $V_0 = 70$  mV and (b) latch output for the case of  $V_0 = 100$  mV.

## 10 WINTER 2022 / IEEE SOLID-STATE CIRCUITS MAGAZINE

Authorized licensed use limited to: UCLA Library. Downloaded on February 17,2022 at 10:45:39 UTC from IEEE Xplore. Restrictions apply.



FIGURE 8: The noise spectrum observed at the DFE summing junction.



FIGURE 9: The DFE design.

Figure 12 and slightly improves the eye opening.

In the last step of our design, we ponder higher-order taps. Our impulse response analysis has yielded  $h_2 = -3\%$  and  $h_3 = -6\%$ . Neglecting the former, we add two more FFs to the DFE and apply the third tap with positive feedback (Figure 13). The resulting eye diagram is shown in Figure 14, exhibiting vertical and horizontal eye openings equal to 500 mV and 15.2 ps, respectively. Checked against the original data, the first FF output indicates correct operation.

The DFE draws about 4.5 mW, meeting our overall equalizer power budget of 10 mW. It is possible to add components such as inductive peaking, a gain stage between the summer and FF<sub>1</sub> [6], or a 0.5-unit-interval feedback tap. For this particular equalizer design and channel response, these techniques do not significantly improve the eye opening. However, for other channels and/or with all of the layout parasitics included, these concepts and other DFE architectures can be considered.

# **Corrections to Previous Articles**

In [12], (9) should read:

 $N = C_E C_B (L^2 - M^2) s^4$  $+ (2 C_B L + 2 C_B M - C_E M) s^2 + 1, (4)$ 

and (10) should be revised to

$$D = C_B C_E (L^2 - M^2) s^4 + 2C_B C_E R_L (L + M) s^3 + (2C_B L + 2C_B M + C_E L) s^2 + R_L C_E s + 1.$$
(5)



FIGURE 10: The output eyes at the (a) CTLE output and (b) DFE summing junction.



FIGURE 11: The summer input stage with degeneration.

Authorized licensed use limited to: UCLA Library. Downloaded on February 17,2022 at 10:45:39 UTC from IEEE Xplore. Restrictions apply.



FIGURE 12: The DFE summer eye diagram in the presence of degeneration.



FIGURE 13: The addition of the third tap to the DFE.



FIGURE 14: The DFE summer eye with the third tap added.

The subsequent results remain unchanged. Richard Schreier has pointed out that the T-coil design equations can also be derived by decomposing N(s) into two quadratics and using the fact that an all-pass transfer function must contain poles that mirror or cancel its zeros.

In [13], the gates of  $M_c$  and  $M_d$  in Figure 9 should be tied to the drains of  $M_a$  and  $M_b$ , respectively.



#### References

- B. Razavi, "The design of an equalizer Part one," *IEEE Solid State Circuits Mag.*, vol. 13, no. 4, pp. 7–160, Fall 2021, doi: 10.1109/MSSC.2021.3111426.
- [2] J. Im et al., "A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct decision-feedback equalization in 16-nm FinFET," *IEEE J. Solid-State Circuits*, vol. 52, no. 12, pp. 3486– 3502, Dec. 2017, doi: 10.1109/JSSC.2017. 2749432.
- [3] T. Shibasaki et al., "A 56-Gb/s receiver front-end with a CTLE and 1-tap DFE in 20-nm CMOS," in Proc. VLSI Circuits Symp. Dig., Jun. 2014, pp. 1–2, doi: 10.1109/VL-SIC.2014.6858400.
- [4] A. Roshan-Zamir et al., "A 56-Gb/s PAM4 receiver with low-overhead techniques for threshold and edge-based DFE FIRand IIR-tap adaptation in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 54, no. 3, pp. 672–684, Mar. 2019, doi: 10.1109/ JSSC.2018.2881278.
- [5] A. Cevrero et al., "A 100Gb/s 1.1pJ/b PAM-4 RX with dual-mode 1-tap PAM-4 / 3-tap NRZ speculative DFE in 14nm CMOS FinFET," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2019, pp. 112–113, doi: 10.1109/ISSCC.2019. 8662495.
- [6] S.Ibrahim and B. Razavi, "Low-power CMOS equalizer design for 20-Gb/s systems," *IEEE J. Solid-State Circuits*, vol. 46, no. 6, pp. 1321–1336, Jun. 2011, doi: 10.1109/ JSSC.2011.2134450.
- [7] B. Razavi, "The decision-feedback equalizer [A Circuit for All Seasons]," *IEEE Solid-State Circuits Mag.*, vol. 9, no. 4, pp. 13–16, Fall 2017, doi: 10.1109/MSSC.2017. 2745939.
- [8] P. Peng, J. Li, L. Chen, and J. Lee, "A 56Gb/s PAM-4/NRZ transceiver in 40nm CMOS," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2017, pp. 110–111, doi: 10.1109/ISSCC.2017.7870285.
- [9] J. Lee and B. Razavi, "A 40-Gb/s clock and data recovery circuit in 0.18um CMOS technology," *IEEE J. Solid-State Circuits*, vol. 38, pp. 2181–2190, Dec. 2003, doi: 10.1109/JSSC.2003.818566.
- [10] B. Razavi, "The design of a comparator [The Analog Mind]," *IEEE Solid-State Circuits Mag.*, vol. 12, no. 4, pp. 8–14, Fall 2020, doi: 10.1109/MSSC.2020.3021865.
- [11] M. J. M. Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, "Matching properties of MOS transistors," *IEEE J. Solid-State Circuits*, vol. 24, no. 5, pp. 1433–1440, Oct. 1989, doi: 10.1109/JSSC.1989.572629.
- [12] B. Razavi, "The design of broadband I/O circuits [The Analog Mind]," *IEEE Solid-State Circuits Mag.*, vol. 13, no. 2, pp. 6–12, Spring 2021, doi: 10.1109/MSSC.2021. 3072299.
- [13] B. Razavi, "The design of a low-voltage bandgap reference [The Analog Mind]," *IEEE Solid-State Circuits Mag.*, vol. 13, no. 3, pp. 6–13, Summer 2021, doi: 10.1109/ MSSC.2021.3088963.

SSC