# A 25 Gb/s 5.8 mW CMOS Equalizer

Jun Won Jung and Behzad Razavi, Fellow, IEEE

Abstract—Low-power equalization remains in high demand for wireline receivers operating at tens of gigabits per second in copper media. This paper presents a design incorporating a continuoustime linear equalizer and a two-tap half-rate/quarter-rate decision-feedback equalizer that exploits charge steering techniques to reduce the power consumption. Realized in 45 nm technology, the prototype draws 5.8 mW from a 1 V supply and compensates for 24 dB of loss with BER  $< 10^{-12}$ .

*Index Terms*—Charge steering, decision feedback, equalizer, linear equalizer, nonlinearity.

## I. INTRODUCTION

T HE demand for high-speed, low-power serial links continues unabated, motivating extensive efforts toward the generally-accepted power efficiency of 1 mW/Gb/s. Recent work in the range of 20 to 30 Gb/s has demonstrated power levels around 10–20 mW for equalizers [1]–[5] and 5–100 mW for clock-and-data recovery circuits [6], [8], [9].

This paper presents the design of a 25 Gb/s equalizer that employs charge steering to reduce the power to 5.8 mW while tolerating a channel loss of 24 dB at 12.5 GHz [10]. This performance is achieved through the use of a one-stage continuous-time linear equalizer (CTLE) and a half-rate/quarter-rate decision-feedback equalizer (DFE). Realized in 45 nm digital CMOS technology, the prototype exhibits a bit error rate (BER) of less than  $10^{-12}$  for an eye opening of 0.44 UI.

Section II provides the background for this work, giving a brief overview of charge-steering techniques. Section III presents the evolution of the DFE architecture, and Section IV deals with the design of the building blocks. Section V describes the experimental results.

#### II. BACKGROUND

#### A. Equalizer Design Considerations

The performance of equalizers is typically quantified in terms of their speed and power consumption. In practice, however, two other parameters must also be considered: the loss of the channel and the robustness of the equalization in terms of the eye opening and the BER. Thus, the power efficiency by itself

Digital Object Identifier 10.1109/JSSC.2014.2364271

fails to represent the practical value of a design. It is possible to define a figure of merit (FOM) accounting for the channel loss [11] but this FOM has not been widely adopted.

The development of an equalizer entails several design choices.

- 1) The number of stages in and the boost factor provided by the CTLE: the larger these parameters, the higher the CTLE power consumption and, at high speeds, the larger the number of inductors; as explained in this paper, the DFE power consumption in our work is reduced so much that it is now *less* than that of the CTLE, a point of sharp contrast to the prior art [1], [3], [5].
- 2) The choice of direct DFE versus loop unrolling: the latter replaces the summing junction settling time with a multiplexer (MUX) delay, but it does not offer an advantage in charge-steering implementations (Section III).
- 3) Full-rate or fractional-rate clock frequency: as the DFE is designed to operate with lower clock frequencies but with multiple paths, the generation and distribution of the clock phases become more complex and, more importantly, the load capacitance presented to the CTLE increases.

With charge-steering circuits, one other issue must be addressed, namely, the return-to-zero (RZ) nature of their outputs (explained below). For example, a charge-steering summer produces a valid output for only half a cycle.

#### B. Charge-Steering Circuits

The operation and properties of charge-steering circuits are described in detail in [6]. We provide a brief overview here for reference. As shown in Fig. 1, a continuous-time currentsteering circuit can be transformed to a charge-steering topology by replacing the tail current source with a charge source and the load resistors with capacitors. In the reset mode,  $C_T$  is discharged to ground and the output nodes are precharged to  $V_{DD}$ . In the evaluation mode,  $C_T$  is switched to node P and X and Yare released from  $V_{DD}$ . The currents drawn by  $C_T$  from  $M_1$  and  $M_2$  carry a differential component proportional to  $V_{in}$ , creating a differential output voltage until  $C_T$  charges and the currents cease.

The topology of Fig. 1 saves power by both discrete-time operation and moderate output swings, which are defined by  $C_T/C_D$ . However, the RZ output requires that the system architecture accommodate this type of operation. The CDR/deserializer in [6] and the DFE described here are examples of such architectures. We call the circuit in Fig. 1 an "RZ latch."

If a charge-steering circuit replaces the reset mode with a sense operation, then it can produce NRZ outputs. Shown in Fig. 2 is an example, where  $V_{in}$  is first sampled at X and Y and next amplified regeneratively by the cross-coupled pair [6]. We call this arrangement an "NRZ latch."

Manuscript received May 08, 2014; revised August 06, 2014, and October 02, 2014; accepted October 11, 2014. Date of publication November 05, 2014; date of current version January 26, 2015. This paper was approved by Associate Editor Woogeun Rhee. This work was supported by Texas Instruments and Realtek Semiconductor.

The authors are with the Electrical Engineering Department, University of California, Los Angeles, CA 90095-1594 USA (e-mail: razavi@ee.ucla.edu).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

<sup>0018-9200 © 2014</sup> IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.



Fig. 1. Transformation from current-mode logic to charge-steering logic.



Fig. 2. NRZ charge-steering latch.



Fig. 3. Setup time of CML latch.

## C. Setup Time

In contrast to CML latches, the charge-steering topology of Fig. 1 requires little setup time—a byproduct of its preharge phase. As shown in the CML circuit of Fig. 17, before  $\overline{CK}$  goes high to activate the regenerative pair,  $V_X$  and  $V_Y$  must recover from their previous values, cross, and diverge in the new direction. The setup time is given by primarily the recovery dynamics at X and Y. The charge-steering latch of Fig. 1, on the other hand, brings  $V_X$  and  $V_Y$  to equilibrium before the evaluation begins, avoiding most of the setup time described above. This advantage is due to the absence of the clocked regenerative pair but accrues at the cost of a lesser overall gain per clock cycle. Master-slave flipflops also benefit from this advantage.

# D. Cascading Issues

In the design of flipflops and more complex circuits, we must cascade two or more latches. It is therefore helpful to determine whether and how the above RZ and NRZ latches can be cascaded. The four permutations are illustrated in Fig. 4 along with their attributes. We note that if  $CK_1$  and  $CK_2$  in Fig. 4(a) and



Fig. 4. Cascading of charge-steering latches. (a), (b) $CK_1$  and  $CK_2$  in quadrature. (c) No need for quadrature phases. (d) Severe charge sharing.

(b) are simply complementary, then the master latch's output begins to vanish as the slave enters the sense mode; to avoid this "race condition," quadrature phases are necessary. The topology in Fig. 4(c) can operate with complementary clocks and that in Fig. 4(d) proves impractical due to charge sharing.

## E. Modified Charge-Steering Stage

In cascaded latches, a convenient common-mode level is the supply voltage, as established by the precharge operation. However, if the gate voltages in Fig. 1 remain at  $V_{DD}$  while the circuit steers charge,  $M_1$  and  $M_2$  enter the triode region as their drain voltages fall below  $V_{DD} - V_{TH}$ . To alleviate this issue, we add a cross-coupled PMOS pair to the output nodes [Fig. 5(a)] to create regenerative gain as the input transistors begin to lose transconductance. Depicted in Fig. 5(b), the simulated waveforms reveal larger output swings owing to the PMOS pair. Also applicable to the NRZ latch of Fig. 2, this method has been utilized throughout this work.

## **III. DFE ARCHITECTURE**

## A. Architecture-Level Issues

Consider the direct DFE shown in Fig. 6(a). The loop timing constraint here is expressed as  $t_{cq} + t_{setup} + t_{FB} < 1$  UI, where  $t_{cq}$  denotes the clock-to-Q delay of the flipflop (FF),  $t_{setup}$  its setup time, and  $t_{FB}$  the settling times at node X, i.e., the time necessary for  $V_X$  to recover from the previous bit. We wish to implement all of the building blocks using charge steering principles. Illustrated in Fig. 6(b), our attempt replaces the load resistor at the summing junction with a precharge switch (while relying on the parasitic capacitance at this node) and applies



Fig. 5. (a) Charge-steering latch with cross-coupled PMOS pair. (b) Behavior without and with the pair.

clocks to  $G_{m1}$ ,  $G_{mf}$ , and the two latches  $L_1$  and  $L_2$ . When CK is asserted,  $G_{m1}$ ,  $G_{mf}$ , and  $L_1$  begin to evaluate and  $L_2$  begins to reset. The circuit thus suffers from the race condition described above between  $L_2$  and  $G_{mf}$  (and between  $L_1$  and  $L_2$  in the next half cycle). But let us disregard this issue for now and determine the loop timing constraint. From the instant CK is asserted,  $G_{m1}$ ,  $G_{mf}$  and  $L_1$  must produce a reasonable swing (e.g., 200 mV) at Y. We call the required time the "clock-to-Q delay" of the  $G_{m1} - L_1$  (or the  $G_{mf} - L_1$ ) combination and denote it by  $t_{cq}$ . We add the setup time of  $L_2$  and bound the result to half of UI because node X is reset for the other half as

$$t_{cq} + t_{\text{setup}} < \frac{1}{2} \text{UI.} \tag{1}$$

This severe timing constraint makes the topology of Fig. 6(b) unattractive. We should remark that an unrolled DFE based on charge steering would face a similar limitation and is not discussed here.

#### B. Evolution of the Architecture

We next consider the half-rate architecture shown in Fig. 7(a), where the demultiplexed outputs,  $D_{odd}$  and  $D_{even}$ , are multiplexed and fed back to the summing junction. If implemented with charge steering, this arrangement too necessitates a 1/2 UI upper bound on the loop delay. On the other hand, the alternate half-rate DFE depicted in Fig. 7(b) exhibits a more favorable behavior. With current steering, we have  $t_{cq} + t_{setup} + t_{FB} < 1$  UI here. The charge-steering implementation is illustrated in Fig. 7(c), with CK denoting the half-rate clock. Disregarding the race condition again, we note that when CK is asserted,  $G_{m1}$  and  $L_1$  begin to evaluate, requiring  $t_{cq}$  seconds to produce proper swings at  $Y_1$ . During this time,  $X_2$  is precharged, and  $L_4$  and  $G_{mf2}$  also evaluate, thereby injecting a scaled copy of the previous bit into node  $X_1$ . When  $\overline{CK}$  is asserted, the same actions occur in the other signal path. It follows that the timing constraint is given by

$$t_{cq} + t_{\text{setup}} < 1 \,\text{UI},\tag{2}$$

where, as in (1),  $t_{cq}$  denotes the time, after the clock edge, that  $G_{m1}$  and  $L_1$  need to create a reasonable swing at  $Y_1$ .<sup>1</sup>

An interesting observation in the above architecture is that  $L_2$  and  $G_{mf1}$  (and  $L_4$  and  $G_{mf2}$ ) can be merged because they evaluate concurrently.<sup>2</sup> In other words, the flipflops in each path can be replaced with latches, thus saving power. This is a unique property of charge-steering circuits.

In addition to race conditions, the architecture of Fig. 7(c) entails another issue: if  $D_{in}$  varies during the evaluation mode of  $G_{m1}$  or  $G_{m2}$ , then the charge steering action is irreversibly affected, producing intersymbol interference (ISI). To avoid this

<sup>1</sup>The term  $t_{FB}$  is dropped because it is included in the evaluation mode of  $G_{m1}$  and  $L_1$  ( $t_{cq}$ ).

<sup>&</sup>lt;sup>2</sup>Note that  $G_{m1}$  and  $L_1$  are not merged so as to allow resolving the race condition as explained later.



Fig. 6. Direct DFE using (a) current steering and (b) charge steering.



Fig. 7. (a) Half-rate multiplexed DFE. (b) Half-rate direct DFE. (c) Charge-steering implementation of (b).

difficulty, we sample  $D_{in}$  at half rate, performing analog demultiplexing as well [Fig. 8(a)]. The flipflops are now replaced with latches.

In order to eliminate the race condition between  $L_A$  and  $G_{mf1}$  (and between  $L_B$  and  $G_{mf2}$ ), we can drive these cascades by quadrature phases of the half-rate (12.5 GHz) clock, a

power-hungry solution. Alternatively, we can generate quadrature phases at 6.25 GHz with moderate power consumption and seek an architecture that lends itself to this rate. We proceed in three steps. First, we perform another 1-to-2 demultiplexing operation in each branch at a clock rate of 6.25 GHz and multiplex the results to return to the rate of 12.5 Gb/s. Illustrated conceptually



Fig. 8. Evolution of DFE architecture.

in Fig. 8(b), the circuit drives the  $G_m$  stages at 12.5 GHz and the latches at 6.25 GHz. Second, we recognize that the MUX and  $G_{mf1}$  suffer from a race condition—unless the former does not generate a *voltage* output. This can be accomplished by merging the MUX and  $G_{mf1}$  as shown in Fig. 8(c). Here, the 6.25 GHz clock decides which latch output is selected, and the 12.5 GHz clock controls the charge steering action.

In the third step, we assign proper phases of  $CK_{12.5G}$  and  $CK_{6.25G}$  to the stages in the chain so as to avoid race conditions. As illustrated in Fig. 8(d), after  $D_{\rm in}$  is sampled,  $G_{m1}$  begins to evaluate and, after one divider delay ( $\Delta T_{\rm div} \approx 20$  ps),  $L_1$  is clocked. This timing relationship allows (1)  $G_{m1}$  to produce a voltage swing of about 200 mV before  $L_1$  is enabled (at  $t = t_1$ ), and (2)  $L_1$  to sense a reasonable input swing for about 20 ps (by



Fig. 9. Addition of second tap.



Fig. 10. (a) Current integration. (b) Charge steering

 $t = t_2$ ). According to simulations,  $\Delta T_{\text{div}}$  varies from 17 ps in the FF, 0°C corner to 24 ps in the SS, 80°C corner, affording a worst-case swing of 180 mV at the output of  $G_{m1}$  (at  $t = t_1$ ).

The other signal path operates in a similar manner, but it swaps  $CK_{12.5G}$  with  $\overline{CK}_{12.5G}$  and  $CK_{6.25G,I}$  with  $CK_{6.25G,Q}$ . The divider quadrature outputs thus prove beneficial here.

It is interesting to note that the loop timing constraint in Fig. 8(d) reduces to

$$t_{cq} < 1 \,\mathrm{UI} \tag{3}$$

where  $t_{cq}$  denotes the delay from the  $G_{m1}$  clock to the Q output of  $L_1$  (or  $L_2$ ). This expression excludes a setup time because, in contrast to continuous-time CML latches, here the input data need not propagate to the (precharged) drain nodes of the merged MUX/ $G_{mf}$  circuit before this stage is clocked (Section II.C). The architecture reported in [7] also exhibits a timing constraint similar to (3) but at the cost of static power consumption in the latches.

## C. Addition of Second Tap

In order to realize the second tap of the DFE, we delay the 6.25 Gb/s data streams by half a period (80 ps), multiplex the results, and return a proportional amount of charge to the summing junction. Fig. 9 depicts these operations for one branch, where  $L_5$  and  $L_6$  serve as delay elements and are driven by quadrature phases to avoid the race condition. Note that this multiplexed value is returned to the summing junction in the same branch

because the cascaded latches and the subsequent multiplexing delay the data by 2 UI (=80 ps).

#### D. Charge Steering vs. Current Integration

Current-integrating summers have been reported in a number of DFE designs [4], [5] as a means of reducing the power consumption. It is important to distinguish between such summers and our proposed charge-steering topology. Fig. 10 depicts the two schemes, highlighting their single-ended and differential output waveforms. We observe that, owing to the constant tail current in Fig. 10(a),  $V_X$  and  $V_Y$  can fall so much as to drive the input transistors into the triode region, eventually producing a zero differential output. In other words, the output is near a peak in a narrow time window, dictating precise timing of the clock. The circuit thus faces a trade-off between the peak at  $t = t_1$  (in proportion to  $I_{SS}$ ) and the decay rate after  $t = t_1$  (also in proportion to  $I_{SS}$ ). A small  $I_{SS}$  leads to a flatter response after  $t_1$ but also a smaller peak for  $V_X - V_Y$  at  $t_1$ . To avoid this trade-off, the output common-mode level can be calibrated [4], [5]. In addition, the switches tied to the drains must be wide enough to charge the capacitances at X and Y and supply  $I_{SS}/2$ . In the charge-steering scheme of Fig. 10(b), on the other hand, the differential output is held until the end of the evaluation cycle and does not collapse of its own accord. Moreover, the drain switches need not provide a dc current.

#### **IV. DESIGN OF BUILDING BLOCKS**

Shown in Fig. 11, the overall system incorporates a front-end CTLE with 8 dB of boost, a divide-by-2 circuit, and RZ/NRZ conversion in the path of the output data. In this design, the



Fig. 11. Complete equalizer.



Fig. 12. (a) CTLE. (b) Simulated frequency response.

CTLE draws 2.4 mW, the DFE and RZ/NRZ converters 2.1 mW, and the divider 1.25 mW.

This section describes the critical building blocks of the equalizer and presents their design details. Unless otherwise stated, all drawn channel lengths are 40 nm. The design of the  $\div 2$  circuit and the RZ/NRZ converters is described in [6].

## A. CTLE

The CTLE must drive the input capacitance of the first DMUX ( $\approx$ 34 fF) while providing a boost factor of about 8 dB at 12.5 GHz. Shown in Fig. 12(a), the circuit incorporates a programmable degeneration resistor with a total bias current of 2.4 mA. The 1.5-nH inductors extend the bandwidth at the output nodes to 22 GHz. Fig. 12(b) plots the simulated frequency response for different degeneration resistor settings, revealing a boost of 8 dB with a low frequency loss of 2 dB.

It is desirable to incorparate dc offset cancellation so as to reduce the effect of imbalances that arise before, within, or even after the CTLE. The circuit of Fig. 12(a) lends itself to efficient offset cancellation if the tail current sources can be adjusted differentially. An imbalance of  $\Delta I$  between the two yields an input-referred offset equal to  $\Delta I R_S$ . In this work,  $\Delta I$  can vary from -240  $\mu$ A to +240  $\mu$ A in steps of 60  $\mu$ A. Note that in contrast to the cancellation techniques in [12], [13], our approach does not introduce additional capacitance at the summing junction.

#### B. Effect of Nonlinearity

It is desirable to amplify the data before it reaches the DFE as larger input swings generally lead to a greater eye opening at the summer output. However, if the data, which is heavily dispersed by the channel, experiences excessive nonlinearity,



Fig. 13. (a) Impulse response and its amplification. (b)  $h_{1,\text{norm}}$  for  $\beta_1/\beta_m = 0.2, 0.4$ , and 0.6.



Fig. 14. (a) Passive DMUX with post-amplification. (b) Waveforms at one DFE summing junction without and with regenerative pair.

it places additional demands upon the DFE. To appreciate this point, let us examine the impulse response of the channel as it travels through a linear amplifier with a gain of k and followed by a hard limiter. As shown in Fig. 13(a) for a unity large-signal gain, the output of the cascade,  $h_{out}(t)$ , retains the main cursor level (at t = 0) but exhibits a *larger* value for the postcursor(s), e.g., at  $t = T_b$ . Thus,  $h_{out}(t)$  contains heavier ISI than  $h_{in}(t)$ does. From another perspective,  $h_{out}(t)$  is wider than  $h_{in}(t)$ , implying greater dispersion.

We now quantify the above effect for a differential circuit characterized by  $y = \alpha_1 x + \alpha_3 x^3$  and an input 1-dB compression point  $A_{1 dB} = \sqrt{0.145 |\alpha_1/\alpha_3|}$ . Suppose the amplitudes of the main cursor and the first postcursor in  $h_{in}(t)$  are equal to  $\beta_m A_{1 dB}$  and  $\beta_1 A_{1 dB}$ , respectively. The normalized level of the postcursor at the output is given by

$$h_{1,\text{norm}} = \frac{\alpha_1 \beta_1 A_{1\,\text{dB}} + \alpha_3 \beta_1^3 A_{1\,\text{dB}}^3}{\alpha_1 \beta_m A_{1\,\text{dB}} + \alpha_3 \beta_m^3 A_{1\,\text{dB}}^3} \\ = \frac{\beta_1}{\beta_m} \frac{1 - 0.145 \beta_1^2}{1 - 0.145 \beta_m^2}.$$
 (4)



Fig. 15. (a) Design of charge-steering summer. (b) DFE summing junction eye without (gray plot) and with (black plot)  $M_{p1}$  and  $M_{p2}$ .

We rewrite this expression as

$$h_{1,\text{norm}} = \left(\frac{\beta_1}{\beta_m}\right)^3 \frac{\left(\frac{\beta_m}{\beta_1}\right)^2 - 0.145\beta_m^2}{1 - 0.145\beta_m^2} \tag{5}$$

so that we can keep  $\beta_1/\beta_m$  constant and raise  $\beta_m$ , thus approaching  $A_{1\,dB}$ . Fig. 13 plots  $h_{1,\text{norm}}$  as a function of  $\beta_1$  for  $\beta_1/\beta_m = 0.2, 0.4$ , and 0.6, revealing significant additional ISI as  $\beta_m$  reaches 1.5, i.e., the main cursor appoaches  $1.5A_{1\,dB}$ .

It is important to recognize that, for a high-loss channel, the main cursor amplitude,  $\beta_m A_{1 \text{ dB}}$ , is much smaller than the full input swing. This can be seen for a simple first-order RC section receiving a narrow pulse of height  $V_0$  and duration  $\Delta T$ . The output exhibits a height of  $V_0 \Delta T/(RC)$ , which is much smaller than  $V_0$  if the pulsewidth is much less than RC. Thus, the maximum allowable input swing is quite large than  $1.5A_{1 \text{ dB}}$ .

The analysis described above can be extended to other postcursors as well as the precursor. If we consider the height of the impulse response in Fig. 13(a) at  $2T_b$ ,  $3T_b$  etc., or at  $-T_b$ , the analysis can be repeated.

## C. Amplification Before DFE

With the above study, we wish to amplify the data after the first (passive) DMUX in Fig. 11, seeking a charge-steering solution. Illustrated in Fig. 14(a), the NRZ latch of Fig. 2 proves useful here as an efficient amplifier. The regenerative pair is activated after the input sampling switches turn off, at the same time that the summer turns on. According to simulations, the design values shown in Fig. 14(a) yield a gain of 6 dB with a power consumption of 19  $\mu$ W and  $A_{1 dB} = 90 \text{ mV}_p$ . Fig. 14(b) plots the simulated eye diagram at one of the summing junctions within the DFE before and after the regenerative pair is added. In this simulation, the single-ended input swing is equal to 290 mV<sub>pp</sub>. We observe that the vertical opening doubles, i.e., the additional gain far outweighs the ISI contributed by the nonlinearity of the regenerative pair.

#### D. Charge-Steering Summer

Fig. 15(a) shows the summer implementation along with the feedback taps. The PMOS cross-coupled pair proves par-



Fig. 16. Equalizer die.

ticularly helpful here because the differential pairs in the two taps tend to reduce the output CM level considerably; when  $CK_{12.5G}$  is asserted, the parasitic and/or explicit capacitances in the tails draw a common-mode current from the output nodes of the summer.<sup>3</sup> Fig. 15(b) displays the simulation results without and with the PMOS pair, indicating substantial improvement in the summing junction waveform.<sup>4</sup>

In order to adjust the tap coefficients, the tail capacitors in Tap 1 are decomposed into 25 1-fF units, and those in Tap 2 into 10 1-fF units. Enabled or disabled through an on-chip serial bus, each unit measures  $1.2 \ \mu m \times 0.5 \ \mu m$  and consists of fingers

<sup>&</sup>lt;sup>3</sup>The output common-mode level of charge-steering circuits is primarily determined by the ratio of the tail and load capacitances and hence a weak function of process corners. Simulations suggest a change of 14 mV from FF, 0 °C to SS, 80 °C.

<sup>&</sup>lt;sup>4</sup>According to simulations, the total integrated noise at the summer output is equal to 0.65 mV<sub>rms</sub> and 0.92 mV<sub>rms</sub> for the minimum and maximum CTLE peaking conditions, respectively. The sensitivity of the charge-steering latch is about 30 mV<sub>pp</sub> (differential).



Fig. 17. Test setup



Fig. 18. Measured loss profiles of two channels.

in metal 3 to metal 7. The summer and the two taps consume 590  $\mu$ W.

## V. EXPERIMENTAL RESULTS

The equalizer has been designed and fabricated in TSMC's 45-nm digital CMOS technology and tested with a 1-V supply. Fig. 16 shows the die photo and identifies the building blocks. The core occupies an area of about 100  $\mu$ m × 100  $\mu$ m.

The chip has been directly mounted on a printed-circuit board, with the high-speed signals traveling through probes. Fig. 17 shows the test setup. An RF generator (Agilent E8257D) drives a divide-by-two circuit and a multiplexer (MUX). The divider drives four PRBS generators (three Centellax TG2P1-A's and one Centellax TG1B1-A), whose outputs are multiplexed to generate data at 25 Gb/s. The data is applied to the device under test (DUT), which also receives a half-rate clock from another RF generator (the Agilent E8257D on the bottom left). These two generators are mutually locked. The quarter-rate output of the DUT then returns to the Centellax TG1B1-A for bit error rate measurements. The bathtub curve is measured by adjusting the internal phase of the bottom E8257D and monitoring the BER reading produced by the TG1B1-A.



Fig. 19. (a) Eye diagram at the end of the channel. (b) Eye diagram of quarterrate output.

Fig. 18 plots the measured loss profile of the two channels used in the characterization of our prototype. The high-loss channel serves a measurement at 8 Gb/s and the other at 25 Gb/s. Each channel exhibits a loss of about 24 dB at the corresponding Nyquist frequency.

Fig. 19(a) shows the measured eye diagram at the end of the lossy channel (the single-ended swing is 300 mV<sub>pp</sub>), and Fig. 19(b) the quarter-rate output produced by the chip. Fig. 20 depicts the measured bathtub curves with 8 Gb/s and 25 Gb/s PRBS data using the two channel profiles shown in Fig. 18. The lower rate consumes 4.12 mW, demonstrating the architecture's power scalability with the speed and its robust operation despite leakage currents in nanometer devices. For

| Reference                          | [4]                            | [5]                              | [3]                             | [2]                            | [1]                            | This Work                       |
|------------------------------------|--------------------------------|----------------------------------|---------------------------------|--------------------------------|--------------------------------|---------------------------------|
| Data Rate                          | 19 Gb/s                        | 28 Gb/s                          | 22 Gb/s                         | 27 Gb/s                        | 20 Gb/s                        | 25 Gb/s                         |
| Architecture                       | 4-tap FFE +<br>5-tap DFE       | CTLE +<br>15-tap DFE             | CTLE +<br>2-tap DFE             | 1-tap DFE                      | CTLE +<br>1-tap DFE            | CTLE +<br>2-tap DFE             |
| DFE Clocking                       | Quarter Rate                   | Half Rate                        | Quarter Rate                    | Quarter Rate                   | Half Rate                      | Half Rate                       |
| Channel Loss<br>@ Nyquist          | 25 dB                          | 35 dB                            | 16 dB                           | >10 dB                         | 26.3 dB                        | 24 dB                           |
| BER /<br>Horizontal<br>Eye Opening | < 10 <sup>-9</sup> /<br>36% UI | < 10 <sup>-9</sup> /<br>35.6% UI | < 10 <sup>-12</sup> /<br>26% UI | < 10 <sup>-9</sup> /<br>26% UI | < 10 <sup>-8</sup> /<br>26% UI | < 10 <sup>-12</sup> /<br>44% UI |
| Supply (V)                         | 1.1                            | 1.05                             | 1.15                            | 1.1                            | 1.2                            | 1                               |
| Power (mW)                         | 118*                           | 80 **                            | 20.6                            | 11.1                           | 13.2                           | 5.8                             |
| Clock Buffer<br>Power (mW)         | N/A                            | N/A                              | 8.1                             | 20.1                           | N/A                            | 2.3                             |
| Area (mm <sup>2</sup> )            | 0.07                           | 0.81***                          | 0.016                           | 0.015                          | 0.012                          | 0.01                            |
| Technology                         | 45-nm SOI<br>CMOS              | 32-nm SOI<br>CMOS                | 45-nm CMOS                      | 40-nm CMOS                     | 45-nm SOI<br>CMOS              | 45-nm CMOS                      |

TABLE I Performance Summary

\*Include clock power.

\*\*Only for odd and even DFEs. Excludes CTLE, etc. \*\*\*Includes TX+RX+PLL/4.



Fig. 20. Measured bathtub.

 $\rm BER < 10^{-12},$  the equalizer accommodates a total clock phase margin of 0.33 UI and 0.44 UI, respectively.

The overall equalizer consumes 5.8 mW at 25 Gb/s: 2.44 mW in the CTLE, 1.25 mW in the  $\div$ 2 circuit, and 2.11 mW in the two DFE branches (including RZ/NRZ conversion). Table I summarizes the performance of our prototype and serveral recent designs in the speed range of 19 to 28 Gb/s. We observe that [2] compensates for 10 dB of channel loss and achieves an eye opening of 26% UI. If the 16-dB loss compensation in [3] is considered close to ours, then our design achieves a fourfold improvement in power efficiency.

#### VI. CONCLUSION

This work has introduced a half-rate/quarter-rate DFE architecture that lends itself to charge-steering implementation. Using a linear 2-to-1 DMUX, regenerative amplification, and PMOS cross-coupled pairs at precharged nodes, the two-tap DFE produces an eye opening of 44% UI with BER <  $10^{-12}$ while equalizing for a loss of 24 dB at 12.5 GHz. We have also analyzed the effect of nonlinearity on the performance of equalizers.

#### ACKNOWLEDGMENT

The authors would like to thank the TSMC University Shuttle Program for chip fabrication.

#### REFERENCES

- J. E. Proesel and T. O. Dickson, "A 20 Gb/s 0.66-pJ/bit serial receiver with 2-stage continuous-time linear equalizer and 1-tap decision feedback equalizer in 45 nm SOI CMOS," in *Proc. IEEE Symp. VLSI Circuits*, June 2011, pp. 206–207.
- [2] K. Kaviani et al., "A 27 Gb/s 0.41-mW/Gb/s 1-tap predictive decision feedback equalizer in 40-nm low-power CMOS," in *Proc. IEEE CICC*, Sept. 2012.
- [3] K. Jung et al., "A 0.94 mW/Gb/s 22 Gb/s 2-tap partial-response DFE receiver in 40 nm LP CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2013, pp. 42–43.
- [4] A. Agrawal et al., "A 19-Gb/s serial link receiver with both 4-tap FFE and 5-tap DFE functions in 45-nm SOI CMOS," *IEEE J. Solid-State Circuits*, vol. 47, no. 12, pp. 3220–3231, Dec. 2012.
- [5] J. Bulzacchelli, "A 28-Gb/s 4-tap FFE/15-tap DFE serial link transceiver in 32-nm SOI CMOS technology," *IEEE J. Solid-State Circuits*, vol. 47, pp. 3232–3248, Dec. 2012.
- [6] J. W. Jung and B. Razavi, "A 25-Gb/s 5-mW CDR/deserializer," *IEEE J. Solid-State Circuits*, vol. 48, no. 3, pp. 684–697, Mar. 2013.

- [7] Y. Lu and E. Alon, "Design techniques for a 66 Gb/s 46 mW 3-tap decision feedback equalizer in 65 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 48, pp. 3243–3257, Dec. 2013.
- [8] C. Kromer et al., "A 25-Gb/s CDR in 90-nm CMOS for high-density interconnects," *IEEE J. Solid-State Circuits*, vol. 41, no. 12, pp. 2921–2929, Dec. 2006.
- [9] K. Yu and J. Lee, "A 2 × 25-Gb/s receiver with 2:5 DMUX for 100-Gb/s Ethernet," *IEEE J. Solid-State Circuits*, vol. 45, no. 11, pp. 2421–2432, Nov. 2010.
- [10] J. W. Jung and B. Razavi, "A 25 Gb/s 5.8 mW CMOS equalizer," in IEEE Int. Solid-State Circuits Dig. Tech. Papers, Feb. 2014, pp. 44–45.
- [11] S. Ibrahim and B. Razavi, "Low-power CMOS equalizer design for 20-Gb/s systems," *IEEE J. Solid-State Circuits*, vol. 46, no. 6, pp. 1321–1336, Jun. 2011.
- [12] T. Toifl et al., "A 22-Gb/s PAM-4 receiver in 90-nm CMOS SOI technology," *IEEE J. Solid-State Circuits*, vol. 41, no. 4, pp. 954–965, Apr. 2006.
- [13] J. Jaussi *et al.*, "8-Gb/s source-synchronous I/O link with adaptive receiver equalization, offset cancellation, and clock de-skew," *IEEE J. Solid-State Circuits*, vol. 40, no. 6, pp. 80–88, Jun. 2005.



He was a Postdoctoral Scholar with the Communication Circuits Laboratory, University of California, Los Angeles, CA, USA, in 2013, focusing on the design of the low-power equalizer. He is currently with the SERDES group of Broadcom Corporation.

Dr. Jung was the recipient of the Best Student Paper Award for the 2012 Symposium on VLSI Circuits.



**Behzad Razavi** (F'03) received the B.S.E.E. degree from Sharif University of Technology, Tehran, Iran, in 1985 and the M.S.E.E. and Ph.D. E.E. degrees from Stanford University, Stanford, CA, USA, in 1988 and 1992, respectively.

He was with AT&T Bell Laboratories and Hewlett-Packard Laboratories until 1996. Since 1996, he has been an Associate Professor and subsequently a Professor of electrical engineering with University of California, Los Angeles, CA, USA. His current research includes wireless transceivers,

frequency synthesizers, phase-locking and clock recovery for high-speed data communications, and data converters. He was an Adjunct Professor with Princeton University from 1992 to 1994 and with Stanford University in 1995. He is the author of *Principles of Data Conversion System Design* (IEEE, 1995), *RF Microelectronics* (Prentice Hall, 1998, 2012) (translated to Chinese, Japanese, and Korean), *Design of Analog CMOS Integrated Circuits* (McGraw-Hill, 2001) (translated to Chinese, Japanese, and Korean), *Design of Integrated Circuits for Optical Communications* (McGraw-Hill, 2003, Wiley, 2012), and *Fundamentals of Microelectronics* (Wiley, 2006) (translated to Korean and Portuguese), and the editor of *Monolithic Phase-Locking in High-Performance Systems* (IEEE, 2003).

Prof. Razavi served on the Technical Program Committees of the International Solid-State Circuits Conference (ISSCC) from 1993 to 2002 and VLSI Circuits Symposium from 1998 to 2002. He has also served as a guest editor and an associate editor of the IEEE JOURNAL OF SOLID-STATE CIRCUITS, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, and International Journal of High Speed Electronics. He was the recipient of the Beatrice Winner Award for Editorial Excellence at the 1994 ISSCC, the Best Paper Award at the 1994 European Solid-State Circuits Conference, the Best Panel Award at the 1995 and 1997 ISSCC, the TRW Innovative Teaching Award in 1997, the Best Paper Award at the IEEE Custom Integrated Circuits Conference in 1998, and the McGraw-Hill First Edition of the Year Award in 2001. He was the corecipient of both the Jack Kilby Outstanding Student Paper Award and the Beatrice Winner Award for Editorial Excellence at the 2001 ISSCC. He received the Lockheed Martin Excellence in Teaching Award in 2006, the UCLA Faculty Senate Teaching Award in 2007, and the CICC Best Invited Paper Award in 2009 and in 2012. He was the corecipient of the 2012 VLSI Circuits Symposium Best Student Paper Award and the 2013 CICC Best Paper Award. He was also recognized as one of the top ten authors in the 50-year history of ISSCC. He received the 2012 Donald Pederson Award in Solid-State Circuits. He was also the recipient of the American Society for Engineering Education PSW Teaching Award in 2014. He has served as an IEEE Distinguished Lecturer.