# A 25-Gb/s 5-mW CMOS CDR/Deserializer

Jun Won Jung and Behzad Razavi, Fellow, IEEE

*Abstract*—The demand for higher data rates in serial links has exacerbated the problem of power consumption, motivating extensive work on receiver and transmitter building blocks. This paper presents a half-rate clock and data recovery circuit and a deserializer that employ charge-steering logic to reduce the power consumption. Realized in 65-nm technology, the overall circuit draws 5 mW from a 1-V supply, producing a clock with an rms jitter of 1.5 ps and a jitter tolerance of 0.5 UI<sub>pp</sub> at 5 MHz jitter frequency.

*Index Terms*—Charge steering, clock and data recovery, deserializer, phase detecter.

# I. INTRODUCTION

**R** ECENT studies indicate that the input/output (I/O) bandwidth of serial links must increase by 2 to 3 times every two years [1] so as to keep up with the demand for higher data rates. In order to manage such bandwidths with reasonable power consumption, an efficiency of around 1 mW/Gb/s for the overall transceiver is targetted [2], necessitating a much smaller value for each building block.

A few CMOS clock and data recovery (CDR) circuits have been demonstrated at a data rate of 25 Gb/s [3], [4]. The former incorporates an off-chip oscillator, a phase interpolator, and a half-rate phase detector (PD) to retime and produce half-rate data while consuming 98 mW. The latter employs a full-rate edge detector along with an LC oscillator and a 1-to-10 demultiplexer (DMUX), drawing 99 mW in the CDR circuit and 64 mW in the DMUX. Both designs are based on current-steering stages.

This paper describes the design of a 25-Gb/s clock and data recovery circuit and a deserializer that, through the use of "charge steering" and other innovations, achieve a twenty-fold reduction in the power dissipation with respect to the prior art. Realized in 65-nm CMOS technology, an experimental prototype exhibits an integrated clock jitter of 1.52 ps,rms and a jitter tolerance of 0.5 unit interval (UI) at a jitter frequency of 5 MHz [5].

Section II provides the background for this work, underscoring the importance of latch design in broadband receivers. Section III describes the concept of charge steering and extends the technique to flipflops. Sections IV and V deal with the design of the phase detector and the deserializer, respectively. The

The authors are with the Electrical Engineering Department, University of California, Los Angeles, CA 90095-1594 USA (e-mail: razavi@ee.ucla.edu).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSSC.2013.2237692



Fig. 1. Generic broadband receiver.

overall system is presented in Section VI and the experimental results in Section VII.

## **II. GENERAL CONSIDERATIONS**

Fig. 1 shows a generic broadband receiver consisting of an analog front end (possibly including an equalizer), a CDR loop, and a demultiplexer (DMUX). The CDR circuit comprises a phase detector (PD), a low-pass filter (LPF), and a voltage-controlled oscillator (VCO). We observe that the PD, the DMUX, and the frequency dividers incorporate nearly a dozen latches, potentially consuming a large power. It is therefore desirable to develop high-speed low-power latches and minimize their number in a receiver.

The choice of the latch topology is governed not only by its intrinsic speed and power drain but also by its environment: (1) The received data typically does not have rail-to-rail swings and may impose severe power or intersymbol interference (ISI) penalty if it is amplified to such levels; the latches must thus operate with moderate data amplitudes (e.g., ~ 400 mV<sub>pp</sub> single-ended). The important implication here is that the data cannot easily sample the clock, dictating PD topologies in which the clock samples the data. (2) The clock can provide nearly rail-to-rail swings if the CDR circuit employs an LC oscillator, but the power consumed by clock buffers ( $\approx f CV_{DD}^2$ ) may become prohibitively large.

## **III. CHARGE STEERING**

The use of charge steering can be traced back to regenerative BiCMOS comparators introduced in the early 1990s [6], [7]. In this work, we extend the idea to non-regenerative circuits and flipflops (FFs), exploit charge steering to realize high-speed phase detectors and demultiplexers, and architect the CDR and the deserializer so as to circumvent this technique's drawbacks.

# A. Basic Idea

Consider the simple differential pair shown in Fig. 2(a), noting that this current-mode logic (CML) stage draws a

Manuscript received September 13, 2012; revised December 09, 2012; accepted December 14, 2012. Date of publication January 22, 2013; date of current version February 20, 2013. This paper was approved by Associate Editor Eric A. M. Klumperink. This work was supported by National Semiconductor (now Texas Instruments) and Realtek Semiconductor.



Fig. 2. (a) Current-mode logic. (b) Charge-steering logic.



Fig. 3. Operation of charge-steering logic.

constant power equal to  $I_T V_{DD}$ . Let us transform the circuit as follows: replace the resistors with capacitors, replace the tail current source with a charge source, and steer charge (current for a short time) rather than current. Depicted in Fig. 2(b), the resulting circuit additionally requires two switches in the tail and two at the output nodes.

The operation of the proposed charge-steering logic (CSL) is illustrated in Fig. 3. In the reset mode, the tail capacitor,  $C_T$ , is discharged to ground and the output nodes are precharged to  $V_{DD}$ . In the evaluation mode,  $C_T$  switches to the tail node, P, drawing a current from  $M_1$  and  $M_2$ , and nodes X and Y are released. The input pair now draws a differential current from the load capacitors in proportion to the differential input voltage until  $V_P$  rises enough to turn  $M_1$  and  $M_2$  off. The CSL circuit can thus amplify and latch the input.

CSL can operate with moderate input and output data swings while drawing power for only a fraction of the clock cycle. In other words, charge steering affords a design style faster than rail-to-rail logic and less power hungry than current-mode logic. As with other dynamic circuits, the average power consumption of the circuit directly scales with frequency (Section III-C). Thus, a design targetting high speeds can also be reused in different parts of a system and at lower speeds with no loss in power efficiency.

The charge injection and clock feedthrough of the precharge switches merit attention. These effects are discussed in Section III-F.

It is important to note that the circuit of Fig. 2(b) is not a simple, complementary version of precharged logic [8]. In particular, the tail charge source plays a critical role here. If the tail node, P, were grounded,  $V_X$  and  $V_Y$  would eventually collapse to zero, providing a valid output for only a short period of time. As explained below, the tail charge source also helps define the output swing and the small-signal gain of the circuit.

#### B. Gain and Swing Calculation

The design of CSL circuits demands simple, intuitive expressions quantifying the performance. To estimate the small-signal voltage gain of the CSL latch shown in Fig. 2(b), let us assume simple square-law MOS devices and, neglecting subthreshold conduction, note that  $V_P$  takes infinite time to reach  $V_{CM} - V_{TH}$ , where  $V_{CM}$  denotes the input common-mode (CM) level and  $V_{TH}$  the threshold voltage of  $M_1$  and  $M_2$ . We wish to determine the time,  $\Delta T$ , necessary for  $V_P$  to rise to  $V_{CM} - V_{TH} - \Delta V$ , where is  $\Delta V$  is somewhat small, arbitrary and, as seen below, eventually unimportant. Merging  $M_1$  and  $M_2$ and viewing the composite device as a source follower, one can prove that  $\Delta T$  is given by [9]:

$$\Delta T \approx \frac{C_T}{\frac{1}{2}\mu_n C_{OX} \frac{W}{L}} \frac{V_{CM} - V_{TH} - \Delta V}{(V_{CM} - V_{TH})\Delta V}.$$
 (1)

The average current drawn by  $C_T$  during this time is equal to

$$I_{avg} = \frac{(V_{CM} - V_{TH} - \Delta V)C_T}{\Delta T}$$
$$= \frac{1}{2}\mu_n C_{OX} \frac{W}{L} (V_{CM} - V_{TH})\Delta V.$$
(2)

Also, the overdrive voltage of  $M_1$  and  $M_2$  varies from  $V_{CM} - V_{TH}$  to  $\Delta V$ , yielding an average roughly given by  $(V_{CM} - V_{TH} + \Delta V)/2$ . The average transconductance of the input transistors thus emerges as

$$g_{m,avg} \approx \frac{2I_D}{V_{GS} - V_{TH}} \approx \frac{I_{avg}}{(V_{GS} - V_{TH})_{avg}}$$
$$\approx \frac{\mu_n C_{OX} \frac{W}{L} (V_{CM} - V_{TH}) \Delta V}{V_{CM} - V_{TH} + \Delta V}.$$
(3)

For a small differential input, this transconductance produces a proportional differential current for  $\Delta T$  seconds [(1)], generating a differential output voltage equal to

$$V_{out} \approx \frac{g_{m,avg}V_{in}\Delta T}{C_D}$$
$$\approx 2\frac{V_{CM} - V_{TH} - \Delta V}{V_{CM} - V_{TH} + \Delta V}\frac{C_T}{C_D}V_{in}.$$
 (4)



Fig. 4. Input/output characteristics of RZ charge-steering latch.

The small-signal voltage gain<sup>1</sup> is therefore obtained for a small  $\Delta V$  as:

$$A_V \approx \frac{2C_T}{C_D},\tag{5}$$

if the circuit is allowed infinite time for charge steering.

The upper bound on the output swing occurs when the input differential voltage is large enough to keep one transistor off for most of the charging period, a desirable condition in latch design. In this case,  $C_T$  draws some charge from both transistors until one turns off and then continues to discharge one output capacitor, yielding in the limit a differential output voltage of approximately

$$V_{out} = \frac{\left(V_{CM} + \frac{V_{in}}{2} - V_{TH}\right)C_T}{C_D} - \frac{C_T}{C_D}V_{in}\frac{\left(V_{CM} - V_{TH} - \frac{V_{in}}{2}\right)}{\left(V_{CM} - V_{TH} + \frac{V_{in}}{2}\right)},$$
 (6)

where the second term accounts for the charge initially lost to the transistor that turns off first.

The foregoing derivations are verified by circuit simulations. Fig. 4 plots the output voltage as a function of the input voltage along with the prediction made by (5) and (6). Despite the oversimplified square-law model and the use of averages we observe a reasonable agreement.

# C. Comparison With Current Steering

We quantify the power advantage of charge steering over current steering assuming that the input swing is large enough to allow relatively complete switching. Suppose the CML stage of Fig. 2(a) sees a load capacitance of  $C_D$  at each output. To accommodate a bit rate of  $r_b$ , the output bandwidth must reach approximately  $0.7r_b$ :  $(2\pi R_D C_D)^{-1} = 0.7r_b$ . The single-ended output swing,  $\Delta V$ , is equal to  $I_{SS}R_D$ , dictating a tail current of

$$I_{SS} = 2\pi (0.7r_b)C_D \Delta V. \tag{7}$$

<sup>1</sup>In the small-signal regime, we are interested in a small voltage change across the tail capacitor and the resulting voltage change at the output. Thus, the voltage gain is relatively independent of the input CM level.

In the charge-steering counterpart of Fig. 2(b), on the other hand, only one load capacitor charges to  $V_{DD}$  and discharges to  $V_{DD} - \Delta V$  in one bit period. The circuit thus draws an average supply current of

$$I_{supp} = C_D r_b \Delta V. \tag{8}$$

In other words, for a given load capacitance, bit rate, and output swing, CSL affords a factor of  $1.4\pi \approx 4.4$  power reduction with respect to CML. This calculation ignores the power consumption necessary to drive the clocked devices in the two topologies. We return to this point in Section VI.

The maximum speed of charge-steering circuits is determined by two mechanisms. First, as the clock period and hence the time for charge steering decrease, the output swing eventually fails to reach the value predicted by (6). That is, at excessively high rates, the large-signal gain falls below unity. Second, the output precharge speed is limited by the on-resistance of the precharge switches and the total output node capacitance. As the clock period decreases, incomplete precharge leads to heavy ISI.

The dynamic nature of charge steering makes the circuits susceptible to device leakage currents, thus placing a lower limit on the operation speed. For the design reported here, simulations indicate that clock frequencies as low as 50 MHz can be accommodated.

# D. Design Issues

While saving considerable power, charge steering does face a number of issues that make the design challenging. First, to drive the tail and output switches in Fig. 2(b), a rail-to-rail clock is necessary, demanding that clock generation and latch design be co-optimized (Section VI). Second, a CSL stage spends about one-half of the clock period,  $T_{CK}$ , in the reset mode, producing a return-to-zero output. This attribute may be considered an advantage or a disadvantage. The reset operation actively removes ISI, a point of contrast to the "passive" continuous-time decay in CML circuits. But it also demands a dedicated fraction of the clock cycle, tightening the timing budget for amplification and latching. Moreover, the RZ<sup>2</sup> output must be converted to non-return-to-zero (NRZ) format at some point for ease of use.

The RZ output issue manifests itself when two CSL stages must be cascaded. Consider, for example, the master-slave flipflop shown in Fig. 5(a). If  $CK_1$  and  $CK_2$  are simply complementary, then the slave stage begins to sense when the master outputs begin to reset. Thus, if the reset operation happens to be faster than the sense operation (e.g., in the slow-NMOS, fast-PMOS corner of the process), then the slave may produce a small differential output.

The above difficulty can be remedied by more complex clocking. Depicted in Fig. 5(b) is an example where  $CK_1$  and  $CK_2$  are offset by about one-quarter of the clock period so that the master provides unreset outputs to the salve for  $T_{CK}/4$  seconds. However, generation and buffering of such clock phases at high frequencies demand substantial power.

<sup>&</sup>lt;sup>2</sup>For differential signals, the data format is in fact "bipolar RZ" [10].



Fig. 5. (a) Master-slave flipflop using charge steering, (b) required clock waveforms for robust operation.



Fig. 6. NRZ charge-steering latch.

#### E. NRZ Charge-Steering Latch

It is possible to avoid the reset mode by merging it with the sense mode. This requires that the input and output nodes be the same! Depicted in Fig. 6(a), such a topology provides an NRZ output. In the sense mode, switches  $S_1$  and  $S_2$  are on, allowing X and Y to track the input, and  $S_3$  is on, discharging  $C_T$ . When  $S_1$ - $S_3$  turn off and  $S_4$  turns on, the circuit begins to regenerate, thus amplifying  $V_X - V_Y$  and holding the result.

We wish to estimate the small-signal voltage gain of this latch in the regeneration mode. Consider the simplified circuit shown in Fig. 6(b), where  $R_{on}$  represents the on-resistance of  $S_4$ . To determine the upper bound on the gain, let us assume that (1) the latch begins with a small imbalance,  $V_{XY0}$ , and (2)  $M_1$  and  $M_2$ are so wide that their gate-source voltage varies negligibly while  $C_T$  charges. The soundness of these assumptions is checked below.

We now write the tail current as

$$I_T(t) = \frac{V_{CM} - V_{GS}}{R_{on}} \exp \frac{-t}{R_{on}C_T},$$
(9)

where  $V_{CM}$  denotes the input CM level. In the design used here, the transistors mostly operate in the subthreshold region<sup>3</sup>, exhibiting a transconductance of  $g_m \approx I_D/(\zeta V_T)$ , where  $\zeta$  is related to the subthreshold slope and given by  $1 + C_d/C_{ox}$  ( $C_d$ is the depletion region capacitance under the channel). Since



Fig. 7. Input/output characteristics of NRZ charge-steering latch.

 $I_{D1} \approx I_{D2} \approx I_T/2$ , the time-variant transconductance of each transistor is estimated as

$$g_m(t) = \frac{1}{2\zeta V_T} \frac{V_{CM} - V_{GS}}{R_{on}} \exp \frac{-t}{R_{on}C_T}.$$
 (10)

We also express the regeneration action by the following equations :

$$-C_D \frac{dV_X}{dt} = g_{m1} V_Y \tag{11}$$

$$-C_D \frac{dV_Y}{dt} = g_{m2} V_X, \qquad (12)$$

and hence

$$C_D \frac{dV_{XY}}{dt} = g_m(t) V_{XY}, \tag{13}$$

where  $g_{m1} = g_{m2} = g_m(t)$ . It follows from (10) and (13) that

$$C_D \frac{dV_{XY}}{V_{XY}} = \frac{V_{CM} - V_{GS}}{2\zeta V_T R_{on}} \exp \frac{-t}{R_{on} C_T} dt.$$
(14)

Integration of both sides for t = 0 to  $t = \infty$  yields

$$C_D \ln \frac{V_{XY\infty}}{V_{XY0}} = \frac{V_{CM} - V_{GS}}{2\zeta V_T} C_T,$$
 (15)

and, therefore,

$$\frac{V_{XY\infty}}{V_{XY0}} = \exp\left(\frac{C_T}{C_D}\frac{V_{CM} - V_{GS}}{2\zeta V_T}\right).$$
 (16)

<sup>&</sup>lt;sup>3</sup>The overdrive voltage of the cross-coupled devices varies from 60 mV at the beginning of the cycle to -80 mV when most of the charge has been steered. The  $f_T$  of these transistors varies from 130 GHz at the beginning to about 31 GHz near the end.



Fig. 8. (a) Master-slave FF using cascaded NRZ latches, (b) simulated waveforms. (The switches are ideal so as to clearly show the charge sharing effect.)

The maximum output swing occurs if  $V_{XY0}$  is large enough to keep one transistor off. In this case, no regeneration takes place and the output swing is given by (6). As with the RZ latch of Fig. 2(b), this circuit begins to experience ISI as the clock period becomes too small to allow full tracking of the input.

Fig. 7 plots the simulated output voltage of the circuit as a function of the initial imbalance. The result predicted by (16) is also plotted. We note a reasonable argument.

# F. Charge-Steering Flipflop

In view of the cascading issues illustrated in Fig. 5, we may contemplate a flipflop employing the above NRZ latch instead. As shown in Fig. 8(a), such a master-slave topology could, in principle, operate with only complementary clocks because it does not require a dedicated reset time. Unfortunately, this approach suffers from severe charge sharing between the master and slave nodes, introducing substantial ISI in random data. We recognize that for a random input sequence, the previous state at  $X_2$  may be the opposite of the present state at  $X_1$ , causing a twofold reduction in the signal amplitude if the capacitances at these nodes are equal. Fig. 8(b) shows the simulated waveforms at the four nodes, revealing severe corruption.

The foregoing studies lead to the proposed charge-steering FF shown in Fig. 9(a) as a viable candidate. Here, the master is realized as the NRZ latch, thus avoiding the reset phase, and the slave as the original RZ latch, thus avoiding charge sharing. The circuit can therefore operate with complementary clocks. The circuit diagram also shows the transistor widths and capacitance values as a design example for an input data rate of 25 Gb/s and and a clock frequency of 12.5 GHz. (The channel length is 60 nm.)

Fig. 9(b) plots the circuit's simulated waveforms with sinusoidal clock waveforms. With a single-ended input swing of  $300 \text{ mV}_{pp}$ , the master produces a swing of about  $340 \text{ mV}_{pp}$  and the slave, about 500 mV  $_{\rm pp}.$  The FF consumes 158  $\mu W$  from a 1-V supply at this rate. It is possible to reduce the power by "linear" scaling of all of the devices [11], but at the cost of a higher offset. According to simulations, the above design exhibits an input-referred offset of about 8 mV, a comfortable value for input swings of a few hundred millivolts. The chargesteering circuits in this work employ PMOS transistors for input sampling and output precharge. The charge injection and clock feedthrough of these devices mostly introduce a common-mode jump of roughly 100 mV at their respective nodes. Fortunately, This jump is *upward*, facilitating the operation by providing a greater voltage headroom for the regenerative or differential pair branches.

The charge-steering circuit of Fig. 9(a) has some sensitivity to the common-mode voltage. For example, according to simulations, if the input CM level falls by 50 mV from 850 mV, the voltage swing at  $X_2$  (or  $Y_2$ ) decreases by 8 mV. Similarly, if the supply voltage drops by 50 mV, the swing at  $X_2$  shrinks by 9 mV.

The proposed FF topology proves useful in the design of phase detectors and (de)multiplexers. However, it still produces RZ data, requiring additional techniques at the architecture level.

#### G. Comparison by Simulations

As suggested by (7) and (8), charge steering affords approximately a fourfold reduction in power compared to CML. In this section, we perform a more detailed comparison with both CML and rail-to-rail implementations. For a fair comparison,



Fig. 9. (a) Charge-steering flipflop. (b) Simulated waveforms.



Fig. 10. (a) CML flipflop example, and (b) its simulated eye diagram with rail-to-rail clocks.

we design each circuit for a set of specifications and examine the output eye diagram.<sup>4</sup>

Let us repeat the CSL design of Fig. 9(a) in CML with the same power consumption (160  $\mu$ W), supply voltage (1 V), output swing ( $\approx 400 \text{ mV}_{pp}$  single-ended), and input offset (8 mV) [Fig. 10(a)]. Each latch therefore has a current budget of 80  $\mu$ A, requiring a load resistor of 5 k $\Omega$ . Choosing a width of 2  $\mu$ m for the transistors in the signal path and 0.23  $\mu$ m for the clocked devices, we obtain the eye diagram shown in Fig. 10(a).

<sup>4</sup>All channel lengths are 60 nm.

Repeating the design in rail-to-rail logic is more difficult as the data swings at these rates are typically a few hundred millivolts and require amplification and hence hundreds of microwatts of additional power. Nonetheless, we neglect this power and consider the flipflop example in Fig. 11(a). For a total power consumption of 175  $\mu$ W, we select  $W_1 = W_2 = 1 \mu m$ ,  $W_3 = W_4 = 1 \mu m$ , and  $W_5 = 5 \mu m$ . The resulting eye diagram is shown in Fig. 11(b).

## **IV. PHASE DETECTOR**

In order to alleviate speed issues and ease the distribution of the recovered clock, this work employs a half-rate CDR archi-



Fig. 11. (a) Rail-to-rail flipflop example, and (b) its simulated eye diagram.





Fig. 12. (a) Half-rate phase detector. (b) Reference generation with RZ data.

tecture. Also, to avoid quadrature clock phases, the half-rate PD in [12] is used here [Fig. 12(a)]. The circuit incorporates four latches to sample the data on both the rising and falling edges of the half-rate clock, CK. Node  $X_1$  carries a pulsewidth equal to  $T_{CK}/2 + \Delta T$ , where  $T_{CK}$  denotes the clock period and  $\Delta T$  the phase error, and node  $X_2$  a pulsewidth equal to  $T_{CK}/2 - \Delta T$ . Thus,  $V_{ERR} = X_1 \oplus X_2$  exhibits a pulse of width  $\Delta T$  for each data transition. However, as in a Hogge PD [13], the average value of these "proportional" pulses is a function of data transition density, failing to uniquely represent the phase difference for various data patterns [14]. To avoid this ambiguity, latches  $L_3$  and  $L_4$  resample the data, generating at  $V_{REF}$  a "reference" pulsewidth equal to  $T_{CK}/2$ . The average value of  $V_{ERR} - V_{REF}$  now has a one-to-one correspondence with the phase error. Satisfying our previously stated condition that the clock sample the data, the PD also provides the half-rate retimed data at  $D_{out1}$  and  $D_{out2}$ .

Even if using the master-slave topology of Fig. 9, the PD of Fig. 12(a) does not readily lend itself to a charge-steering implementation: since  $D_{out1}$  and  $D_{out2}$  carry RZ data [as do  $X_2$  and  $Y_2$  in Fig. 9(a)],  $D_{out1} \oplus D_{out2}$  does not yield correct information. As illustrated by the differential waveforms of  $D_{out1}$  and  $D_{out2}$  in Fig. 12(b), in each half cycle, the reset phase of one output is XORed with (and hence multiplied by) the data on the other output, thus producing a zero  $V_{REF}$ .

This issue is resolved by adding two more RZ latches as shown in Fig. 13(a). Even though the inputs and outputs of  $L_5$  and  $L_6$  are still of RZ form,  $Y_1 \oplus Z_2$  and  $Y_2 \oplus Z_1$  yield constant pulsewidths equal to  $T_{CK}/2$  for each data transition [Fig. 13(b)]. Thus,  $V_{REF1} + V_{REF2}$  serves as a proper reference to be subtracted from  $V_{ERR}$ . The timing issue illustrated in Fig. 13(a) persists for the  $L_3$ - $L_5$  and  $L_4$ - $L_6$  cascades to some extent, but the amplitude of the retimed data at  $Y_1$  and  $Y_2$  is large enough to impress correct levels onto  $L_5$  and  $L_6$ . As a precaution, nonetheless, a "half-gate" delay realized by a pass PMOS transistor is inserted in series with the D inputs of these two latches to allow an additional 5 to 10 ps for sensing  $Y_1$  and  $Y_2$ before they are reset. It is important to note that the path consisting of  $L_5$ ,  $L_6$ , and their XOR gates merely generates a reference pulse for phase error computation and hence is fairly insensitive to this delay. Circuit simulations suggest that the static phase error of the CDR rises from 3 ps to 6 ps if the width of this transistor is doubled, indicating the low sensitivity.

Each XOR gate is implemented as shown in Fig. 14 [15] so as to maintain systematic symmetry between the two inputs and avoid the need for rail-to-rail swings. The gain of the XOR, defined as the output voltage change divided by the input phase difference change, depends on the tail current to some extent. This current is chosen equal to 40  $\mu$ A before the gain reaches diminishing returns. Note that the XOR output bandwidth is



Fig. 13. (a) Modified half-rate phase detector ( $L_1$  and  $L_2$  are NRZ latches and  $L_3$ - $L_6$ , RZ latches), and (b) internal waveforms and reference generation.



Fig. 14. One half of the XOR gate with symmetric inputs.

unimportant<sup>5</sup> because the subsequent voltage-to-current (V/I) converter only senses the dc content of this output.

## V. DESERIALIZER

While the half-rate PD performs one level of demultiplexing, it is typically necessary to further deserialize the data for ease of use by the subsequent processor. Moreover, the data retimed by the PD must be converted to the NRZ format at some point. These two functions are now described.

### A. Demultiplexer

We wish to demultiplex the 12.5-Gb/s data at  $Y_1$  and  $Y_2$ in Fig. 13(a) by means of charge-steering latches driven by a quarter-rate clock. We also prefer to avoid the cascading issue described in Fig. 5(a) to maintain the integrity of the data. Illustrated in Fig. 15(a), the idea is to exploit the quadrature outputs of a divide-by-two circuit to drive the latches. Fig. 15(b) shows the timing relationship between the clocks applied to the latches. We observe that, when CK and  $CK_{1/2,I}$ go high,  $L_3$  and  $L_7$  enter the evaluation mode, behaving like the master-slave configuration of Fig. 9(a), even though each is

<sup>5</sup>The pole at the XOR output in this design is around 6 GHz, negligibly impacting the loop stability.

realized as the RZ latch of Fig. 9(a). Similarly, when CK goes low and  $CK_{1/2,Q}$  goes high,  $L_4$  and  $L_9$  begin to evaluate.

Each latch employs a width of 0.8  $\mu$ m for the tail reset switch and 4  $\mu$ m for all others along with a tail capacitance of 10 fF. The four latches consume a total of 183  $\mu$ W at a clock frequency of 6.25 GHz.

## B. Frequency Divider

The divide-by-two circuit must operate with an input frequency of 12.5 GHz and drive four inverter buffers, each having an NMOS width of 1.2  $\mu$ m and a PMOS width of 2.4  $\mu$ m. To generate quadrature outputs, the circuit must incorporate two identical stages in a feedback loop, e.g., two latches of the form shown in Fig. 16(a).<sup>6</sup> However, according to simulations, such a divider fails around 12 GHz.

To examine the above latch's failure mechanism, consider the state depicted in Fig. 16(b), where  $V_X = 0$ ,  $V_Y = V_{DD}$ ,  $D_{in} =$  $0, \overline{D}_{in} = V_{DD}$ , and CK goes high. Two transitions must occur:  $V_Y$  must fall to zero and, as a result,  $V_X$  must rise to  $V_{DD}$ . Note that the rise in  $V_X$  is critical as it provides the overdrive for the input transistor of the other latch in the loop. The fall in  $V_Y$  is less important because it simply turns off one input transistor of the next latch. We observe from Fig. 16(b) that during this operation, (1)  $M_4$  fights the series combination of  $M_{CK}$  and  $M_2$ , and (2)  $V_X$  rises little before  $V_Y$  reaches zero. Thus,  $M_3$ must be, on the one hand, strong enough to rapidly charge the capacitance at X, and, on the other hand, weak enough not to vehemently fight the series combination of  $M_{CK}$  and  $M_1$  (in the next half cycle). This trade-off limits the maximum toggling speed of the divider, causing failure if  $V_X$  does not rise enough in  $T_{CK}/2$  seconds.

The foregoing study suggests that the speed can be improved if the rise in  $V_X$  and  $V_Y$  is somehow augmented. This can be accomplished by means of NMOS source followers [Fig. 16(c)].



Fig. 15. (a) Deserializer, (b) their timing diagram.



Fig. 16. (a) Rail-to-rail latch, (b) operation of the latch, (c) new latch, (d) simulated speed of the dividers, (e) simulated power consumption of the dividers.

While increasing the latch input capacitance to some extent, each follower actively pulls up the corresponding output node, relaxing the above trade-off. In addition, the source followers provide an unclocked feedforward path, impressing the next state at X (or Y) before the clock rises and the main path is activated. This feedforward action further improves the maximum speed, but at the cost of a lower bound on the toggle rate. Fig. 16(d) plots the simulated output frequency as a function of the input frequency for the conventional and the proposed divide-by-two circuits. We note that the source followers raise the maximum speed to 14.5 GHz (while limiting the lower end to 0.4 GHz).

Another remarkable attribute of the proposed divider is that it consumes *less* power than the conventional topology does [Fig. 16(e)]. Since the source followers reduce the rise and fall times at the output nodes, the crowbar current flowing from  $V_{DD}$  to ground during transitions decreases, thereby lowering the power consumption by about 20% at 12 GHz.

# C. RZ-to-NRZ Conversion

With the data rate brought down by the deserializer to 6.25 Gb/s, the task of RZ/NRZ conversion becomes simpler. The conversion can be performed by applying the RZ data to a simple RS latch: when both inputs are, e.g., zero, the latch maintains the previous state, and when one input goes high, the state can change. However, a rail-to-rail RS latch requires that the moderate output swings of  $L_7-L_{10}$  in Fig. 15(a) be amplified and their common-mode level be shifted so that



Fig. 17. (a) RZ-to-NRZ conversion, (b) proposed comparator.



Fig. 18. (a) Overall CDR/deserializer architecture, (b) simulated transient behavior, (c) simulated recovered half-rate differential data.

the R and S inputs are not activated simultaneously when the differential RZ data collapses to zero.

More efficient amplification can be realized by means of (clocked) comparators. Illustrated in Fig. 17(a), the idea is to utilize the quadrature phases of the 6.25-GHz clock to drive  $L_7$  and a comparator in a master-slave fashion. When  $CK_{1/2,I}$  rises,  $L_7$  enters the evaluation mode; 40 ps later,  $CK_{1/2,Q}$  rises, allowing the comparator to regenerate to the rails.

Owing to its low power consumption, the StrongARM comparator [16] or its modified version [17]<sup>7</sup> is attractive here, but, in 65-nm technology, it does not robustly operate at 6.25 GHz. Fig. 17(b) shows the modified, faster design: the cross-coupled PMOS devices are removed, thus reducing the capacitance at the output node and improving the speed by about 25%. In the ab-

<sup>&</sup>lt;sup>7</sup>The modified version adds reset switches to the drains of the input transistors, suppressing dynamic offsets.



Fig. 19. Locked phase noise profile for jitter calculations.

sence of these devices, the high level at the output degrades if the input differential voltage is not large enough to keep  $M_1$  or  $M_2$  off. This issue is not problematic here because  $L_7$  in Fig. 17(a) produces a swing of more than 400 mV. According to simulations, the comparator and the RS latch in Fig. 17(a) draw a total of 130  $\mu$ W at 6.25 Gb/s.

#### VI. OVERALL SYSTEM

Fig. 18(a) shows the overall CDR/deserializer architecture along with the simulated power dissipation of the blocks. The CDR loop consists of the PD described in Section IV, a V/I converter, a loop filter, and an LC VCO. For  $R_1 = 500 \Omega$ ,  $C_1 = 80 \text{ pF}$ ,  $C_2 = 8 \text{ pF}$ , and  $K_{VCO} = 1 \text{ GHz/V}$ , the loop exhibits the simulated transient behavior shown in Fig. 18(b), locking in about 50 ns. The retimed half-rate differential data at  $Y_1$  and  $Y_2$  is plotted in Fig. 18(c). The loop bandwidth is approximately 6 MHz.<sup>8</sup>

## A. VCO Interface

The VCO can draw considerable power and must therefore be designed with three considerations in mind: (1) the amount of random jitter that it introduces in the locked state, (2) the amount of load capacitance that it must drive, and (3) whether it must drive the load capacitance directly (with rail-to-rail swings) or through buffers. The relative severity of these issues depends on the frequency of operation, the jitter target, the PD clock swing and drive requirements, and the routing capacitances.

Let us begin with the first issue. Suppose the locked VCO exhibits the phase noise profile shown in Fig. 19, where  $f_{BW}$  denotes the CDR loop bandwidth. To obtain the rms jitter,  $\Delta T_j$ , we integrate the area under this plot and normalize the result to the VCO period,  $T_{CK}$ . If the declining phase noise beyond an offset of  $\pm f_{BW}$  can be approximated by  $1/f^2$ , then the total area is equal to  $4f_{BW}S_0$ . Thus,

$$\Delta T_j = \frac{\sqrt{4f_{BW}S_0}}{2\pi} T_{CK}.$$
(17)

For example, to target  $\Delta T_j = 1$  ps, rms with  $f_{BW} = 6$  MHz, we require  $S_0$  to be less than -96 dBc/Hz. That is, the free-running VCO must provide a phase noise of less than -96 dBc/Hzat 6-MHz offset.

It is instructive to estimate the minimum VCO supply current,  $I_{SS}$ , that yields the requisite phase noise. From [18], [19], we



Fig. 20. Two scenarios of driving the load by the VCO.



Fig. 21. Die photograph.

express the free-running phase noise of an LC VCO with one (NMOS or PMOS) cross-coupled pair as

$$S(\Delta\omega) = \frac{\pi^2}{R_P} \frac{kT}{I_{SS}^2} \left(\frac{3}{8}\gamma + 1\right) \frac{\omega_0^2}{4Q^2 \Delta \omega^2}$$
(18)

where  $R_P$  denotes the equivalent parallel resistance of the differential tank at resonance, and  $\gamma$  the noise coefficient of MOS-FETs. For a peak-to-peak single-ended swing,  $2R_P I_{SS}/\pi$ , of 1 V,  $\gamma = 1$ , and Q = 8, we obtain  $I_{SS} \approx 6.3 \,\mu\text{A}$ ,  $R_P \approx 250 \,\text{k}\Omega$ , and hence a tank inductance of nearly 400 nH! In other words, the phase noise specification is much more relaxed than the other two issues mentioned above.

To address the second and the third issues, we note that the clock in Fig. 18(a) must drive six latches, the divider, and about 45  $\mu$ m of interconnects in the layout – a total of approximately 270 fF. We consider the two scenarios depicted in Fig. 20. The minimum power that two buffers (for CK and  $\overline{CK}$ ) would consume to drive the 270-fF capacitance is equal to  $2f_{CK}CV_{DD}^2 =$ 6.75 mW. (An LC buffer would draw about 2 mW but at the cost of larger area.) It is therefore highly desirable to avoid these buffers and absorb the capacitance into the VCO tank. Allowing another 50 fF for the VCO and inductor capacitances, we choose a differential inductance of 1 nH, obtaining  $R_P = QL\omega \approx$ 630  $\Omega$  and hence  $I_{SS} = 2.5$  mA for a 1-V<sub>pp</sub> single-ended swing. The use of both PMOS and NMOS cross-coupled pairs in the VCO permits a twofold reduction in this current, leading to a power consumption of 1.25 mW. The actual design draws 1.4 mA and employs MOS varactors along with a two-bit capacitor bank for tuning. Despite the constant capacitance loading the VCO, the measured tuning range is from 12.25 GHz to 13.59 GHz.

<sup>&</sup>lt;sup>8</sup>The periodic behavior of  $V_{cont}$  is due to the PRBS length of  $2^7 - 1$ .



Fig. 22. Test setup

The key idea proposed here is that it is generally advantageous to omit the buffers and utilize their power consumption in the VCO itself. However, the absence of buffers after the VCO raises two concerns: (1) The VCO may experience coupling from the input data through the PD latches [14]. Fortunately, the large capacitance seen at each output node of the VCO suppresses this effect, yielding a (simulated) peak-to-peak jitter of 300 fs due to this coupling. (2) The interconnect resistance and the MOS gate resistance may degrade the tank Q. According to simulations, this effect raises the VCO phase noise by 0.07 dB, pointing to the direct VCO/PD interface as the preferable approach in CDR design.

It is worth noting that, if two buffers consuming a total of 8 mW were to follow the VCO, the power consumption of this CDR circuit would still be substantially less than that of prior current-steering designs. Charge steering therefore maintains its advantage even in this case.

#### VII. EXPERIMENTAL RESULTS

The CDR/deserializer prototype has been fabricated in TSMC's 65-nm digital CMOS technology and characterized with a 1-V supply. Fig. 21 shows a photograph of the circuit's core. The prototype (excluding the output 50- $\Omega$  buffers) draws 4.97 mW, of which 1.4 mW is consumed by the VCO, 1.3 mW by the PD, 1.24 mW by the divider, and 0.43 mW by the V/I converter.

The chip has been directly mounted on a printed-circuit board, with the input and output connections provided by high-speed probes. A Centellax bit error rate (BER) tester drives the circuit with a singled-ended swing of 300 mV<sub>pp</sub> and captures its outputs. Fig. 22 shows the test setup: four Centellax PRBS generators are multiplexed to generate a 25-Gb/s stream, which is then applied to the device under test (DUT). The DUT demuxed output is fed back to Centellax TG1B1-A to measure the BER.

Fig. 23(a) shows the recovered half-rate clock spectrum, revealing a loop bandwidth of about 6 MHz. Fig. 23(b) shows the measured phase noise with a PRBS length of  $2^7 - 1$  and  $2^{15} - 1$ . The area under the profiles from 100-Hz to 1-GHz offset yields an rms jitter of 1.13 ps and 1.52 ps, respectively. The BER is less than  $10^{-12}$  for a PRBS length of  $2^{15} - 1$ . Fig. 23(c) shows the measured quarter-rate recovered data.

The jitter transfer and tolerance of the prototype have also been measured and plotted in Fig. 24. The former indicates a





Fig. 23. (a) Recovered clock spectrum, (b) measured phase noise, (c) recovered data eye diagram [vertical scale: 100 mV/div, horizontal scale: 50 ps/div].

loop bandwidth of about 6 MHz and the latter a tolerance of 0.5  $UI_{pp}$  at jitter frequencies as high as 5 MHz. To study the robustness of the circuit, the jitter tolerance is also measured with a 1.1-V supply, yielding similar results.

Jitter Transfer (a) (b) (b) (b) (c) (c)(c)

Fig. 24. Measured jitter transfer and jitter tolerance.

Data Rate

**DEMUX** Ratio

Power

Consumption

**RMS jitter of** 

Recoverd Clock

Jitter tolerance

Technology

Supply Voltage

TABLE I Performance Summary

[3]

25 Gb/s

1:2

98 mW

off-chip VCO)

N/A

0.35 UI PP

at 100 MHz

90 nm

1.1V

(excludes

CDR in [4]

25 Gb/s

1:1

99 mW

(excludes

2:5 DMUX)

1.02 ps

0.4 UI<sub>PP</sub>

at 10 MHz

65 nm

1.2 V

This Work

25 Gb/s

1:4

4.97 mW

1.56 ps

0.5 UI<sub>PP</sub>

at 5 MHz

65 nm

1 V

| Table I   | compares | the perf | formance  | of this | work | with | that | of |
|-----------|----------|----------|-----------|---------|------|------|------|----|
| two other | CMOS ex  | amples   | from prio | r art.  |      |      |      |    |

#### VIII. CONCLUSION

The use of charge steering can dramatically reduce the power consumption of high-speed circuits, affording a design style faster than rail-to-rail logic and less power-hungry than current steering. This paper describes a CDR/deserializer that incorporates charge steering in phase detection and demultiplexing along with a new frequency divider and comparator. The circuit and architecture techniques culminate in a prototype that consumes about 20 times less power than prior art.

## ACKNOWLEDGMENT

The authors gratefully thank the TSMC University Shuttle Program for chip fabrication.



#### REFERENCES

- F. O'Mahony et al., "The future of electrical I/O for microprocessors," in Proc. Int. Symp. VLSI-DAT, Apr. 2009, pp. 31–34.
- [2] K. Fukuda et al., "A 12.3-mW 12.5-Gb/s complete transceiver in 65-nm CMOS process," *IEEE J. Solid-State Circuits*, vol. 45, no. 12, pp. 2838–2849, Dec. 2010.
- [3] C. Kromer et al., "A 25-Gb/s CDR in 90-nm CMOS for high-density interconnects," *IEEE J. Solid-State Circuits*, vol. 41, no. 12, pp. 2921–2929, Dec. 2006.
- [4] K. Yu and J. Lee, "A 2 × 25-Gb/s receiver with 2:5 DMUX for 100-Gb/s Ethernet," *IEEE J. Solid-State Circuits*, vol. 45, no. 11, pp. 2421–2432, Nov. 2010.
- [5] J. W. Jung and B. Razavi, "A 25-Gb/s 5-mW CMOS CDR/deserializer," in Symp. VLSI Circuits Dig. Tech. Papers, 2012, pp. 113–114.
- [6] P. J. Lim and B. A. Wooley, "An 8-bit 200-MHz BiCMOS comparator," *IEEE J. Solid-State Circuits*, vol. 25, no. 2, pp. 192–199, Feb. 1990.
- [7] B. Razavi and B. A. Wooley, "Design techniques for high-speed, high-resolution comparators," *IEEE J. Solid-State Circuits*, vol. 27, no. 12, pp. 1916–1926, Dec. 1992.
- [8] J. A. Pretorius *et al.*, "Analysis and design optimization of Domino CMOS logic with application to standard cells," *IEEE J. Solid-State Circuits*, vol. 20, no. 4, pp. 523–530, Apr. 1985.
- [9] B. Razavi, Design of Analog CMOS Integrated Circuits. New York, NY, USA: McGraw-Hill, 2001.
- [10] L. W. Couch, II, Digital and Analog Communication Systems. Englewood Cliffs, NJ, USA: Prentice-Hall, 1997.
- [11] S. Ibrahim and B. Razavi, "Low-power CMOS equalizer design for 20-Gb/s systems," *IEEE J. Solid-State Circuits*, vol. 46, no. 6, pp. 1321–1336, Jun. 2011.
- [12] J. Savoj and B. Razavi, "A 10-Gb/s CMOS clock and data recovery circuit with a half-rate linear phase detector," *IEEE J. Solid-State Circuits*, vol. 36, no. 5, pp. 761–768, May 2001.
- [13] C. R. Hogge, "A self-correcting clock recovery circuit," J. Lightw. Technol., vol. LT-3, pp. 1312–1314, Dec. 1985.
- [14] B. Razavi, Design of Integrated Circuits for Optical Communications. New York, NY, USA: McGraw-Hill, 2003.
- [15] B. Razavi et al., "Design techniques for low-voltage high-speed digital bipolar circuits," *IEEE J. Solid-State Circuits*, vol. 29, no. 3, pp. 332–339, Mar. 1994.
- [16] D. W. Dobberpuhl, "Circuit and technology for Digital's StrongARM and ALPHA microprocessors [CMOS technology]," in *Proc. 17th IEEE Conf. Advanced Research in VLSI*, 1997, pp. 2–11.
- [17] Y. T. Wang and B. Razavi, "An 8-bit 150-MHz CMOS A/D converter," *IEEE J. Solid-State Circuits*, vol. 35, no. 3, pp. 308–317, Mar. 2000.
- [18] P. Andreani et al., "A study of phase noise in Colpitts and LC-tank oscillators," *IEEE J. Solid-State Circuits*, vol. 40, no. 5, pp. 1107–1118, May 2005.
- [19] A. Mazzanti and P. Andreani, "Class-C harmonic CMOS VCOs, with a general result on phase noise," *IEEE J. Solid-State Circuits*, vol. 41, no. 12, pp. 2921–2929, Dec. 2008.



**Jun Won Jung** received the B.S. degree in electrical engineering from Seoul National University, Seoul, Korea, in 2004, and the M.S. and Ph.D. degrees in electrical engineering from the University of California, Los Angeles (UCLA), CA, USA, in 2008 and 2012, respectively.

Since September 2012, he has been pursuing research as a postdoc at the Communication Circuits Laboratory, UCLA. His research interest includes analog/mixed-signal IC design and high speed wireline transceivers.

Dr. Jung received the Best Student Paper Award for the 2012 Symposium on VLSI Circuits.



**Behzad Razavi** (F'03) received the B.S.E.E. degree from Sharif University of Technology, Tehran, Iran, in 1985, and the M.S.E.E. and Ph.D.E.E. degrees from Stanford University, Stanford, CA, USA, in 1988 and 1992, respectively.

He was with AT&T Bell Laboratories and Hewlett-Packard Laboratories until 1996. Since 1996, he has been Associate Professor and subsequently Professor of electrical engineering at the University of California, Los Angeles, CA, USA. His current research includes wireless transceivers,

frequency synthesizers, phase-locking and clock recovery for high-speed data communications, and data converters. He was an Adjunct Professor at Princeton University from 1992 to 1994, and at Stanford University in 1995.

Prof. Razavi served on the Technical Program Committees of the International Solid-State Circuits Conference (ISSCC) from 1993 to 2002 and VLSI Circuits Symposium from 1998 to 2002. He has also served as Guest Editor and Associate Editor of the IEEE JOURNAL OF SOLID-STATE CIRCUITS, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, and International Journal of High Speed Electronics. He received the Beatrice Winner Award for Editorial Excellence at the 1994 ISSCC, the best paper award at the 1994 European Solid-State Circuits Conference, the best panel award at the 1995 and 1997 ISSCC, the TRW Innovative Teaching Award in 1997, the best paper award at the IEEE Custom Integrated Circuits Conference in 1998, and the McGraw-Hill First Edition of the Year Award in 2001. He was the co-recipient of both the Jack Kilby Outstanding Student Paper Award and the Beatrice Winner Award for Editorial Excellence at the 2001 ISSCC. He received the Lockheed Martin Excellence in Teaching Award in 2006, the UCLA Faculty Senate Teaching Award in 2007, and the CICC Best Invited Paper Award in 2009 and 2012. He was the co-recipient of the 2012 VLSI Circuits Symposium Best Student Paper Award. He was also recognized as one of the top 10 authors in the 50-year history of ISSCC. Professor Razavi received the IEEE Donald Pederson Award in Solid-State Circuits in 2012.

Prof. Razavi is a Fellow of IEEE, has served as an IEEE Distinguished Lecturer, and is the author of *Principles of Data Conversion System Design* (IEEE Press, 1995), *RF Microelectronics* (Prentice Hall, 1998, 2012) (translated to Chinese, Japanese, and Korean), *Design of Analog CMOS Integrated Circuits* (McGraw-Hill, 2001) (translated to Chinese, Japanese, and Korean), *Design of Integrated Circuits for Optical Communications* (McGraw-Hill, 2003), and *Fundamentals of Microelectronics* (Wiley, 2006) (translated to Korean and Portuguese). He is also the editor of *Monolithic Phase-Locked Loops and Clock Recovery Circuits* (IEEE Press, 1996), and *Phase-Locking in High-Performance Systems* (IEEE Press, 2003).