# A 56-Gb/s 8-mW PAM4 CDR/DMUX With High Jitter Tolerance

Guanrong Hou<sup>10</sup> and Behzad Razavi<sup>10</sup>, *Fellow, IEEE* 

Abstract—The demand for low-power wireline circuits has motivated extensive work on novel circuit solutions. This article describes a one-eighth-rate clock and data recovery (CDR) circuit and a demultiplexer (DMUX) for processing four-level pulse-amplitude modulation (PAM4) signals in receivers (RXs). Detecting both major and minor data transitions, the proposed architecture can achieve a wider loop bandwidth (BW), suppressing oscillator phase noise and improving the jitter tolerance. Fabricated in 28-nm CMOS technology, the prototype provides a jitter transfer BW of 160 MHz with a tolerance of 1 UI at 10 MHz.

*Index Terms*—Background offset cancellation, bang-bang phase detector (BBPD), clock and data recovery (CDR), four-level pulse-amplitude modulation (PAM4), jitter tolerance.

### I. INTRODUCTION

**H** IGH-SPEED wireline transceivers have embraced fourlevel pulse-amplitude modulation (PAM4) communication for its bandwidth (BW) efficiency [1]–[4]. While affording a longer symbol period than non-return-to-zero (NRZ) data, PAM4 signaling nonetheless presents a multitude of circuit design challenges, especially in the receiver (RX). For this reason, typical RXs have opted for front-end analog-to-digital converters (ADCs) and a significant amount of signal processing in the digital domain [1]–[4]. As explained in Section II, such ADC-based solutions face their own issues.

An alternative possibility is an "analog" PAM4 RX, where the three principal functions, namely linear equalization, clock and data recovery (CDR), and decision-feedback equalizer (DFE), are realized in the analog domain. Motivated by the potentially lower power consumption and complexity of such an approach, this article deals with the CDR circuit. In such a scenario, the continuous-time linear equalizer (CTLE) and the DFE compensate for the channel imperfections, presenting a moderately open eye to the CDR. We propose

Manuscript received 20 October 2021; revised 27 December 2021 and 3 February 2022; accepted 17 February 2022. Date of publication 10 March 2022; date of current version 26 August 2022. This article was approved by Associate Editor Jaeha Kim. This work was supported in part by Realtek Semiconductor Corporation and in part by the Taiwan Semiconductor Manufacturing Company (TSMC) University Shuttle Program. (Corresponding author: Guanrong Hou.)

Guanrong Hou was with the Department of Electrical and Computer Engineering, University of California at Los Angeles, Los Angeles, CA 90095 USA. He is now with MaxLinear, Inc., Carlsbad, CA 92008 USA (e-mail: guanronghou@ucla.edu).

Behzad Razavi is with the Department of Electrical and Computer Engineering, University of California at Los Angeles (UCLA), Los Angeles, CA 90095 USA (e-mail: razavi@ee.ucla.edu).

Color versions of one or more figures in this article are available at https://doi.org/10.1109/JSSC.2022.3153695.

Digital Object Identifier 10.1109/JSSC.2022.3153695

Chip A Chip B Die A Die B (a)

Module B

Module A

Fig. 1. (a) Module-to-module link. (b) Chip-to-chip link. (c) Die-to-die link.

a CDR/demultiplexer (DMUX) architecture incorporating a PAM4 phase detector (PD), a low-power comparator, and a background offset cancellation technique [5]. Fabricated in 28-nm CMOS technology, an experimental prototype provides a threefold improvement in power efficiency and a fivefold increase in the jitter tolerance BW.

Sections II and III provide the background for this work. Section IV examines the effect of data transition density on CDR behavior, and Section V describes the proposed CDR/DMUX architecture. Sections VI–X deal with the design of the building blocks. Section XI presents the experimental results.

#### **II. GENERAL CONSIDERATIONS**

Wireline systems incorporate various types of communication links. Shown in Fig. 1 are three examples: 1) module-tomodule links, also called "long reach" (LR); 2) chip-to-chip links, also called "medium reach" (MR); and 3) die-todie links, also called "ultrashort reach" (USR) [6]. The Nyquist-frequency loss in these media varies from roughly 30 dB for LR to 2 dB for USR. Today's commonly accepted practice is to employ ADC-based RXs for LR.

However, we recognize that the communication hierarchy depicted in Fig. 1 consists of many more USR links than LR links, by, for example, a factor of 10 [7]. Thus, the power consumption of USR transceivers must be proportionally lower. For example, with a typical 56-Gb/s ADC-based RX drawing

0018-9200 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. 300 mW [8], we expect that a USR link should burn less than 30 mW, a formidable challenge.

In addition to consuming high power, ADC-based solutions entail two other issues as well. First, to maintain a reasonable effective number of bits (ENOB), the ADC requires a low clock jitter. For example, a 7-bit converter sampling at 56 GHz incurs 3 dB of ENOB degradation if its jitter exceeds 40 fs<sub>rms</sub> [9]. By comparison, a 56-Gb/s analog CDR circuit provides acceptable performance with clock jitters as high as 1 ps<sub>rms</sub>.

Second, the ADC and the digital signal processor suffer from a long latency, dictating a narrow CDR BW so as to avoid jitter peaking. As an example, suppose a BW of 100 MHz is desired. Denoting the CDR loop latency by  $\Delta T$ , we note that the loop transmission is multiplied by  $\exp(s\Delta T) \approx 1 - s\Delta T$ . The resulting right-half-plane zero,  $f_z$ , degrades the phase margin and must remain about one decade beyond the BW

$$f_z \approx \frac{1}{2\pi \,\Delta T} > 1 \text{ GHz.}$$
 (1)

It follows that  $\Delta T < 160$  ps, a difficult constraint to meet. Simulations of a CDR loop using a Hogge PD and a charge pump (CP) indicate that  $\Delta T$  with values of 160 and 320 ps yield jitter peakings equal to 1.4 and 1.8 dB, respectively, for a nominal BW of 100 MHz.

### **III. CDR TRADEOFFS**

Typical CDR circuits must deal with two tradeoffs. Considering a phase-locked system and noting that the loop BW is proportional to  $I_P K_{VCO}$ , we observe that the CP imperfections perturb the voltage-controlled oscillator (VCO) in every phase comparison instant, creating pattern-dependent jitter. As the CP current,  $I_P$ , or the VCO gain,  $K_{VCO}$ , is raised so as to enlarge the BW, these imperfections lead to a greater jitter. The first tradeoff is thus between VCO phase noise suppression and pattern-dependent jitter.

The second tradeoff is that between the jitter tolerance and pattern-dependent jitter, again through the BW. As we increase  $I_P K_{VCO}$ , both of these quantities rise.

From these thoughts, we are motivated to increase the loop gain by raising the inherent gain of the PD,  $K_{PD}$ , so as to obtain greater VCO noise suppression and jitter tolerance without exacerbating the pattern-dependent jitter. Explored in Section IV, this idea forms the foundation of our proposed PD topology.

# IV. EFFECT OF DATA TRANSITION DENSITY ON CDR BEHAVIOR

Most critical aspects of CDR circuits depend on the input data transition density,  $\alpha_d$ . It is readily observed that the PD gain is proportional to  $\alpha_d$ : the more transitions the PD receives in unit time, the greater is the rate of updates that it provides to the succeeding stages.

As we strive for a higher  $\alpha_d$ , the performance of the CDR circuit improves along several dimensions. First, a greater PD gain allows a wider loop BW. To appreciate this point, let us



Fig. 2. PAM4 major transitions (in black) and minor transitions (in gray).

write the transfer function of a type-II CDR as

$$H(s) = \frac{2\zeta\omega_n s + \omega_n^2}{s^2 + 2\zeta\omega_n s + \omega_n^2}$$
(2)

where  $\zeta$  and  $\omega_n$  are the damping factor and the natural frequency, respectively. If  $\zeta^2 \gg 1$ , then the zero and the first pole approximately coincide, yielding a first-order response

$$H(s) \approx \frac{2\zeta \omega_n s}{s^2 + 2\zeta \omega_n s}.$$
(3)

In a first-order feedback system, the closed-loop BW is proportional to the loop gain. That is

$$2\zeta \omega_n s \propto K_{\rm PD}.$$
 (4)

It is important to recognize that, in a CP phase-locked environment, increasing the PD gain is more advantageous than raising the CP current or the VCO gain. As explained in Section III, the pattern-dependent jitter is proportional to the CP imperfections and  $K_{\rm VCO}$  but not  $K_{\rm PD}$ . The higher loop gain afforded by a greater  $K_{\rm PD}$  eases both of the tradeoffs described in Section III.

The second improvement arising from a higher  $\alpha_d$  concerns the "droop" in the loop filter's voltage in the absence of data transitions. Another source of deterministic jitter, this droop is less pronounced if more transitions are detected in unit time.

The foregoing observations suggest that it is desirable for a PAM4 PD to respond to all of the data transitions. Prior work, however, has relied on only a few [10]–[14]. Depicted in Fig. 2, PAM4 data exhibit two major transitions (in black) and ten minor ones (in gray). We expect a PD that processes all of the edges to exhibit a sixfold increase in gain compared to one that senses only the major transitions. This trend is readily verified by behavioral CDR simulations. As illustrated in Fig. 3, both the jitter transfer and the jitter tolerance display proportionally wider BWs as the PD is so modified as to detect all transitions.

Nevertheless, such an improvement entails two issues: 1) the PD signal path must be sufficiently linear so as to extract meaningful information from the PAM4 signal and 2) the PD can be potentially more complex and power-hungry. We address these points in Sections V and VI.

## V. PROPOSED CDR/DMUX ARCHITECTURE

Shown in Fig. 4, the proposed CDR/DMUX architecture employs a PAM4 PD, up/down logic, a CP, a loop filter, a 28-GHz LC VCO, a phase generation block, and a data





Fig. 3. (a) Simulated jitter transfer. (b) Simulated jitter tolerance.



Fig. 4. Proposed CDR/DMUX architecture.

extraction circuit. The PD consists of eight transition detector units, which collectively compute the phase error between  $D_{in}$  and the recovered clock. The input data also drives eight data extraction units, yielding demultiplexed NRZ outputs at 3.5 Gb/s. The loop filter and the CP are programmable.



Fig. 5. Phase detection for (a) NRZ data and (b) PAM4 data.

The proposed architecture merits a few remarks. First, the eight transition detector units within the PD can draw high power and present a high capacitance to  $D_{in}$ . It is therefore desirable to use small transistors in the PD circuit (see Section VI). Second, the data extraction unit also provides information to the up/down logic to avoid a certain ambiguity that otherwise causes failure (see Section VIII). Third, the PD and up/down logic provide bang-bang operation, introducing jitter peaking for wide loop BWs [15]. Fourth, by virtue of the wide loop BW (at least 25 MHz), the VCO phase noise is greatly suppressed, allowing the oscillator to draw only 0.9 mW. Fifth, the phase generation circuit must drive eight PAM4 transition detector units and eight data extraction units, potentially consuming significant power (see Section X).

# VI. PAM4 PHASE DETECTOR

## A. Basic Idea

We wish to develop a bang-bang PD (BBPD) that responds to all of PAM4 transitions. Let us first consider the NRZ counterpart, i.e., the Alexander PD [16]. Illustrated conceptually in Fig. 5(a), the circuit takes three consecutive samples of  $D_{in}$ and decides the clock is late if  $V_A \neq V_E = V_B$  or early if  $V_A = V_E \neq V_B$ . More generally, we can say that the PD must determine whether  $V_E$  is closer to  $V_A$  or to  $V_B$ ; an "analog" PD must then form  $V_E - V_A$  and  $V_B - V_A$  and compare these differences.

We surmise that this idea can be extended to PAM4 signals as well. As shown in Fig. 5(b), the early-late phase relationship between  $D_{in}$  and CK is still available from the "Euclidean distance" between the PAM4 constellation levels; we simply need to determine whether  $V_E - V_A < V_B - V_E$  or  $V_E - V_A > V_B - V_E$ , or, equivalently, whether  $V_A + V_B - 2V_E$  is positive or negative. To this end, we require four operations: 1) sample  $V_A$ ,  $V_E$ , and  $V_B$ ; 2) align the samples in time; (3) form  $V_{sum} = V_A + V_B - 2V_E$ ; and (4) detect the polarity of  $V_{sum}$ .



Fig. 6. (a) High-level realization of the PD core and (b) its waveforms (SS: sample-and-sum circuit).



Fig. 7. Sample-and-sum stage.

While intuitively appealing, the proposed PAM4 PD concept entails three issues: namely, nonlinearity, early-late ambiguity, and dc offsets. We deal with these effects in Section VI-B.

### B. PD Circuit Realization

For a symbol rate of 28 GBd, it is difficult to perform the above four operations at full speed in 28-nm technology. We instead turn to interleaving and sample  $D_{in}$  by 16 clock phases,  $CK_1-CK_{16}$ , each running at 3.5 GHz. Fig. 6 depicts the high-level realization and the waveforms. Each "sampleand-sum" stage,  $SS_j$ , is driven by three overlapping clock phases to capture  $V_A$ ,  $V_E$ , and  $V_B$ , and produce  $V_{sumj}$ . This timing scheme offers two benefits: 1) it naturally aligns the three samples from  $t_2$  to  $t_3$  and 2) it allows sufficient time,  $\Delta T = 107$  ps, for forming the sum. Every input transition therefore generates a nonzero  $V_{sumj}$  if the data and the clocks are not properly aligned. These results must be sliced and logically combined before reaching the CP.

We should point out that the alignment and sum formation time interval,  $\Delta T$ , is the principal bottleneck here. If, for example, quarter-rate sampling is used,  $\Delta T$  drops to 35.7 ps, leaving little margin for these operations. On the other hand, the parallelism cannot be indefinitely increased due to concerns including capacitive loading at  $D_{in}$ . In this design, the capacitance seen by  $D_{in}$  is about 65 fF.

Fig. 7 shows the single-ended view of the sample-andsum stage. The circuit provides  $V_X = G_{m1}(V_A + V_B - 2V_E) + G_{m2}V_{cal}$ , where  $V_{cal}$  calibrates the offset (see Section IX). Passive samplers afford high linearity for the PAM4 input, but



Fig. 8. Effect of  $G_m$  stage compression.



Fig. 9. Simulated PD characteristics with a behavior nonlinear  $G_m$  model.

the  $G_m$  cells must also be sufficiently linear. Specifically, when  $D_{\rm in}$  reaches its maximum single-ended swing of 300 mV, these cells must not compress significantly. In order to compute the maximum tolerable nonlinearity, we first determine the effect of compression on the CDR performance. As depicted in Fig. 8, suppose the peak input swing experiences a compression of  $\Delta V$  with respect to its ideal value. The sum,  $V_{\rm sum} = V_A + V_B - \Delta V - 2V_E$ , therefore incurs an error of  $-\Delta V$ . We assume  $V_A$  and  $V_E$  incur negligible compression. The CDR loop attempts to force  $V_{sum}$  to zero by adjusting the clock phase and hence  $V_E$ . For  $V_E$  to change by  $\Delta V/2$ , its sample time must shift by  $(\Delta V/2)/r$ , where r denotes the slew rate. Compression then leads to a phase offset. For example, let us assume Din has a peak-to-peak differential swing of 600 mV and a slew rate of 20 mV/ps. If the circuit experiences 1 dB of compression, i.e., if  $\Delta V = 30$  mV, then the sample time of  $V_E$  shifts by 0.75 ps in Fig. 8, a negligible phase offset.

In reality, the  $G_m$  stages can tolerate even greater nonlinearity because the CDR loop acts on the average phase error obtained from all of the 12 transitions. Fig. 9 plots the simulated PD characteristics for a behavioral nonlinear  $G_m$ model, suggesting that even 2 dB of compression has little effect. In this work, the differential pairs realizing the  $G_m$ stages in Fig. 7 incorporate resistive degeneration so as to compress by only 0.3 dB at the peaks of the input PAM4 swings.

We should remark that since the PD path does not perform data recovery, signal compression here negligibly compromises the detection carried out by the data extraction block in Fig. 4.

The linearity provided by the  $G_m$  cells also makes them tolerable of their own (input-referred) offsets. According to





Fig. 10. (a) Values of  $V_{\text{sum}}$  when the clock is late. (b) Values of  $V_{\text{sum}}$  when the clock is early. (c) Circuit for removing the ambiguity.

Monte Carlo simulations, the  $3\sigma$  value of this offset is 38 mV for one  $G_m$  cell, causing a phase offset of 1.9 ps if assuming a  $D_{in}$  slew rate of 20 mV/ps. The cancellation method described in Section IX removes the effect of this offset.

The next PD issue relates to an ambiguity that manifests itself between rising and falling data transitions. As illustrated in Fig. 10(a), if the clock is late,  $V_{sum}$  is negative for rising edges but positive for falling edges. The average  $V_{sum}$  is thus zero if the PAM4 symbols occur with equal probabilities, thereby prohibiting the CDR loop from locking. Similarly, for the case of early clock [see Fig. 10(b)], the average  $V_{sum}$  is zero. To resolve this issue, we must "rectify" the  $V_{sum}$  information such that both  $V_{sum}$  values are negative in Fig. 10(a) and positive in Fig. 10(b) (or vice versa).

Articulated differently, this point means that the sign of  $V_B - V_A$  determines whether a given transition's sum result must be negated or not. That is, an output of the form

$$A_{\rm out} = V_{\rm sum} \times (V_B - V_A) \tag{5}$$

uniquely and properly carries the early or late phase information. To obtain  $V_B - V_A$ , we incorporate a replica of the sampleand-sum circuit shown in Fig. 7, but with the  $V_E$  path removed. The multiplication operation indicated in (5) is implemented logically after  $V_{\text{sum}}$  and  $V_B - V_A$  are sliced [see Fig 10(c)], yielding  $A_{\text{out}}$ , The PD in Fig. 4 delivers eight such outputs.

The third issue stems from dc offsets. Simulation results show that the output of the sample-and-sum circuit,  $V_{sum}$ , has a  $3\sigma$  offset of 60 mV. Fig. 11 shows the simulated PD characteristics if  $V_{sum}$  in Fig. 7 contains an offset,  $V_{OS}$ , of zero or 60 mV. The dead zone observed here can be explained as follows. Suppose  $V_{OS} > 0$ , making  $V_{sum}$  more



Fig. 11. Simulated PD characteristics with different offsets in V<sub>sum</sub>.

*positive* than expected. As a result, the low-to-high transition in Fig. 10(a) may yield  $V_{sum} > 0$  and  $A_{out} > 0$ . On the other hand, the high-to-low transition still produces  $V_{sum} > 0$ and  $A_{out} < 0$ . We observe that these opposite logical outputs tend to cancel each other, leading to a dead zone for a small phase difference between the data and the recovered clock. According to simulations, the  $3\sigma$  offset must remain less than 15 mV for the dead zone to be negligible. We conclude that the signal path must incorporate offset cancellation, preferably in the background (see Section IX).

# C. Effect of Channel Loss

As mentioned in Section I, we assume that a CTLE and a DFE open the eye applied to the CDR. Nevertheless, it is instructive to examine the PD performance in the presence of channel loss. We note that a given analog sample at the channel output can be written as  $h_0\alpha_j + h_1\alpha_{j-1} + h_2\alpha_{j-2} + \cdots$ , where the *h* coefficients denote the channel's postcursors and the *a* samples are taken every 1 unit interval.

Bearing in mind that the PD computes  $V_{\text{sum}} = V_A + V_B - 2V_E$ , and assuming a high loop gain so that the loop locks with  $V_{\text{sum}} \approx 0$ , we express  $V_A$ ,  $V_E$ , and  $V_B$  in the abbreviated form  $d_j$ ,  $e_j$ , and  $d_{j+1}$ , respectively. Owing to channel loss, these samples emerge as

$$d'_{i} = h_{0}d_{j} + h_{1}d_{j-1} + h_{2}d_{j-2} + \cdots$$
(6)

$$e'_{j} = h_{0}e_{j} + h_{1}e_{j-1} + h_{2}e_{j-2} + \cdots$$
(7)

The PD summing action thus yields

$$V_{\text{sum}} = d'_j + d'_{j+1} - 2e'_j$$
  
= 0. (8)

That is, the loop still locks to the same point.

The PD gain, however, is affected by the loss, as is the case in other topologies as well [17], [18]. Fig. 12(a) plots the simulated PD characteristics with 0- and 6-dB channel losses at 14 GHz, revealing a 47% gain reduction. Fig. 12(b) shows the eye diagram at the output of the channel with 6-dB loss. With a 6-dB channel loss, the overall phase offset may reach



Fig. 12. (a) Simulated PD characteristics. (b) Eye diagram (6-dB loss at 14 GHz).

0.17 UI. If such an offset proves undesirable, the CDR circuit cannot directly sense the channel output and must be fed from the CTLE or the DFE.

The PD operation assumes some waveform symmetry around  $V_E$  in Fig. 6 so that positive or negative time displacements yield approximately equal and opposite changes in  $V_E$ . This property generally holds if the stages preceding the PD contain multiple poles even if the channel has no loss.<sup>1</sup> For a hypothetical single-pole response, the asymmetry around  $V_E$ moves the sampling point by 0.11 UI if the time constant is equal to 1 UI. This error is tolerable as the eye is open in such a case.

# VII. DATA EXTRACTION

The eight interleaved data extraction units in Fig. 4 must sample the PAM4 input and convert the result to 2 bits. This is accomplished by means of a passive sampler and comparators [see Fig. 13(a)]. In unit number j, the thermometer code  $T_{3i}T_{2i}T_{1i}$  acts as the demultiplexed output. Each comparator

<sup>1</sup>As the number of poles increases, the impulse response approaches a Gaussian function.



Fig. 13. (a) Data extraction unit number j. (b) Realization of the slicer.

consists of two differential pairs and a StrongArm latch [see Fig. 13(b)]. With a voltage gain of 1.5 provided by these pairs and a slicer input-referred offset of 29 mV, the digitizer can operate with a minimum eye opening of 100 mV while ensuring bit error rate (BER)  $< 10^{-12}$ . The simulated input-referred noise is 1.9 mV. Therefore, the BER for this CDR is dominated by slicer offsets. In practice, forward error correction in the digital domain allows the BER of PAM4 signals to be as high as  $10^{-5}$  [6].

The thermometer code in Fig. 13(a) also drives two sets of three retimers realized by StrongArm latches. Aligned in time, these signals travel to the CP and disable it in the absence of data transitions (see Section VIII). Each unit consumes 125  $\mu$ W at 3.5 GHz.

# VIII. UP/DOWN LOGIC AND CP

# A. PD Output Combining

The Alexander PD of Fig. 5(a) employs two XOR gates for its up/down logic, yielding two outputs that can directly drive the CP. On the other hand, the PD in Fig. 4 generates eight outputs,  $A_{outj}$ , j = 1, ..., 8, which must be combined logically so as to provide the up and down commands. These outputs arrive at a rate of 3.5 Gb/s with a phase difference of 45°. One can simply multiplex the eight signals, but the resulting 28-Gb/s waveform would require a wide BW at the input of the CP. We instead decompose the CP into eight equal slices, each driven by its own up and down signals,  $Up_i$  and Down<sub>i</sub> (see Fig. 14). We have

$$Up_{j} = \overline{A_{outj}}$$
(9)

$$\mathrm{Down}_j = A_{\mathrm{outj}}.\tag{10}$$



Fig. 14. Parallel CP scheme.

In essence, the multiplexing action is deferred to the CP output, where a narrow BW can be tolerated.

#### B. Problem of Long Runs

In the absence of data transitions, a BBPD such as the Alexander topology in Fig. 5(a) generates  $V_A = V_E = V_B$  and hence  $V_A \oplus V_E = 0$  and  $V_E \oplus V_B = 0$ . This naturally turns off both the up current and the down current. The proposed PD and CP, however, operate according to (9) and (10), always enabling either the up or the down current source without the presence of data transitions. Consequently, the VCO control is greatly disturbed and the output incurs significant patterndependent jitter.

We therefore seek additional information from the data extraction unit so as to detect absence of transitions and accordingly disable the CP. The basic idea is to determine when two consecutive data samples are equal. Let us return to Fig. 13(a) and assume samples  $V_A$  and  $V_B$  are taken by units number 1 and 2, respectively. The corresponding thermometer codes are  $T_{3,1}T_{2,1}T_{1,1}$  for unit 1 and  $T_{3,2}T_{2,2}T_{1,2}$ for unit 2. A transition occurs between  $V_A$  and  $V_B$  if these codes are unequal or, equivalently,  $T_{3,1} \neq T_{3,2}$ ,  $T_{2,1} \neq T_{2,2}$ , or  $T_{1,1} \neq T_{1,2}$ .

The up and down controls are therefore modified to

$$Up_{i} = A_{outj} \cdot D_{j} \quad (11)$$

$$\operatorname{Down}_j = A_{\operatorname{outj}} \cdot D_j \quad (12)$$

$$D_j = T_{3,j} \oplus T_{3,j+1} + T_{2,j} \oplus T_{2,j+1} + T_{1,j} \oplus T_{1,j+1}.$$
 (13)

If there is a data transition from  $V_A$  to  $V_B$ ,  $D_i = 1$ . Otherwise,  $D_{i} = 0.$ 

Based on (11)-(13), a natural approach is to keep Fig. 14 unchanged and simply modify the logic gates leading to  $Up_i$  and  $Down_i$ . Unfortunately, this would introduce unequal delays in the up and down paths, disturbing the VCO control.

This difficulty is overcome by removing the OR functions from (11) and (12) and realizing them by the CP itself. Illustrated in Fig. 15, one single CP slice in Fig. 14 is replaced with three new CP slices, all driven by the same  $A_{\text{out}}$ . Receiving all of the thermometer codes shown in (13), the three slices realize each XOR term and OR the results at the output.

This approach provides nominally equal delays in the up and down paths. The total power consumption of the logic in the 24 slices amounts to 0.4 mW at 3.5 GHz.



Fig. 15. CP corresponding to one Aouti-

## IX. COMPARATOR CORE DESIGN

The proposed PAM4 PD described in Section VI generates relatively small amplitudes for  $V_X$  in Fig. 7. Before arriving at the CP, these signals must reach rail-to-rail swings, calling for efficient low-offset amplification. Recall from Section VI that the sample-and-sum circuit contributes 60 mV of  $3\sigma$  offset to  $V_X$ . It is preferable to perform offset cancellation in the background.

Consider the arrangement shown Fig. 16(a), where the offset at node X is sensed by  $A_1$  and injected into X by  $G_{m2}$ . We expect the offset to be reduced by a factor of  $1 + A_1 G_{m2} R_D$ . This method faces two drawbacks. First,  $A_1$ itself must exhibit a low offset, inevitably presenting a high capacitance at X and degrading the signal path. Specifically, as Fig. 6(b) shows,  $V_X$  must settle in less than 107 ps, indicating the total capacitance at node X cannot be arbitrarily large. Second, this technique does not correct the offset of the comparator. The comparator itself can incorporate offset cancellation [19], [20], but such an approach does not lend itself to background operation.

We introduce a new topology that naturally affords background offset cancellation for the chain consisting of the sum generator and the comparator. Depicted in Fig. 16(b), the structure employs gain stage  $A_1$ , positive-feedback switch  $S_3$ , and offset cancellation devices  $S_5$ ,  $S_7$ ,  $C_1$ ,  $C_3$ , and  $G_{m2}$ . The circuit operates in three phases [see Fig. 16(c)] with 3.5-GHz clocks.

First, when  $CK_A$  is high, from t = 36 ps to t = 108 ps,  $A_1$  senses  $V_X$  and applies the result to  $C_1$ , storing the offset (and the data) on  $C_1$ . Next,  $C_1$  is disengaged from  $A_1$  at t = 108 ps and switched to  $C_3$  at t = 180 ps so as to share its charge. In this mode,  $A_1$  finds sufficient time to settle and generate  $A_1V_X$ . In the meantime,  $G_{m2}$  injects a current into X, opposing the offset (and the data). We expect that this feedback component converges toward only the offset as the data polarity changes in different cycles and the data component on  $C_3$  averages to zero.

In the third phase, from t = 144 ps to t = 288 ps, switch  $S_3$  turns on, placing  $A_1$  in a positive-feedback loop and amplifying  $V_X$  to nearly rail-to-rail swings. The choice of 25% duty cycle for  $CK_B$  and  $CK_C$  ensures that  $A_1$  is not loaded by  $C_2$ .



Fig. 16. (a) Conventional offset cancellation. (b) Proposed offset-cancellation topology. (c) Three phases of operations.

The differential, transistor-level realization of the circuit is shown in Fig. 17(a). When  $S_1$ ,  $S_2$ ,  $S_5$ , and  $S_6$  are on,  $V_X$  is amplified by  $M_1$  and  $M_2$  and stored on  $C_1$  and  $C_2$ . Next,  $M_1$ and  $M_2$  are configured as a regenerative pair [see Fig. 17(b)] and the tail is tied to the ground so as to increase the tail current. As a result,  $V_{out}$  approaches a valid logical value.

Fig. 17(c) plots the simulated differential output waveforms in response to a random  $V_X$  waveform. The waveform denoted by  $V_{cap}$  represents the differential voltage across capacitors  $C_1$ and  $C_2$ . The circuit senses  $V_X$  and stores it on these capacitors from t = 0 ns to t = 0.12 ns, and then regenerates from t = 0.12 ns to t = 0.28 ns, delivering an output swing of nearly 0.8 V.

The proposed comparator topology merits several remarks. First, we have  $(W/L)_{1,2} = 1 \ \mu m/30 \ nm$ ,  $R_D = 7.5 \ K\Omega$ ,  $C_1 = C_2 = 0.2 \ fF$ , and  $C_3 = C_4 = 3 \ pF$ . Capacitances  $C_1$  and  $C_2$  are device and layout parasitics, chosen small enough to have a proper low-pass filter corner frequency (as explained below). Second, before cancellation, the  $3\sigma$  offset referred to node X in Fig. 16(b) contains a component equal to 60 mV from the sample-and-sum circuit and another equal to 30 mV from the comparator. The combined  $3\sigma$  offset is thus about 67 mV. Third, the cancellation loop lowers this offset value



Fig. 17. (a) Circuit realization of the proposed comparator, (b) regeneration mode, and (c) waveforms of  $V_{cap}$ ,  $V_{out}$ , and  $V_X$ .

by a factor of  $A_1G_{m2}R_D \approx 45$ . Fourth, with input transistor dimensions of W/L = 20  $\mu$ m/200 nm,  $G_{m2}$  contributes only a  $3\sigma$  offset of 0.9 mV while negligibly affecting the speed at node X. We note that capacitance mismatch at the drains of  $M_1$  and  $M_2$  can translate to offset during regeneration. With a  $3\sigma$  transistor capacitance mismatch and 20% layout parasitic mismatch, the simulated  $3\sigma$  input-referred offset is 1.3 mV. Even though this offset cannot be canceled by the proposed topology, it negligibly affects the circuit operations.

In Fig. 17(a) and (b), the charge transfer between the primary capacitors,  $C_1$  and  $C_2$ , and the secondary capacitors,



Fig. 18. (a) VCO, (b) phase generator, and (c) latch used in the phase generator.

 $C_3$  and  $C_4$ , can be approximated by a continuous-time resistance equal to  $1/(C_{1,2}f_{CK}) \approx 1.4 \text{ M}\Omega$ . This resistance and  $C_{3,4}$  define a low-pass filter having a corner frequency equal to 38 kHz. With a loop gain of 45, we then obtain a time constant of 0.6  $\mu$ s and hence a convergence time of roughly 2.4  $\mu$ s for the offset-cancellation loop. The large loop time constant ensures that long runs in the data negligibly corrupt the stored offset.

## X. VCO AND PHASE GENERATOR

As shown in Fig. 18(a), the 28-GHz VCO is based on a conventional programmable topology, employing  $W/L = 2.5 \ \mu m/30$  nm for all four transistors and a differential inductance of 0.95 nH. The VCO simulated phase noise is -90 dBc/Hz at 1-MHz offset with a power consumption of 0.9 mW. In a CDR loop BW of 160 MHz, the VCO contributes about 118 fs to the overall jitter.

The 16 phases necessary for the PD and the data extraction units are generated by applying the VCO output to a  $\div 2$  stage and using the resulting quadrature phases to drive two cascaded ring counters [see Fig. 18(b)] [21]. To operate at 28 GHz with low power consumption, the  $\div 2$  circuit incorporates the latch shown in Fig. 18(c) [22], where source followers  $M_3$ 



Fig. 19. Simulated jitter transfer versus latency.



Fig. 20. Die photograph.

TABLE I Power Breakdown

| Block                          | Power<br>Consumption<br>(mW) |  |  |
|--------------------------------|------------------------------|--|--|
| PAM4<br>Transition<br>Detector | 2.6                          |  |  |
| Data<br>Extraction<br>Unit     | 1.1                          |  |  |
| Up/Down Logic<br>& CP          | 0.4                          |  |  |
| VCO                            | 0.9                          |  |  |
| Phase<br>Generation            | 2.8                          |  |  |
| Total                          | 7.8                          |  |  |

and  $M_4$  improve the speed by providing feedforward paths to the output. Each latch draws 0.1 mW, and the overall phase generator draws 2.8 mW.

The cascade of blocks in Fig. 18(b) raises some concern regarding the latency around the CDR loop and hence the possibility of peaking in its frequency response. Specifically, if a VCO output transition is displaced at t = 0, it takes at most  $\Delta T = 0.6$  ns to reach the PD. Modeling this effect by  $\exp(s\Delta T) \approx 1 - s\Delta T$ , we note that the right-half-plane zero



Fig. 21. Input eye diagram.



Fig. 22. (a) Recovered clock spectrum. (b) Recovered clock phase noise.

affects the CDR phase margin if  $1/\Delta T$  is less than ten times the loop BW. Fig. 19 plots the simulated jitter transfer for two values of  $\Delta T$ , predicting a worst case peaking of 2 dB. This is observed in measurements as well (see Section XI).

## XI. EXPERIMENTAL RESULTS

The CDR/deserializer circuit has been fabricated in TSMC's 28-nm CMOS technology. Shown in Fig. 20 is the die active area, which measures  $250 \ \mu m \times 350 \ \mu m$ . The chip is mounted on a printed-circuit board and the high-speed signals travel



Fig. 23. (a) Measured jitter transfer. (b) Measured jitter tolerance.

TABLE II Performance Summary and Comparison With Prior Art

|                              | This<br>Work       | [10]               | [11]               | [12]               | [13]              |
|------------------------------|--------------------|--------------------|--------------------|--------------------|-------------------|
| Data Rate (Gb/s)             | 56                 | 56                 | 28                 | 32                 | 29.1              |
| Power (mW)                   | 8                  | 49.2*              | 47*                | 14.7               | 19.16             |
| 1-UI Jtol. Freq.<br>(MHz)    | 10                 | 0.5                | 0.6                | 2                  | 1.8               |
| Loop BW (MHz)                | 160                | 10                 | 11                 | 10                 | 12                |
| CK Jitter (fs)               | 574                | N/A                | 513                | 352                | 487               |
| Input PRBS Length            | 2 <sup>15</sup> -1 | 2 <sup>15</sup> -1 | 2 <sup>7</sup> -1  | 2 <sup>9</sup> -1  | 2 <sup>7</sup> -1 |
| BER                          | 10 <sup>-12†</sup> | 10 <sup>-12†</sup> | 10 <sup>-8**</sup> | 10 <sup>-12‡</sup> | 10 <sup>-12</sup> |
| Technology (nm)              | 28                 | 65                 | 65                 | 40                 | 28                |
| Power Efficiency<br>(pJ/bit) | 0.14               | 0.88               | 1.68               | 0.46               | 0.66              |

\*Only including CDR portion for fair comparison.

<sup>†</sup>1/8-rate BER, \*\*estimated BER, <sup>‡</sup>1/4-rate LSB BER.

through probes. Table I shows the breakdown of the power consumption.

The chip is tested with Keysight's M8040A BER tester (BERT). The preemphasis coefficient setup is

(-0.05, 0.85, -0.1). The total channel loss from the output of the BERT to the output of the probe is 3.9 dB at 14 GHz. Fig. 21 shows the eye diagram of a single-ended output. The recovered clock is measured by a spectrum analyzer. One of the eighth-rate data streams is sent back to the error detector for BER measurements. The CP and loop filter are programmable, allowing characterization with different loop BWs. All measurements are conducted with a pseudorandom binary sequence of length  $2^{15} - 1$ .

Fig. 22(a) plots the recovered clock spectrum for a loop BW of 160 MHz. The VCO phase noise is shaped with a few dB of peaking at the edge of the BW, an effect attributed to the loop latency (see Section X). Shown in Fig. 22(b) is the recovered clock phase noise for offsets from 100 Hz to 100 MHz. The corresponding integrated jitter is 360 fs<sub>rms</sub>. The phase noise has also been measured directly from the spectrum at offsets up to 14 GHz. The total integrated jitter rises to 574 fs<sub>rms</sub>.

Fig. 23(a) plots the CDR jitter transfer for loop BWs from 25 to 160 MHz, also exhibiting a few dB of peaking for the widest BW. Fig. 23(b) depicts the jitter tolerance, revealing a value of 1 UI at 10 MHz for BWs of 120 MHz and above. The BER threshold in this measurement is 10<sup>-12</sup>.

Table II summarizes the measured performance of the proposed CDR and compares it to that of the prior art. We observe that the frequency at which the jitter tolerance reaches 1 UI has been raised by a factor of 5 and the power efficiency has been improved by a factor of 3.

## XII. CONCLUSION

This work demonstrates that detection of all PAM4 data transitions can improve the performance of the CDR circuits considerably. Also proposed is a new comparator that calibrates its offset in every clock cycle with minimal loading of the signal path. The CDR/deserializer draws 8 mW at 56 Gb/s in 28-nm CMOS technology.

## ACKNOWLEDGMENT

The authors thank the TSMC University Shuttle Program for chip fabrication.

#### References

- J. Im *et al.*, "A 112-Gb/s PAM-4 long-reach wireline transceiver using a 36-way time-interleaved SAR ADC and inverter-based RX analog front-end in 7-nm FinFET," *IEEE J. Solid-State Circuits*, vol. 56, no. 1, pp. 7–18, Jan. 2021, doi: 10.1109/JSSC.2020.3024261.
- [2] M. Pisati et al., "A 243-mW 1.25–56-Gb/s continuous range PAM-4 42.5-dB IL ADC/DAC-based transceiver in 7-nm FinFET," *IEEE J. Solid-State Circuits*, vol. 55, no. 1, pp. 6–18, Jan. 2020, doi: 10.1109/JSSC.2019.2936307.
- [3] T. Ali et al., "6.2 A 460 mW 112Gb/s DSP-based transceiver with 38dB loss compensation for next-generation data centers in 7nm FinFET technology," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2020, pp. 118–120, doi: 10.1109/ISSCC19947.2020.9062925.
- [4] B.-J. Yoo et al., "6.4 A 56Gb/s 7.7 mW/Gb/s PAM-4 wireline transceiver in 10nm FinFET using MM-CDR-based ADC timing skew control and low-power DSP with approximate multiplier," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2020, pp. 122–124, doi: 10.1109/ ISSCC19947.2020.9062964.
- [5] G. Hou and B. Razavi, "A 56-Gb/s 8-mW PAM4 CDR/DMUX with high jitter tolerance," in *Proc. Symp. VLSI Circuits*, Jun. 2021, pp. 1–2, doi: 10.23919/VLSICircuits52068.2021.9492414.
- [6] (Dec. 29, 2017). Common Electrical I/O (CEI) Electrical and Jitter Interoperability Agreements for 6G+ bps, 11G+ bps, 25G+ bps I/O and 56G+ bps. Accessed Sep. 1, 2021. [Online]. Available: https://www. oiforum.com/wp-content/uploads/2019/01/OIF-CEI-04.0.pdf

- [7] (Sep. 2016). Arista 7500 Scale-Out Cloud Network Designs. Accessed: Sep. 1, 2021. [Online]. Available: https://www.arista.com/assets/data/ pdf/Whitepapers/Arista\_7500\_Scale\_Out\_Designs.pdf
- [8] Y. Frans *et al.*, "A 56-Gb/s PAM4 wireline transceiver using a 32-way time-interleaved SAR ADC in 16-nm FinFET," *IEEE J. Solid-State Circuits*, vol. 52, no. 4, pp. 1101–1110, Apr. 2017, doi: 10.1109/JSSC.2016.2632300.
- [9] B. Razavi, "Jitter-power trade-offs in PLLs," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 68, no. 4, pp. 1381–1387, Apr. 2021, doi: 10.1109/TCSI.2021.3057580.
- [10] A. Roshan-Zamir *et al.*, "A 56-Gb/s PAM4 receiver with low-overhead techniques for threshold and edge-based DFE FIR- and IIR-tap adaptation in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 54, no. 3, pp. 672–684, Mar. 2019, doi: 10.1109/JSSC.2018.2881278.
- [11] A. K. M. D. Hossain, M. Mohammad, and M. Hossain, "Channeladaptive ADC and TDC for 28 Gb/s PAM-4 digital receiver," *IEEE J. Solid-State Circuits*, vol. 53, no. 3, pp. 772–788, Mar. 2018, doi: 10.1109/JSSC.2017.2777099.
- [12] Z. Zhang, G. Zhu, C. Wang, L. Wang, and C. P. Yue, "A 32-Gb/s 0.46-pJ/bit PAM4 CDR using a quarter-rate linear phase detector and a self-biased PLL-based multiphase clock generator," *IEEE J. Solid-State Circuits*, vol. 55, no. 10, pp. 2734–2746, Oct. 2020, doi: 10.1109/JSSC.2020.3005780.
- [13] X. Zhao, Y. Chen, P.-I. Mak, and R. P. Martins, "A 0.0285 mm<sup>2</sup> 0.68pJ/bit single-loop full-rate bang-bang CDR without reference and separate frequency detector achieving an 8.2(Gb/s)/µs acquisition speed of PAM-4 data in 28nm CMOS," in *Proc. IEEE Custom Integr. Circuits Conf. (CICC)*, Mar. 2020, pp. 1–4, doi: 10.1109/CICC48029.2020.9075885.
- [14] N. Qi et al., "A 51Gb/s, 320 mW, PAM4 CDR with baud-rate sampling for high-speed optical interconnects," in *Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC)*, Nov. 2017, pp. 89–92, doi: 10.1109/ ASSCC.2017.8240223.
- [15] J. Lee, K. S. Kundert, and B. Razavi, "Analysis and modeling of bang-bang clock and data recovery circuits," *IEEE J. Solid-State Circuits*, vol. 39, no. 9, pp. 1571–1580, Sep. 2004, doi: 10.1109/JSSC. 2004.831600.
- [16] J. D. H. Alexander, "Clock recovery from random binary signals," *Electron. Lett.*, vol. 11, no. 22, pp. 541–542, Oct. 1975, doi: 10.1049/ el:19750415.
- [17] M. Hossain and N. Nguyen, "DDJ-adaptive SAR TDC-based timing recovery for multilevel signaling," *IEEE J. Solid-State Circuits*, vol. 54, no. 10, pp. 2833–2844, Oct. 2019, doi: 10.1109/JSSC.2019.2923586.
- [18] K. Park, M. Shim, H.-G. Ko, and D.-K. Jeong, "6.5 A 6.4-to-32Gb/s 0.96pJ/b referenceless CDR employing ML-inspired stochastic phasefrequency detection technique in 40nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2020, pp. 124–126, doi: 10.1109/ISSCC19947. 2020.9063010.
- [19] P. Nuzzo, F. De Bernardinis, P. Terreni, and G. Van der Plas, "Noise analysis of regenerative comparators for reconfigurable ADC architectures," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 55, no. 6, pp. 1441–1454, Jul. 2008, doi: 10.1109/TCSI.2008.917991.
- [20] M. J. E. Lee, W. J. Dally, and P. Chiang, "Low-power area-efficient high-speed I/O circuit techniques," *IEEE J. Solid-State Circuits*, vol. 35, no. 11, pp. 1591–1599, Nov. 2000, doi: 10.1109/4.881204.
- [21] J. W. Park and B. Razavi, "Channel selection at RF using Miller bandpass filters," *IEEE J. Solid-State Circuits*, vol. 49, no. 12, pp. 3063–3078, Dec. 2014, doi: 10.1109/JSSC.2014.2362843.
- [22] J. W. Jung and B. Razavi, "A 25-Gb/s 5-mW CMOS CDR/deserializer," *IEEE J. Solid-State Circuits*, vol. 48, no. 3, pp. 684–697, Mar. 2013, doi: 10.1109/JSSC.2013.2237692.



**Guanrong Hou** received the B.S. degree in electrical engineering from Peking University, Beijing, China, in 2014, and the M.S. and Ph.D. degrees in electrical engineering from the University of California at Los Angeles (UCLA), Los Angeles, CA, USA, in 2015 and 2021, respectively.

He is currently with MaxLinear, Inc., Carlsbad, CA, USA.

Behzad Razavi (Fellow, IEEE) received the B.S.E.E. degree from the Sharif University of Technology, Tehran, Iran, in 1985, and the M.S.E.E. and Ph.D.E.E. degrees from Stanford University, Stanford, CA, USA, in 1988 and 1992, respectively.

He was an Adjunct Professor at Princeton University, Princeton, NJ, USA, from 1992 to 1994, and at Stanford University in 1995. He was with AT&T Bell Laboratories, Holmdel, NJ, USA, and Hewlett-Packard Laboratories, Palo Alto, CA, USA, until 1996. Since 1996, he has been an Asso-

ciate Professor and subsequently a Professor of electrical engineering at the University of California at Los Angeles (UCLA), Los Angeles, CA, USA. He has authored *Principles of Data Conversion System Design* (IEEE Press, 1995), *RF Microelectronics* (Prentice Hall, 1998 and 2012) (translated to Chinese, Japanese, and Korean), *Design of Analog CMOS Integrated Circuits* (McGraw-Hill, 2001 and 2016) (translated to Chinese, Japanese, and Korean), *Design of Integrated Circuits for Optical Communications* (McGraw-Hill, 2003 and Wiley, 2012), *Design of CMOS Phase-Locked Loops* (Cambridge University Press, 2020), and *Fundamentals of Microelectronics* (Wiley, 2006, 2014, and 2021) (translated to Korean, Portuguese, and Turkish) and is an Editor of *Monolithic Phase-Locked Loops and Clock Recovery Circuits* (IEEE Press, 1996) and *Phase-Locking in High-Performance Systems* (IEEE Press, 2003). His current research interests include wireless and wireline transceivers and data converters.

Dr. Razavi is a member of the U.S. National Academy of Engineering and a fellow of the U.S. National Academy of Inventors. He has served as an IEEE Distinguished Lecturer. He received the Beatrice Winner Award for Editorial Excellence at the 1994 International Solid-State Circuits Conference (ISSCC). the Best Paper Award at the 1994 European Solid-State Circuits Conference, the Best Panel Award at the 1995 and 1997 ISSCC, the TRW Innovative Teaching Award in 1997, the Best Paper Award at the IEEE Custom Integrated Circuits Conference in 1998, and the McGraw-Hill First Edition of the Year Award in 2001. He was a co-recipient of both the Jack Kilby Outstanding Student Paper Award and the Beatrice Winner Award for Editorial Excellence at the 2001 ISSCC. He received the Lockheed Martin Excellence in Teaching Award in 2006, the UCLA Faculty Senate Teaching Award in 2007, and the CICC Best Invited Paper Award in 2009 and 2012. He was a co-recipient of the 2012 and 2015 VLSI Circuits Symposium Best Student Paper Awards and the 2013 CICC Best Paper Award. He was also recognized as one of the top ten authors in the 50-year history of ISSCC. He received the 2012 Donald Pederson Award in Solid-State Circuits. He was also a recipient of the American Society for Engineering Education PSW Teaching Award in 2014 and the 2017 IEEE CAS John Choma Education Award. He served on the Technical Program Committees of the ISSCC from 1993 to 2002 and the VLSI Circuits Symposium from 1998 to 2002. He has also served as a Guest Editor and an Associate Editor for the IEEE JOURNAL OF SOLID-STATE CIRCUITS, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, and International Journal of High Speed Electronics. He served as the founding Editor-in-Chief of the IEEE SOLID-STATE CIRCUITS LETTERS.