## A 56-Gb/s 8-mW PAM4 CDR/DMUX with High Jitter Tolerance

Guanrong Hou and Behzad Razavi

Electrical and Computer Engineering Department, University of California, Los Angeles, CA 90095, USA

guanronghou@ucla.edu

**Abstract** — An analog one-eighth-rate CDR circuit detects both major and minor transitions in PAM4 data by calculating the Euclidean distances between the sampled points. Realized in 28-nm CMOS technology, the prototype exhibits a jitter transfer bandwidth of 160 MHz and a jitter tolerance of 1 UI at 10 MHz.

Keywords: PAM4, CDR, automatic dc offset cancellation

The use of PAM4 signaling alleviates the channel loss but introduces other difficulties in transceiver design. Specifically, the linearity and resolution issues have generally called for ADC-based PAM4 receivers [1, 2]. However, such architectures entail three drawbacks. First, the ADC and the subsequent digital processing consume high power, a particularly serious challenge in multi-lane applications. Second, the ADC clock must exhibit a very small jitter, e.g., less than 40 fs for a 7-bit converter sampling at 56 GHz (if the SNR penalty due to jitter must not exceed 3 dB). Third, the latency in the ADC and the digital processor limits the overall clock recovery loop bandwidth and jitter tolerance. An "analog" receiver, on the other hand, can greatly relax all three issues but it requires both a CDR circuit and a DFE that robustly process PAM4 signals. This paper demonstrates the former by introducing three new ideas: a PAM4 phase detector (PD), a low-power comparator, and a background offset cancellation technique. The prototype provides a 3X improvement in power efficiency and 5X increase in jitter tolerance bandwidth.

Fig. 1 shows the CDR/DMUX architecture. It consists of 16 PAM4 transition detectors, which collectively compute the phase error between  $D_{in}$  and the recovered clock, a charge pump (CP), a loop filter, a 28-GHz LC VCO, and a phase generation network delivering 16 clocks. The input data also drives 16 data extraction units, which provide demultiplexed NRZ outputs at 3.5 Gb/s. The loop filter and the CP current are programmable.

The most challenging aspect of PAM4 CDR design is the phase detector. While it is possible for the PD to sense only the major transitions of the PAM4 signal [1]-[4], such an approach leads to a narrow loop bandwidth as it discards at least half of the edges. To benefit from both major and minor transitions, we first examine the Alexander PD's operation [Fig. 2(a)]: we recognize that the XOR gates, in essence, measure the "Euclidean distance" between  $V_E$  and  $V_A$  and between  $V_B$  and  $V_E$  to decide if the clock is early or late. That is, a more general approach would be to determine whether  $V_E$  is closer to  $V_A$  or to  $V_B$ . Moreover, the PD must sample and align these values in time before the comparisons are made.

In our proposed PD, we extend these notions to PAM4 signals. As illustrated in Fig. 2(b), every major or minor transition can be characterized by three samples,  $V_A$ ,  $V_B$ , and  $V_E$ , while the Euclidean distances  $V_B - V_E$  and  $V_E - V_A$  provide the phase information. The PD must therefore perform two functions: (a) sample and align these three values, and

(b) compute the sign of  $(V_{\rm B} - V_{\rm E}) - (V_{\rm E} - V_{\rm A})$ .

At a symbol rate of 28 Gbaud, it is difficult to align the samples in the time domain and perform the subsequent operations related to the sign of  $(V_{\rm B} - V_{\rm E}) - (V_{\rm E} - V_{\rm A}) = V_{\rm A} + V_{\rm B} - 2V_{\rm E}$ . We thus employ a "divide-and-conquer" approach and sample the input by 16 clock phases,  $CK_1 - CK_{16}$ , each at 3.5 GHz. Owing to their 50% duty cycle, these phases provide an overlap time of  $\Delta T = 107$  ps, during which the subtraction of the samples can occur [Fig. 2(c)].

Fig. 3 shows the single-ended view of one input sampling and combining slice. The circuit generates  $V_{\rm X} = G_{\rm m1}(V_{\rm A} + V_{\rm B} - 2V_{\rm E}) + G_{\rm m2}V_{\rm cal}$ , where  $V_{\rm cal}$  serves to calibrate the offset. The  $G_{\rm m1}$  stages must accommodate the input PAM4 waveform with no compression and hence utilize resistive degeneration with a voltage drop of 250 mV. The comparator detects the sign of  $V_{\rm X}$ , delivering nearly rail-to-rail swings for further processing.

The linearity required of the  $G_{m1}$  stages in Fig. 3 inevitably leads to a low voltage gain, accentuating the comparator offset. To calibrate all the offsets in this chain, we propose the comparator topology in Fig. 4. The circuit senses and amplifies  $V_X$  while  $S_1$  and  $S_2$  are on. In this mode  $V_{out}$  is also stored on  $C_1$ and  $C_2$ , requiring that the output pole be higher than 3.5 GHz. Next,  $S_1$  and  $S_2$  turn off and  $S_3$ ,  $S_4$ , and  $S_T$  turn on, allowing  $M_1$ and  $M_2$  to amplify the output regeneratively and produce nearly rail-to-rail swings. At the same time, the charge stored on  $C_1$ and  $C_2$  is shared with  $C_3$  and  $C_4$  and the resulting voltage is amplified by  $A_1$ . Viewing  $C_1$ ,  $S_5$ , and  $S_7$  as a resistor, we observe that each branch leading to  $A_1$  low-pass filtering  $V_{out}$ , thereby extracting the dc offset contributed by the  $G_m$  stages and  $M_1$  and  $M_2$ . The output of  $A_1$  drives the input of  $G_{m2}$  in Fig. 3.

Even though small,  $C_1$  and  $C_2$  in Fig. 4 can limit the bandwidth at the output nodes. This issue is resolved using clocks with a 25% duty cycle to control  $S_5 - S_8$  so that the output data has sufficient time to settle after  $S_5$  and  $S_6$  turn off and before regeneration begins. These clocks are generated within the slice from  $CK_1 - CK_{16}$ .

The slice shown in Fig. 3 forms the core of each transition detector in Fig. 1. But the sign of  $V_A + V_B - 2V_E$  does not, by itself, provide complete phase information. This is because a rising, late data transition yields the same result as a falling, early edge. This ambiguity is removed by forming  $(V_A + V_B - 2V_E) \times (V_A - V_B)$  and implementing this function in the Up/Down logic that drives the CP. In the absence of data transitions, the logic disables the CP, thus minimizing pattern-dependent jitter.

Each data extraction unit in Fig. 1 incorporates a passive sampler followed by three StrongArm comparators (Fig. 5). The clock,  $CK_j$ , is the same as that in Fig. 3 to guarantee that the sampling occurs in the middle of each PAM4 eye. The thermometer code thus generated also drives the Up/Down logic to disable the CP in the absence of data transitions. The retimers in Fig. 5 ensure that this result and the retimer's output

in Fig. 3 reach the Up/Down logic at the same time. The phase generation block in Fig. 1 consists of a tree of  $\div$ 2 stages. For a given power budget, the LC VCO and this tree exhibit less jitter than a ring oscillator followed by a phase interpolator.

The CDR/DMUX circuit has been fabricated in 28-nm CMOS technology. Fig. 7(a) shows the die photograph and Fig. 7(b) shows the measured recovered clock phase noise up to 100 MHz for a loop bandwidth of 160 MHz. The phase noise at higher offsets is measured manually from the spectrum analyzer. Integrated up to 14 GHz, the rms jitter amounts to 574 fs. Fig. 6 plots the measured jitter transfer and tolerance, the latter reaching 1 UI<sub>pp</sub> at 10 MHz with PRBS15. Table I demonstrates a 5X increase in tolerance and 3X improvement in power efficiency.

Acknowledgements The authors thank the TSMC University Shuttle Program for chip fabrication. Research supported by Realtek Semiconductor.

## References

- [1] A. Roshan-Zamir et al., IEEE JSSC, Mar. 2019, pp. 672-684.
- [2] Aurangozeb et al., IEEE JSSC, Mar. 2018, pp. 772-788.
- [3] Z. Zhang et al., IEEE JSSC, Oct. 2020, pp. 2734-2746.
- [4] X. Zhao et al., IEEE CICC, 2020, pp. 1-4.
- [5] D.-H. Kwon et al., IEEE TCS-II, Mar. 2019, pp. 362–366.



Fig. 1. Proposed CDR/DMUX architecture.



Fig. 2. (a) Alexander PD, (b) PAM4 PD operation, (c) proposed PD timing.



Fig. 3. Phase detector slice.



Fig. 4. Proposed comparator.



Fig. 5. Data extraction unit.



Fig. 6 Jitter transfer and tolerance for different loop settings.



Fig. 7. (a) Die photograph, (b) phase noise (100 Hz to 100 MHz).

TABLE I. Performance Summary.

|                              | 2         |       |       |       |       |     |
|------------------------------|-----------|-------|-------|-------|-------|-----|
|                              | This Work | [1]   | [2]   | [3]   | [4]   | [5] |
| Data Rate (Gb/s)             | 56        | 56    | 28    | 32    | 29.1  | 32  |
| Power (mW)                   | 8         | 49.2* | 47*   | 14.7  | 19.16 | 32  |
| 1-UI Jitter Tol. Freq. (MHz) | 10        | 0.5   | 0.6   | 2     | 1.8   | 1   |
| Loop BW (MHz)                | 160       | 10    | 11    | 10    | 12    | 10  |
| CK Jitter (ps)               | 0.574     | N/A   | 0.513 | 0.352 | 0.487 | 3.8 |
| Technology (nm)              | 28        | 65    | 65    | 40    | 28    | 28  |
| Power Efficiency (pJ/bit)    | 0.14      | 0.88  | 1.68  | 0.46  | 0.66  | 1   |

\*Only including CDR portion for fair comparison.