## A 56-Gb/s 17-mW NRZ Receiver in 0.018 mm<sup>2</sup>

Kshitiz Tyagi and Behzad Razavi

Electrical and Computer Engineering Department, University of California, Los Angeles, CA 900995, USA

ktyagi30@ucla.edu

Abstract - An NRZ receiver incorporates new architecture and circuit techniques to achieve a low power and area consumption. Realized in 28-nm CMOS technology, the full-rate RX operates at 56 Gb/s with a bit error rate less than 10<sup>-12</sup> for a channel loss of over 25 dB at 28 GHz.

Keywords: NRZ, wireline, receiver, low-power.

The large number of short-reach and medium-reach links in a typical wireline system requires that both their power consumption and their footprint be minimized. This paper introduces several architecture and circuit techniques that lead to low-power, compact receivers in 28-nm CMOS technology. It is also demonstrated that NRZ signaling offers a BER less than 10<sup>-12</sup> for channel losses up to 25 dB at 28 GHz, thereby obviating the need for power-hungry, complex PAM4 implementations [1],[2]. Following a "minimalist" approach, the design aims for full-rate operation as it, in fact, consumes less power than half-rate or quarter-rate architectures. This is because of the reduced loading that the DFE imposes on the CTLE and the smaller input capacitance that the CDR presents to the DFE. The compact floorplan affords short interconnects, obviating the need for buffers and further lowering the power.

Figure 1(a) shows the RX architecture at a high level. While it appears fairly standard, we explain below how it departs from conventional designs so as to achieve low power consumption. Depicted in Fig. 1(b), the CTLE consists of two stages. In order to reduce the CTLE area, we opt for active inductors as its loads. It was recognized in [3] that the structure depicted in Fig. 1(c) consumes less voltage headroom than a diode-connected device and exhibits an inductive behavior at  $Z_X$ . This concept was later modified and used in driver design by [4]. Figure 1(d) presents our realization of the first stage, where V<sub>1</sub> adjusts the resistance in series with the source follower and hence the inductance value. We have introduced capacitors C<sub>1</sub> and C<sub>2</sub> to provide feedforward paths to the gates of M<sub>3</sub> and M<sub>4</sub>, thus raising the gain at high frequencies by 2.4 dB. The CTLE's second stage is similar, except that it employs negative Miller capacitors instead of feedforward. The CTLE displays a maximum boost of 14 dB at 28 GHz while drawing 4.8 mW.

Shown in Fig. 2(a), the CML DFE incorporates four latches,  $L_1$ - $L_4$ , along with two taps. We propose two techniques that improve the performance considerably. First, we apply the concept of "reverse scaling": the latches for the second tap need not be as large as those for the first. We scale  $L_3$  and  $L_4$  down by a factor of 2, thus lowering the load seen by  $L_2$  and saving power. The CML latches employ shunt peaking by means of stacked square spiral inductors to overcome the speed limitations of 28-nm CMOS technology and operate at 56 Gb/s.

Nonetheless, the loop timing for the first tap still proves challenging, as such an FF design exhibits an input sensitivity of 250 mV at 56 Gb/s. The FF output eye is therefore nearly closed, and requires an inordinately large eye opening at the summing node for low BER operation.

Our second proposed technique is a new flipflop with discretetime negative feedback. We recognize that the limited bandwidth at the output of  $L_1$  in Fig. 2(a) can be viewed as a "lossy channel" that incurs a long tail in its pulse response. To our aid comes the fact that  $L_2$  holds the *previous* bit when  $L_1$  is in the sense mode. We thus remove the ISI at the output of  $L_1$ by returning a fraction of the output of  $L_2$ . The circuit implementation is shown in Fig. 2(b), where XC denotes the clocked cross-coupled pair. The proposed feedback improves the sensitivity to 30 mV, allows the FF to regenerate much faster, and opens the eye at the FF output, as demonstrated by the two diagrams shown in Figs. 2(c), (d). Note that the feedback path is active only when  $L_1$  is in the sense mode, and does not interfere with the regeneration of XC when  $L_1$  enters the hold mode.

In conventional receivers, the CDR input is tied to a node in the CTLE path or to the DFE summing junction. The latter provides a fair eye opening but is sensitive to the large input capacitance of the CDR. In this work, we push this interface further down the chain and feed the CDR from the first DFE flipflop [Fig. 3(a)].

This approach offers four advantages. First, it avoids loading node X, whose eye opening is the most important. Second, it senses threefold larger voltage swings at B owing to the flipflop regeneration. Third, the much smaller ISI jitter at this port maintains a high gain for the CDR's phase detector (PD), guaranteeing a wide lock range and jitter tolerance even in the presence of high-loss channels. Fourth, with such large swings and with the aid of self-biased inverters, we can employ a CDR architecture with a "zero-power" PD.

Depicted in Fig. 3(b), the CDR is based on [5] but including a variable phase shift in the loop for fine alignment of CK with the data. This topology avoids regenerative flipflops, potentially drawing less power than other structures, but the gates of the switches in the PD require large data swings. This interface issue is resolved in our architecture because the data is sensed at the output of  $FF_1$ .

The variable phase shift in Fig. 3(b) operates as follows. Programmable current sources  $I_1$  and  $I_2$  can create a positive or negative shift in the inverter output: since the VCO requires a certain dc level at node N, such a shift must be compensated by a phase shift at the VCO output. With a range of  $\pm$  5 ps ( $\pm$  0.3 UI), this circuit places the clock edge in the middle of the data eye at the DFE summing junction. The 56-GHz LC VCO drives the DFE without buffers and draws 5.2 mW.

For bathtub measurement results, it is necessary to disable the CDR loop and apply an external clock. To this end, we have provided an injection port in Fig. 3(b) that receives a clock and locks the VCO to it. This method obviates the need for a high-speed multiplexer in the clock path. Moreover, it does not require differential phases for the external injection.

The receiver has been fabricated in 28-nm CMOS technology [Fig. 4(a)]. Unless otherwise stated, all results are reported for

a channel loss of 25.5 dB at 28 GHz. Figure 4(b) plots the eye diagram of the 56 Gb/s chip output data. Plotted in Fig. 5(a) are two measured bathtub curves: one for a loss of 25.5 dB, and the other for 30 dB but with an FFE function of the form  $-0.2 + 0.8z^{-1}$  implemented within the pattern generator.

For the characterization of the RX (including the CDR), we obtain the recovered clock shown in Fig. 5(b), which exhibits an RMS jitter of 338 fs. The jitter tolerance remains greater than 0.7 UI<sub>pp</sub> at 5 MHz, and the jitter transfer reveals a CDR bandwidth of 80 MHz. Table I summarizes the RX performance and compares it to the prior art. Our prototype demonstrates a 2.8-fold improvement in both the power efficiency and the area consumption.

**Acknowledgements** We thank the TSMC University Shuttle program for chip fabrication.

## References

[1] E. Depaoli et. al., IEEE JSSC, pp. 6-17, Jan. 2019.

- [2] A. Roshan-Zamir et. al., IEEE JSSC, pp. 672-684, Mar. 2019.
- [3] B. Razavi, "Design of Analog CMOS Integrated Circuits", McGraw-Hill 2001.
- [4] Y.S. M. Lee et. al., IEEE ASSCC, Nov. 2008.
- [5] L. Kong et. al., IEEE JSSC, pp. 2857-2866, Oct. 2019.



Fig. 1. (a) Receiver architecture, (b) CTLE block diagram, (c) active-inductor topology, and (d) proposed CTLE implementation.



Fig. 2. (a) DFE architecture, (b) proposed FF, FF output eye (c) without, and (d) with discrete-time feedback.



Fig. 3. (a) CDR-DFE interface, and (b) CDR design details, along with external injection path facilitating bathtub measurements.



Fig. 4. (a) Die micrograph, and (b) output data eye diagram.



Fig. 5. Measured (a) bathtub curves, (b) recovered clock jitter, (c) jitter tolerance, and (d) jitter transfer.

TABLE I. Performance Summary.

| Reference                    | Shibasaki<br>ISSC 2016         | Depaoli<br>JSSCC 2019             | Cevrero<br>ISSCC 2019           | Atharav<br>ISSCC 2021          | This<br>Work                    |
|------------------------------|--------------------------------|-----------------------------------|---------------------------------|--------------------------------|---------------------------------|
| Data Rate (Gb/s)             | 56                             | 64                                | 50                              | 56                             | 56                              |
| Modulation                   | NRZ                            | PAM4                              | NRZ                             | NRZ                            | NRZ                             |
| Architecture                 | CTLE<br>1-tap DFE              | 2–tap RX FFE<br>CTLE<br>3–tap DFE | CTLE<br>3–tap DFE               | CTLE + DTLE<br>2-tap DFE       | CTLE +<br>2-tap DFE             |
| Channel Loss<br>@ Nyquist    | 18.4 dB*                       | 16.8 dB **                        | 37.8 dB*                        | 25 dB                          | 25.5 dB                         |
| BER/<br>Eye Opening          | <10 <sup>-9</sup> /<br>0.28 UI | <10 <sup>-6</sup> /<br>0.19 UI    | <10 <sup>-12</sup> /<br>0.44 UI | <10 <sup>-12</sup> /<br>0.4 UI | <10 <sup>-12</sup> /<br>0.49 UI |
| Power (mW)                   | 141.7                          | 180 <i>\$</i>                     | 112\$                           | 49.56                          | 16.84                           |
| Power Efficiency<br>(pJ/bit) | 2.53                           | 2.81                              | 2                               | 0.88                           | 0.32                            |
| Area (mm <sup>2</sup> )      | 1.4#                           | 0.32                              | 0.053                           | 0.102                          | 0.018                           |
| Technology                   | 28-nm<br>CMOS                  | 28-nm<br>FDSOI                    | 14-nm<br>FinFET                 | 28-nm<br>CMOS                  | 28-nm<br>CMOS                   |

\*Includes 2-tap TX FFE \*\*Includes 4-tap TX FFE <sup>\$</sup> Excludes clock gen <sup>#</sup> Includes TX area