# Efficient Systolic Architectures for 1-D and 2-D DLMS Adaptive Digital Filters

Lan-Da Van and Wu-Shiung Feng\*

Lab 331, Department of Electrical Engineering, National Taiwan University, Taipei, China \*Department of Electronics Engineering, Chang Gung University, Taoyuan, China

E-mail: d86029@cad.ee.ntu.edu.tw

#### Abstract

In this paper, we propose two efficient systolic architectures for 1-D and 2-D Delay Least-Mean-Square (DLMS) adaptive digital filters. Using our developed architectures, higher convergence rate and Signal-to-Noise Ratio (SNR) than those of the conventional DLMS structure can be obtained without sacrificing the properties of the systolic architecture. Furthermore, the adaptive digital filters operate at the highest throughout due to the new tree-systolic processing element. Besides, based on our proposed optimized rule, one can easily design N th tap and window size N × N systolic adaptive digital filters with the compromise of minimum delay and high regularity under the constraint of the maximum number of tap-connections of the feedback signal.

## I. Introduction

Adaptive digital filters have a wide range of communication and DSP applications such as adaptive equalizer, image restoration, and so on. The most widely used algorithm for adaptive digital filters is the Least-Mean-Square (LMS) algorithm due to its superior performance and simple calculation. However, owing to the maximum connections of the feedback error signal and the requirement of the local connection in hardware, it is difficult to directly implement the LMS algorithm via VLSI techniques without considering delays. Thus, Long et al. [1] developed the DLMS algorithm such that VLSI design of an approximating LMS adaptive filter could be possible. Therefore, some researches [2, 3] have been conducted on the systolic-array architecture of the DLMS adaptive filter. In [1-3], they either require shorter delay without considerations of systolic array or longer delay via entire taps of the systolic array. Recently, Douglas et al. [4] and Matsubara et al. [5] proposed new structures that can approach well to the convergence of LMS algorithm. Nevertheless, it additionally requires a larger area to calculate the new derived feedback error and has low throughput rate if delay D decreases. On the other hand, 2-D signal processing requires a much larger amount of computations than the 1-D adaptive digital filter, so high throughpupt and realizable features are desirable. It is our motivation to propose two efficient systolic architectures of 1-D and 2-D DLMS algorithm for DSP applications.

#### 11. Efficient 1-D and 2-D Systolic Architectures

The LMS adaptive algorithm minimizes the mean-square error by recursively altering the weights at

each sampling instance. The 1-D LMS adaptive digital filter can be described as

$$y(i) = \sum_{k=0}^{N-1} w_k(i) x(i-k),$$
 (1)

$$e(i) = d(i) - y(i), \qquad (2)$$

$$w_k(i+1) = w_k(i) + \mu e(i)x(i-k),$$
 (3)

where  $w_k(i)$  and x(i) denote the k th tap weight and the i th input sample for an N th-tap adaptive digital filter, respectively. On the other hand, d(i) and y(i) represent the desired signal and output signal, respectively. The step-size  $\mu$  is used for adaptation of the weights, and e(i) is the feedback error. The coefficient updates using the DLMS algorithm can be represented as:

$$w_k(i+1) = w_k(i) + \mu e(i-D)x(i-k-D)$$
, (4)  
where  $D$  is a delay value in weight adaptation.

Here, we apply tree concept to devise a new tree-systolic PE such like PE2 as shown in Fig. 1, where unit delay elements painted with cross section are inserted to obtain the lowest critical period (i.e., highest throughput rate). Note that the symbol, **\bigcup\_{\text{n}}**, denotes the unit delay element. The critical period is defined as the minimum operation cycle-time for correct responses. The subscript 2 of  $PE_2$  is the number of the tree level, that is, PE2 allows us to concurrently adjust 4 taps per each clock and other  $PE_p$ , for  $p \in$  nonnegative integers, can be obtained by the similar technique. Note that  $p \le p_{\text{max}}$ . The maximum value of  $p_{\text{max}}$  is based on practical maximum number of tap-connections of the feedback signal. Hence, the new PE with finite and local connections leads to higher degree of realization unlike other structures in [1-5]. Although we had proposed similar PE [6], the ever proposed PE cannot be exactly mapped to weight update equation, Eq. (4), so we modified it to Fig. 1. Thus, the generalized 1-D systolic architecture of the adaptive filter is depicted in Fig. 2, where  $z_2^{-1}$ equals unit delay. It is observed that when p is equal to 0, this architecture can be reduced to fully pipelined architecture. According to the architecture as shown in Fig. 2, we induce the following rule:

Rule:

$$S_p = N \mod 2^p$$

$$k = (p-1): -1: 0$$

$$S_k = S_{k+1} \mod 2^k$$
end
$$D = p+1 + \left\lfloor \frac{N}{2^p} \right\rfloor + \sum_{k=0}^{p-1} \left\lfloor \frac{S_{k+1}}{2^k} \right\rfloor,$$
(5)

where  $S_k$  is the k th residue. The notation of  $\lfloor \bullet \rfloor$  is the maximum integer value less than or equal to  $\bullet$ . Therefore, the rule shows a trade-off between the minimum delay and high regularity while varying p in the constraint of the maximum number of tap-connections. As an example, when N is equal to 62 and the maximum number of tap-connections is 32 taps at each clock, we find that the tree level could be equal to 4 or 5 for the same minimum delay. With respect to the characteristic of the regularity, we observe that p=4 tree level implies that fewer kinds of PE's are required than that of p=5.

With the sprit of constructing an 1-D systolic adaptive digital filter and rule, we further make an effort to design a 2-D systolic adaptive digital filter. Since the window size  $N \times N$  2-D adaptive digital filter has been widely applied to image application, one of the possible methods to realize 2-D DLMS algorithm [7, 8] processing an image size  $M \times M$  can be briefly described as:

$$y(m,n) = \sum_{l=0}^{N-1} \sum_{k=0}^{N-1} w_{l,k}(j) x(m-l,n-k),$$
 (6)

$$e(j) = d(m,n) - y(m,n),$$
 (7)

$$\omega_{lk}(j+1) = \omega_{lk}(j) + \mu e(j-D_{2l})x(\widetilde{m}-l,\widetilde{n}-k)$$
, (8)

$$\widetilde{m} = \left\lfloor \frac{j - D_{2D}}{M} \right\rfloor, \quad \widetilde{n} = (j - D_{2D}) \mod M, \quad (9)$$

where j=mM+n and l as well as k are in the range of  $\{0 \le l \le N-1, \ 0 \le k \le N-1\}$ . d(m,n), x(m,n), y(m,n) and e(j) represent desired, input, output, and error feedback signals. It is known that given values,  $p_{\max}$ , are the same for 1-D and 2-D adaptive systems, so the tree-level relationship can be expressed as

$$q_H + q_{1'} \le p_{\max} , \qquad (10)$$

where  $q_H$  and  $q_V$  denote the horizontal and vertical tree levels, respectively. For simplification, while  $q_V = 1$ , the novel 2-D systolic architecture as shown in Fig. 3 in which  $z_1^{-1} = z_2^{-M}$  remain the similar characteristics of 1-D systolic architecture. Similarly, the total delay of the 2-D adaptive digital filter can be formulated as

$$D_{2D} = q_H + q_{V} + 1 + \left\lfloor \frac{N}{2^{q_H}} \right\rfloor + \left\lfloor \frac{N}{2^{q_V}} \right\rfloor + \left\lfloor \frac{N}{2^{q_V}} \right\rfloor + \sum_{k=1}^{q_H} \left\lfloor \frac{S_{k_1}}{2^{k_1 - 1}} \right\rfloor + \sum_{k=1}^{q_V} \left\lfloor \frac{S_{k_2}}{2^{k_2 - 1}} \right\rfloor.$$
 (11)

### III. Comparison Results

In view of hardware characteristics, we provide the comprehensive comparison results of the different N th-tap adaptive filter structures in Table 1. Since D-Z-Ss' structure [4] and M-N-Ks' structure [5] independently utilized the same conversion of DLMS into LMS, the proposed structures owned the similar characteristics. Hence, we only tabulate the features of M-N-Ks' structure regardless of D-Z-Ss' structure. Let  $T_m$  and  $T_a$  denote the operating time for one multiplier as well as one adder, respectively. In [5], since  $\delta$  and r have been defined, we omit here.

In terms of critical period, the proposed architecture has the lowest period independent of other control parameters among five structures in Table I. Before comparing the convergence performance, we define the convergence in [2, 3] as "normal". Of course, the convergence of the original LMS algorithm is our aim, so we address that the convergence of the LMS is denoted as "best". For convenient comparisons of convergence of M-N-Ks' structure with our work, we briefly show the constraints of M-N-Ks' pipeline design as:

$$\delta \le D - D_B, \quad D_B = \left[ \frac{T_m + \lceil \log_2 \delta \rceil T_a}{T_m + \lceil \log_2 (r+1) \rceil T_a} \right]. \quad (12)$$

Using less r, larger  $D_B$  is obtained. This condition limits  $\delta$  to be smaller than D and results in the approximation of the conventional DLMS, that is, the approximation of normal convergence. On the contrary, less  $D_B$  is achieved while r increases. The latter condition causes an approximation of the LMS at the expense of the number of multipliers and critical period compared with our work due to the larger  $\delta$  and r, respectively. In [5], the structure generating the new feedback error does not consider the finite and local connections for input, x(i), so this structure belongs to a semi-systolic structure. In Table 1, using our efficient systolic architecture suitable to a single chip realization, lowest period and better convergence characteristics can be achieved without sacrificing other features.

## IV. Applications

In this section, we simulate two examples to verify the validity of our proposed architectures. Since the M-N-Ks' structure whose convergence performance depends on the selection of  $\delta$ . It can approximate well either to the convergence of the LMS algorithm or DLMS algorithm. For the convenience of comparison, we omit M-N-Ks' convergence simulation and only simulate the LMS, efficient systolic DLMS using optimized rule, and conventional DLMS.

In the first application, we study the proposed 1-D efficient systolic architecture for adaptive equalization [9] with a linear dispersive channel that produces unknown distortion. The random sequence 1 applied to the channel

input consists of a *Bernoulli* sequence with zero mean and unit variance. Another random-sequence 2 serves as the source of additive white noise with zero mean and variance 0.001 that corrupts the channel output. These two random sequences are independent of each other. The impulse response of the channel is described as:

$$h_{i} = \begin{cases} \frac{1}{2} \left[ 1 + \cos(\frac{2\pi}{A}(i-2)) \right], & i = 1, 2, 3\\ 0, & otherwise \end{cases}$$
 (13)

where the parameter A controls the degree of amplitude distortion produced by the channel. Herein, we choose A=3.1,  $p_{\rm max}=3$  for N=12. The simulation result with ensemble-average 150 runs is shown in Fig. 4 for  $\mu=0.033$ . Next, for image restoration application, we apply  $2.56\times2.56$  noisy Lena image as shown in Fig. 5(a) with  $SNR=3.09\,{\rm dB}$  to 2-D LMS, efficient systolic architecture with  $q_H=1$ ,  $q_V=1$ , and conventional DLMS architecture for  $p_{\rm max}=3$  and N=4. Figs. 5(b), (c), and (d) show the separately SNR values corresponding to above structures at  $\mu=3\times10^{-7}$ . Figs. 4 and 5 reveal that efficient systolic DLMS architectures have better results than the conventional DLMS architectures in these two experiments.

## V. Conclusions

Two efficient N th-tap 1-D and window size  $N \times N$  2-D systolic adaptive digital filters utilizing the tree-systolic PE has been presented in this paper. Under considering maximum number of tap-connections of the feedback error signal, the practical rule to decide the optimized tree level without sacrificing the systolic characteristics is provided. At last, we verify our 1-D and 2-D efficient systolic architectures via applications of adaptive equalizer and image restoration, respectively. From the comparisons of table and simulation, we guarantee the proposed architectures furnish better results than the conventional systolic-array architectures [2-3]. Furthermore, these efficient systolic architectures amenable to VLSI implementation achieve lowest critical period



Fig. 1. The proposed tree-systolic  $PE_2$ 

among proposed structures [2-5].

## References

- G. Long, F. Ling, and J. G. Proakis, "The LMS algorithm with delayed coefficient adaptation," *IEEE Trans. Acoust., Speech, Signal Processing*, vol. ASSP-37, no. 9, pp. 1397-1405, Sept. 1989.
- [2] H. Herzberg, R. Haimi-Cohen, and Y. Be'ery, "A systolic array realization of an LMS adaptive filter and the effects of delayed adaptation," *IEEE Trans. Signal Processing*, vol. 40, no. 11, pp. 2799-2803, Nov. 1992.
- [3] M. D. Meyer and D. P. Agrawal, "A high sampling rate delayed LMS filter architecture," *IEEE Trans. Circuits Syst.*, *II*, vol. CAS-40, no. 11, pp. 727-729, Nov. 1993.
- [4] S. C. Douglas, Q. Zhu, and K. F. Smith, "A pipelined LMS adaptive FIR filter architecture without adaptation delay," *IEEE Trans. Signal Processing*, vol. 46, no. 3, pp. 775-779, Mar. 1998.
- [5] K. Matsubara, K. Nishikawa, and H. Kiya, "Pipelined LMS adaptive filter using a new look-ahead transformation," *IEEE Trans. Circuits Syst.*, II, vol. CAS-46, no. 1, pp. 51-55, Jan. 1999.
- [6] L. D. Van, S. Tenqchen, C. H. Chang and W. S. Feng, "A tree-systolic array of DLMS adaptive filter," in Proc. IEEE Int. Conf. on Accoustics, Speech and Signal Processing, Mar. 1999, vol. 3, pp. 1253-1256, Phoenix, Arizona.
- [7] M. M. Hadhoud and D. W. Thomas, "The two-dimensional adaptive LMS (TDLMS) algorithm," *IEEE Trans. Circuits Syst.*, vol. CAS-35, no. 5, pp. 485-494, May 1988.
- [8] K. Matsubara, K. Nishikawa, and H. Kiya, "2-D pipelined adaptive filters based on 2-D delayed LMS algorithm," *IEICE Trans.*, vol. E80-A, no. 6, pp. 1009-1014, June 1997.
- [9] S. Haykin, Adaptive Filter Theory. Englewood Cliffs, NJ: Prentice-Hall, 1996, chap 9.



Fig. 2. The proposed 1-D systolic adaptive digital filter.



Fig. 3. The proposed 2-D systolic adaptive digital filter.



Fig. 4. Comparison results of the adaptive equalizer at  $\mu = 0.033$  with (a) LMS, (b) efficient systolic DLMS resulting in p = 2, D = 6, and (c) conventional DLMS.



Fig. 5. (a) noisy "Lena" image, and comparison results using (b) LMS with  $SNR = 9.10 \, dB$ , (c) efficient systolic DLMS with  $SNR = 8.23 \, dB$ , and (d) conventional DLMS with  $SNR = 8.07 \, dB$ .

Table 1: Comparison Results of the different N th-Tap Adaptive Digital Filter Architectures

|                               | <del>,</del>            |               |               |                                       |             |
|-------------------------------|-------------------------|---------------|---------------|---------------------------------------|-------------|
|                               | LMS                     | H-H-Bs'       | M-As'         | M-N-Ks' Structure                     | This work   |
|                               |                         | Structure [2] | Structure [3] | [4-5]                                 |             |
| Architecture<br>Type          | Non-systolic            | Systolic      | Systolic      | Semi-systolic                         | Systolic    |
| Convergence<br>Characteristic | Best                    | Normal        | Normal        | Depend on $\delta$                    | Depend on p |
| Value of D                    | . 0                     | N             | N .           | $\lceil N/r \rceil + 1$               | Eq. (5)     |
| Critical Period               | $3T_m + (2N+1)T_\alpha$ | $2T_m + 2T_a$ | $2T_m + 2T_a$ | $T_m + \lceil \log_2(r+1) \rceil T_a$ | $T_m + T_a$ |
| No. of the<br>Multipliers     | 2N+1                    | 2 <i>N</i> +1 | 3 <i>N</i>    | $2N+1+2\delta$                        | 2N +1       |