## A New VLSI Architecture without Global Broadcast for 2-D Digital Filters

Lan-Da Van, Chih-Chun Tang, Shing Tengchen\*, and Wu-Shiung Feng\*\*

Lab 353, Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, R.O.C.

\*Chunghwa Telecom Telecommunication Labs., 12, Lane 551, Sec. 5,

Min-Tsu Rd., Yang-Mei Zien, Tao-Yuan County, Taiwan 326, R.O.C.

\*\*Department of Electronics Engineering, Chang Gung University, Taoyuan, Taiwan, R.O.C.

E-mail: d86029@cad.ee.ntu.edu.tw and stc@ms.chttl.com.tw

#### **ABSTRACT**

In this paper, we propose the new two-dimensional (2-D) systolic-array structures of IIR/FIR digital filters without global broadcast by the different derivation and another systolic transformation. For more practical considerations, we further provide a detailed block diagram of a 2-D FIR filter using recently proposed multiplier to reduce the roundoff quantization error in the logic-gate level. These proposed systolic structures amenable to VLSI implementation permit the 2-D input sequence to be scanned in row-wise mode and locally broadcast one value each clock per delay element.

#### 1. Introduction

2-D IIR/FIR digital filters have been widely applied to image/video processing applications such as image enhancement and noise reduction. The architectures of 2-D IIR/FIR digital filters are mainly derived from either the discrete-time difference equation or transfer function [1-3]. However, we can obviously find that the input and output signals globally broadcast in these structures [1-3]. This is our motivation to improve the structures to be more suitable to VLSI design. In this paper, we propose the improved 2-D systolic structures by the different derivation of the transfer function and first by using the different systolic transformation. The paper is organized as follows. The new systolic architectures for 2-D digital filters are presented in Section 2. In Section 3, comparison results are tabulated in terms of whether the structures eliminate the global broadcast, the critical cycle period, the number of multipliers and delay elements. In the logic-gate level, the block diagram of an FIR digital filter consisting of a lower-error fixed-width multiplier [4] is depicted in Section 4. Thus, the lower quantization error expression due to the finite precision multiplication in digital filters can be addressed in Section 5. At last, the concise statements conclude this paper.

 This work was supported by National Science Council under NSC: 88-2216-E-002-018.

#### 2. Systolic Architectures without Global Broadcast

A 2-D IIR filter has the following transfer function

$$H(z_1, z_2) = \frac{\sum_{i=0}^{N_1} \sum_{j=0}^{N_2} a_{ij} z_1^{-i} z_2^{-j}}{1 - \sum_{i=0}^{N_1} \sum_{j=0}^{N_2} b_{ij} z_1^{-i} z_2^{-j}},$$
(1)

where  $b_{00} = 0$ ,  $a_{ij}$  and  $b_{ij}$  are the coefficients of the IIR filter, and  $N_1 \times N_2$  is the order of the IIR digital filter. We deduce the structure under the assumption of  $N_1 = N_2 = N = 2$ , and then we pay attention to derive 2-D IIR filter structure. For the sake of reducing the operation time, we first apply another systolic transformation as shown in Fig. 1(a), which is different from Fig. 1(b), to replace the cell as shown in Fig. 1(c). Using this systolic transformation as well as tree structure for addition, we can obtain less critical-cycle digital filter in feedback path than [3]. Next, it is commented that there exists the problem of input and output global broadcast of the above temporary filter structure using this systolic transformation and of the previous proposals. Since the global broadcast results in critical path in [1] and another critical path in [3] and this temporary filter structure, we take the different reordering of delay elements and summations of the transfer function to solve the existing defect. Given the input  $X \equiv X(z_1, z_2)$  and output  $Y \equiv Y(z_1, z_2)$  in z transform domain, we can rewrite Eq. (1) as follows:

$$Y = \sum_{i=0}^{N} \sum_{j=0}^{N} a_{ij} X z_1^{-i} z_2^{-j} + \sum_{i=0}^{N} \sum_{j=0}^{N} b_{ij} Y z_1^{-i} z_2^{-j}$$

$$= \sum_{j=0}^{N} a_{0j} z_2^{-j} X + \sum_{j=0}^{N} b_{0j} z_2^{-j} Y$$

$$+ \sum_{i=1}^{N} z_1^{-i} z_2^{i} \left[ (\sum_{j=0}^{N} a_{ij} z_2^{-j}) (z_2^{-i} X) + (\sum_{j=0}^{N} b_{ij} z_2^{-j}) (z_2^{-i} Y) \right]. \tag{2}$$

Note that  $b_{00} = 0$  and, furthermore, we define some terms to simplify the representation of Eq. (2) as follows:

$$F(i) \equiv \sum_{i=0}^{N} a_{ij} z_2^{-j}$$
, for  $i = 0, 1, ..., N$ , (3)

$$G(i) \equiv \sum_{i=0}^{N} b_{ij} z_2^{-j}$$
, for  $i = 0, 1, ..., N$ . (4)

Substituting Eqs. (3) and (4) into Eq. (2), we can rewrite Eq. (2) as follows:

$$Y = [F(0)X + G(0)Y] + \sum_{i=1}^{N} z_1^{-i} z_2^{i} [F(i)(z_2^{-i}X) + G(i)(z_2^{-i}Y)]$$

$$= [F(0)X + G(0)Y] + z_1^{-1}z_2^1([F(1)\hat{X}_1 + G(1)\hat{Y}_1]$$

$$+z_1^{-1}z_2^1([F(2)\hat{X}_2+G(2)\hat{Y}_2]+\cdots$$

$$+z_1^{-1}z_2^1([F(N)\hat{X}_N + G(N)\hat{Y}_N])\cdots)),$$
 (5)

where  $\hat{X}_k \equiv z_2^{-1} \hat{X}_{k-1}$ ,  $\hat{Y}_k \equiv z_2^{-1} \hat{Y}_{k-1}$  for k = 2, 3, ..., N, and  $\hat{X}_1 \equiv z_2^{-1} X$ ,  $\hat{Y}_1 \equiv z_2^{-1} Y$ . In Eq. (5), the summation terms in the square brackets characterized in different number of delay elements can be directly realized by the new systolic transformation appending corresponding delay elements in the input paths. Since the image is scanned in row-wise mode of this paper, the delay  $z_2^{-1} = z^{-1}$  and the delay  $z_1^{-1} = z^{-M}$  can be realized by a unit delay element and a shift-register (SR) with the size of M, where M is the width of image, respectively. Thus, the mapping of Eq. (5) onto a systolic architecture is depicted in Fig. 2. More importantly, this systolic structure has no global broadcast signals anywhere even though we implement the higher order 2-D IIR digital filters. In addition, one merely sets  $b_{ij}$  to zero such that a new systolic-array structure of an FIR digital filter can be obtained.

#### 3. Comparison Results

Comparison results of IIR digital filters are tabulated in Table I. It is in terms of the critical cycle period, latency, the number of multipliers, delay elements, and, importantly, whether the input and output signals globally broadcast in these structures. In Table I,  $\lceil \bullet \rceil$  denotes a minimum integer that greater than or equal to  $\bullet$ , and  $T_m$  and  $T_a$  represent the operation time required for the multiplier and adder, respectively. So as to minimize periods in the critical paths as shown in [1-3] and Fig. 2, we properly apply tree method to those structures and then separately evaluate the critical periods. In our work, the resulting period is less than that in [1] for  $N \ge 2$  and [3]. For practical case, since M is much larger than N, the number of delay elements is dominated by the product-term MN. Hence, the number of delay elements in this

work is almost equal to that of [2] as well as [3], and less than that of [1]. Analogously, we compare the performance of these FIR filters as listed in Table II after setting  $b_{ij}$  to zero. From Tables I and II, we guarantee that the proposed structures have no global broadcast and maintain the lowest latency under the accepted critical cycle period.

# 4. An FIR Filter Combined with the Lower-Error Fixed-Width Multiplier

We have successfully proposed the new IIR and FIR digital filters in the word level. For more practical considerations, we should take them down to the logicgate level and apply the recently developed lower-error fixed-width multiplier [4] as shown in Fig. 3. In Fig. 3,  $a_{ij}^{t}$  and  $x_{ij}^{t}$  are the t-th bits of the coefficient,  $a_{ij}$ , and input signal,  $x_{ij}$ , respectively.  $P_t$  denotes the t-th bit of the product P of the multiplier. For simplicity, we only present 4-bit operation of a new 2-D FIR digital filter while receiving a  $512 \times 512$  image for N = 2 and M = 512. That is, we require the 4×4 fixed-width multiplier, 4-bit adder, 4-bit D-flip-flop and shift-register with the size of (M-1) to implement a 2-D FIR filter as shown in Fig. 4. For the sake of the prevention of overflow, we normalize the value of input sequence and confirm that the sum of the absolute values of all coefficients is less than one. On the other hand, the output bits after the last stage adder must shift one-bit left in the scaling box to obtain correct binary point position as the next subsystem input bits. Thus, we only need 4 bits for the output bits. In addition, we need a counter, which counts the number from 0 to 511, to produce the signal RST2. Then, RST1 is obtained through the OR gate whose inputs are RST2 and RST. Some specified D flip-flops in Fig. 4 are connected to the signal RST1 and other D flip-flops are connected to the signal RST for the purpose of synchronization. This logic-gate-level structure is more suitable to the implementation via VLSI technology or well-developed ICs such as TTL ICs or CMOS ICs.

#### 5. Error Analysis in New Digital Filters

In this section, we additionally reveal the brief results for the error analysis due to finite precision arithmetic in 2-D IIR digital filters. Here, we adopt the same notations as in [5], and we directly extend the representation in [3, 5]. Since we normalize the value of input sequence and confirm that the sum of the absolute values of all coefficients is less than one, the number  $(N+1)\times(N+1)$  storage errors can be easily eliminated and only N storage errors are residual due to addition of each 1-D filter output. However, this concept increases coefficients quantization; that is, it will be a tradeoff between coefficients quantization error and storage error. Thus, the total combined errors are written as

$$f(n,m) = \sum_{i=0}^{N} \sum_{j=0}^{N} (p_{ij} + r_{ij}) + \sum_{i=1}^{N} s_{i}$$

$$+ \sum_{i=0}^{N} \sum_{j=0}^{N} (\mathbf{a}_{ij} x(n-i, m-j) + \mathbf{b}_{ij} y(n-i, m-j))$$

$$+ \sum_{i=0}^{N} \sum_{j=0}^{N} (a_{ij} e_{X} (n-i, m-j) + b_{ij} e_{Y} (n-i, m-j)) . (6)$$

In Eq. (6), alternately, the roundoff errors due to finite precision multiplication can be significantly reduced while applying a fixed-width multiplier with lower average error and lower variance compared to other multipliers [4]. Other terms will be identical to those in other proposed architectures while restricting the sum of absolute values of all coefficients is less one.

### 6. Conclusions

The new systolic-array architectures for the implementation of 2-D IIR and FIR filters have been accomplished by reordering the delay elements as well as summations and by applying another systolic transformation. The realization yields locally broadcast and lowest latency without sacrificing the number of

multipliers and delay elements under the accepted critical cycle period. In addition, we provide a detailed logic-gate level block diagram of a 2-D FIR filter, which is suitable to image/video processing applications due to the lower quantization effect in finite precision multiplication.

#### References

- [1] M. A. Sid-Ahmed, "A systolic realization for 2-D digital filters," *IEEE Trans. Acoust., Speech, Signal Processing*, vol. ASSP-37, pp. 560-565, Apr. 1989.
- [2] S. Sunder, F. El-Guibaly, and A. Antoniou, "Systolic implementations of two-dimensional recursive digital filters," in *Proc. IEEE Int. Symp. Circuits Syst.*, May 1990, pp. 1034-1037.
- [3] N. R. Shanbhag, "An improved systolic architecture for 2-D digital filters," *IEEE Trans. Signal Processing*, vol. 39, no. 5, pp. 1195-1202, May 1991.
- [4] L. D. Van, S. S. Wang, S. Tenqchen, W. S. Fegn, and B. S. Jeng, "Design of a lower error fixed-width multiplier for speech processing application," in *Proc. IEEE Int. Symp. Circuits Syst.*, May 1999, vol. 3, pp. 130-133, Orlando, Florida.
- [5] D. Raghuramireddy and R. Unbehauen, "A systolic structure for complex digital filters," in *Proc. IEEE Int. Symp. Circuits Syst.*, May 1990, pp. 1252-1255.

|                          | •                                           | 0                  |                          |                              |
|--------------------------|---------------------------------------------|--------------------|--------------------------|------------------------------|
| Parameters               | Sid-Ahmed [1]                               | SCH3 of S-G-A [2]  | Shanbhag [3]             | This Work                    |
| Critical Cycle Period    | $T_m + (2 + \lceil \log_2(N+1) \rceil) T_a$ | $T_m + 2T_a$       | $2T_m + 2T_a$            | $T_m + 3T_a$                 |
|                          |                                             |                    |                          |                              |
| Global Broadcast         | Input and Output                            | Input and Output   | Input and Output         | No                           |
| Latency                  | 1                                           | 1                  | 0                        | 0                            |
| No. of Multipliers       | $2(N+1)^2-1$                                | $2(N+1)^2-1$       | $2(N+1)^2-1$             | $2(N+1)^2-1$                 |
| No. of<br>Delay Elements | $(N+1)^2 + 2MN$                             | $(N+1)^2 + (M+1)N$ | $\frac{3N(N+1)}{2} + MN$ | $\frac{3N(N+1)}{2} + (M+1)N$ |

Table I Comparison Results among the IIR Architectures

Table II Comparison Results among the FIR Architectures

| Parameters            | Sid-Ahmed [1]                     | SCH3 of S-G-A [2] | Shanbhag [3] | This Work    |
|-----------------------|-----------------------------------|-------------------|--------------|--------------|
| Critical Cycle Period | $\max\{(T_m + T_a),$              | $T_m + T_a$       | $T_m + 2T_a$ | $T_m + 2T_a$ |
|                       | $(\lceil \log_2(N+1) \rceil)T_a $ |                   |              |              |
| Global Broadcast      | Input                             | Input             | Input        | No           |
| Latency               | 1                                 | 1                 | 0            | 0            |
| No. of Multipliers    | $(N+1)^2$                         | $(N+1)^2$         | $(N+1)^2$    | $(N+1)^2$    |
| No. of                | $(N+1)^2 + MN$                    | $(N+1)^2 + MN$    | N(N+1) + MN  | N(N+1) + MN  |
| Delay Elements        |                                   | , ,               |              |              |



Fig. 1. Systolic transformation, (a) applied by this paper, (b) applied by [3], and (c) original second-order relation.



Fig. 3. (a) A lower-error fixed-width multiplier [4] for  $4 \times 4$  multiplication and (b) a symbol view



Fig. 2. A new proposed IIR digital filter without the global broadcast for N=2.

Fig. 4. A logic-gate-level FIR digital filter consisting of the fixed-width multiplier for N = 2.