# Multiple Stopping Criteria and High-Precision EMD Architecture Implementation for Hilbert-Huang Transform

Tsung-Che Lu, Pei-Yu Chen, Shih-Wei Yeh, and Lan-Da Van Department of Computer Science, National Chiao Tung University Hsinchu, Taiwan, R.O.C.

Abstract—In this work, a multiple stopping criteria and highprecision empirical mode decomposition (EMD) hardware architecture implementation is proposed for Hilbert-Huang transform (HHT) in biomedical signal processing. The proposed architecture can support multiple stopping criteria including the constant criteria, the SD criteria and the ratio criteria. The 38-bit floating point precision is adopted in this work to support 10 IMF components with enough accuracy. The off-chip memory architecture is adopted to increase the processing capacity. By the pipelined cubic spline coefficient unit (PCSCU), the computation time can be reduced. The proposed EMD hardware architecture is implemented in TSMC 90 nm CMOS process with the core area of 4.47 mm<sup>2</sup> at the operating frequency of 40 MHz. The post-layout simulation result shows that our work with the constant criterion can speed up the performance 50.4 times compared to the software computation on a single core of ARM11 for 2K data size breathing signals.

#### I. INTRODUCTION

The analytic method of Hilbert-Huang transform (HHT) is firstly proposed by Huang *et al.* [1] for the non-linear and non-stationary signal processing. Recently, HHT has been further involved in the biomedical signal processing [2]-[9] for EEG analysis [2], [4], [9], EKG analysis [3], [5]-[6], breathing signal analysis [6], biological sound analysis [7] and ultrasound signal analysis [8]. Since most biomedical signals have non-linear and non-stationary property, HHT becomes favorable in biomedical signal processing and can provide different information compared to the conventional analysis methods including fast Fourier transform (FFT) or discrete wavelet transform (DWT).

The HHT analysis includes two parts: empirical mode decomposition (EMD) and Hilbert transform (HT). EMD is used to decompose the original signals into several components with individual range of frequency called intrinsic mode function (IMF) and the residue component. Although HHT is useful in many aspects, the EMD processing in HHT has high computational complexity and therefore leads to long computation time. In [3]-[9], several EMD implementations with the field programmable gate array (FPGA) approach or the application specific integrated circuit (ASIC) approach have been proposed and show better EMD acceleration performance. However, these implementations do not provide

multiple stopping criteria for different applications. Besides, the processing capability up to 10 IMF components is not provided for the state-of-the-art EMD implementations with the ASIC approach. Thus, we are motivated to propose an ASIC-based EMD implementation with multiple stopping criteria and 10 IMF components in this work [10].

The rest of the paper is organized as follows. The EMD algorithm is briefly reviewed in Section II. Section III describes the proposed EMD architecture. In Section IV, we show the simulation and comparison results. Last, the conclusion is remarked in Section V.

## I. Brief Review of EMD Algorithm

## A. Empirical Mode Decomposition

The EMD proposed by Huang *et al.* [1] aims to analyze and explore the non-linear and non-stationary signals in the HHT. Through the EMD algorithm, the raw data x(t) can be decomposed into the IMF components and the residue component. The steps of the EMD algorithm are summarized as follows.

Step 1: Let input(t) = x(t).

Step 2: Find local extrema of input(t) and use the cubic spline method [11] to generate the upper envelope U(t) and the lower envelope L(t) with the local extrema.

Step 3: Compute the mean value m(t) of U(t) and L(t).

$$m(t) = [U(t) + L(t)]/2$$
 (1)

Step 4: Subtract m(t) from  $h_{i,(k-1)}(t)$  to generate the IMF candidate  $h_{i,k}(t)$ .

$$h_{i,k}(t) = h_{i,(k-1)}(t) - m(t)$$
, where  $h_{i,1}(t) = x(t)$ . (2)

Step 5: If the stopping criterion is not satisfied, let  $input(t) = h_{i,k}(t)$  and repeat Step 2~Step 5. Otherwise, go to Step 6.

Step 6: Let IMF  $c_i(t) = h_{i,k}(t)$  and compute the residue  $r_i(t)$  from  $r_{i-1}(t)$  and  $c_i(t)$ .

$$r_i(t) = r_{i-1}(t) - c_i(t)$$
, where  $r_0(t) = x(t)$ . (3)

Step 7: Let  $input(t) = r_i(t)$  and repeat Step 2~Step 7 to find the next IMF component.

# B. Cubic Spline Theory

In this subsection, the cubic spline theory is briefly reviewed, where the concept and notations are based on [11]. In the EMD algorithm, the third order polynomial s(t) has to be

T. C. Lu, P. Y. Chen, S. W. Yeh and L. D. Van are with the Department of Computer Science, National Chiao Tung University, Hsinchu 300, Taiwan (e-mail: ldvan@cs.nctu.edu.tw). This work was supported in part by the Ministry of Science and Technology Grant MOST 103-2221-E-009-099-MY3 NSC 102-2219-E-009-008, and NSC 102-2220-E-009-039.

calculated through cubic spline method to obtain the upper envelope of a set of local maxima,  $y_0$ ,  $y_1$ , ...,  $y_{n-1}$ , with the corresponding time indices  $t_0$ ,  $t_1$ , ...,  $t_{n-1}$ . The time intervals between n maxima are interpolated. In the same way, the lower envelope of the local minima is obtained by a set of local minima. The cubic polynomial equation s(t) is represented as

$$s(t) = s_i(t)$$
 for  $i = 0, 1, ..., n-2$  and  $t_i \le t \le t_{i+1}$  (4)

where  $s_i(t)$  denotes the piecewise cubic polynomial equation and is represented as

$$s_i(t) = a_i + b_i(t - t_i) + c_i(t - t_i)^2 + d_i(t - t_i)^3$$
(5)

where  $a_i$ ,  $b_i$ ,  $c_i$  and  $d_i$  denote the unknown coefficients. In addition, the following equations should be satisfied to ensure the continuities of s, s' and s''.

$$\begin{cases} s_i(t_i) = y_i \\ s_i(t_{i+1}) = y_{i+1} \end{cases}$$
 for  $i = 0, 1, ..., n-2$  (6)

$$\begin{cases} s_i'(t_{i+1}) = s_{i+1}'(t_{i+1}) \\ s_i''(t_{i+1}) = s_{i+1}''(t_{i+1}) \end{cases} \text{ for } i = 0, 1, ..., n-3$$
 (7)

Beside the above equations, the endpoint condition is required to solve the unknown coefficients. In this work, the not-a-knot condition is adopted as the endpoint condition.

$$\begin{cases} s_0'''(t_1) = s_1'''(t_1) \\ s_{n-3}'''(t_{n-2}) = s_{n-2}'''(t_{n-2}) \end{cases}$$
(8)

Thus, all equations can be represented by the linear system in (9), where  $h_i = t_{i+1} - t_i$  and  $m_i = 2c_i$ .

$$\begin{bmatrix} -h_1 & h_0 + h_1 & -h_0 & \cdots & \cdots & 0 \\ h_0 & 2(h_0 + h_1) & h_1 & 0 & \cdots & 0 \\ 0 & h_1 & 2(h_1 + h_2) & h_2 & 0 & \vdots \\ \vdots & 0 & \ddots & \ddots & \ddots & 0 \\ 0 & \cdots & 0 & h_{n-3} & 2(h_{n-3} + h_{n-2}) & h_{n-2} \\ 0 & \cdots & 0 & -h_{n-2} & h_{n-3} + h_{n-2} & -h_{n-3} \end{bmatrix} \begin{bmatrix} m_0 \\ m_1 \\ \vdots \\ m_{n-3} \\ m_{n-1} \end{bmatrix} = \begin{bmatrix} 0 \\ \underbrace{y_2 - y_1}_{1} - \underbrace{y_1 - y_0}_{h_0} \\ \vdots \\ \underbrace{y_{n-2} - y_{n-3}}_{h_n - 3} - \underbrace{y_{n-3} - y_{n-4}}_{h_n - 4} \\ \underbrace{y_{n-1} - y_{n-2}}_{h_{n-2}} - \underbrace{y_{n-2} - y_{n-3}}_{h_{n-3}} \\ \end{bmatrix} (9)$$

Consequently, the unknown coefficients of the linear system can be solved by Gaussian elimination.

# C. Stopping Criteria of EMD

The EMD algorithm is based on the sifting process to decompose the signals into IMF components. Therefore, the stopping criterion is crucial since the physical meaning of the component should be retained to avoid over-decomposing in the sifting process. In general, different applications require different stopping criteria due to the signal disparity. Therefore, the following three stopping criteria are adopted in this work for flexibility. First, the constant criterion [12]-[13] sets a constant iteration number for the sifting process. Second, the SD criterion in [14] using the SD value defined in (10).

$$SD = \frac{\sum_{t=0}^{n} \left| h_{i,(k-1)}(t) - h_{i,k}(t) \right|^{2}}{\sum_{t=0}^{n} h_{i,(k-1)}^{2}(t)}$$
(10)

If the SD value is smaller than the pre-defined parameter, the iteration will be stopped. Finally, the ratio criterion [15] aims



Fig. 1. Proposed EMD architecture.

to guarantee global small amplitude of the mean value. The ratio function ratio(t) is defined in (11).

$$ratio(t) = \frac{U(t) + L(t)}{U(t) - L(t)}$$
(11)

The amounts of the ratio(t) is used to control the iteration in the EMD algorithm with  $ratio(t) < \theta_1$  for the specific percentage p of all the data set and the other  $ratio(t) < \theta_2$  for the remaining 1-p of the dataset.

# III. EMD HARDWARE ARCHITECTURE

In this section, the proposed EMD hardware architecture as shown in Fig. 1 with the multiple stopping criteria judgment unit (MSCJU), 38-bit floating-point format, off-chip memory [6] and pipelined cubic spline coefficient unit (PCSCU) is presented. The units of the proposed EMD architecture includes the upper envelope unit (UEU), the lower envelope unit (LEU), the mean and residue unit (MRU), MSCJU, the off-chip memory data controller (OCMDC), the data synchronization controller (DSC), and the FIFOs.

The extrema FIFOs store the extrema indices and values. The coefficient FIFOs are used to keep the cubic spline coefficients for the cubic spline interpolation. If the PCSCU is idle and the extrema FIFO is not empty, the PCSCU will be driven to generate the available cubic spline coefficients for the interpolation procedure. Due to the asynchronous processing of UEU and LEU, the data inputs of UEU and LEU are controlled by DSC to ensure that the mean value is computed properly and all internal FIFOs has no overflow and underflow. The outputs of UEU and LEU are saved in the envelope data FIFO. Then, the mean value will be calculated by MRU and saved at the off-chip memory until the envelope data FIFO is empty. After the computation of mean value, the IMF candidate  $h_{i,k}(t)$  will be calculated by MRU using the mean value and the previous IMF candidate  $h_{i,(k-1)}(t)$  stored in the off-chip memory.

The overlapped window concept [2], [4] is adopted to decrease the window length such that the computation effort in hardware can be reduced. The proposed EMD architecture supports multiple stopping criteria including the constant criterion, the SD criterion and the ratio criterion. During the iteration of the EMD computation, the MSCJU as shown in Fig. 2 will determine whether the iteration should be terminated according to stopping criterion selected by the user. For the constant criteria, if *constant\_result* reaches ten, the criteria matching unit will stop the iteration. For the SD criteria, if



Fig. 2. Block diagram of the multiple stopping criteria judgment unit (MSCJU).



Fig. 3. Block diagram of the pipelined cubic spline coefficient unit (PCSCU).

SD\_result satisfies the user defined parameter, the criteria matching unit will stop the iteration. For the ratio criteria, if ratio\_result1 is satisfied for the percentage p of all the data set and the other ratio\_result2 for the remaining 1-p of the dataset, the criteria matching unit will stop the iteration. The divider in Fig. 2 can be shared by the SD criterion and the ratio criterion. In [7], the 40-bit free-form fixed-point binary word is used to preserve the precision. In this work, in order to support 10 IMF components for the EMD computation, the self-defined 38-bit floating-point format with the sign bit, the 7-bit exponent, and the 30-bit fraction is utilized in the EMD architecture to ensure the computation accuracy.

In this work, the zero-bus turnaround (ZBT) SRAMs [16] with single port controller is used for the off-chip memory. By adopting the off-chip memory architecture for EMD, the restriction of on-chip memory size can be avoided and the processing capacity can be improved. Note that the off-chip memory architecture is usually supported in the FPGA platform. That means the proposed EMD architecture can be compatible with the FPGA approach.

To improve the speed, the upper and lower envelopes are computed in parallel by adopting dual envelope units [4], [8]-[9] to further speed up the EMD computation. In the envelope units, the PCSCU in Fig. 3 is utilized to accelerate the calculation of the cubic spline coefficients. PCSCU contains four stages: input processing stage (IPS), forward operation stage (FOS), backward operation stage (BOS), and coefficient output stage (COS). Each stage requires 45 cycles to finish the corresponding computation for (9).

# IV. COMPARISON RESULTS AND CHIP IMPLEMENTATION

## A. Functional Simulation Results

In this subsection, the register-transfer level (RTL) and software simulations are performed using the real 60-second breathing signal with the sampling rate of 1000 Hz. The input data size is 1000×60=60000=60K. Through the simulation, the function and precision of the proposed hardware EMD can be



Fig. 4. NRMSD results of the IMF components with different stopping criteria. Fig. 6. Proposed EMD implementation layout without I/O pad.



Fig. 5. Comparison results of the IMF components with the constant criterion.

validated by comparing the RTL and software simulation results in terms of the normalized root mean squared deviation (NRMSD) defined in [2]. We refer to the MATLAB code [13] to write C code for the software simulation. In the simulation, the parameters of the stopping criteria are listed as follows. The iteration number is set to 10 for the constant criterion. The threshold is set to 0.0003 for the SD criterion. For the ratio criterion,  $\theta_1 \approx 0.05$ ,  $\theta_2 \approx 10\theta_1$ , and  $p \approx 95\%$  are set. Note that the software simulation is executed on single core with double precision of the ARM11 MPcore platform, where the default CPU clock frequency is 210 MHz. Fig. 4 shows the NRMSD results with all three stopping criteria. The maximum NRMSD values with the constant criterion, the SD criterion, and the ratio criterion are 2.83%, 5.42%, and 0.6%, respectively. Fig. 5 shows the comparison results of 10 IMF components and the residue component for the breathing signal with constant criterion. According to the simulation, the proposed EMD design can achieve acceptable NRMSD performance among 10 IMF components with different stopping criteria. Therefore, the function and precision of the hardware EMD can be validated.

# B. Chip Implementation

The cell-based design flow in TSMC 90 nm CMOS process is adopted for the chip implementation, where the power supply voltage is 1.0 V. Synopsys Design Compiler is employed to synthesize the RTL design with the constraint of 25 ns. Cadence SOC Encounter is used for the place and route procedure. The layout of the proposed EMD architecture is shown in Fig. 6, where the core area is 4.47 mm<sup>2</sup>.

TABLE I: COMPARISON RESULTS AMONG THE STATE-OF-THE-ART EMD IMPLEMENTATIONS

| Reference                                | BMEI'10 [3]             | BioCAS'11 [4]                                                             | TIE'11 [5]           | ASP-DAC'12 [6]          | BioCAS'12 [7]    | TIM'12 [8]           | TCASII'13 [9]                                         | This Work                    |
|------------------------------------------|-------------------------|---------------------------------------------------------------------------|----------------------|-------------------------|------------------|----------------------|-------------------------------------------------------|------------------------------|
| Application                              | EKG                     | EEG                                                                       | EKG                  | EKG/breathing<br>signal | Biological sound | Ultrasonic<br>signal | EEG                                                   | Breathing signal             |
| Sampling Rate                            | N/A                     | 256 Hz                                                                    | 360 Hz               | 250 Hz/100 Hz           | 11.025 KHz       | 12.5 MHz             | 256 Hz                                                | 1000 Hz                      |
| Operation Frequency                      | 100 MHz                 | 522.24 KHz                                                                | 50 MHz               | N/A                     | 62.5 MHz         | 12.5 MHz             | 50 MHz                                                | 40 MHz                       |
| Latency/Computation<br>Time              | 63.833 ms<br>(512 data) | 700 ms for IMF1,<br>2000 ms for<br>IMF1~IMF2,<br>4700 ms for<br>IMF1~IMF3 | 2780 ms<br>(1K data) | N/A                     | N/A              | 0.1 ms<br>(1K data)  | 0.16 ms for<br>IMF1~IMF3,<br>0.27 ms for<br>IMF1~IMF5 | 11.494 ms<br>(2K data)       |
| Number of IMF components                 | N/A                     | 5                                                                         | N/A                  | 4 (at least)            | 8                | 2                    | 5                                                     | 10                           |
| Support of Multiple<br>Stopping Criteria | No                      | No                                                                        | No                   | No                      | No               | No                   | No                                                    | Yes (3 Stopping<br>Criteria) |
| Implementation<br>Approach               | FPGA                    | ASIC                                                                      | DSP+FPGA             | ARM+FPGA                | FPGA             | FPGA                 | ASIC                                                  | ASIC                         |
| Process Technology                       | N/A                     | UMC 90 nm<br>CMOS                                                         | N/A                  | N/A                     | N/A              | N/A                  | TSMC 90 nm<br>CMOS                                    | TSMC 90 nm<br>CMOS           |
| Core Area                                | N/A                     | 1.02 mm <sup>2</sup>                                                      | N/A                  | N/A                     | N/A              | N/A                  | $0.47 \text{ mm}^2$                                   | 4.47 mm <sup>2</sup>         |
| Power Dissipation                        | N/A                     | 57.3 μW                                                                   | N/A                  | N/A                     | N/A              | N/A                  | 13.6 mW                                               | 61.6 mW                      |

## C. Comparison Results

Since the post-layout simulation is too time consuming for 60K data size, the chip implementation described in the previous subsection is validated by the post-layout simulation for 2K data size breathing signal with the constant criterion. The power consumption is 61.6 mW. According to the postlayout simulation result, the computation time of the proposed EMD implementation is 11.494 ms for IMF1~IMF10. The proposed EMD hardware implementation can speed up the EMD computation 50.4 times with 2K data size compared to the software approach on single core of the ARM11 MPcore platform. Table I summarizes the comparison results among the state-of-the-art EMD implementations [3]-[9] in terms of the algorithm, application, sampling rate, operation frequency, computation time, number of IMF components, support of multiple stopping criteria, implementation approach, process technology, core area and power dissipation. In summary, compared to this work, none of the state-of-the-art EMD architecture implementations provide multiple stopping criteria as well as the processing capability for 10 IMF components with the ASIC approach. Although this work has the largest core area and power dissipation among the EMD implementations due to high precision, this EMD architecture can provide multiple stopping criteria and 10 IMF components for more flexibility and capability, respectively, compared to other EMD implementations in Table I. On the other hand, the off-chip memory architecture enables this work to process large data size on the hardware.

# V. CONCLUSION

In this work, the proposed EMD architecture with multiple stopping criteria and 38-bit floating-point precision can support the flexibility and 10 IMF components, respectively. In addition, the off-chip memory architecture utilized in this work enables the large data processing capability. By adopting PCSCU of the dual envelope units, the computation time can be reduced. From the RTL simulation results for 60K breathing signals with multiple stopping criteria and the post-layout simulation result for 2K data size breathing signals with the constant criterion, the proposed EMD architecture can achieve what we expect.

#### REFERENCES

- [1] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, N. C. Yen, C. C. Tung, and H. H. Liu, "The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis", *Proceedings of The Royal Society of London Series A*, vol. 454, pp. 903-995, Mar. 1998.
- [2] N. F. Chang, C. Y. Chiang, T. C. Chen and L. G. Chen, "Cubic spline interpolation with overlapped window and data reuse for on-line Hilbert Huang transform biomedical microprocessor," in *Proc. IEEE EMBC*, 2011, pp. 7091–7094.
- [3] L. Wang, M. I. Vai, P. U. Mak, and C. I. Ieong, "Hardware-accelerated implementation of EMD," in *Proc. IEEE BMEI*, 2010, pp. 912–915.
- [4] N. F. Chang, T. C. Chen, C. Y. Chiang and L. G. Chen, "On-line empirical mode decomposition biomedical microprocessor for Hilbert Huang transform," in *Proc. IEEE BioCAS*, 2011, pp. 420–423.
- [5] M. H. Lee, K. K. Shyu, P. L. Lee, C. M. Huang, and Y. J. Chiu, "Hardware implementation of EMD using DSP and FPGA for online signal processing," *IEEE Trans. Ind. Electron.*, vol. 58, no. 6, pp. 2473-2481, Jun. 2011.
- [6] S. C. Chen, C. C. Chen, W. C. Guo, T. J. Lin, and C. W. Yeh, "Complexity-effective Hilbert-Huang transform (HHT) IP for embedded real-time applications," in *Proc. ASP-DAC*, 2012, pp. 473-474.
- [7] W. C. Fang, C. C. Chou, T. H. Hung, K. C. Lin, A. H. T. Li, Y. C. Chang, B. K. Hwang, and Y. W. Shau, "An efficient and accurate empirical mode decomposition of the technical design and methods for biological sound," in *Proc. IEEE BioCAS*, 2012, pp. 320–323.
- [8] Y. Y. Hong and Y. Q. Bao, "FPGA implementation for real-time empirical mode decomposition," *IEEE Trans. Instrum. Meas.*, vol. 61, no.12, pp. 3175–3184, Dec. 2012.
- [9] W. C. Shen, H. I. Jen, and A. Y. Wu, "New Ping-Pong scheduling for low-latency EMD engine design in Hilbert–Huang Transform," *IEEE IEEE Trans. Circuits Syst. II*, vol. 60, no. 8, pp. 532-536, Aug. 2013.
- [10] S. W. Yeh, "Design and implementation of multiple stopping criteria EMD for Hilbert-Huang transform," M.S. thesis, Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan, 2012.
- [11] R. L. Burden and J. D. Faires, *Numerical Analysis*, 8th ed. Belmont, CA: Books/Cole, 2005.
- [12] G. Wang, X. Y. Chen, F. L. Qiao, Z. H. Wu, and N. E. Huang, "On intrinsic mode function," *Advances in Adaptive Data Analysis*, vol. 2, no. 3, pp. 277-293, 2010.
- [13] The EMD program for MATLAB [Online]. Available: http://rcada.ncu.edu.tw/research1\_clip\_program.htm
- [14] N. E. Huang and Samuel Shen, The Hilbert-Huang Transform and Its Applications. Singapore: World Scientific, 2005.
- [15] G. Rilling, P. Flandrin, and P. Gonçalvès, "On empirical mode decomposition and its algorithms," in *Proc. IEEE-EURASIP Workshop* on *Nonlinear Signal and Image Processing*, 2003.
- [16] "NoBL<sup>TM</sup>: The Fast SRAM Architecture-AN1090," Application Note, Cypress Semiconductor, San Jose, CA 95134-1709, pp. 1-10, Mar. 2012