ISSN (Print): 0974-6846 ISSN (Online): 0974-5645

# Modified Architecture for Distributed Arithmetic with Optimized Delay using Parallel Processing

S. Subathradevi<sup>1\*</sup> and C. Vennila<sup>2</sup>

<sup>1</sup>A.P. Department of ECE, Anna University, BIT Campus Tiruchirappalli, Tamil Nadu, India; saminathan.sruthi@gmail.com <sup>2</sup>Department of ECE, Saranathan College of Engineering, Tiruchirappalli, Tamil Nadu, India; vennila.nitt@gmail.com

#### **Abstract**

In VLSI design of system configuration, the important parameter speed is playing a vital role in the design which is purely determined by the delay of the design. In the delay of the design is arrived by design delay and routing i.e. path delay. Nowadays in the design, the path or routing delay dominates more due to miniaturization of the design. During earlier days design delay dominates more. Because of scaling in the design, it is essential to concentrate more towards routing delay of the design to get the optimized delay in turn optimized speed in trade of with area. In the proposed work, by reviewing different architecture was constructed with different basic module in order to achieve optimized delay which is realized using verilog HDL.

Keywords: Architecture, Distributed Arithmetic, FIR Filter, Parallel Processing, Routing Delay, VLSI Design

# 1. Introduction

For any system design, the arithmetic units are much important to make it as a successful design. Arithmetic units like adders took much important part to play a role since subtraction and multiplication are all computed by computation of addition only. For any design there is a need to use arithmetic circuits. Without those circuits none other designs can be made. By concentrating delay minimization in arithmetic circuits such as distributed arithmetic, surely leads to make greater impact in minimizing the delay in all the design. So, if the speed is achieved by minimizing the delay in the architecture of the distributed arithmetic surely it leads to better performance in the speed of the design in the system. The Distributed Arithmetic computation is mainly related with Look-Up-Table (LUT).

In Look-Up-Table the pre-computed values gets stored and utilized for computation. Distributed Arithmetic using look-up-table is the widely used method for computation while doing FPGA based implementation.

So, various distributed arithmetic architectures were surveyed, realized and their delay was compared<sup>9</sup>.

This paper gets flow as follows. Section - 1 gives the introduction Section - 2 briefs about the Distributed Arithmetic. Section - 3 brief about the Literature Survey and Existing Architecture. Section - 4 deals with proposed Architecture. Section - 5 displays the results and comparison of result and Section - 6 deals with conclusion of the work.

# 2. Distributed Arithmetic Architecture

Distributed arithmetic is a multiply accumulate unit which is performing based on bit level arrangement in related with multiplication based computation. It is an important method by parallel computation of hardware multiply-accumulate. This type of implementation can be easily implemented in FPGA based block or system designs. As the extension it gets continued with sum functions like complex multipliers and so on¹.

<sup>\*</sup>Author for correspondence

DA is an important method by parallel computation for the calculation of sum of products or intermediate products to perform Multiplication and Accumulation (MAC). It is a very common method in all Digital Signal Processing (DSP) Algorithms<sup>1</sup>.

The expression is given for performing Multiplication and Accumulation (MAC).

$$y = A_1 \times x_1 + A_2 \times x_2 + ... + A_K \times x_K$$

Based on bit level arrangement the intermediate products are computed from the given expression in Distributed arithmetic for multiply and accumulate<sup>1</sup>.

$$y = \sum_{k=1}^{K} A_k x_k$$

Ak are fixed coefficients, and the Xk are the input data words. If each Xk is a 2's-complement binary number scaled then:

$$x_k = -b_{k0} + \sum_{n=1}^{N-1} b_{kn} 2^{-n}$$

Where the  $b_{kn}$  are the bits, 0 or 1,  $b_{k IS}$  0 as the sign bit, and bk, N-1 is the Least Significant Bit (LSB). To express y in terms of the bits:

$$y = \sum_{k=1}^{K} A_k \left[ -b_{k0} + \sum_{n=1}^{N-1} b_{kn} 2^{-n} \right]$$

If directly represented the equation is like "lumped" arithmetic computation. By interchanging the order of the summations it results in:

$$y = -\sum_{k=1}^{K} (b_{k0} \bullet A_k) + \sum_{k=1}^{K} \sum_{n=1}^{N-1} (A_k \bullet b_{kn}) 2^{-n}$$

As each *kn may* assigned by the values of 0 and 1. Instead of computing all these values directly, they may be precomputed and stored in a ROM for further processing like multiplication<sup>1</sup>.

# 3. Literature Survey and Existing Architecture

In paper<sup>1</sup>, they had given a technique for area efficiency improvement in the implementation of FIR filters using the Distributed Arithmetic (DA). These techniques exploit

the flexibility in partitioning the filter coefficients for a two Look-Up-Table (LUT) based DA implementation. The first technique is targeted at a ROM based implementation of LUTs and aims at minimizing number of column in the ROMs<sup>1</sup>.

Area complexity in the algorithm of Finite Impulse Response Filter is mainly caused by multipliers. Among the multiplier less techniques of FIR Filter, Distributed Arithmetic is most preferred area efficient technique. In this, partial products of filter coefficients are pre computed and stored in LUT and the filtering is done by shift and accumulates operations on these partial products. Figure 1 showing this ROM-LUT with pre computed values<sup>1</sup>.

The look-up-table size gets increased exponentially with the co-efficient. For less number of co-efficient, it is very easy to realize with lesser size of LUT. When having larger co-efficient, it will take up a lot of storage spaces in LUT, for FPGA implementation and also reduces the calculation speed. The paper presented the improvement of the DA algorithm by reducing the LUT size<sup>2</sup>.

The paper<sup>2</sup> brief about the implementation serial and parallel distributed arithmetic algorithm without multiplier for the design of FIR filters which was shown in Figure 2. The 16:1 multiplexer is replaced by 8:1 and 2:1 MUX. The traditional adder-based shift and accumulate for the implementation of Distributed Arithmetic based on intermediate products computation which gets replaced by accumulation<sup>1,2</sup>.

In paper<sup>1</sup>, a highly efficient multiplier-less FIR filter design was presented. Distributed Arithmetic (DA)



Figure 1. Distributed Arithmetic (ROM with 16).



Figure 2. Distributed Arithmetic (ROM with shared  $ROM)^1$ .



Figure 3. ROM and adder-based shift accumulation for DA1.



**Figure 4.** DA with tree adder.



**Figure 5.** DA with adder and Accumulator.



**Figure 6.** DA with array adder.



Figure 7. Simulation result.



| Parameter/   | Architecture | Architecture         | Architecture with | Architecture |
|--------------|--------------|----------------------|-------------------|--------------|
| architecture | With         | With                 | Add/sub with Acc. | With         |
|              | Acc. Alg.    | Shared LUT&Acc. Alg. |                   | Adder Tree   |
| Slices       | 14           | 14                   | 14                | 14           |
| 4 i/P LUTs   | 8            | 23                   | 23                | 23           |
| delay        | 6.007        | 6.009ns              | 5.407             | 6.009        |
| Power        | 0.081W       | 0.081W               | 0.081W            | 0.081W       |

Figure 8. Result comparison table.

has been used to implement an FIR filter, by taking the advantage as look-up-table based structure for FPGAs based implementations.

# 4. Proposed Architecture

The modified architecture which was constructed proposes a modified architecture for Distributed Arithmetic using parallel processing of ROM-LUTs with adder tree array. This architecture results in following advantages as reduced delay and quick prototyping with flexibility. The limitation of this design is trade of with area.

### 5. Results

The modified architecture has been synthesized with Xilinx ISE 14.3 and implemented as a target device of Spartan3E FPGA.

### 6. Conclusion

In the presented modified architecture of the proposed work given the design of Distributed Arithmetic with parallel processing which was implemented on VLSI by having the target device as Spartan-3E with the title "Modified Architecture for Distributed Arithmetic with optimized delay using parallel processing" which has provided minimized combinational path delay compared to other architecture which are listed.

Analysis in this direction leads to get the reduced path delay compare to the older existing architecture design of Distributed Arithmetic in the field of high speed system design and also in the design of on chip design like System on Chip and Network on Chip with DSP processor IC, for the design digital filter IC), in the high speed ICs.

Even though various sub-modules are used in the construction of architecture of Distributed Arithmetic using parallel processing with ROM and MUX provides 10.1% less in delay compared with architecture with other sub-module which are listed. Distributed Arithmetic Architecture for four tap FIR filter was taken, verified, compared and concluded.

# 7. Future Work

Although the focus of this work is towards the optimized path delay in the architecture, the same can be achieved by implementing this design using retiming in which the reusable of sub-modules or components may also be used. Even if the delay in the architecture has been minimized, the power dissipation can also be reduced in trade of with area, by properly placing the architecture in the place add/shift and accumulate block.

# 8. References

- Sinha A, Mehendale M. Improving area efficiency of FIR filters implemented using distributed arithmetic. 1998. p. 104–9.
- Kalaiyarasi D, Kalpalatha Reddy T. Area efficient implementation of FIR filter using distributed arithmetic with offset binary coding. Journal of VLSI and Signal Processing. 2014; 4(3):1–9.
- Singh Pal N, Pal Singh H, Sarin RK. Implementation of high speed FIR filter using serial and parallel distributed arithmetic algorithm.
- Kumar V, Muthyala L, Chitra E. Design of a high speed FIR filter on FPGA by using DA-OBC algorithm. International Journal of Engineering Research and General Science. 2014; 2(4):510–7.
- Usha M, Ramadoss R. An efficient adaptive FIR filter based on distributed arithmetic. International Journal of Engineering Science Invention. 2014 Apr; 3(4):15–20.
- Sravanthi P, Srinivasa Rao CH, Madhava Rao S. A novel approach of area-efficient FIR filter design using distributed arithmetic with decomposed LUT. Journal of Electronics and Communication Engineering. 2013; 7(2):13–8.
- Ramesh R, Nathiya. Realization of FIR filter using modified Distributed Arithmetic Architecture. Signal and Image processing: An International Journal. 2012 Feb; 3(1):83–94.
- Iyer A, Krishnapriya PN. Power and area efficient implementation for parallel FIR filters using FFAs and DA.
   International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering. 2013 Dec; 2(Spl issue 1):552–60.
- 9. Available from: http://www.Wikipedia.com