21.03.2009, author: Ing. Robert Krejci, category: Digital Signal Processing

DSP Algorithms Optimization

Poster 2009, International Conference on Electrical Engineering in Prague

May 21, 2009

at Faculty of Electrical Engineering, CTU Prague

Ing. Robert Krejčí, Department of Circuit Theory, Czech Technical University, Technická 2, 166 27 Praha, Czech Republic

Abstract. This paper presents basic possibilities of digital signal processing algorithms optimization. Real-time and real environment tasks of embedded systems require special approach to programming methods due to hardware-limited resources. For computationally-intensive tasks effective code must be executed. Some special properties of digital signal processors are described. Some aspects of suitable hardware platform selection are mentioned, with emphasis on digital signal processors possibilities. As a main part of this paper, some fundamental methods of software optimizations are introduced with several C code simple examples.

Keywords

Digital signal processing, DSP, optimization, SIMD, VLIW, floating-point, fixed-point.

1. Introduction

In our research we deal with the optimization of speech recognition systems algorithms for selected hardware platforms. Speech recognition is generally computationally-intensive task and includes many of digital signal processing algorithms. In real-time and real environment speech recognisers applications it's often necessary to use embedded resource-limited hardware. Less memory, clock frequency, space and cost, related to common architecture PC (x86), must be balanced by more effective computation. Thus, the first step of the optimization is selection of suitable hardware platform. The next step will be algorithms optimization tailored for the platform, using maximum of possibilities of the selected hardware architecture.

2. Hardware platform selection

In normal cases, particularly in our research, we develop algorithms of digital signal processing on the PC platform. Certainly we know that the heart of every computer is a microprocessor, most commonly it's Intel IA-32 (x86) or compatibile. When using an algorithm, e.g. predictive filter for the processing of speech signals, we assume that it is somehow calculated. We simply write a function in Matlab (or Octave):

A=lpc(x,N);

... and we are not at all interested, how sophisticated calculation is hidden in the function. Working on some abstract level, we assume that the authors of this function have made the optimization for the platform x86, and therefore that the calculation is as fast as possible. And even if it was not, we do not mind, because in Matlab on Intel x86 platform, we usually work out of real time, so we just wait a while to calculate.

To keep at disposal with Quad-Core processor running on clock frequency of 3 GHz and with 8 GB of RAM is certainly convenient. There are, however, tasks, in which such an achievement we can not afford, for example for the following reasons:

There is a requirement for a low power device.
There is a requirement for small size and device portability.
There is a small budget or need to minimize the cost of the device.

Therefore, it is necessary to choose another hardware platform, that is other than the x86 microprocessor and apparently also other instruments for programs development. For real-time signal processing, it is obviously appropriate to choose a digital signal processor.

Digital signal processors are special types of processors that are different from the general ones, in particular in the following points:

The instruction set architecture (ISA) is specialized for signal processing:

There is often available a category of “MAC” instructions: multiplication and accumulation in one instruction cycle.
Arithmetic instructions allow to compute the operations which would be otherwise written to several instructions, such as absolute value, saturation of the result or rounding.

More data and address buses, in order to compute quickly with more data at once.
Special memory addressing modes allow effectively transfer data from memory at the hardware level: for example, a cyclical addressing or a bit reversion addressing, which is used when calculating the FFT.

To properly select a suitable architecture for DSP and speech recognition systems, it is necessary to examine well the available supply and to become familiar with the hardware capabilities of the “candidates”. In the decision it is necessary to take into account some basic features, in which processors from different manufacturers differ. Among others there are the following aspects:

2.1 Fixed Point vs. Floating Point

One of the aspects of digital signal processors is the way of their dealing with real numbers.

One class of microprocessors works with real numbers stored in the form of a Fixed Point.

Fig. 1: Fixed Point bit format

The advantage of such processors is possibility of greater speed, lower cost and smaller dimensions. Conversely, a disadvantage is the need to modify the algorithms so as to prevent overflow or underflow range of results. This is usually done by right bit rotation of the appropriate count of bits after each arithmetic operation and it is called scaling.

More effective way to deal with real numbers allow processors with Floating Point.

Fig. 2: Floating Point bit format

This class of processors include moreover so-called Floating Point Unit (FPU), a hardware unit for computing with numbers expressed by mantissa (Z) and exponent (E). The advantage of a processor with FPU is that the programmer has a huge dynamic range of numbers, and thus saturation or scaling basically does not need to deal with. The disadvantage is, however, less processor speed and usually more cost.

2.2 RISC vs. CISC

Another aspect is the complexity of the processor instruction set. From this point of view processors are divided to CISC (Complex Instruction Set Computer) and RISC (Reduced Instruction Set Computer).

Typical representatives of CISC are platforms like Intel x86, Motorola 68k and 8051. Their main features are:

Each instruction can perform more complex operations.
The processor allows complex effective addressing modes.
Each instruction starts a microprogram, the performance lasts for different times at different instructions.
Compiler (program for code translating) of the higher programming language is more complex.

Typical representatives of RISC processors are ARM platform, MIPS, PowerPC, SPARC, Microchip PIC / dsPIC, Atmel AVR, and others. Their main features are:

Instruction set is limited to a small number of simple instructions, but it is ensured that the execution of each instruction always takes the same time, usually one instruction cycle.
Addressing is limited to a few fundamental modes.
Simple instructions and easy way of addressing allows to create simple and efficient compiler from a higher programming language.
Simpler logic enables to clocking the processor at higher frequencies.

2.3 Pipelining

Another feature that appears in current processors is called pipelining. It is a chainloading instruction processing with phase overlapping. The instructions are executed during several instruction cycles, whereas one instruction execution is finished at each cycle.

When optimizing algorithms, it's necessary to have in mind that some results can be available till after several instruction cycles, not immediately. If one instruction waits for results of previous instruction, pipeline stalling becomes.

2.4 SIMD

The principle of SIMD (Single Instruction Multiple Data) means a computational operation execution on more data at the same time during one instruction. Such after one instruction cycle there can be available a result of a complex addition, which implies the execution of two operations at once. It is usually necessary to make a bit reduction of operands.

2.5 VLIW

The VLIW (Very Long Instruction Word) architecture means that the instructions are processed in parallel. They are grouped in an instruction packet and thus the very long instruction word is set, such as 256 bits. Parallel processing allows several different operations execution at the same time: for example, arithmetic operations, memory transfers etc. Above this, particular instructions of the instruction packet may be of SIMD type, so they can work with more data simultaneously. Such an architecture can help to create a very effective program.

2.6 Cache

Within simple microcontrollers, there are are all circuits, including memory, on one chip, in a single package. Within more powerful processors it would be technically impossible to place large amounts of memory into a single chip. Therefore, the external memory is used, as known from classical PC. However, the problem arises: processors are much faster than memories. Sufficient memory to directly communicate with the processor would have to be very fast, and therefore disproportionately expensive. If we use a relatively slow memory, the processor would have to wait for the execution of any writing to and reading from memory. Thus processors with (usually two-level) cache are used.

Cache is a very fast memory located directly on the chip near the core of a microprocessor and it is intended to mirror a part of data- or program-memory, depending on which parts are just used. When the data is present in the fast cache, the calculation is carried out without delay. When the data is not present, the processor must wait until the data come from the main memory, which takes time.

Some platforms, such as x86, work with cache transparently, i.e. at the level of hardware. The programmer does not have to deal with movements of data between memory and cache. This is called a coherent cache. Other platforms such as MIPS give the task of cache control to a programmer or respectively to an operating system. Within the optimal execution of the programs it is necessary to prepare the data to the cache, that it may be available when needed. [1]

2.7 Criteria for optimal platform choice

At the beginning of each optimization process is an important decision: the choice of appropriate hardware platform on which a system of digital signal processing is operated. It is necessary to understand to hardware aspects in order to implement effective optimized algorithms. The above hardware aspects imply several criteria for choosing the appropriate platform:

It is preferable to choose a signal processor than a processor for general use.
If it is not economical to use signal processor, it is appropriate to use at least a processor with instruction set extended of DSP and SIMD instructions.
It may not be decisive a processor frequency, but its effectiveness, such as (roughly):

2400 MIPS = 300 MHz of VLIW = 2400 MHz of RISC = n × 2400 MHz of CISC

3. Software optimizations

There are many methods for improving the efficiency of programs. Some are quite general, others are appropriate for the area of digital signal processing and others are effective only for a specific processor architecture (and vice versa for another can reduce the efficiency). In the following text some of them are mentioned.

Depending on used programming tools or development environment, some optimizations are executed automatically by compiler and linker or based on request of optimization level. But it isn't a general rule. Some compilers also operate with following terms, but without more explanation. Thus it would be useful to describe some methods with respective simple examples.

3.1 Fixed-point / floating-point

If a 32-bit fixed-point processor is used, computing in floating-point would be inefficient. The solution is to rewrite the program to a fixed-point arithmetic. This procedure is illustrated on the following fragment of code in C language.

Calculation in floating-point arithmetic:

#define N 16
 float a[N], b[N];
 float s = 0.0;
 for(i = 0; i < N; i++) { s += a[i] * b[i]; }

Transcription to fixed-point arithmetic:

#define SCALE 5
 
 short a[N], b[N]; // data arrays in Q15 format
 long long s = 0; // 64-bit accumulator
 short r; // 16-bit result
 
 for (i = 0; i < N; i++) {
   s += a[i] * b[i]; // accumulate Q1.30 products
 }
 
 s = s >> SCALE; // scaling
 
 if (s > 0x7FFF) // positive saturation?
   r = 0x7FFF; // clamp the result to +1
 else if (s < -0x8000) // negative saturation?
   r = -0x8000; // clamp the result to -1
 else // no saturation
   r = s & 0xFFFF; // discard redundant sign bits

As can be seen, it is also necessary to ensure the saturation of the results. The optimization could continue down to the level of assembler, using the DSP instructions, which would execute proper and effective saturation of the result.

3.2 Code Scheduling

Optimization is not just rewriting the code to the instructions that correspond to processor architecture. In case of using processor with pipelining there is the need to plan the execution of the code so that each instruction, if possible, work with the already calculated operands, i.e. the instructions have to be rearranged in a way that maximizes the performance by reducing the pipeline stalls. [2]

3.3 Loop Unrolling

This method is used when the cycle management is greater than its proper execution. Let us have the following code:

#define N 3
 short i;
 short a[N], b[N];
 long long s = 0; // 64-bit accumulator
 
 for(i=0; i < N; i++) { s += a[i] * b[i]; }

Since N is known at the compile-time to be the constant and it is a small number, it is more efficient to rewrite the cycle as follows:

s += a[0] * b[0];
 s += a[1] * b[1];
 s += a[2] * b[2];

The result is the same, but with none cycle overhead. [2]

3.4 Software pipelining

This method is used when the results of operations are interconnected to previous ones, but when the second operation begins to process, the result of the previous operation is not yet known (due to more-level pipelining process). The processor would have to wait until it finishes its first operation in order to begin to handle the second operation. Such a case is presented in a symbolic code:

// Original loop
 for(i = 0; i < N; i++) {
   a = A(i);
   b = B(a);
   c = C(b);
 }

The solution is to hand-division of operations so that each other follows with a greater distance:

// Software-pipelined loop
 a0 = A(0);
 b1 = B(a0);
 a1 = A(1);
 for(i = 2; i < N; i++) {
   a = A(i);
   b = B(a1);
   c = C(b1);
   a1 = a;
   b1 = b;
 }
 b = B(a1);
 c = C(b1);
 c = C(b);

To understand this example a time to think over it is needed. [2]

3.5 Caching

Some smaller microcontrollers allow execute mathematical operations directly above the memory cells. As already said, within faster processors, this option does not come on force and for more efficient computing with data in memory there is the cache available. If the transfers between main memory and the cache aren't provided by hardware automatically, it must be a part of optimized algorithms. [1]

3.6 SIMD and bit reduction

If the result is sufficient for the calculation of limited accuracy, for example, instead of 32-bit operands to rest with 16-bit, the class of SIMD instructions can be used, which in a single instruction cycle calculate several results at once. In the case of 32-bit processor can be obtained e.g. two 16-bit results, or four 8-bit ones.

3.7 Intrinsic operators

Some compilers (e.g. GCC or Texas Instruments Code Composer Studio) allow almost direct access to specific assembler functions in C code by so-called intrinsic operators. Intrinsics allow the user to express the meaning of certain assembly statements that would otherwise be cumbersome or inexpressible in C/C++. Intrinsics are used like functions, the user can pass them variables, just as he would with any normal function. For example:

int x1, x2, y;
 y = _sadd(x1, x2);

This code executes addition of two operands and simultaneously saturates the result. [3]

3.8 Look-up tables

In some cases, the results can be computed in compile-time and stored in memory as constants. This is quite common practice in case of digital filter coefficients calculation. But this method can be used also e.g. for goniometrical functions with reduced accuracy, instead of periodical computing of sinus function.

4. Conclusion

Some basic terms of digital signal processing were mentioned and hardware and software aspects of algorithms optimization were explained with simple fragments of C code.

Acknowledgements

This work has been supported by the Grant Agency of the Czech Republic under project GD102/08/H008 “Analysis and Modelling of Biomedical and Speech Signals” and project GA102/08/0707 “Spoken speech recognition in real environment”.

References

SWEETMAN D.: See MIPS Run Linux. 2^nd ed. San Francisco: Morgan Kaufmann Publishers, 2007.
MIPS TECHNOLOGIES INC.: Effective Programming of the 24KE™ and the 34K™ Core Families for DSP Code. Mountain View, (California), 2007.
TEXAS INSTRUMENTS INC.: TMS320C6000 Optimizing Compiler v 6.1. Dallas (Texas), 2008.

{e_like}