## Introduction

Digital Filters are one of the fundamental blocks for digital signal processing, like the analog filters are for analog signal conditioning. There are many different types of filters but the fundamental ones are the FIR and IIR filters. In general, we first simulate and tune the frequency response of the desired filter on the PC using tools like SciLab (or Matlab) or online design tools like **MicroModeler**.

After tuning the filter to get the required characteristics, the filter needs to be implemented in C to run on an MCU. The best and most efferent way of implementing a digital filter in an embedded system based on an ARM Cortex-M processor is using the DSP library provided by ARM, the CMSIS-DSP library. It provides functions for both FIR and IIR filters that are highly optimized. Some newer MCUs feature dedicated hardware accelerated filter calculation units that can be used to offload some of these digital filters, e.g. the FMAC on some STM32 MCUs.

There is also the question of which number format to use, fixed-point or floating point. This depends mostly on the application and the chosen MCU. Fixed-point math is often preferred in embedded systems as it is faster to compute, when no floating point arithmetics unit (FPU) is present, and doesn’t require conversions as most sensors and ADCs/DACs use integers/fixed-point notations already.

## FIR Filter

Finite Input Response (FIR) Filter is a filter with a finite impulse response, it settles to zero after a finite, N + 1 samples, amount of time. It is constructed without feedback, N amounts of previous inputs are multiplied and summed up to give the filter output. These are called Taps. The simplest form of a FIR filter is the moving average, or rolling average, which is often used without even seeing it as a “real” filter. For it all coefficients are the same and equal to $ 1 \over {N_{Taps}} $. More complex and different frequency responses are accomplished by changing the values of the coefficients.

Bellow is a block diagram of a 8 Tap FIR filter, $ Z^{-1} $ is a delay of 1 sample and b[N] are the coefficients for each Tap:

A FIR filter can easily be designed with SciLab by using the **ffilt(ft,n,fl,fh)** function. It uses 4 parameters:

- ft: Filter type: ‘lp’ for low-pass, ‘hp’ for high-pass, ‘bp’ for band-pass, ‘sb’ for stop-band
- n: Number of Taps
- fl: First cut-off frequency
- fh: Second cut-off frequency (not used for ‘lp’ or ‘hp’).

The function then returns an array with the coefficients for the FIR filter.

To get/draw the filter response the function **frmag(sys,npts)** is used where the first parameter, **sys**, is the coefficients array and the second, **npts**, the number of points to calculate for the frequency response. Bellow is an example of using all these to draw the frequency response of a FIR low-pass filter with 8 Taps and a cut-off frequency of $ 0.1 * F_{S} $:

```
hn = ffilt('lp', 8, 0.1, 0); //Generate FIR Filter Coefficients
[hzm, fr] = frmag(hn, 256); //Generate Filter Response Curve
plot(fr', hzm'); //Draw Filter Response Curve
```

Bellow is a figure that shows the change of the frequency responses with the increase of the number of Taps. Both with linear magnitude and with logarithmic magnitude, $ 20 * log(mag) $. In grey dashed is a 4th Order IIR filter for reference:

### Implementation on ARM MCU (CMSIS-DSP)

The CMSIS-DSP library provides functions for the implementation of the FIR filter in fixed point format, both Q15 and Q31, as well as in float format in F32. A good resource to see how to use each function can be found on this page.

The first thing to do is to initialize the filter structure by using the **arm_fir_init_q15(S, numTaps, pCoeffs, pState, blockSize)** or **arm_fir_init_q31(S, numTaps, pCoeffs, pState, blockSize)** or **arm_fir_init_float(S, numTaps, pCoeffs, pState, blockSize)** functions. These functions expect 5 parameters:

- S: Pointer to the filter structure (arm_fir_instance_q15, arm_fir_instance_q31, arm_fir_instance_f32)
- numTaps: Number of Taps of the FIR filter
- pCoeffs: Pointer to the filter coefficient array
- pState: Pointer to a state buffer array (used internally as the delay line under others)
- blockSize: Block size aka how many samples are processed with each filter function call, often just 1 sample at a time.

For the fixed-point version, the first thing to do is to convert the coefficients calculated by SciLab into the appropriate fixed point format. For Q15 this is done by multiplying the coefficients by $ 2^{15} $ and for Q31 by $ 2^{31} $, and then rounding them to an integer.

The coefficients have to be in reversed order and in case of the Q15 version the number of Taps has to be greater then 4 and even. If an odd number of Taps is to be used the last coefficient is set to 0. Also, the state buffer has to be of length $ nTaps + blockSize - 1 $ for both Q31 and Float version, and $ nTaps + blockSize$ for the Q15 version.

Bellow is example code used to initialize each version of the FIR filter, using the same filter coefficients calculated with the SciLab code above:

```
//Q15 FIR Filter
uint16_t blockLen = 1;
int16_t firStateQ15[8 + blockLen];
// b[tabs-1], b[tabs-2], ..., b[1], b[0]
int16_t firCoeffQ15[8] = { 2411, 4172, 5626, 6446, 6446, 5626, 4172, 2411};
arm_fir_instance_q15 armFIRInstanceQ15;
arm_fir_init_q15(&armFIRInstanceQ15, 8, firCoeffQ15, firStateQ15, blockLen);
//Q31 FIR Filter
uint16_t blockLen = 1;
int32_t firStateQ31[8 + blockLen - 1];
// b[tabs-1], b[tabs-2], ..., b[1], b[0]
int32_t firCoeffQ31[8] = { 158004550, 273426110, 368677283, 422466574, 422466574, 368677283, 273426110, 158004550 };
arm_fir_instance_q31 armFIRInstanceQ31;
arm_fir_init_q31(&armFIRInstanceQ31, 8, firCoeffQ31, firStateQ31, blockLen);
//Float FIR Filter
uint16_t blockLen = 1;
float firStateF32[8 + blockLen - 1];
// b[tabs-1], b[tabs-2], ..., b[1], b[0]
float firCoeffF32[8] = { 0.0735766, 0.127324, 0.1716787, 0.1967263, 0.1967263, 0.1716787, 0.127324, 0.0735766 };
arm_fir_instance_f32 armFIRInstanceF32;
arm_fir_init_f32(&armFIRInstanceF32, 8, firCoeffF32, firStateF32, blockLen);
```

After the initialization the filter is now ready to be used. To apply the filter to a new input sample the **arm_fir_q15(S, pSrc, pDst, blockSize)** or **arm_fir_q31(S, pSrc, pDst, blockSize)** or **arm_fir_f32(S, pSrc, pDst, blockSize)** function is called, which expects 4 parameters:

- S: Pointer to the filter structure
- pSrc: Pointer to the input samples (an array when block size > 1)
- pDst: Pointer to where write the output filtered samples (an array when block size > 1)
- blockSize: Block length, how many samples to process in one call

Bellow is example code on how to use each of these filter functions. In case of the fixed point implementations, there is a *_fast version* for each of them that is slightly faster but does not do any overflow protection and so the input has to be scaled to prevent overflow, by log2(number of Taps) bits. In general, the performance gain is so minimal that it is not worth the complications that can happen, as can be seen in the benchmark section.

```
//Q15 FIR Filter
uint16_t blockLen = 1;
int16_t inputSample;
int16_t filteredSample;
arm_fir_q15(&armFIRInstanceQ15, &inputSample, &filteredSample, blockLen);
//Slightly faster but no overflow protection
//arm_fir_fast_q15(&armFIRInstanceQ15, &inputSample, &filteredSample, blockLen);
//Q31 FIR Filter
uint16_t blockLen = 1;
int32_t inputSample;
int32_t filteredSample;
arm_fir_q31(&armFIRInstanceQ31, &inputSample, &filteredSample, blockLen);
//Slightly faster but no overflow protection
//arm_fir_fast_q31(&armFIRInstanceQ31, &inputSample, &filteredSample, blockLen);
//Float FIR Filter
uint16_t blockLen = 1;
float inputSample;
float filteredSample;
arm_fir_f32(&armFIRInstanceF32, &inputSample, &filteredSample, blockLen);
```

## IIR Filter

Infinite Input Response (IIR) Filter is a filter with a infinite impulse response, it never reaches zero but only tends towards it indefinitely. In practice, they normally reach zero or at least close enough to zero after some time. IIR filters are the digital “equivalent” to analog filters. They are constructed with feedback, N amounts of previous inputs and previous outputs are multiplied and summed up to give the filter output. Normally IIR filters, instead of being represented as a long chain, are split into blocks with each two delay blocks for the input and output, a second order IIR filter. These blocks/sections are called biquads. To get higher order filters they are cascaded aka chained in series.

Bellow is a single biquad section, 2nd order IIR filter, where $ Z^{-1} $ is a delay of 1 sample and b[N] are the coefficients for each Tap on the input side (feed-forward) and a[N] are the coefficients for each Tap on the output side (feed-back):

A IIR filter can be designed with SciLab by using the **iir(n,ftype,fdesign,frq,delta)** function. It uses 5 parameters:

- n: Filter Order
- ftype: Filter type: ‘lp’ for low-pass, ‘hp’ for high-pass, ‘bp’ for band-pass, ‘sb’ for stop-band
- fdesign: The analog filter design used as basis: ‘butt’ for Butterworth, ‘cheb1’ and ‘cheb2’ for Chebyshev, ‘ellip’ for Ellipsis
- frq: Vector of two frequencies for the first and second cut-off frequency (‘lp’ and ‘hp’ only use the first one)
- delta: Vector of two error values for ‘cheb1’, ‘cheb2’ and ‘ellip’ filter.

The function returns the transfer function for the IIR filter.

To get/draw the filter response the function **frmag(sys,npts)** is used where the first parameter, **sys**, is the transfer function and the second, **npts**, the number of points to calculate for the frequency response. Bellow is an example of using all these to draw the frequency response of a IIR low-pass filter of order 2 and a cut-off frequency of $ 0.1 * F_{S} $:

```
hz = iir(4, 'lp', 'butt', 0.1, []); //Generate IIR Filter Transfer Function
[hzm, fr] = frmag(hz, 256); //Generate Filter Response Curve
plot(fr', hzm'); //Draw Filter Response Curve
```

Bellow is a figure that shows the change of the frequency responses with the increase of the filter order. Both in linear magnitude and in logarithmic magnitude, $ 20 * log(mag) $, form:

To get the actual coefficients of the IIR filter, the transfer function returned by **iir(n,ftype,fdesign,frq,delta)** has to be evaluated for the delay operator, $ z^-1 $. This can be done by the following SciLab code:

```
hz = iir(4, 'lp', 'butt', 0.1, []); //Generate IIR Filter Transfer Function
q = poly(0, 'q'); //To express the result in terms of the delay operator q=z^-1
hzd = horner(hz, 1/q); //Evaluates the polynomial by substituting the variable z in hz by 1/q
coeffsA = coeff(hzd.den); //Get a[N] feed-back coefficients
coeffsB = coeff(hzd.num); //Get b[N] feed-forward coefficients
```

This gives the IIR feed-back (a[N]) and feed-forward (b[N]) coefficients for the Direct Form 1, in the extended unfolded form. Because many implementations of the IIR filter, e.g. the CMSIS-DSP library, use the cascaded biquad form, the transfer function has to first be converted/unfolded into the cascaded biquad form. This is, to factorize the obtained transfer function into 2nd order sections, 2nd order polynomials.

This can be done with the **factors(rat)** function in Scilab. The function takes the polynomial as input and returns a list of polynomials of order 1 or 2 for both the numerator (lnum) and denominator (lden) as well as the “gain” (g). With this the biquad sections can now be designed by paring numerator and denominator polynomials, distributing the gain and to normalize the sections so that the denominators are in the form $ ax^2 + bx + 1 $

`[lnum, lden, g] = factors(hzd,'d');`

### Implementation on ARM MCU (CMSIS-DSP)

The CMSIS-DSP library provides functions for the implementation of the Biquad Cascade Direct Form 1 IIR filter in fixed point format, both Q15 and Q31, as well as in float format in F32 (because of the error accumulation problems in the Direct Form 2, the library only provides functions for floats and doubles). A good resource to see how to use each function can be found on this page.

The first thing to do is to initialize the filter structure by using the **arm_biquad_cascade_df1_init_q15(S, numStages, pCoeffs, pState, postShift)** or **arm_biquad_cascade_df1_init_q31(S, numStages, pCoeffs, pState, postShift)** or **arm_biquad_cascade_df1_init_f32(S, numStages, pCoeffs, pState)** functions. These functions expect 5, 4 in case of float, parameters:

- S: Pointer to the filter structure (arm_biquad_casd_df1_inst_q15, arm_biquad_casd_df1_inst_q31, arm_biquad_casd_df1_inst_f32)
- numStages: Number of Taps of Cascaded Biquad Stages
- pCoeffs: Pointer to the filter coefficient array
- pState: Pointer to a state buffer array (used internally as the delay line under others)
- postShift: Shift to be applied to the accumulated result for when the coefficients are not in the Q15/Q31 format

After that, for the fixed-point version, the coefficients have to be convert into the appropriate fixed point format. For Q15 this is done by multiplying the coefficients by $ 2^{15} $ and for Q31 by $ 2^{31} $, and then rounding them to an integer. In case at least one coefficients is over 1, don’t fit in the Q15/Q31 format, all coefficients have to be converted into the format where the largest coefficient fit. For example, when a[1] is 1.049 all coefficients have to be converted into Q14/Q30 by multiplying them by $ 2^{14} $ or $ 2^{30} $. For this case the **postShift** parameter has to be set to 1 to get the correct filter result.

The coefficients have to be arranged in the following order: b[0], b[1], b[2], a[1], a[2] and then again b[0], b[1], b[2], a[1], a[2] for the next stages. In the Q15 case, there is an additional 0 “coefficient” in-between b[0] and b[1] (b[0], 0, b[1], b[2], a[1], a[2]) used for optimized ALU usage internally. Care has to be taken with the sign of the a[n] coefficients, the CMSIS-DSP implementation uses the coefficients directly with there sign in the multiplication, and often design tools expect them to be used with the “-” sign, as is shown in the block diagram. The coefficients calculated with Scilab is one of those cases, which means that the a[n] coefficients have to be negated to be used in the CMSIS-DSP functions. The state buffer has to be of length $ 4*numStages $ for all versions.

Bellow is example code used to initialize each version of the IIR filter, for a 4th Order Low Pass aka two cascaded biquad sections. The coefficient values used are obtained with the above expressions and with a gain splitting of 50/50:

```
//Q15 IIR Filter
int16_t irrStateQ15[8];
int16_t iirCoeffQ15[12] = { 779, 0, 1557, 779, 17180, -4852, //Stage 1: b[0], 0, b[1], b[2], a[1], a[2]
1663, 0, 3326, 1663, 21642, -10367 }; //Stage 2: b[0], 0, b[1], b[2], a[1], a[2]
arm_biquad_casd_df1_inst_q15 armIIRInstanceQ15;
arm_biquad_cascade_df1_init_q15(&armIIRInstanceQ15, 2, iirCoeffQ15, irrStateQ15, 1);
//Q31 IIR Filter
int32_t irrStateQ31[8];
int32_t iirCoeffQ31[10] = { 51021742, 102043591, 51021742, 1125925247, -317978333, //Stage 1: b[0], b[1], b[2], a[1], a[2]
109014001, 218028109, 109014001, 1418319963, -679398113 }; //Stage 2: b[0], b[1], b[2], a[1], a[2]
arm_biquad_casd_df1_inst_q31 armIIRInstanceQ31;
arm_biquad_cascade_df1_init_q31(&armIIRInstanceQ31, 2, iirCoeffQ31, irrStateQ31, 1);
//Float IIR Filter
float irrStateF32[8];
float iirCoeffF32[10] = { 0.0475f, 0.095f, 0.0475f, 1.049f, -0.296f, //Stage 1: b[0], b[1], b[2], a[1], a[2]
0.1015f, 0.203f, 0.1015f, 1.321f, -0.633f }; //Stage 2: b[0], b[1], b[2], a[1], a[2]
arm_biquad_casd_df1_inst_f32 armIIRInstanceF32;
arm_biquad_cascade_df1_init_f32(&armIIRInstanceF32, 2, iirCoeffF32, irrStateF32);
```

After the initialization the filter is now ready to be used. To apply the filter to a new input sample the **arm_biquad_cascade_df1_q15(S, pSrc, pDst, blockSize)** or **arm_biquad_cascade_df1_q31(S, pSrc, pDst, blockSize)** or **arm_biquad_cascade_df1_f32(S, pSrc, pDst, blockSize)** function is called, which expects 4 parameters:

- S: Pointer to the filter structure
- pSrc: Pointer to the input samples (an array when block size > 1)
- pDst: Pointer to where write the output filtered samples (an array when block size > 1)
- blockSize: Block length, how many samples to process in one call

Bellow is example code on how to use each of these filter functions. In case of the fixed point implementations, there is a *_fast* version for each of them that is slightly faster but does not do any overflow protection and so the input has to be scaled to prevent overflow, by 2 bits into the [-0.25; 0.25] range. In general, the performance gain is so minimal that it is not worth the complications that can happen, as can be seen in the benchmark section.

```
//Q15 FIR Filter
uint32_t blockLen = 1;
int16_t inputSample;
int16_t filteredSample;
arm_biquad_cascade_df1_q15(&armIIRInstanceQ15, &inputSample, &filteredSample, blockLen);
//Slightly faster but no overflow protection
//arm_biquad_cascade_df1_fast_q15(&armIIRInstanceQ15, &inputSample, &filteredSample, blockLen);
//Q31 FIR Filter
uint32_t blockLen = 1;
int32_t inputSample;
int32_t filteredSample;
arm_biquad_cascade_df1_q31(&armIIRInstanceQ31, &inputSample, &filteredSample, blockLen);
//Slightly faster but no overflow protection
//arm_biquad_cascade_df1_fast_q31(&armIIRInstanceQ31, &inputSample, &filteredSample, blockLen);
//Float FIR Filter
uint32_t blockLen = 1;
float inputSample;
float filteredSample;
arm_biquad_cascade_df1_f32(&armIIRInstanceF32, &inputSample, &filteredSample, blockLen);
```

## CIC Filter

Cascaded Integrator-Comb (CIC) filter is an optimized version of the moving average FIR filter. It is based on the recursive implementation of it where instead of summing N samples together the newest sample ($ x[n] $) is added to the previous output ($ y[n - 1]) and the oldest sample ($ x[n - M] $) is subtracted. This is, with the gain scaling factor omitted (division by M):

$$ y[n] = \sum_{k=0}^{M-1} x[n - k] = y[n - 1] + x[n] - x[n - M] $$

The CIC filter implements this with two sections: the comb section which implements the delayed subtraction ($ c[n] = x[n] - x[n - M] $), and the integrator section which implements the feedback part ($ y[n] = y[n - 1] + c[n] $). The basic CIC filter is a integrator section followed by a comb section where the delay line length (M) of the comb section defines the cut-off frequency, equivalent to the number of samples summed in the moving average filter. The block diagram for the basic CIC filter can be seen in the figure bellow:

Rewriting the above block diagram by using the z-transform transfer functions for the comb and integrator section gives the following block diagram, with the corresponding time-domain difference equation for each block written bellow them:

The basic CIC filter, with the architecture shown in the block diagrams above, has the same frequency response as a moving average filter and is given by the equation bellow:

$$ H({f \over f_s}) = \left | {{1 \over M} {{\sin (2 \pi f {M \over 2})} \over {\sin (2 \pi {f \over 2})} }} \right | $$

This gives a response shown in the figures bellow, for different delay line lengths and in both linear and logarithmic gain scale.

Higher orders CIC filters are achieved by cascading multiple CIC filters, with the comb and integrator sections rearranged so that they are grouped together. This is possible because both the comb and integrator sections are linear time-invariant (LTI) systems which allows the rearranging without modifying the result. It is important to note that with multiple stages the gain scaling factor gets very large quick as now it not only depends on the number of samples in the delay line (M) but also on the number of stages (S), and exponentially with those! A simple multi-stage CIC filter is shown in the block diagram bellow:

The figures bellow show the frequency response of a 8 samples CIC filter with different number of stages:

An important note is that, in contrast to other digital filter like FIR or IIR, the CIC filter can only be implemented with fixed-point notation (no floats!) because it relays on the overflow characteristics of fixed-point math. Also, a great article on CIC filters and simple compensation with some more in-depth explanations can be found here.

### CIC Decimator

CIC filters are specially interesting when used in combination with decimation (or interpolation). Because the decimation operation is also a LTI operation, it can be moved into the CIC filter and fitted between the integrator and comb sections. With this arrangement, the size of the delay line of the comb section is reduced by the decimation ratio. On the limit, when the decimation ratio is the same as the comb section delay line length (M), to only one sample. The block diagram for a CIC decimator with the same decimation ratio as the original CIC filter delay line length is shown in the figure bellow:

This represents the most commonly used implementation, the CIC filter bandwidth and the decimation ratio coincide and the resulting decimation filter is extremely simple. It only uses two sums, no multiplications, and when the decimation factor is a power of 2 the gain scaling can be achieved with a simple right shift. This makes the CIC decimation architecture the preferred choice for systems that require decimation, specially in FPGAs and integrated circuits e.g. in high resolution ADCs or compact digital transceivers.

### CIC Interpolator

Another popular application for CIC filters is in interpolation. As with decimation, the interpolation stage is also integrated into the CIC filter but with the order of the comb and integrator stages swapped. Again this is made possible because of the LTI nature of interpolation. By arranging the sections in this way the delay line can be reduced by the interpolation ratio. On the limit, when the interpolation ratio is the same as the comb section delay line length (M), to only one sample. The interpolation stage performs zero-stuffing, adding L zeros between two consecutive samples. The block diagram for a CIC interpolator with the same interpolation ratio as the original CIC filter delay line length is shown in the figure bellow:

### Compensation Filter

Because CIC filters don’t have a very flat frequency response in the pass-band they are often followed by a compensation filter. This compensation filter is normally implemented as a FIR filter and is designed to have a frequency response with the inverse characteristics of the CIC filter, flattening the overall response. This makes for a great overall decimation, or interpolation, filter and because the compensation filter is put either after the decimation or before the interpolation it always runs on the lower sampling rate, making it still more efficient then doing the decimation/interpolation with a FIR filter directly.

There are many different techniques to design the FIR compensation filter, with the easiest and most straight forward being using frequency sampling. Using frequency sampling allows to design the compensation FIR filter by sampling the exact response that is desired, normally the inverse of the CIC filter with a low-pass “window” added to improve the attenuation outside the required bandwidth. In Matlab this can be done with the **fir2** function, the approach used in a application note from Intel/Altera: Understanding CIC Compensation Filters. But there is no equivalent in Scilab, no frequency response sampling method that returns the FIR filter coefficients…

To overcome this, a custom frequency sampling function was implemented (good article about frequency sampling can be found here), as well as a function that returns the inverse CIC response and low-pass “window” function. A wrapper function, **CICCompFilter(m,d,fc,bp,n)**, is also implemented so that the compensation filter can easily be adjusted to fit the application by setting the following 5 parameters:

- m: CIC filter delay line length
- d: Decimation ratio
- fc: Low-pass cut-off frequency
- bp: Low-pass transition bandwidth
- n: Number of Taps (should be odd)

The function returns the FIR filter coefficients and shows a plot of the designed filter with plots showing the design parameters. The source code file can be downloaded from the link bellow:

Scilab code file: CICCompFilter.sce

As an example, a compensation filter for a CIC with a delay line length of 8, decimation factor of 8, a cutoff frequency of 0.25 and a transition band of 0.1 with 11 taps is designed:

`hn = CICCompFilter(8, 8, 0.25, 0.1, 11); //Generate FIR Filter Coefficients`

The figure bellow shows the obtained filter response (red) together with the CIC filter (purple) response and inverse of it (purple dashed), the low-pass “window” (white dashed), the ideal/desired filter response (green) and the global response (blue) of the CIC followed by the compensation filter:

### Implementation on ARM MCU

The CMSIS-DSP library doesn’t provide functions for any CIC filters so a costume implementation has to be realized. Bellow is a working implementation for all three CIC filters: the basic filter, the decimation filter and the interpolation filter. The API is based on the CMSIS-DSP library API and should be easy to understand and use. These implementations are not highly optimized but they work and serve as a basis and example for a implementation in C for embedded systems.

Source file: dspCICFilter.c

Header file: dspCICFilter.h

(NOTE: Currently only the decimation filter supports multiple stages)

## ARM CMSIS-DSP Benchmark

The ARM CMSIS-DSP library filters were tested on different MCUs, with different ARM Cortex-M cores, to evaluate there performance. The base test is running the filter over a 1024 samples sized array and measure the time it takes to complete. This is then repeated 10 times and average is calculated.

The MCUs tested are:

**STM32F103:**An ARM Cortex-M3 with no FPU and no cache, 20 kB RAM and clocked at 72MHz**STM32F446:**An ARM Cortex-M4f with FPU and no cache, 128 kB RAM and clocked at 180MHz**STM32H723:**An ARM Cortex-M7 with FPU and 32/32 kB D-/I-Cache, 564 kB RAM and clocked at 550MHz

Bellow are the results for the different MCUs, both in $ \mu s $ and normalized for the clock frequency (x1/MHz):

### More detailed tests

With the STM32H723, more detailed performance tests where realized:

With different Taps counts for FIR filter:

With different filter orders for the IIR filter, number of biquad sections:

To see the impact of the processing block size and the *_fast* filter functions for the fixed-point versions (“xx_fast_q15” and “xx_fast_q31”), the 4th Order IIR Filter is used. The results for these can be seen in the figure bellow:

As the STM32H723 has large cache of 32 KB of both I and D cache, the performance impact of using or not the cache was also tested and can be seen in the figure bellow:

## Hardware Accelerated Filters (FMAC)

Some newer MCUs have dedicated hardware peripherals to offload filter calculations from the MCUs Core. An example of this is the Filter Math Accelerator (FMAC) unit in the newer STM32 MCUs, e.g. in the STM32G4 and some STM32H7 series MCUs. The FMAC unit performs multiply and accumulate (MAC) operations over a vector/array of 16-bit fixed point values using its local 256 word (16-bit) memory. This local memory is split into three areas: two for input data ($ X_1 $ and $ X_2 $) and one for output data ($ Y $).

Bellow is a block diagram of the FMAC unit and how the different memory areas are connected to the MAC. The configuration of the MUX is what defines what type of filter is implemented:

### FIR Filter

For FIR filters, the FMAC local memory areas are used as follows: $ X_1 $ for input data, $ X_2 $ for the filter coefficients and $ Y $ for the output data. Because for a FIR filter N previous input data are multiplied and summed together, feed-forward, the input buffer has to be of length equal or greater then the number of filter coefficients. The figure bellow shows an example on how to organize the local memory:

To set-up the local memory like in the figure above, the code bellow is used:

```
//Set the coefficient buffer
LL_FMAC_SetX2Base(FMAC, 0);
LL_FMAC_SetX2BufferSize(FMAC, n);
//Set the input buffer
LL_FMAC_SetX1Base(FMAC, n);
LL_FMAC_SetX1BufferSize(FMAC, s);
LL_FMAC_SetX1FullWatermark(FMAC, LL_FMAC_WM_0_THRESHOLD_1); //X1 area full watermark, when the X1FULL flag is set
//Set the output buffer
LL_FMAC_SetYBase(FMAC, n + s);
LL_FMAC_SetYBufferSize(FMAC, e);
LL_FMAC_SetYEmptyWatermark(FMAC, LL_FMAC_WM_0_THRESHOLD_1); //Y area empty watermark, when the YEMPTY flag is set
```

After setting up the local memory areas the coefficient area ($ X_2 $) has to be initialized, filled with the filter coefficients. The input data area can be initialized to so that the filter outputs data faster, the FMAC only starts running when the input area has at least N elements (number of tabs) in it. The code bellow initializes the coefficients area and also pre-fills the input area with N (number of tabs) zero elements (0x00):

```
//Load X2 buffer (Coefficient buffer)
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_LOAD_X2);
LL_FMAC_SetParamP(FMAC, n); //Number of values to be loaded into X2 buffer @ X2_BASE
LL_FMAC_SetParamQ(FMAC, 0); //Number of values to be loaded into X2 buffer @ X2_BASE + N (only used for IIR filter)
LL_FMAC_SetParamR(FMAC, 0); //Not used
LL_FMAC_EnableStart(FMAC); //Start FMAC processing
for(i = 0; i < n; i++) {
LL_FMAC_WriteData(FMAC, firCoeffQ15[i]);
}
//Load empty values to input buffer (X1)
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_LOAD_X1);
LL_FMAC_SetParamP(FMAC, n); //Number of values to be loaded into X1 buffer
LL_FMAC_SetParamQ(FMAC, 0); //Not used
LL_FMAC_SetParamR(FMAC, 0); //Not used
LL_FMAC_EnableStart(FMAC); //Start FMAC processing
for(i = 0; i < n; i++) {
LL_FMAC_WriteData(FMAC, 0x00);
}
```

Finally the FMAC unit has to be configured to run in the FIR filter mode, as well as setting the necessary interrupts and DMA requests. The code bellow is an example of how to set up the FMAC in polling mode, with no interrupts or DMA requests.

```
//Configure FMAC Unit
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_CONVO_FIR);
LL_FMAC_SetParamP(FMAC, n); //Number of coefficients (N+1)
LL_FMAC_SetParamQ(FMAC, 0); //Not Used
LL_FMAC_SetParamR(FMAC, 0); //Gain applied to the accumulator output
//Enable Clipping: Values saturate and do not wrap (optional)
LL_FMAC_EnableClipping(FMAC);
//Configure the read/write method
//Write: Polling; Read: Polling
LL_FMAC_DisableIT_WR(FMAC);
LL_FMAC_DisableIT_RD(FMAC);
LL_FMAC_DisableDMAReq_WRITE(FMAC);
LL_FMAC_DisableDMAReq_READ(FMAC);
//Start FMAC processing
LL_FMAC_EnableStart(FMAC);
```

After this, input data can be loaded into the FMAC and the filter results can be read out. In polling mode, the X1FULL and YEMPTY flags are monitored and data is written/read from the FMAC unit accordingly, as shown in the example code bellow:

```
while(LL_FMAC_IsActiveFlag_X1FULL(FMAC) == 0x01);
LL_FMAC_WriteData(FMAC, inputData[i++]);
while(LL_FMAC_IsActiveFlag_YEMPTY(FMAC) == 0x01);
outputData[i++] = LL_FMAC_ReadData(FMAC);
```

This method is, of course, a very inefficient way of using the FMAC as the CPU is always waiting for the FMAC to finish and therefore can’t be used for other things, which defeats the purpose of the FMAC unit. A slightly more efficient way of using the FMAC is to use larger input and output buffers so that an array of samples can be written/read at once to/from the FMAC. Bellow is an example of setting up the buffer areas for a 56 Tab FIR filter with 16 words input and output data buffers and their respective watermarks area increased to 8 words.

```
uint8_t tabs = 14;
uint8_t bufferSize = 16;
//Set the coefficient buffer
LL_FMAC_SetX2Base(FMAC, 0);
LL_FMAC_SetX2BufferSize(FMAC, tabs);
//Set the input buffer
LL_FMAC_SetX1Base(FMAC, tabs);
LL_FMAC_SetX1BufferSize(FMAC, tabs + bufferSize);
LL_FMAC_SetX1FullWatermark(FMAC, LL_FMAC_WM_3_THRESHOLD_8); //X1 area full watermark, when the X1FULL flag is set
//Set the output buffer
LL_FMAC_SetYBase(FMAC, 2*tabs + bufferSize);
LL_FMAC_SetYBufferSize(FMAC, bufferSize);
LL_FMAC_SetYEmptyWatermark(FMAC, LL_FMAC_WM_3_THRESHOLD_8); //Y area empty watermark, when the YEMPTY flag is set
```

With this, it is possible to write a block of 16 input samples to the FMAC and then wait for at least 8 samples to be processed (the watermark) until new samples have to be written and read from it, freeing up more time for the MCU to do some processing in the meantime. The efficiency can be further increased by using the DMA to perform both the writes and reads from the FMAC, freeing up the MCU completely. For this, both the input and output area watermarks should be set to one sample (LL_FMAC_WM_0_THRESHOLD_1) and the buffer area can be set to only a few samples, on the limit to only one sample but to account for possible DMA delays/jitter, and not loosing performance, they should be set to something higher. Also, when the DMA is used the DMA requests have to be enabled.

Bellow is a complete code example of how to set up the FMAC for a 14-tap FIR with the DMA writing/reading to/from the FMAC unit:

```
//Peripheral clock enable
LL_AHB1_GRP1_EnableClock(LL_AHB1_GRP1_PERIPH_FMAC);
//Start with resetting the FMAC Unit
LL_FMAC_EnableReset(FMAC);
uint8_t tabs = 14;
uint8_t bufferSize = 1;
int16_t firCoeffQ15[14] = { 238, -1405, -2523, -331, 2361, 6841, 9722,
9722, 6841, 2361, -331, -2523, -1405, 238 }; // b[tabs-1], b[tabs-2], ..., b[1], b[0]
//Configure Buffer memory base and lengths
//Set the coefficient buffer
LL_FMAC_SetX2Base(FMAC, 0);
LL_FMAC_SetX2BufferSize(FMAC, tabs);
//Set the input buffer
LL_FMAC_SetX1Base(FMAC, tabs);
LL_FMAC_SetX1BufferSize(FMAC, tabs + bufferSize);
LL_FMAC_SetX1FullWatermark(FMAC, LL_FMAC_WM_0_THRESHOLD_1);
//Set the output buffer
LL_FMAC_SetYBase(FMAC, (2 * tabs) + bufferSize);
LL_FMAC_SetYBufferSize(FMAC, bufferSize);
LL_FMAC_SetYEmptyWatermark(FMAC, LL_FMAC_WM_0_THRESHOLD_1);
//Load empty values to input buffer (X1)
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_LOAD_X1);
LL_FMAC_SetParamP(FMAC, tabs); //Number of values to be loaded into X1 buffer
LL_FMAC_SetParamQ(FMAC, 0); //Not used
LL_FMAC_SetParamR(FMAC, 0); //Not used
LL_FMAC_EnableStart(FMAC); //Start FMAC processing
uint8_t i;
for(i = 0; i < tabs; i++) {
LL_FMAC_WriteData(FMAC, 0x00);
}
//Load X2 buffer (Coefficient buffer)
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_LOAD_X2);
LL_FMAC_SetParamP(FMAC, tabs); //Number of values to be loaded into X2 buffer @ X2_BASE
LL_FMAC_SetParamQ(FMAC, 0); //Number of values to be loaded into X2 buffer @ X2_BASE + N
LL_FMAC_SetParamR(FMAC, 0); //Not used
LL_FMAC_EnableStart(FMAC); //Start FMAC processing
for(i = 0; i < tabs; i++) {
LL_FMAC_WriteData(FMAC, firCoeffQ15[i]);
}
//Configure the read/write method
//Write: Polling; Read: Polling
LL_FMAC_DisableIT_WR(FMAC);
LL_FMAC_DisableIT_RD(FMAC);
LL_FMAC_EnableDMAReq_WRITE(FMAC);
LL_FMAC_EnableDMAReq_READ(FMAC);
//Enable Clipping: Values saturate and do not wrap
LL_FMAC_EnableClipping(FMAC);
//Configure FMAC Unit
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_CONVO_FIR);
LL_FMAC_SetParamP(FMAC, tabs); //Number of coefficients (N+1)
LL_FMAC_SetParamQ(FMAC, 0); //Not Used
LL_FMAC_SetParamR(FMAC, 0); //Gain applied to the accumulator output
```

Of course the DMA also has to be configured, to both write and read data from the FMAC. This has to be done before the FMAC unit is started, before calling * LL_FMAC_EnableStart(FMAC)*. The code bellow shows how to set-up the DMA1 to write to the FMAC using Channel 1 and reading from it using Channel 2:

```
//Input and output arrays, from where to read and write data
uint16_t inputSampleArrayLength = 1024;
int16_t inputSampleArray[1024];
uint16_t outputSampleArrayLength = 1024;
int16_t outputSampleArray[1024];
//DMA1 Channel 1 to DAC1 Channel 1
LL_DMA_SetPeriphRequest(DMA1, LL_DMA_CHANNEL_1, LL_DMAMUX_REQ_FMAC_WRITE);
LL_DMA_SetDataTransferDirection(DMA1, LL_DMA_CHANNEL_1, LL_DMA_DIRECTION_MEMORY_TO_PERIPH);
LL_DMA_SetChannelPriorityLevel(DMA1, LL_DMA_CHANNEL_1, LL_DMA_PRIORITY_HIGH);
LL_DMA_SetMode(DMA1, LL_DMA_CHANNEL_1, LL_DMA_MODE_NORMAL);
LL_DMA_SetPeriphIncMode(DMA1, LL_DMA_CHANNEL_1, LL_DMA_PERIPH_NOINCREMENT);
LL_DMA_SetMemoryIncMode(DMA1, LL_DMA_CHANNEL_1, LL_DMA_MEMORY_INCREMENT);
LL_DMA_SetPeriphSize(DMA1, LL_DMA_CHANNEL_1, LL_DMA_PDATAALIGN_WORD);
LL_DMA_SetMemorySize(DMA1, LL_DMA_CHANNEL_1, LL_DMA_MDATAALIGN_HALFWORD);
LL_DMA_ConfigAddresses(DMA1, LL_DMA_CHANNEL_1, (uint32_t)&inputSampleArray, (uint32_t)&(FMAC->WDATA), LL_DMA_DIRECTION_MEMORY_TO_PERIPH);
LL_DMA_SetDataLength(DMA1, LL_DMA_CHANNEL_1, inputSampleArrayLength);
LL_DMA_EnableChannel(DMA1, LL_DMA_CHANNEL_1);
//DMA1 Channel 2 to DAC1 Channel 2
LL_DMA_SetPeriphRequest(DMA1, LL_DMA_CHANNEL_2, LL_DMAMUX_REQ_FMAC_READ);
LL_DMA_SetDataTransferDirection(DMA1, LL_DMA_CHANNEL_2, LL_DMA_DIRECTION_PERIPH_TO_MEMORY);
LL_DMA_SetChannelPriorityLevel(DMA1, LL_DMA_CHANNEL_2, LL_DMA_PRIORITY_HIGH);
LL_DMA_SetMode(DMA1, LL_DMA_CHANNEL_2, LL_DMA_MODE_NORMAL);
LL_DMA_SetPeriphIncMode(DMA1, LL_DMA_CHANNEL_2, LL_DMA_PERIPH_NOINCREMENT);
LL_DMA_SetMemoryIncMode(DMA1, LL_DMA_CHANNEL_2, LL_DMA_MEMORY_INCREMENT);
LL_DMA_SetPeriphSize(DMA1, LL_DMA_CHANNEL_2, LL_DMA_PDATAALIGN_WORD);
LL_DMA_SetMemorySize(DMA1, LL_DMA_CHANNEL_2, LL_DMA_MDATAALIGN_HALFWORD);
LL_DMA_ConfigAddresses(DMA1, LL_DMA_CHANNEL_2, (uint32_t)&(FMAC->RDATA), (uint32_t)&outputSampleArray, LL_DMA_DIRECTION_PERIPH_TO_MEMORY);
LL_DMA_SetDataLength(DMA1, LL_DMA_CHANNEL_2, outputSampleArrayLength);
LL_DMA_EnableChannel(DMA1, LL_DMA_CHANNEL_2);
```

After this the FMAC can be started and the whole input buffer will be processed and the result written to the output buffer, all without any intervention of the MCU, freeing it up to perform other tasks. With the * LL_DMA_IsActiveFlag_TC2(DMA1)* function the end of the operation can be checked, the TC2 flag is set when the DMA1 Channel 2 has finished writing all samples to the output array.

### IIR Filter

The FMAC can also be used to implement a IIR filter. In that case, the local memory areas are used as follows: $ X_1 $ for input data, $ X_2 $ for the filter coefficients and $ Y $ for the output data. The FMAC IIR Filter is implemented as the Direct Form 1 in the extended unfolded form, not in the concatenated/folded biquad sections form like in the CMSIS-DSP implementation.

As the IIR filter has feedback, N previous inputs and outputs are multiplied together and summed, the filter coefficient memory area ($ X_2 $) is split into two, one for the feed-forward coefficients (b[N]) and one for the feed-back coefficients (a[M]). Also, as here both previous input and output samples are required for the filter, both the input buffer as well as the output buffer has to be of length equal or greater to the number of filter coefficients. The figure bellow shows an example on how to organize the local memory:

To set-up of the FMAC for a IIR filter is almost identical to the FIR filter case. The only big difference is the way the coefficient memory area is set up, which now has two sections. The coefficients for the feed-forward and the feed-back part are concatenated and written to the $ X_2 $ memory area as follows:

```
//Load X2 buffer (Coefficient buffer)
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_LOAD_X2);
LL_FMAC_SetParamP(FMAC, n); //Number of values to be loaded into X2 buffer @ X2_BASE: Feed-forward coefficients (b[N])
LL_FMAC_SetParamQ(FMAC, m); //Number of values to be loaded into X2 buffer @ X2_BASE + N: Feed-back coefficients (a[N])
LL_FMAC_SetParamR(FMAC, 0); //Not used
LL_FMAC_EnableStart(FMAC); //Start FMAC processing
//First the feed-forward coefficients (b)
for(i = 0; i < n; i++) {
LL_FMAC_WriteData(FMAC, iirBCoeffQ15[i]); //b[N]
}
//Second the feed-back coefficients (a)
for(i = 0; i < m; i++) {
LL_FMAC_WriteData(FMAC, iirACoeffQ15[i]); //a[M]
}
```

As is discussed in the FIR case, the most efficient way of using the FMAC is with the DMA so only that example will be shown here. The code bellow is for how to set-up the FMAC for a 4th Order IIR with the DMA writing/reading to/from the FMAC unit:

```
//Peripheral clock enable
LL_AHB1_GRP1_EnableClock(LL_AHB1_GRP1_PERIPH_FMAC);
//Start with resetting the FMAC Unit
LL_FMAC_EnableReset(FMAC);
uint8_t order = 4;
uint8_t bufferSize = 1;
uint8_t n = order + 1;
uint8_t m = order;
uint8_t postShift = 2; //Coefficients are in Q2.13
int16_t iirBCoeffQ15[5] = { 40, 159, 238, 159, 40 }; //b[0], b[1], ..., b[N]
int16_t iirACoeffQ15[4] = { -19411, 18957, -8639, 1536 }; //a[1], a[2], ..., a[M]
//Configure Buffer memory base and lengths
//Set the coefficient buffer
LL_FMAC_SetX2Base(FMAC, 0);
LL_FMAC_SetX2BufferSize(FMAC, n + m);
//Set the input buffer
LL_FMAC_SetX1Base(FMAC, (n + m));
LL_FMAC_SetX1BufferSize(FMAC, (n + bufferSize));
LL_FMAC_SetX1FullWatermark(FMAC, LL_FMAC_WM_0_THRESHOLD_1);
//Set the output buffer
LL_FMAC_SetYBase(FMAC, (2 * n + m + bufferSize));
LL_FMAC_SetYBufferSize(FMAC, (m + bufferSize));
LL_FMAC_SetYEmptyWatermark(FMAC, LL_FMAC_WM_0_THRESHOLD_1);
//Load empty values to input buffer (X1)
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_LOAD_X1);
LL_FMAC_SetParamP(FMAC, n); //Number of values to be loaded into X1 buffer
LL_FMAC_SetParamQ(FMAC, 0); //Not used
LL_FMAC_SetParamR(FMAC, 0); //Not used
LL_FMAC_EnableStart(FMAC); //Start FMAC processing
uint8_t i;
for(i = 0; i < n; i++) {
LL_FMAC_WriteData(FMAC, 0x00);
}
//Load empty values to input buffer (Y)
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_LOAD_Y);
LL_FMAC_SetParamP(FMAC, m); //Number of values to be loaded into Y buffer
LL_FMAC_SetParamQ(FMAC, 0); //Not used
LL_FMAC_SetParamR(FMAC, 0); //Not used
LL_FMAC_EnableStart(FMAC); //Start FMAC processing
uint8_t i;
for(i = 0; i < m; i++) {
LL_FMAC_WriteData(FMAC, 0x00);
}
//Load X2 buffer (Coefficient buffer)
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_LOAD_X2);
LL_FMAC_SetParamP(FMAC, n); //Number of values to be loaded into X2 buffer @ X2_BASE: Feed-forward coefficients (b[N])
LL_FMAC_SetParamQ(FMAC, m); //Number of values to be loaded into X2 buffer @ X2_BASE + N: Feed-back coefficients (a[N])
LL_FMAC_SetParamR(FMAC, 0); //Not used
LL_FMAC_EnableStart(FMAC); //Start FMAC processing
for(i = 0; i < n; i++) {
LL_FMAC_WriteData(FMAC, iirBCoeffQ15[i]);
}
for(i = 0; i < m; i++) {
LL_FMAC_WriteData(FMAC, iirACoeffQ15[i]);
}
//Configure the read/write method
//Write: Polling; Read: Polling
LL_FMAC_DisableIT_WR(FMAC);
LL_FMAC_DisableIT_RD(FMAC);
LL_FMAC_EnableDMAReq_WRITE(FMAC);
LL_FMAC_EnableDMAReq_READ(FMAC);
//Enable Clipping: Values saturate and do not wrap
LL_FMAC_EnableClipping(FMAC);
//Configure FMAC Unit
LL_FMAC_SetFunction(FMAC, LL_FMAC_FUNC_IIR_DIRECT_FORM_1);
LL_FMAC_SetParamP(FMAC, n); //Number of feed-forward (b) coefficients
LL_FMAC_SetParamQ(FMAC, m); //Number of feed-back (a) coefficients
LL_FMAC_SetParamR(FMAC, postShift); //Gain applied to the accumulator output
```

The DMA configuration is the same as in the FIR case, as is the way of monitoring the operation completion.

### Benchmark

This section shows some performance evaluation of the FMAC unit, comparing Polling vs DMA implementation and also comparing it to the ARM CMSIS-DSP Library implementation running on the MCU core. The tests consists on running the filter over a 1024 samples sized array and measure the time it takes to complete. The MCU used is the STM32G473 running at 170 MHz, with the FMAC also running at 170 MHz.

Bellow is a graph showing some performance numbers of the FMAC for different FIR and IIR filters, both in polling mode with different input and output buffer area sizes as well as when using the DMA. It is very clear that for small buffer sizes the processing time is dominated by read/writing and waiting operations and only when the buffer area is at least 16 samples long the FMAC processing dominates.

For comparing the FMAC performance to running the filters on the MCU Core the ARM CMSIS-DSP library functions are used. The FMAC is running in DMA mode while the ARM CMSIS-DSP functions are using a block size equal to the number of samples (1024) so that both are running with their ideal conditions. Both the normal and the *_fast* version on the filter functions are used. The results can be seen in the figures bellow:

It shows that the FMAC performs on paar with the MCU Core when the equivalent filter function is used, the *_fast* version which doesn’t have overflow protection. This is expected for this MCU as both the Core and the FMAC run at the same clock rate, but the results can vary a lot when they don’t, which can be the case with higher clocked MCUs like the STM32H723.