Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Reconfigurable ComputingReconfigurable Computing

ApplicationsApplications

Chapter 9Chapter 9

Prof. Dr.Prof. Dr.--Ing. Jürgen TeichIng. Jürgen TeichLehrstuhl für HardwareLehrstuhl für Hardware--SoftwareSoftware--CoCo--DesignDesign

Reconfigurable Computing

Overview Overview

Reconfigurable Computing 2

FPGAs have been used in the past mostly in

Rapid prototyping

Non-frequent reconfigurable systems

Hardware implementation, sometimes specific for the FPGA architectureThe most important application areas are:

Searching (text, genetic database, etc.)

Image processing

Mechanical control

Etc.

Searching Searching –– pattern matchingpattern matching


Pattern matching is the basis of search engines

The purpose is to find and (count) the occurrence of a given pattern in a given text

Useful in:

DictionariesDocument collection indexingDocument filtering and classificationSpam avoidanceContent surveillance

Searching Searching –– pattern matching pattern matching –– sliding windowssliding windows


Sliding windows (Cockscot & Foulk )

Keywords are kept in register. One character / ByteA set of comparators are used. One comparator / ByteHit signal is set whenever the text-segment matches the corresponding wordAdvantage:

Easy to replace old patternsDrawbacks:

Not flexible: Fixed length of registersRedundancy: more comparators than necessary for word with same prefix

Searching Searching –– pattern matching pattern matching -- sliding windowssliding windows


Avoid redundancyUse only one comparator for common characters in different words

Data folding (Foulk)

Fold the data in the circuitConsider the bit-representation of each characterGenerate a comparator circuit for each character in the words to be searched for

Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit

8-bit comparator

Bit Bit Bit Bit Bit Bit Bit Bit

01001110-Comparator

Searching Searching –– pattern matching pattern matching -- FSMFSM--BasedBased


FSM-Based pattern matcherEach regular grammar can be recognized by an FSMIn pattern matching, the target words define the regular grammarThe target words are compiled in the automatonEach word defines a unique path from the start state to an end stateWhen scanning a text, the automaton changes its state with the appearance of charactersReaching a final state correspondsto the appearance of a wordRedundancy is avoided by implementing common prefix

FSM-Recognizer and correspondingstate transition table for the word conte



FSM-Based pattern matcherRAM-implementation

One RAM or ROM for storing the state transition tableOne state registerOne character registerA hit detectorThe Input character and the state register are used to determine the nextstateThe hit detector checks if the current state is equal to a hit state and sets a hit for the corresponding wordAdvantage:

Simple to implementDrawback:

Expensive in terms of flip flops

Char reg

RAMROM

State Reg Nextstate

Hit detect

Characterstream

RAM/ROM implementation of the word recognizer



FSM-Based pattern matcherOne-hot implementation

Each state is coded in one flip flopThe D-input of the flip flop is obtained by an AND of the output of the previous flip flop with the result of the comparatorThe comparator is character-specificOnly n FF are used to implement a word of length nAdvantage:

Low costReflects the structure of the grammar

Drawback: Not easy to buildRedundancy in the comparators

o n t e

c

Character-specific comparators



FSM-Based pattern matcherExploiting common prefix

For words with common prefix, only one common starting path corresponding to the length of the common prefix is used.Redundancy of comparators can be avoided by implementing only one comparator for each character.The result of the comparison will then be provided to all gates using them Words with common prefix

and the corresponding FSM



FSM-Based pattern matcherOptimized architecture

Implement the common prefix Redundancy of comparators is removed: Each character in the set is implemented in a position vector: pos(i) = 1 iff character i is detected Block diagram of the optimal

pattern matcher

Detailed structure of the optimal pattern matcher


Searching Searching –– pattern matching pattern matching –– use of use of reconfigurationreconfiguration

Bit Bit Bit Bit Bit Bit Bit Bit

FSM-Based pattern matcherUse of reconfiguration

Replace the character comparatorsReplace the FSM for a set of words

Rec

onfig

urat

ion

New character comparator

New set of wordsReconfiguration


Signal processing Signal processing –– distributed arithmetic distributed arithmetic --MotivationMotivation

Signal processing applications (FFT, Convolution, Filter algorithms) are characterized by MAC-intensive computations Signal processing functions are usually implemented on special processors

DSPsASICs

FPGAs provide the advantage of reconfigurability, but MAC-intensive applications are expensiveHowever, for MAC computations involving one constant vector, FPGAs present one of the best alternatives to DSPs

Signal processingSignal processing-- distributed arithmetic distributed arithmetic -- BasicsBasics


iji XA ∗∑Because the Ai are constant, there exist 2n possible values for

We can pre-compute the possible values and store them in a LUT (DALUT) and retrieve them on demand at run-time

( ) ( )ijijj

iji XA2=2XA=XA=Z ∗∗∗∗∗ ∑∑∑∑

With the binary representation for Xi: j

iji 2X=X ∑

( )∑ ∗∗ ii XA=XA=ZA constant row vector, X column vectorSolution of the following equation:

( )ijij XA2=Z ∗∗∑∑ is the classical form of distributed arithmetic

FPGA Advantage: Computation is memory-based (use of LUTs)

Signal processingSignal processing-- distributed arithmetic distributed arithmetic -- BasicsBasics


To better understand, we spread the DA equation

Z=[ ] 02( ) ( ) nn01n01n AX+AX ∗∗ −−.........................220110 AX+AX ∗∗

+ [ ] 12( ) ( ) nn11n11n AX+AX ∗∗ −−221111 AX+AX ∗∗ .........................

2W21W1 AX+AX ∗∗ ......................... ( ) ( ) nnW1nW1n AX+AX ∗∗ −− ] W2

. .........

.........

........

+ [

The bits of the variables will be used to address the memory andretrieve the required values in a bit-serial way.

The DA-datapath implementation is straightforward

Signal processingSignal processing-- distributed arithmetic distributed arithmetic -- DatapathDatapath


( )1W1W1 XX − ................

......

1011XX( )1W2W2 XX − ................ 2021XX

( )1WnnWXX − ................ n0n1XX ...

1A0

2A

21 A+A3A

13 A+A23 A+A

123 A+A+A4A

Z+/-

DA-LUT AddressDA-LUT

Parallel bit-serial input j-shift

Signal processingSignal processing-- distributed arithmetic distributed arithmetic -- DatapathDatapath


k-parallel( )1W1W1 XX − ................

......

1011 XX

( )1W2W2 XX − ................ 2021 XX

( )1WnnW XX − ................ n0n1 XX

DA-LUT 1 DA-LUT 2 DA-LUT k

ACC 1 ACC 2 ACCk

Z Adder tree

Signal processingSignal processing-- distributed arithmetic distributed arithmetic -- ExampleExample


Recursive convolution of time domain simulation of opticalmultimode intra/system interconnects

( ) ( ) 35322415041n0n xf+xf+xfxf+tyf=ty ∗∗∗−∗∗ −

Recursive formula to be implemented on 3 intervals

Comparison of different implementations Virtex 2000E implementation on

the Celoxica RC1000-PP board

Signal processing Signal processing –– Fast Fourier Transform (FFT)Fast Fourier Transform (FFT)


Fourier Series developed by the French Mathematician Joseph Fourier (1807)

Application of the initial idea in the field of heat diffusion

The advent of digital computation and “discovery” of Fast Fourier Transform (FFT) in the 50's revolutionized the field of signal processing

Practical processing and meaningful interpretation of signals with exceptional importance for human and industry

Medical monitors and scanners

Modern electronic communications

Image processing

etc.

Signal processing Signal processing –– FFT FFT -- BasicsBasics


Fourier transform F(u) of a continuous function f(x):dxe)x(f)u(F )ux2j( π−

∞

∞−∫=

Given F(u), we can obtain f(x) by means of inverse Fourier transform

Extension in 2-D, F(u,v)

Corresponding inverse in 2-D

due)u(F)x(f )ux2j( π−∞

∞−∫=

due)v,u(F)y,x(f ))vyux(2j( +−∞

∞−∫= π

dxe)y,x(f)v,u(F ))vyux(2j( +−∞

∞−

∞

∞−∫∫= π



Fourier transform of a discrete function of one variable F(u)

Given F(u), we can obtain, f(x) by means of inverse Fourier transform

1M,...,0u,1M,...,0x,e)x(fM1)u(F )M/ux2j(

1M

0x−=−== −

−

=∑ π

1M−1M,...,0u,1M,...,0x,e)u(F)x(f )M/ux2j(

0x−=−== −

=∑ π

The Brute force computation of the Fourier transform requires M2 multiplications and additions

The performance can be improved to (M log M) using the successive doubling method



For notational convenience, we replace the previous equation with:

)M/2j(M

)ux(M

1M

0xeW,W)x(f

M1)u(F π−

−

=== ∑

We assume M to be of the form

With n and K being positive integers each, we have:

K22M n ==

]WW)1x2(fK1W)x2(f

K1[

21)u(F )u(

K2)ux(

K

1K

0x

))x2(u(K2

1K

0x++= ∑∑

−

=

−

=

)ux(K2

1M

0xW)x(f

K21)u(F ∑

−

==

Fodd(u)Feven(u)



Since and , we have

]W)u(F)u(F[21)u(F u

K2oddeven +=

The FFT is a fast implementation of the Discrete Fourier Transform (DFT).

Based on a divide-and-conquer model,M log2M computations

]W)u(F)u(F[21)Ku(F u

K2oddeven −=+

uM

)Mu(M WW =+ u

M2)Mu(

M2 WW −=+

Signal processing Signal processing –– FFT FFT -- AlgorithmAlgorithm


The N-point DFT computation can be divided into two N/2-point DFT computation. These N/2-point DFT computations can be divided into two N/4-point DFT computations, and so on.

There are log2(N) stages

After the division and the DFT computation, a merging process is performed, in which the transforms are reassembled. Merging process

Signal processing Signal processing –– FFT FFT –– Algorithm Algorithm –– butterfly unitbutterfly unit


The butterfly unitThe reassembling is done by using the complex elements say, g and h from the previous stage. Current element = g - h*WN

k and g + h*WNk

The twiddle factorThe twiddle factor terms (of the WN

k = exp(j 2* π*k/N) must be availableThe real and complex parts of these factors are stored in a ROM.The factors correspond to a sine (imaginary part) and cosine (real part) functions, in a set of N/2 equally-spaced angles in an interval from 0 to (N-2)* π/N. Therefore, only a N/2-position memory is needed.

g

h * Wk

N-

+

Signal processing Signal processing –– FFT FFT –– FPGA implementationFPGA implementation


16 points FFTPipelined READ, EXECUTE and OUTPUT stage

Read one complex input (32-bits) in every cycle16 point input is read in 16 cycles

Output one complex result (32-bits) in every cycle

EXECUTE stage also takes 16 cycles

Performance Latency: 17 cyclesThroughput: 1 transform per 16 cycles

READ EXEC OUTPUT

READ EXEC OUTPUT



Completing the execution stage in 16 cycles results in large fan-outs for the internal registers: very poor timing characteristics

By trial and error, divide the execution stage into 4 pipelined sub-stagesThat necessitates 5x internal storage, but 2560 bits of storage is no big deal for XC2V2000.Final performance :

Latency : 65 cyclesThroughput : 1 transform per 16 cycles (same as before)

R E1 E2 E3 E4 O

R E1 E2 E3 E4 O



Resource requirements:

1432 out of 21504 flip-flops (6%)1037 out of 21504 LUT (4%)65 out of 624 I/O pins (10%)

TimingMinimum clock period: 5.899ns (Maximum frequency: 169.520MHz)

Power estimation 440 mW

Image processing Image processing –– Basic operatorsBasic operators


Image processing algorithms usually process an image (set of points with a given characteristics such as color, gray level, luminance, etc.) point by point. The resulting pixels depend only on the pixels in the original picture. A sequential processor needs quadratic run-time to process a complete image. By using parallelism, each pixel can be computed independently.Many image processing system are based on the following operators:

Median filteringBasic Morphological operationsConvolutionEdge detection

Algorithms are often based on the moving window operator

Image processing Image processing –– Moving windowMoving window


The moving or sliding window algorithm usually processes one pixel of the image at a timeThe value of the pixel is changed by a function of a local pixel region covered by the windowThe operator moves over the image to cover all pixelsFor a pipelined implementation, all the pixel of the windows must be accessed at the same time for each clockFPGA implementation uses FIFO buffers

FIFO1

FIFO2

RAM

W11 W12 W13

W21 W22 W23

W31 W32 W33

Disposed

FIFO1

FIFO2

W11 W12 W13

W21 W22 W23

W31 W32 W33

DisposedW14 W15

W24 W25

W34 W35

W41 W42 W43

W51 W52 W53

W44 W45

W54 W55RAM

FIFO3

FIFO4

3x3 and 5x5 moving windows

Image processing Image processing –– Moving window Moving window -- FPGAFPGA


FIFO ImplementationFIFOs are implemented using circular buffers constructed from Multi-ported RAMs (available in, e.g., Virtex FPGA)Indexes keep track of the front and tail items in the bufferBLOCK RAMs are readable and writable in one clock-cycle. This allows a throughput of one pixel per cycle.

3x3 windows2 buffers of size W-3 (W = image width) are usedThe two FIFO buffers must be full to access all the window pixels in one cycleIn each clock cycle, a pixel is read from the memory and placed into the bottom left cornerThe content of the window is shifted to the right with the right most member being added to the tail of the FIFOThe top right pixel is disposed after computation, since it is not used in the future computation

Image processing Image processing –– Median filteringMedian filtering


BasicsAn impulse noise (or salt and pepper noise) in an image has a gray level with higher low different from the neighbor point.Linear filters have no ability to remove this type of noiseMedian filters share remarkable advantages on removing this type of noiseOften used in digital signal and image/video applications

ImplementationUse a sliding window of odd size (e.g., 3x3) over an imageAt each window position, the median of the sample values is taken to replace the value at the center of the windowHigh computational cost O(N log N) even using most efficient sorting algorithmsGeneral purpose processors are not a good solutions for real time implementation. This justifies the use of FPGAs.

Image processing Image processing –– Median filteringMedian filtering


Sequential implementation(pseudo code)For x=1 to # rows

For y = 1 to # colsBuild Windows arraypixel(x,y) = Median(window array)

EndEnd

Complexity O(#rows X #cols x NlogN) (N=3)

Hardware sorting implementation

10 5 2014 3 1115 25 2

11

2 3 5 10 11 14 20 25

Median

Image processing Image processing –– Median filtering Median filtering -- resultresult


Original image Filtered image

Image processing Image processing –– Basic Morphological OperatorsBasic Morphological Operators


Morphology in image processing studies the appearance of objects. Useful for example in:

Skeletonization Edge detectionRestoration

ProcessingThe image is processed pixel-by-pixel using a structuring element (the sliding windows)The window may fit or not to the image

Most basic building blocks:Erosion (shrinks or erodes an object in the image)Dilation (grows the image)Operations like opening and closing of an image can be derived by performing erosion and dilation in different order

Image processing Image processing –– Basic Morphological OperatorsBasic Morphological Operators


Erosion Replaces the center pixel in the sliding window by the smallest pixel value in the window arrayThe bright area of the image shrinks, or erodes

Dilation Replaces the center pixel in the sliding window by the greatest pixel value in the window arrayThe bright area of the image grows

AlgorithmSame as the medianInstead of selecting the median element,the minimum is selected for erosion and the maximum is selected for dilation

Image processing Image processing –– Median filtering Median filtering -- resultresult


DilationOriginal image Erosion

Image processing Image processing –– Convolution Convolution -- BasicsBasics


Convolution multiplies two arrays of numbers with different sizes and produces a third array of numbersIn image processing, convolution implements operators whose output pixels are computed as linear combinations of certain input pixels values.1-D Convolution

Formally convolution takes two input functions f(x) and g(x) andgenerates h(x) = f(x)*g(x) where g(x) is referred to as the filter:

2-D ConvolutionMost important in modern image processingA finite size window (convolution mask) is scanned over the imageThe output pixel value is the weighted sum of the input pixels within the windowThe weight is the value of the filter assigned to each pixel in the window

( ) ( ) ( )dττxgτf=xh −∫∞

∞−

Image processing Image processing –– Convolution Convolution -- BasicsBasics


2-D Convolution

Mathematically represented by the following equation:

where x is the input image, h is the filter and y is the output image

Supports a virtual infinite variety of masks, each with its own feature

3x3 convolutions are most commonly used and operate only on a pixel and its directly adjacent neighbours

( ) ( ) ( )jni,mxji,h=nm,yheightimg

0=i

widthimg

0=j−−∑ ∑

− −

P1 P2 P3

P4 P5 P6

P7 P8 P9

W1 W2 W3

W4 W5 W6

W7 W8 W9

P*∑

∑9

1=ii

i9

1=ii

W

PW=P

Image processing Image processing –– Convolution Convolution –– Gaussian filtersGaussian filters


Gaussian convolution filters1-D 2-D

The idea is to use the 2-D distribution as a point spread function. This is achieved by convolutionA discrete approximation of the Gauss function is required to perform the convolutionIn theory, Gauss distribution is zero anywhere. Therefore an infinite large convolution kernel may be requiredBut in practice, the convolution kernel is truncated as shown in the pictures

( ) 2

2

σ2x

eπσ21=xG

−

( )( )

2

22

σ2y+x

eπσ21=yx,G

−

21 31 21

31 48 31

21 31 21

1256

2 4 5

4 9 12

5 12 15

4 2

9 4

12 5

4 9 12

2 4 5

9 4

4 2

1115

3x3 Gaussian smooth filter 5x5 Gaussian smooth filter

1.4



ConvolutionOriginal image

Image processing Image processing –– Edge detection Edge detection -- BasicsBasics


Edges Placed in image with strong intensity contrastOften occurs at image location representing boundaries

Edge detectionExtensively used in image segmentation, i.e., dividing an image into areas corresponding to different objectsRepresenting an image by its edges significantly reduces the amount of dataSince edges correspond to strong illumination gradients, the derivatives of the image are used to compute the edgesOperators often used are

Laplace operatorSoebel operatorCanny edge detection algorithm

Image processing Image processing –– Edge detection Edge detection -- operatorsoperators


LaplaceGradient operatorIntensity difference are enhanced. The edges are more pronouncedFor each pixel, the gray value of its four neighbours (top, left, bottom, right) pixel value are subtracted from its own value

Soebel operatorCombination of two 1-D operators

One for detecting horizontal edgesOne for detecting vertical edges

-1

-1 4 -1

-1

Laplace operator

-1 1

-2 2

-1 1

Soebel-x

1 2 1

-1 -2 -1

Soebel-y



Edge detectionOriginal image

Image processing Image processing –– Use of reconfigurationUse of reconfiguration


Intelligent image processing systemAccording to input image and other conditions,

Some operations are done to improve the image

Filtering (the correct filter is chosen)Smoothing Segmentation (Edge detection)Skeletonization

Some adjustments are done on the image input hardware

CalibrationFocussing

Everything is done while the system keeps runningFixed parts of the system will run continuouslyReconfigurable must be replaced at run-time

Mechanical Control Mechanical Control –– BasicsBasics


Controller task is to influence the dynamic behavior of a plantInputs values for the plant depends on plant's outputs (Feedbacks)A plant is modeled as a linear time invariant (LTI)-SystemController is modeled as LTI-System Time discretization

Scaling to fix-pointk, k+1, k+2 …sample pointsT… sample periodtc… calculation time of controller

Plant

Controller

D/A-C

D/A-C

referencevalue

TTkk kk+1+1 kk+2+2

tt

ttcc

Mechanical Control Mechanical Control –– BasicsBasics


kur

kyr

kkk

kk1kuDxCyuBxAxrrr

rrr

+=

+=+

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡==

...vMvM

vMz row2

row1v

v

rrcombine A,B,C,D to M,combine A,B,C,D to M,represent computation asrepresent computation asset of scalar products ofset of scalar products ofeach row of M witheach row of M with

u: Controller input, x: State, y: OutputA, B, C, D: Constant coefficient matrices

vr

Mechanical Control Mechanical Control –– DADA--ImplementationImplementation


∑=

==n

1iiiaxxaz (1)

rr

∑−

=

−=1w

0j

jiji 2xx (2)

∑ ∑

∑ ∑

−

= =

−

=

−

=

−

=

=

1w

0j

n

1iiij

j

n

1i

1w

0j

jiji

ax2

2xaz (3)

Scalar product:const. vector and var. vector

xi as w-bit fix-point(here x just unsigned in [0,2[ )

replace (2) in (1),swap the sums

since xij is in {0,1}right sum can have just 2n values

pre-compute it and store it in a 2n x w ROM as LUT

20 2-1 2-2 2-3 2-(w-1)…

ar x

v

Mechanical Control Mechanical Control –– DADA--Implementation (Parallel)Implementation (Parallel)


Coeffic ient Look - Up T able(DA LUT )

Inpu t R eg is te r

x 1> > c

x 2> > c

DA L UT1x [1..n],w-1

DA L UT2x[1..n],w-2

Data P ath ( DP )( c + 1 Input A dder )

Res ult ( z )

> > c

DA L UTcx n> > c x[1..n],w-c

.. .

...

......

w w w

w

n

n

> > 0> > c -1 > > c -2∑−

=

−=1w

0jj]n..1[

j )x(DALUT2z

c-Bit at a time Architecture

Mechanical Control Mechanical Control –– Multi controller systemMulti controller system


Many controller modules optimized for different operating regimes

Controllers have different structures not only different coefficients

Supervisor observes plant and determines best controller module

Multiplexer switches controller outputs

Mechanical Control Mechanical Control –– Multi controller systemMulti controller system


Periodic execution of task graphConditional branching to controller modules (CM)CMs implement various area/time trade-offs possible

tex

area


Mechanical Control Mechanical Control –– Multi controller system Multi controller system --Use of reconfigurationUse of reconfiguration

One slot solution



Two slots solution



)(:timeexecution

CMCMex Aft =

)2/()(22

)2()2(min

1)2(

AftTtfAA

CMex

CMex

CM

==== −

)()()(

)1()1()1(min

1)1(

AgAfttTtfAA

CMrec

CMex

CMex

CM

+=+=== −

CM

CMCMreconfig

ArAgt

×≈= )(

: timereconfig.


Mechanical Control Mechanical Control –– Multi controller system Multi controller system --FPGA Implementation FPGA Implementation -- inverse penduluminverse pendulum

Synthesis results of controller:

FPGA: Virtex 800Area: 1003 slices (ca. 10%) Clock Rate: > 70 Mhz

Computation Time: tc < 1 µs

The Raptor 2000 Board

Host Simulates PlantHost configurations

RAPTOR 2000Comm. ResourcesConfiguration Manager

FPGAController ModuleSupervisorCommunication

Controller characteristics:Dimensions: p=3, n=2, q=3Word-width: Input: 16 Bit, Intern: 32 Bit


Mechanical Control Mechanical Control –– Multi controller system Multi controller system --FPGAFPGA--Implementation Implementation -- ArchitectureArchitecture

Two-slot implementationOne slot implementation

CM1

reconfiguring…

CM1

CM1

CM2

CM2

reconfiguring…

CM2

CM3


Mechanical Control Mechanical Control –– Multi controller system Multi controller system --FPGAFPGA--Implementation Implementation -- ArchitectureArchitecture

ReconfigurableModule Slot

(Controller Module)

Fix Module Slot(Supervisor)

Bus

Mac

ros

Documents

Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift