24
FPGA implementation of a near-ML sphere detector for 802.16e broadband wireless systems chris dick: Xilinx Milos Trajkovic, Slobodan Denic, Dragan Vuletic: Signum Concepts Raghu Rao: Xilinx fred harris: Signum Concepts & San Diego State University Kiarash Amiri: Rice University

FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

FPGA implementation of a near-ML sphere detector for 802.16e broadband wireless systems

chris dick: XilinxMilos Trajkovic, Slobodan Denic, Dragan Vuletic: Signum ConceptsRaghu Rao: Xilinxfred harris: Signum Concepts & San Diego State UniversityKiarash Amiri: Rice University

Page 2: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 2

Agenda

Spatial multiplexing MIMO Quasi-ML SM decoders and sphere detection Useful approximations for efficient hardware implementation Guided tour of sphere detector implementation for 802.16e Implementation results

Page 3: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 3

802.16e Sphere Detector Implementation

Specifications– 5MHz WiMAX– 360 data sub-carriers– antenna ordering to be updated for every sub-carrier for every OFDM symbol

• extremely rapid update; could probably be relaxed

– 2,3 or four antennas programmable dynamically at run-time– QPSK, 16-QAM and 64-QAM configurable on a per user basis

Mod

ulat

ion

and

Map

ping

TX

... ...

Sig

nal P

roce

ssin

g

RX

MIMO channel

Ser

ial t

o P

aral

lel

Par

alle

l to

Ser

ial

... ...

In OutDACRF

ADCRF... ...

11 12 13

21 22 23

31 32 33

h h hh h hh h h

H

TX ant num

RX a

nt n

um

Page 4: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 4

FPGA Sphere Detector Effective FPGA implementation is intersection of:

Algorithm optimizations• K-best, depth-first search, sort-free,…

Architecture• clock rates• hardware folding

Micro-architecture optimizations that are possible with the FPGA• bit-width optimizations• math optimizations: norms to simplify hardware footprint for

sphere detector for example

1 2vs vs

Page 5: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 5

Sphere Detection – QRD Pre-processing

Maximum Likelihood detection

Key element of complexity reduction is suitable factoring of channel matrix H

MR x MT matrix Q has orthonormal columns and R is upper triangular Distance metric can be written as

H QR

2 ZFˆ ˆ( ) with Hd s c y y Rs y Q Rs 'y R s 2||||

2ˆ arg minMTs

s

y Hs

2ˆ y R s

Page 6: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 6

Example: Solving for Antenna 2, Real Component of Symbol

Solve for symbol transmitted on antenna 1 by performing hypothesis test– recall that for QPSK the coordinates for the constellation points are = {±3,±1}

Form all of the possible required products Rs and differences as shown

Now make a decision for the real component of the symbol sent on antenna 1

1,1 1,2 1,31

2,

0,

2

3

2

0,0

3

0,10

2 2,32

3,

0

3

1

3

,2

2

3

0( )

0 00 ˆ0

ˆ

ˆ0

T

T

T

M

M

MR R RyR Ry

Rys

R

sRy

RR

d

s

s

1 3 1 1 3ˆ min , , ,TMs d s d s d s d s

1,1 1,2 1,31

2,

0,

2

3

2

0,0

1

0,10

2 2,32

3,

0

3

1

3

,2

2

1

0( )

0 00 ˆ0

ˆ

ˆ0

T

T

T

M

M

MR R RyR Ry

Rys

R

sRy

RR

d

s

s

1,1 1,2 1,31

2,

0,

2

3

2

0,0

1

0,10

2 2,32

3,

0

3

1

3

,2

2

1

0( )

0 00 ˆ0

ˆ

ˆ0

T

T

T

M

M

MR R RyR Ry

Rys

R

sRy

RR

d

s

s

1,1 1,2 1,31

2,

0,

2

3

2

0,0

3

0,10

2 2,32

3,

0

3

1

3

,2

2

3

0( )

0 00 ˆ0

ˆ

ˆ0

T

T

T

M

M

MR R RyR Ry

Rys

R

sRy

RR

d

s

s

Page 7: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 7

Process of Sphere Decoding

The decoding process can be mapped to finding a shortest path (with minimum Euclidean distance) in a tree topology

decodingsequence

Computational complexity for one solution Adds: M(M+1)/2 + 2M − 1 Mults: M(M+1)/2 complex mults 2M square operators Mem for H: M2 complex values

E.g. 16×16 systems Nadd = 167 Nmult = 136 complex mults 32 square operators Mem for H: 256 complex values

2

1 1,1 1 1,2 2 1,

2 2 2,2 2 2,

,

21,1 1

2,2 2

1

2

,

ˆ

ˆˆ

ˆ

T

T

T T

T T TT

T

M M

M M

M

M M M M

M M M

y R s R s R s

y R s R s

y R s

R sbb

b

R s

R s

y Rs

ant-1ant-2

ant-MT

common term for each row

Page 8: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 8

Mapping Distance Computation to FPGA

The actual system we are considering is 4x4 and the matrix equation looks like

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,70

1,1 1,2 1,3 1,4 1,5 1,6 1,71

2,2 2,3 2,4 2,5 2,6 2,72

3,3 3,4 3,5 3,6 3,73

4,4 4,5 4,6 4,74

5,5 5,6 5,75

6,6 6,76

7,77

R R R R R R R RyR R R R R R Ry

R R R R R RyR R R R Ry

R R R RyR R Ry

R RyRy

2

0

0

1

1

2

2

3

3

ssssssss

The element wise product and subtraction maps well to the DSP48 element in the FPGA

Leverages the dedicated routing between cascaded DSP48’s to minimize the critical path and hence maximize the FPGA clock frequency

DSP48 tile contains a multiplier for computing the products and add/sub for accumulating the partial-products

C

A

B

PCOUT

DSP48

PCIN

A

B

PCOUT

DSP48

iy

Symbol

( )R

Symbol

( )R

Dedicated high-speed interconnect in Xilinx FPGA

PCIN

A

B

PCOUT

DSP48

( )R

SymbolPath MetricUpdate

Sphere detector essentially a streaming dataflow computation Parallel processing pipeline for computing the incremental distance metric

Page 9: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 9

i=5

i=4

i=3

i=2

i=1

Sphere Detection Tree Example: MT =4 antennas, K=3, 16-QAM

K survivors at each level requires sort

→ high-latency process

25 5,7 7 5,6 6 5,5 5y R s R s R s i=6

26 6,7 7 6,6 6y R s R s i=7

27 7,7 7y R s

i=8

MIN value: Detected Vector

4ˆRe s 4ˆIm s

3Re s

3Im s

2ˆIm s

2ˆRe s

1Im s

1Re s

Page 10: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 10

FlexSphere Detection Tree4x4 64-QAM

Two levels in tree correspond to one antenna First 2 levels fully expanded Expansion of the minimum node rather sorting whole level

– approximate sort avoids requirement for high-latency min-search at each level

Page 11: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 11

Micro-Architecture Optimizations to Reduce Cost: Simplified Norms

Motivation for using simplified norm is to reduce complexity of implementation Minimize FPGA resource footprint Mechanism that permits tradeoff

between FPGA architectural elements, e.g. DSP48’s for LUTs

2

221

Let partial sphere constraint is

i i

i i i

T X

X X e r

2

1

1

1

approximate the norm with a different norm

or

max ,

i i i

i i i

i i i

X f X e

X X e

X X e

1

In the complex-valued case, corresponding approximations for are

norm: or

norm: max ,

i

i i i

i i i

e

e e e

e e e

Page 12: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 12

L-1, L-2 Norm Comparison

Negligible impact using simpler L-1 norm versus L-2 norm

K-Best sphere detector is quasi MLD solution Already some BER degradation introduced with K-Best

Eliminates MPYs in implementation which may be beneficial

0 5 10 15 20 25 3010

-5

10-4

10-3

10-2

10-1

100

EbNo [dB]

BE

R

16 QAM, K-best, 2X2

K-best (K=4), L-2 normK-best (K=4), L-1 normK-best (K=6), L-2 normK-best (K=6), L-1 normML

Floating-point simulation

Page 13: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 13

802.16e Sphere DetectorAntenna Ordering

Figure below shows the antenna ordering pipeline

1† * *

2 2

1 1 ,[ ]

, 1,

arg for 1 arg for 1

,

max min

j

j j j j j T

j j j jk kk k

j j k

j M

k j k j

G H H H H

G G

H H H H

2, 2G i

2,1G i

2,3G i

2, 4G i

norm calculations

Page 14: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 14

802.16e Sphere DetectorTop-Level Block Diagram

Key to achieving good BER performance is around how the antenna ordering is performed– the ‘V-BLAST channel reordering’ component above

Assume we have the channel estimate

H sortedH

yy

H R

y

Page 15: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 15

802.16e Sphere DetectorAntenna Ordering Processing Pipeline

Linear processing pipeline of MT-1 stages employed Each stage: QRD matrix inversion, norm, updated H Buffer memories store intermediate results and support the

TDM pipeline

Antenna ordering: Top Each stage of the pipeline

realized using a QRD-style systolic array

G matrix calc

Norm search

Update Hsorted

G matrix calc

Norm search

Update Hsorted

G matrix calc

Norm search

Update Hsorted

360*32x18b

360*32x18b

5*16*2x18b

5*16*2x18b

5*16*2x18b

5*16*2x18b

5*16*2x18b

5*16*2x18b

4X4 matrix calculations

3X3 matrix calculations

2X2 matrix calculations

Page 16: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 16

Matrix Inversion using complex QRDPipeline Processing

QRD using Givens rotations

Angle estimation and vector rotation performed using complex multiplication and approximation

High-speed pipelined architecture using Xilinx DSP tiles (DSP48)

Heavy pipelining and time division multiplexing of hardware employed to meet latency/throughput requirements of 802.16e

Channel Matrix H 4x4 Configuration

3,0h 3,1h

*3,0

3,1 3,12multication3,0performed ininner cell

hh h

h

I jQ

2 2I QCmplx.Conj.

22 2

I jQ

I Q

To Inner Cells

1( ) ( ) ( );

( ) and ( ) stored in FPGA BRAMaddition and ( ) realized using DSP48

f x x f x x f x f xx

f x f xx f x

3,1h 3,1h

1/ x

1/ x

2

2

Cmplx.Conj.

SaveCurrent

ReadPrevious

To Inner Cells

First Set of Rotations

Perform rotations to successively null matrix entries

22 2

I jQ

I Q

Page 17: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 17

Approximating 1/sqrt(x)f(x+x) = f(x) + x f’(x); f(x) = 1/sqrt(x)

0 0.5 1 1.5 2 2.5 3x 105

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5Floating-point Matlab and Floating-point Approximation

x

1/sq

rt(x)

Floating point MatlabFloating point approximation

0 0.5 1 1.5 2 2.5 3x 105

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5Floating-Point Matlab and Fixed-Point Approximation

x

1/sq

rt(x)

Floating point MatlabFixed point approximation

0.5 1 1.5 2-4

-2

0

2

4

6

8x 10-6 Difference Function - Floating-Point

x

App

roxi

mat

ion(

1/sq

rt(x)

) - 1

/sqr

t(x)

0.5 1 1.5 2-1.5

-1

-0.5

0

0.5

1

1.5x 10-5 (Apprx Fixed-Point) - Fixed-point

x

App

roxi

mat

ion(

1/sq

rt(x)

) - 1

/sqr

t(x)

Page 18: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 18

FlexSphere BERUsing Novel Interference Cancellation

Key to performance is on determining suitable order in which to process the antennas

Innovative element of this design is on the use of interference cancellation in channel matrix pre-processing and novel real-valued ordering of the sphere detector tree

BER plot for fixed-point design generated from simulation of System Generator model

Floating-point matlab model overlay also shown

0 5 10 15 20 2510

-5

10-4

10-3

10-2

10-1

100

EbNo [dB]

BE

R

64 QAM, 4X4

Fixed-Point FPGA Sphere Det./preprocessingFloating-point Matlab Sphere Det./preprocessingMaximum Likelihood

Page 19: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 19

802.16e Sphere DetectorTree Search

Once the tree ordering is determined a detection tree is assembled using real-valued decomposition and approximate sorting strategy to minimize detection latency

Sphere Detector Processing Pipeline

8,:R

(8)y

7,:R (7)y 6,:R (6)y

8i 7i 6i 1i

1,:R (1)y

( )iiT s

2( ) ( 1) ( )1

i i ii i iT T e

s s s

( 1)1

1

ˆTM

ii i ij j

j i

b y R s

s

1 2 8 57 58 64

1 8

1

Detected Vector

Min-Search

i=2

i=1

i=3

i=4

i=5

i=6

i=7

i=8

Page 20: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 20

Development / Debugging Environment (Host PC)Deployment Environment

802.16e Sphere DetectorDevelopment/Deployment

LocalLink Interface

Gigabit Ethernet Switch

SysGen HW Co-sim APISysGen HW Co-sim APIMicroprocessorMicroprocessor DSP

ProcessorDSP

Processor

StandaloneC/C++ Program

StandaloneC/C++ Program

MATLAB/SimulinkMATLAB/Simulink

Ethernet Co-simulation Protocol

Virtex-5 FPGA

Virtex-5 EMACLocal-Link Wrapper

SGMII

ClockGenerator

ClockGenerator

MemoryMap

MemoryMap

EthernetCo-simulation

Processor

EthernetCo-simulation

Processor

PHYPHY

I/OPortsI/O

Ports

SharedRegistersShared

Registers

SharedRAMs

SharedRAMs

SysGen DSP Design

(HW CosimDesign)

Antenna layerprocessing, QRD, Sphere detector

SysGen DSP Design

(HW CosimDesign)

Antenna layerprocessing, QRD, Sphere detector

EMAC0

EMAC1

RX FIFO

TX FIFO

RX FIFO

TX FIFO

UserFunction 2

UserFunction 2

Virtex-6 LX240T: 241,152 LCs, 768 MPYs USB, ethernet Hardware cosim support using Xilinx DSP

tools

Page 21: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 21

802.16e Sphere DetectorBERT using HW Cosim

System Generator HW Cosim module from previous slide

PN seq generator

Channel Preproc.y

Flex-Sphere Detector

6b QAM mapper

DeM

ux

s~

Gaussian noise gen

H~

Gaussian noise gen

DeM

ux

n~

~

H~PEDmin

sout 64QAM de-mapper

~ Self-synchronizing

BER tester

Eb/Nodata register

EbNo ctrl

Calc scale factor

BER_ready

BER_out

MIMO-OFDM Spatial Multiplexing Detector

Modulation

Page 22: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 22

802.16e Sphere DetectorBERT using HW Cosim: Demo UI

0 5 10 15 20 2510-5

10-4

10-3

10-2

10-1

Page 23: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 23

802.16e Sphere DetectorFPGA Resource Utilization

V-BLAST channel ordering is key Each sub-carrier for each OFDM symbol

– could probably be relaxed to reduce FPGA area

Main compute complexity is in the antenna ordering 802.16e 5 MHz bandwidth, 360 data sub-carriers, 4x4 antenna

configuration, 64-QAM Real-time operation at 102.9 microsecond frame rate Compute performance

– ~53E9 MPYs/second, ~21E6 matrix inversions/second, 225E6 path metric computations/second

Function Slices LUT/FFT Pairs Block Memory (18k)

DSP Slices

Channel Pre-processor 9,999 20,339/29,954 105 159 RVD QRD 1,715 4,418/5,556 27 30 Sphere Detector 2,445 3,113/6,525 12 48 Total 14,159 27,870/42,035 144 237

Page 24: FPGA implementation of a near-ML sphere detector for 802 ...china.xilinx.com/publications/prod_mktg/sdr_fpga_implementation.pdf · BER plot for fixed-point design generated from simulation

Page 24

Conclusion

Key to performance is on determining suitable order in which to process the antennas

Innovative element of this design is on the use of interference cancellation in channel matrix pre-processing and novel real-valued ordering of the sphere detector tree

Might not want to lock this function down in ASIC/ASSP– desire to have flexibility to tailor ML-decoder to air-interface requirements– upgrade with new algorithms

Algorithmic, architecture and micro-architecture optimizations required for good implementation

Architectural transparency provided by System Generator design flow allows for production of FPGA optimized implementation