28
© Copyright 2019 Xilinx Hot Chips 31, Aug 20, 2019 Xilinx First 7nm Device: Versal AI Core (VC1902) Sagheer Ahmad, Sridhar Subramanian Vamsi Boppana, Shankar Lakka, Fu-Hing Ho, Tomai Knopp, Juanjo Noguera, Gaurav Singh, Ralph Wittig @Xilinx Inc

Xilinx First 7nm Device: Versal AI Core (VC1902)

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Hot Chips 31, Aug 20, 2019

Xilinx First 7nm Device: Versal AI Core (VC1902)

Sagheer Ahmad, Sridhar SubramanianVamsi Boppana, Shankar Lakka, Fu-Hing Ho, Tomai Knopp, Juanjo Noguera, Gaurav Singh, Ralph Wittig

@Xilinx Inc

Page 2: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Agenda

˃Versal OverviewWhat is Versal

Versal series overview

First Versal device

˃Key Blocks & FeaturesNOC, Memory, Interfaces, IOs, and SerDes

PS/PMC, Security, Config and Debug

Programmable Logic

˃AI EngineArray, Core

Compute, memory, and throughput

Benchmarks and use-cases performance

Page 3: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

DEVICE CATEGORY

FPGA SoC ACAP

FEATURED PRODUCTS

Spartan

Artix

Kintex

Virtex

Zynq-7000

Zynq UltraScale+ MPSoC

Versal

Zynq UltraScale+ RFSoC

Xilinx Device Categories

>> 3

Versal = First ACAP device series

ACAP = Adaptive Compute Acceleration Platform

Page 4: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Versal Series Overview

ProgrammableLogic

Scalar Engines Adaptable Engines Intelligent Engines

Arm Dual-CoreCortex-R5Real-TimeProcessor

Arm Dual-CoreCortex-A72ApplicationProcessor

AIEngines

DSPEngines

28G

58G

112GPCIe & CCIX

(w/DMA)DDR HBM

MultirateEthernet

600GCores

DirectRF

MIPI

LVDS

GPIO

Block RAM

UltraRAM

Accelerator RAM

PlatformManagement

Controller

Processing System

Network On Chip

Compute Engines

– Scalar Processors in every device

– Enhanced Programmable Logic

– New AI and enhanced DSP Engines

NoC and Memory– High BW Network-on-Chip

– Hardened [LP]DDR4/5, and HBM

High-Speed Interfaces– PCIe & CCIX up to Gen5

– Ethernet MAC up to 600Gbps

SerDes and RF– SerDes Up to 112G PAM4

– Integrated ADC/DAC

>> 4

Page 5: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

First Versal AI Core (VC1902)

Process Technology TSMC 7nm FF

# Transistors 37B

On-die memory 855Mb

# AI engine cores 400

# IOs 785

# SerDes 44

Shipping to Early Customers

AI Engines

PS & PMC

Se

rDe

s

Se

rDe

s

VN

OC

Colu

mn

VN

OC

Colu

mn

VN

OC

Colu

mn

VN

OC

Co

lum

n

DDR MC, PHY & IOs

PCIe & CCIX

Eth

ern

eta

nd

PC

Ie

DS

P C

olu

mn

DS

P C

olu

mn

DS

P C

olu

mn

DS

P C

olu

mn

HNoC

HNoC

PL

>> 5

Page 6: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Highly Configurable & Scalable

Configurable topology, ports, routing, and QoS

Compiler to generate use-case specific routing, QoS, ...

NoC extends for die-to-die connectivity

Versal NoC (Network-on-Chip)

Vertical NoC- 2 physical channels, each w/ 8 VCs- 7 NMU and 7 NSU per column- >0.5Tbps of bidirectional bandwidth per column

Horizontal NoC- 4 physical channels, each w/ 8 VCs- 4 NSU ports per DDR Controller- >1Tbps bidirectional bandwidth per row

Packetized High-speed NoC

All of SoC building blocks & PL connected via NoC

Packetized w/ VCs & end-to-end ECC protection

Clocking & Power Management

Clock forwarding to minimize clock jitter & power

Aggressive clock-gating & data bypass

Data movement efficiency critical for compute acceleration

AW

AR

WR

B

RESP

AW

AR

WR

B

RESP

REQ

REQ

Ingress EgressHigh speed

transport

One Switch

NMU

NSU

NoC (Conceptual)

>> 6

Page 7: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Memory Subsystem and IO

Unified Memory Subsystem

Unified memory subsystem, but can be customized

Transaction reordering & QoS for multiple traffic types

Parallel IOs

644x high performance XPIOs for DDR, MIPI, …137x high density multiprotocol IOs for up to 3.3v

HD

IOs

HD

IOs

MIO

s

256b DDR w/ 4x 64b or 8x 32b channels

Optimized for 64b or 32b memory channels

32b granularity more efficient for some use-cases

DDR4 up to 3200 and LPDDR4 up to 4266 Mbps

DDR Memory Controller

DDR & XPIOs

>> 7

Page 8: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

PCIe, Ethernet, and SerDes

4x 100G Multi-rate Ethernet

Multi-rate (100/50/40/25/10Gbps) Ethernet

MACs with RS-FEC. 1588 support.

6x PCIe Gen4

Up to Gen4 x16 with End-point & Root-port

Smart storage or IO-Hub accelerator

Se

rDe

s

PC

IeG

bE

s

SerDes

44x 32Gbps multi protocol transceivers

Supports 100+ protocol/rate combinations

Se

rDe

s

PC

Ie

>> 8

Page 9: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Versal CCIX

CCIX and PCIe (CPM)– 2nd generation of CCIX coherent accelerator link

Coherent Home-node and L2 cache– Home-node for coherent peer processing

– L2 for caching capability for PL accelerator kernels

CCIX ESM (Extended Speed Mode)– Supports PCIe Gen4 x16 for CCIX & PCIe

– Supports up to CCIX 25Gbps 2x8

CPM

PCIe w/ CCIX

AXI4

AXIS

GT

s

Ph

ysic

al L

aye

r

Lin

k La

yer

CC

IX

T L

ay

er

PC

Ie

T L

ay

er

PL

Clock, Reset,

Debug

PCIe w/ CCIX

DMA & Bridge

Ph

ysic

al L

aye

r

Lin

k La

yer

PC

Ie

T L

ay

er

CC

IX

T L

ay

er

GT

s

PS/NoC

XP

IPE

XP

IPE

AXIS

CCIX

to CHI

Bridge

CCIX

to CHI

Bridge

Cache

Coherent

Mesh

L2

Cache

L2

Cache Loca

l

Cac

he

Loca

l

Cac

he

CHI

CHI

Use

r

Ke

rne

l

Use

r

Ke

rne

l

AT

CA

TC

CPM Interconnect

Coherent Load/Store Memory Semantics>> 9

Page 10: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Versal Processor System (PS)

PS in all Versal devices

3rd generation of PS integration

First generation with PS in all devices

Host for embedded, control for acceleration

Processor System

Dual-core A72 APU

2x Cores with 1MB L2 Cache with ECC

Coherency and virtualization support

Dual core R5 RPU w/ lockstep

2x Cores with 256KB TCM & 256KB OCM

ASIL-C(D) capable functional safety

>> 10

Page 11: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Versal PMC and Security

PMC (Platform Mgmt Controller)

PL

Triple

Redundant

MicroBlaze

AE

S-k

ey

Device

Efuse

BBRAM

PUF

AES-GCM SHA3-384RSA/

ECDSA

TRNG

Decryption Authentication

Key-loading

Boot

ROMRAM

User

Boot/Flash

Interfaces

Voltage &

Temperature

Monitor

Internal

Clock

Generator

PS

NoC,

DDR,

ME, ...

Platform Management Ctrl (PMC)Gateway for Boot/Config, Security, Power mgmt, …Dual-core triple redundant MicroBlaze subsystem

Crypto accelerator engines (RSA,ECDSA,AES,SHA)

Security & MonitorsHardware RoT with authentication and encryption

Key storage & management including PUF support

Distributed Voltage & Temp monitors

50Gbps Configuration Interface

Typical PL kernel configuration in sub 10msec

8x faster PL configuration time per config-bit

10Gbps Debug & Trace Interface

New HSDP (High-Speed Debug Port) serial interface.

100x faster than JTAG for debug & trace

>> 11

Page 12: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Versal Programmable Logic (PL)

158Mb of URAM & BRAM

Distributed URAM and BRAM columns

Customizable memory hierarchy

50% lower power than previous gen

900K LUTs (2M LC) and 1.8M Flops

4x Larger CLB (8 LUTs 32 LUTs)

16 Flip-Flops 64 Flip-Flops

Increased local routing ( lower global routing)

Imux registers for pipeline & time borrowing

Versal CLB

Ne

w C

LB I

nte

rco

nn

ect

>> 12

Page 13: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Versal Programmable Logic DSP

DS

P C

olu

mns

DS

P C

olu

mns

DS

P C

olu

mns

DS

P C

olu

mns

DS

P C

olu

mns

DS

P C

olu

mns

Versal DSP58

FP32/16 floating point

INT8/16/24 and CMPLX18 fixed point

1968x DSP58

Distributed DSP columns

2.8TFLOP/s FP32 Peak

11.8TOP/s (INT8) Peak

>> 13

Page 14: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Versal SiliconE

ng

ines

Inte

rfa

ces &

IP

100G Multi-Rate MAC• Passed internal loopback

without GTY at full speed

32Gb/s Transceivers• Passed backplane PRBS31

• 7.16ps total &180fs random jitter

PCIe & CCIX• Passed Gen3 compliance

• Clean link at Gen4 x4

DDR Memory• DDR4 running at 3200Mb/s

• LPDDR4 running at 4266Mb/s

Programmable Logic• PL state machines

• Data across NoC & AI Engines

Scalar Engines• Boots 64-bit Linux

• A72, R5, PMC all running

AI Engines• All 400 AI Engine Tiles functional

Network-on-Chip (NoC)• Running error-free @ 3200 Mb/s

• Arbitration across engines

>> 14

Page 15: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

400 AI Engine Tiles

133 TOPs (INT8) Peak

AI Engine: Array

Non-blocking Interconnect Mesh

20Tbps row x-sectional bandwidth

10 32-bit channels per column and 8 per row

Distributed Memory Hierarchy

12.5MB distributed L1 memory

Multi-bank local memory shared w/ neighboring tiles

Distributed DMA

>> 15

Page 16: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

AI Engine: Core

Local, Shareable Memory• 32KB Local, 128KB Addressable

32b Scalar RISC Processor• 2 Scalar Ops / Stream Access Vector Processor

• 512-bit SIMD Datapath• 2 Vector Loads / 1 Mult / 1 Store• vec128int8• vec8fp32

Memory Interface

Scalar Unit

ScalarRegister

File

Scalar ALU

Non-linear Functions

Vector Register

File

Fixed-Point Vector Unit

Floating-Point Vector Unit

Vector Unit

Instruction Fetch & Decode Unit

AGU AGU AGU

Load Unit A Load Unit B Store Unit

Stream Interface

7+ Ops per cycle VLIW

>> 16

Page 17: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Multi-Precision Support

8 816

32

64

128

32x32SPFP

32x32Real

32x16Real

16x16Real

16x8Real

8x8Real

MACs / Cycle (per core)

AI Data Types Signal Processing Data Types

2

4

8

16

32x32Complex

32x16Complex

16x16Complex

16 Complexx 16 Real

MACs / Cycle (per core)

>> 17

Page 18: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

AI Engine Memory Hierarchy

DRAM

L2 SRAM

L1 SRAM

Flexible XBAR

Adaptable L1 NOC with DMA

>> 18

Page 19: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

AI Engine Memory Hierarchy

DRAM

L2 SRAM

L1 SRAM

Multicast /

Broadcast

>> 19

Page 20: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

AI Engine Memory Hierarchy

> 64 GByteDRAM

16.3 MByteL2 SRAM

12.5 MByte(128 kByte 4 Core Cluster)

L1 SRAM

1.6 TB/s

102 GB/s

38 TB/s

>> 20

Page 21: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

AI Engine: Compute Efficiency

95%

80%

98%

ML Convolutions FFT DPD

Vector Processor Efficiency

Peak Kernel Theoretical Performance

Block-basedMatrix Multiplication(32×64) × (64×32)

1024-ptFFT/iFFT

Volterra-basedforward-path DPD

˃ Adaptable, non-blocking interconnect

Flexible data movement architecture

Avoids interconnect “bottlenecks”

˃ Adaptable memory hierarchy

Local, distributed, shareable = extreme bandwidth

No cache misses or data replication

Extend to PL memory (BRAM, URAM)

˃ Distributed DMA for overlapping Compute and Comm.

Compute

Comm

Compute Compute

Comm Comm

>> 21

Page 22: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

AI Engine: Performance Benchmark

4087

29250

0

5000

10000

15000

20000

25000

30000

35000

Alveo U250 xDNN Versal AI Core Series

Images/S

econd

GoogleNet Inference Performance (sub 2ms latency)

2043

11812

0

2000

4000

6000

8000

10000

12000

14000

Alveo U250 xDNN Versal AI Core Series

Images/S

econd

ResNet-50 Inference Performance

* *

*Versal AI Core (VC1902) projected performance

UltraScale+ series UltraScale+ series

>> 22

Page 23: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Accelerating AI Applications on Versal

NETWORK-ON-CHIP

AI Engines

Arm Dual-CoreCortex-R5

Arm Dual-CoreCortex-A72

I/O

TB/s of BandwidthPL-to-AI Engine

Scalar, Sequential& Complex Compute

Any-to-AnyConnectivity

Flexible Parallel Compute,Data manipulation

ML & Signal ProcessingVector, Compute Intensive

IntelligentAdaptableScalar

Video + AI

Genomics + AI

Risk Modeling + AI

Database + AI

Network IPS + AI

Storage + AI

Heterogeneous Accelerationfrom Data Center to the Edge

Deterministic Performance & Low Latency

Custom MemoryHierarchy

463 x 32KB +

967 x 4KB of RAM

>> 23

Page 24: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Accelerating 5G Wireless on Versal5G Wireless Infrastructure

Pa

cke

t P

roce

ssin

ga

nd

Wire

d B

ackh

au

l

Hig

he

r L

aye

r P

roce

ssin

g

Sw

itch

ing

Beam

Fo

rmin

g &

MM

IO

+ S

om

e B

aseband

Tra

nsfo

rms

Dig

ita

l Rad

io

AD

C /

DA

C

An

alo

gu

e R

ad

io

An

ten

na

Arr

ay

Ba

se

ba

nd

Pro

ce

ssin

g

Digital Radio w/ ADC/DAC

DUC: Digital Up ConverterDPD: Digital Pre-DistortionCPRI: Common Public Radio Interface

ADC/ DAC

DPDDUCCPRI

DPD Update

CFR

Channel

FilterHB1 2 HB2 2 LTE20

Channel

FilterHB1 2 HB2 2 LTE20

Channel

FilterHB1 2 HB2 2 LTE20

Channel

FilterHB1 2 HB2 2 LTE20

Mixing

DDS DDS DDS DDS

Peak

Detect and

Scale Find

HB4 2 Delay

PC-CFR

Peak

Detect and

Scale Find

Delay

PC-CFR

HB5 5/4 DPDFilter

1/4

CABS

9x9 DPDkernel

NHB3 = 43

NHB1 = 23NC = 89 NHB2 = 11

NHB5 = 41

30.72 MHz 30.72 MHz 61.44 MHz

491.5

2 M

Hz

122.8

8 M

Hz

614.4

MH

z

DPDLUTs

Coefficientto LUT

Conversion

Programmable Logic (PL)

AI Engine Array

Processor System (PS) : APU

AIE I/F

CoefficientsGain

MemoryActive/Shadow

Channel Filter

HB1 2 HB2 2 LTE20 HB3 2

NHB4 = 27

DPDFilter

2/4

DPDFilter

3/4

DPDFilter

4/4

DPDLUTs

Frequency

Domain

Measurements

Power Spectrum Estimate

DPDOutput

Versal

>> 24

AIE I/FAIE I/F

AIE I/FAIE I/FAIE I/FAIE I/FAIE I/F AIE I/F

Crest Factor ReductionShaping Up-sample Heterodyne Digital Pre-distortion

Page 25: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Software Programmable

Compile

Design

4G/5G/Radar

Library

Frameworks

AI

LibraryVision

Library

C/C++C/C++

AI Engine Compiler

Programming

Abstraction Levels

1

2

3Run

Architecture Overlay

Data Floww/ Xilinx libraries

Kernel ProgramData Flow w/ user defined libraries

>> 25

Page 26: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

AI Engine ArrayPS PL

Unified Tool Chain for Programming

Xilinx SDK: Eclipse GUI

User-Directed System Partitioner

AI Engine CompilerARM C Compiler

System-C Virtual Simulation Platform

Core ISSQEMU

System-Level

Performance Analysis (using core profiling)

System-Level

Debugger (using core debugger)

Base Platform

ApplicationPerformance &

Partitioning Constraints

Binaries & Bitstream

Targets

SDK

Versal Device

Vivado

HLSRTL

IP

>> 26

Page 27: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Summary

Versal is the first generation of ACAP device– ACAP is a new class of device from Xilinx

Versal employs adaptable heterogeneous system architecture– New SW programmable AI Engine for diverse compute acceleration workloads

– New High-bandwidth Network-on-Chip integrated w/ hardened DDR Subsystem

– Processor System in all Versal devices

– Re-architected Programmable Logic

Xilinx first 7nm device: Versal AI Core VC1902– 133TOPs AI Engines, 12TOPs DSP Engines, and 900K LUTs

– 256b DDR4/LPDDR4, PCIe Gen4 & CCIX up to 25Gpbs

– For more details, refer to: www.xilinx.com/versal

>> 27

Page 28: Xilinx First 7nm Device: Versal AI Core (VC1902)

© Copyright 2019 Xilinx

Adaptable.Intelligent.