PCI Express DMA Engine für Active Buffer Projekt im CBM
Experiment
Wenxue Gao, Andreas Kugel, Reinhard Männer, Holger Singpiel, Andreas Wurz
Uni. MannheimDPG Tagung, Gießen
14 März 2007
Inhalt
• Einleitung
• Blockdiagramm
• Realisierung
• Leistung
2 von 15
Einleitung – CBM Experiment
CBM TSR, Jan. 2006
Einleitung – PCI Express
• 2,5 Gbps pro Link
• Point-to-Point
• TLP (Transaction Layer Packet)– Post: MWr (Memory Write Request), …– Non-post: MRd (Memory Read Request), …– Completion: CplD, Cpl, …– Message: Msg
4 von 15
Host End-Point
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
Host End-Point
MWr1
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
Host End-Point
MWr1
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
Host End-Point
MWr1
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
Host End-Point
MWr2
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
MWr1
Host End-Point
MWr2
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
MWr1
Host End-Point
MWr2
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
MWr1
Host End-Point
MWr3
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
MWr2
MWr1
Host End-Point
MWr3
PCI Express – Post TLP (MWr, …)
Rx
Tx
Trn.
MWr2
MWr1
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
Tx
Trn.
End-Point
MRd1
PCI Express – Non-post TLP (MRd, …)
Host
Rx
Tx
Trn.
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
Tx
MRd1
Trn.
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1 CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
MRd1
MRd2
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
MRd1
MRd2
Tag[7:0]
PCI Express – Non-post TLP (MRd, …)
End-PointHost
Rx
TxMRd1
Trn.
MRd2
CplD1
CplD2
MRd1
MRd2
Tag[7:0]
Einleitung – SG DMA
• SG(Scatter/Gather) – Multiple-descriptor chain
• Voll-Duplex– Downstream: Host Endpoint– Upstream: Endpoint Host
• „Done“ Zustand– Status Register– Interrupt
Downstream
Upstream
Host Endpoint
Blockdiagramm
Rx
Tx
Tx Arbitrator
Memory
BRAM + FIFO + Registers
UpstreamDMA
Channel
DownstreamDMA
ChannelPIO
Channel
Rx Resolution
PCIeTransact .LayerInterface
Ch
ann
el B
uffe
r
TagRAM
Channel Buffer
• TLP Channel FIFO– Breite = 128– Tiefe = 15
• TLP ohne Payload– Alles im Word
• TLP mit Payload– Lokale Adresse– Zusätzliche Informationen
LAdr Hdr2 Hdr1 Hdr0
Rx
Tx
xxxx Hdr2 Hdr1 Hdr0
LAdr Hdr2 Hdr1 Hdr0
95127 63 31 0
9 von 15
Realisierung – DMA teilen
• 4 KB Grenze verboten
• Address/Length Combination
Realisierung – „Done“ bestätigen
• Wann ist DMA beendet?– „Done“ Zustand nötig
• CplD‘s für unterschiedliche MRd‘s kommen nicht folgend
– Mögliche Lösungen• Tag RAM lesen• CplD zählen• Channel Buffer leer• Letzten Tag triggern (x)
• Bitmap füllen– 128-bit Register für 7-bit Tags
11 von 15
Leistungsparameter• Zielbaustein
– Virtex4 XC4VFX60-11ff672• FFs
– 9 834 out of 50 560 ( 19 % )• LUT4s
– 11 464 out of 50 560 ( 22 % )• RAMb16
– 58 out of 232 ( 25 % )• Slices
– 9 426 out of 25 280 ( 37 % )• Frequenz ( trn_clk )
– 250 MHz• Verzögerung (Transaction layer)
– PIO: 52 ns (MRd CplD )– DMA: 80 ns (DMA „Start“ Tx TLP)
• Theoretische Bandbreite– 2Gbps x4 = 8Gbps, bi-directional
12 von 15
4-Lane Tests
0
1000
2000
3000
4000
5000
6000
7000
4096 8192 16384 32768 65536 131072 262144 524288
Packet Length (Bytes)
Bandwidth (Mbps)
PI O Wri teDMA Wri tePI O ReadDMA Read
Offene Fragen
• Kleinerer Channel Buffer– Meistens reichen 64-bit, statt 128-bit
• Bessere Behandlung von Fehlern– Teilweise unvollständig– Überschreiben von CplD zu vermeiden– Time-out
• tag Recycling
• Höhere Bandbreite für downstream DMA
Zusammenfassung
• PCI Express Vorteile– Parallelität– Skalierbarkeit
• Virtual channels– 2 DMA Channels– 1 PIO Channel
• Xilinx Lösung– 62,5 MHz für x1– 250 MHz für x4
15 von 15
x4-ABB• Design Summary• --------------• Logic Utilization:• Number of Slice Flip Flops: 9,834 out of 50,560 19%• Number of 4 input LUTs: 11,464 out of 50,560 22%
• Logic Distribution:• Number of occupied Slices: 9,426 out of 25,280 37%• Total Number 4 input LUTs: 12,993 out of 50,560 25%• Number used as logic: 11,464• Number used as a route-thru: 643• Number used for Dual Port RAMs: 202• Number used as Shift registers: 684
• Number of bonded IPADs: 18 out of 62 29%• Number of bonded OPADs: 16 out of 24 66%• Number of bonded IOBs: 1 out of 352 1%• Number of BUFG/BUFGCTRLs: 5 out of 32 15%• Number used as BUFGs: 4• Number used as BUFGCTRLs: 1• Number of FIFO16/RAMB16s: 58 out of 232 25%• Number used as FIFO16s: 0• Number used as RAMB16s: 58• Number of DSP48s: 2 out of 128 1%• Number of DCM_ADVs: 1 out of 12 8%• Number of GT11s: 8 out of 16 50%• Number of GT11CLKs: 1 out of 8 12%
X4 Test
DMA Prozess
• Buffer-descriptor– SA (Source Address)– DA (Destination Address)– NXA (Next Descriptor Address)– Length (Length in bytes)– Control (Control register)
• Start/Stop Befehl– Upstream: MWr + MRd (dex)– Downstream: MRd
• Busy/Done Zustände erkennen– Status Register– Interrupt (Msg)
Rx
TxTx Arbitrator
MWr_usp MWr_usp
MRd_dsdMRd_dsd
MRd_usd MRd_usd
MRd_dspMRd_dsp
Cpl/DCpl/D MWrMWr
Memory
BRAM + Registers + FIFO
Memory
BRAM + Registers + FIFO TagRAM
MR
d:
Cpl
D
Cpl
MR
d:
Cpl
D
Cpl
CplDCplD
MRdMRd
Rd
Wr
Wr
Wr
Rx Resolution
US
:
MW
r
MR
d
Msg
US
:
MW
r
MR
d
Msg
DS
:
MR
d
Msg
DS
:
MR
d
Msg
DMA Upstream
EngineRegisters
DMADownstream
EngineRegisters
Blockdiagram
m
Verifizieren
• PIO + DMA ($random)– Transaction length– Address-pair– Chain length (DMA)– Descriptor Address (DMA)– Flow control: *_rdy_n
• Output checking– tsof/teof– Data– Deskriptor abteilen
Downstream(Write)
Upstream(Read)
Root Endpoint
1
2
Memory Space
• BRAM– 16KB
• FIFO– 32 x 32– Loop-back
• Registers– Write / Read– Control / Status
• Eventuelle Erweiterung– DDR (BRAM ähnlich)– GbE (FIFO ähnlich)
BRAM
Registers
Loop-Back
Wr
Rd
OFIFO
Wr Rd
Wr Rd
IFIFO