Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying

Lehrstuhl für BetriebssystemeRWTH Aachen

Lehrstuhl für Rechnerarchitektur

TU Chemnitz

* +

Efficient Asynchronous Message Passing via SCI with Zero-Copying

Joachim Worringen*, Friedrich Seifert+, Thomas Bemmerl*

SCI Europe 2001 – Trinity College Dublin

SCI Europe 2001 – Trinity College Dublin Lehrstuhl für Betriebssysteme

Agenda

• What is Zero-Copying? What is it good for?Zero-Copying with SCI

• Support through SMI-LibraryShared Memory Interface

• Zero-Copy Protocols in SCI-MPICHMemory Allocation SetupsPerformance Optimizations

• Performance EvaluationPoint-to-PointApplication KernelAsynchronous Communication


Zero-Copying

• Transfer of data between two user-level accessible memory buffers with N explicit intermediate copies:N-way–Copying

No intermediate copy: Zero-Copying• Effective Bandwidth and Efficiency:

n

iipeak

eff

BB

B

1

111

eff

peak

BB


Efficiency Comparison

FastEthernet

GigaEthernet

SCI DMA


Zero-Copying with SCI

SCI does zero-copy by nature.But: SCI via IO-Bus is limited:• No SMP-style shared memory• Specially allocated memory regions were required No general zero-copy possibleNew possibility:• Using user-allocated buffers for SCI communication Allows general zero-copy!

Connection setup is always required.


SMI LibraryShared Memory Interface

High-Level SCI support library for parallel applications or libraries• Application startup• Synchronization & basic communication• Shared-Memory setup:

- Collective regions- Point-2-point regions- Individual regions

• Dynamic memory management• Data transfer


Data Moving (I)

Shared Memory Paradigm:• Import remote memory in local address space• Perform memcpy() or maybe DMA• SMI Support:

- region type REMOTE- Synchronous (PIO): SMI_Memcpy()

- Asynchronous (DMA if possible): SMI_Imemcpy() followed by SMI_Mem_wait()

Problems: • High Mapping Overhead• Resource Usage (ATT entries on PCI-SCI adapter)


Mapping Overhead

Not suitable for dynamic memory setups!


Data Moving (II)

Connection Paradigm:• Connect to remote memory location• No representation in local address space only DMA possible• SMI support:

• Region type RDMA• Synchronous / Asynchronous DMA:SMI_Put/SMI_Iput, SMI_Get/SMI_Iget, SMI_Memwait

Problems:• Alignment restrictions• Source needs to be pinned down


Setup Acceleration

Memory buffer setup costs time ! Reduce number of operations to increase performance

Desirable: only one operation per buffer• Problem: limited ressources• Solution: caching of SCI segment states by lazy-release

- Leave buffers registered, remote segments connected or mapped- Release unneeded resources if setup of new resource fails- Different replacement strategies possible:

LRU, LFU, best-fit, random, immediate- Attention: remote segment deallocation! Callback on connection event to release local connection

• MPI persistent communication operations:• Pre-register user buffer & higher „hold“ priority


Memory Allocation

Allocate „good“ memory:• MPI_Alloc_mem() / MPI_Free_mem()• Part of MPI-2 (mostly for single-sided operations)• SCI-MPICH defines attributes:

-type: shared, private or default Shared memory performs best.

-alignment: none, specified or default Non-shared memory should be page-aligned

• „Good“ memory should only be enforced for communication buffers!


Zero-Copy Protocols

• Applicable for hand-shake based rendez-vous protocol• Requirements:

• registered user allocated buffersor• regular SCI segments„good“ memory via MPI_Alloc_mem()

• State of memory range must be known SMI provides query functionality

• Registering / Connection / Mapping may fail• Several different setups possible Fallback mechanism required


Data Transfer

SenderApplicationThread

DeviceThread

ReceiverApplicationThread

DeviceThread

Asynchronous Rendez-Vous

OK to send

Control Messages

Ask to sendIsendIsend

IrecvIrecv

WaitWait

WaitWait

ContinueDone

Done


Test Setup

Systems used for performance evaluation:• Pentium-III @ 800 MHz• 512 MB RAM @ 133 MHz• 64-bit / 66 MHz PCI (ServerWorks ServerSet III LE)• Dolphin D330 (single ring topology)• Linux 2.4.4-bigphysarea• modified SCI driver (user memory for SCI)


Bandwidth Comparison


Application Kernel: NPB IS

• Parallel bucket sort• Keys are integer numbers• Dominant communication:MPI_Alltoallv for distributed key array:

Class Array size [MiB]

Procs Msg size [kiB]

Alltoallv [ms]

% of execution time

A 1 4 256 16.363 34.6W 8 4 2048 123.921 36.2


MPI_Alltoallv Performance

• MPI_Alltoallv is translated into point-to-point operations: MPI_Isend / MPI_Irecv / MPI_Waitall

• Improved performance with asynchronous DMA operations

• Application speedup deduced

Class Procs regular [ms]

speedup user [ms]

speedup

A 4 7.578 1.22 9.617 1.16W 4 52.415 1.26 63.957 1.21


Asynchronous Communication

Goal: Overlap Computation & Communication• How to quantify the efficiency for this? Typical overlapping effect:

totaltime

computation time

Computation

Synchronous

Asynchronous


Saturation and Efficiency (I)

Two parameters are required:1. Saturation s

• Duration of computation period required to make total time (communication & computation) increase

2. Efficiency • Relation of overhead to message latency


Saturation and Efficiency (II)

ttotal

tbusy

tmsg_a ttotal - tbusy

Computation

Synchronous

Asynchronous

tmsg_s

msg

busytotal

ttt

1

Saturation s

busytotalmsg ttts


Experimental Setup: Overlap

Micro-Benchmark to quantify overlapping:

latency = MPI_Wtime()if (sender)

MPI_Isend(msg, msgsize)while (elapsed_time < spinning_duration)

spin (with multiple threads)MPI_Wait()

elseMPI_Recv()

latency = MPI_Wtime() - latency


Experimental Setup: Spinning

Different ways of keeping CPU busy:• FIXED

Spin on single variable for a given amount of CPU time No memory stress

• DAXPYPerform a given number of DAXPY operations

on vectors (vectorsizes x, y equivalent to message size) Stress memory system

jyjxAjy


DAXPY – 64kiB Message


DAXPY – 256kiB Message


FIXED – 64kiB Message


Asynchronous PerformanceSaturation and Efficiency derived from experiments:

Experiment Protocol tmsg [ms] s [ms] 64 kiBDAXPY

a-DMA-0-R 0.490 0.285 0.581a-DMA-0-U 0.735 0.473 0.643s-PIO-1 0.572 0.056 0.043

256 kiBDAXPY

a-DMA-0-R 1.300 1.099 0.845a-DMA-0-U 1.506 1.148 0.762s-PIO-1 1.895 -0.030 -0.015

64 kiBFIXED

a-DMA-0-R 0.493 0.446 0.904a-DMA-0-U 0.738 0.691 0.936s-PIO-1 0.567 0.016 0.028


Summary & Outlook• Efficient utilization of new SCI driver functionality for MPI

communication: Max. bandwidth of 230 MiB/s (regular)

190 MiB/s (user)• Connection overhead hidden by segment caching

Asynchronous communication pays off much earlier than before

• New (?) quantification scheme for efficiency of asynchronous communication

• Flexible MPI memory allocation supports MPI application writer• Connection-oriented DMA transfers reduce resource utilization

• DMA alignment problems• Segment callback required for improved connection caching

Documents

Lehrstuhl für Betriebssysteme RWTH Aachen Lehrstuhl für Rechnerarchitektur TU Chemnitz * + Efficient Asynchronous Message Passing via SCI with Zero-Copying