Parallele Algorithmen zur Matrix Multiplikation Matthias Dohm Parallele Algorithmen zur Matrix...

Parallele Algorithmen zur Matrix Multiplikation

Matthias Dohm

Parallele Algorithmen zur Matrix MultiplikationSeminar Parallele Programmierung und Parallele Algorithmen

2 Parallele Algorithmen zur Matrix Multiplikation

Agenda

Einleitung

Algorithmen für quadratische Matrizen und quadratische Prozessor Grids

Algorithmen für nicht quadratische Matrizen und nicht quadratische Prozessor Grids

Fazit und Ausblick

Agenda

Einleitung

Fazit und Ausblick

Einleitung

16151413

1211109

Multiplikation

Einleitung

𝑐𝑖,𝑗 = 𝑎𝑖,𝑘𝑏𝑘,𝑗𝑛−1𝑘=0

public static int[][] MatrixMult (int[][]a, int[][]b) {int m = a.length; //A is a m x n matrixint n = b.length; //B is a n x o matrixint o = b[0].length; //Result is a m x o matrix

int[][] c = new int[m][o];

//Calculationfor (int i = 0; i < m; i++) {

for (int j = 0; j < o; j++) {c[i][j] = 0; //Initialize c_i,jfor (int k = 0; k < n; k++) {

c[i][j] += a[i][k]*b[k][j];}

}}return c;

Θ(n³)Laufzeit

Block-orientierter Algorithmus

Laufzeit: Θ(n³)

Einleitung

𝐶= ൬𝐴00𝐵00 + 𝐴01𝐵10 𝐴00𝐵01 + 𝐴01𝐵11𝐴10𝐵00 + 𝐴11𝐵10 𝐴10𝐵11 + 𝐴11𝐵11൰

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5

a 2,0 a 2,1 a 2,2 a 2,3 a 2,4 a 2,5

a 3,0 a 3,1 a 3,2 a 3,3 a 3,4 a 3,5

a 4,0 a 4,1 a 4,2 a 4,3 a 4,4 a 4,5

a 5,0 a 5,1 a 5,2 a 5,3 a 5,4 a 5,5

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5

A0,0 A0,1

A1,1A1,0

B0,0 B0,1

B1,1B1,0

Einleitung

ArchitekturVerteilter Speicher

KommunikationGleichzeitiges Senden und Empfangen möglich

Kein paralleles Senden an mehrere Empfänger und Empfangen von mehreren Sendern

Einleitung

Algorithmen von Cannon und FoxBlock-orientierte Algorithmen

Prozessoren als 2D-Grid organisiert (muss quadratisch sein)

Matrizen müssen quadratisch sein

Algorithmus von LiVariante von Cannons Algorithmus für nicht quadratische Matrizen und Prozessor-Grids

MM3, MM4, MM5

Varianten vom Algorithmus von Fox für nicht quadratische Matrizen und Prozessor-Grids

Agenda

Einleitung

Fazit und Ausblick

Algorithmus von Cannon

Matrizen A und B werden auf Prozesse aufgeteilt

Entsprechender Teil von C wird initialisiert

Problem: Nur Prozesse entlang der Hauptdiagonalen halten passende Teilmatrizen (Ai,k, Bk,j)

Deshalb: UmordnungAi,j : i Spalten nach links

Bi,j : j Zeilen nach oben

A 0,0 A 0,1 A 0,2 A 0,3

B 0,0 B 0,1 B 0,2 B 0,3

A 1,0 A 1,1 A 1,2 A 1,3

B 1,0 B 1,1 B 1,2 B 1,3

A 2,0 A 2,1 A 2,2 A 2,3

B 2,0 B 2,1 B 2,2 B 2,3

A 3,0 A 3,1 A 3,2 A 3,3

B 3,0 B 3,1 B 3,2 B 3,3

Process Mi,j Submatrix of M

A 0,0 A 0,1 A 0,2 A 0,3

B 0,0 B 1,1 B 2,2 B 3,3

A 1,1 A 1,2 A 1,3 A 1,0

B 1,0 B 2,1 B 3,2 B 0,3,

A 2,2 A 2,3 A 2,0 A 2,1

B 2,0 B 3,1 B 0,2 B 1,3

A 3,3 A 3,0 A 3,1 A 3,2

B 3,0 B 0,1 B 1,2 B 2,3

A 0,0 A 0,1 A 0,2 A 0,3

B 0,0 B 1,1 B 2,2 B 3,3

A 1,1 A 1,2 A 1,3 A 1,0

B 1,0 B 2,1 B 3,2 B 0,3,

A 2,2 A 2,3 A 2,0 A 2,1

B 2,0 B 3,1 B 0,2 B 1,3

A 3,3 A 3,0 A 3,1 A 3,2

B 3,0 B 0,1 B 1,2 B 2,3

Algorithmus von Cannon

Iteration (Anzahl der Prozess-Zeilen)

Alle Prozesse führen eine (sequentielle) Matrix Multiplikation aus

KommunikationsschrittProzess sendet Block von A nach links und empfängt Block von A von rechts

Prozess sendet Block von B nach oben und empfängt Block von B von unten

Umordnung wird rückgängig gemacht

A 0,1 A 0,2 A 0,3 A 0,0

B 1,0 B 2,1 B 3,2 B 0,3

A 1,2 A 1,3 A 1,0 A 1,1

B 2,0 B 3,1 B 0,2 B 1,3,

A 2,3 A 2,0 A 2,1 A 2,2

B 3,0 B 0,1 B 1,2 B 2,3

A 3,0 A 3,1 A 3,2 A 3,3

B 0,0 B 1,1 B 2,2 B 3,3

Algorithmus von Cannon - Komplexität

Größe des Prozess Grid: p × p

Größe der Matrizen: n × n

Zeitaufwand für Multiplikation von 2 Elementen + Addition zum Ergebnis: χ

Zeit zum Aufbau einer Kommunikation: λ

Zeit zum Übertragen eines Matrix-Elements: 1 / β

Algorithmus von Cannon - Komplexität

Berechnungen pro Iteration

Berechnungen Gesamt

Übertragung einer Submatrix

Kommunikation Gesamt

Algorithmus von Fox

Matrizen A und B werden auf Prozesse aufgeteilt

Entsprechender Teil von C wird initialisiert

Keine Umordnung erforderlich

Submatrizen von A werden entlang der Prozessreihe per Broadcast übertragen

A 0,0 A 0,1 A 0,2 A 0,3

B 0,0 B 0,1 B 0,2 B 0,3

A 1,0 A 1,1 A 1,2 A 1,3

B 1,0 B 1,1 B 1,2 B 1,3

A 2,0 A 2,1 A 2,2 A 2,3

B 2,0 B 2,1 B 2,2 B 2,3

A 3,0 A 3,1 A 3,2 A 3,3

B 3,0 B 3,1 B 3,2 B 3,3

Process Mi,j Submatrix of M

A 0,0 A 0,0 A 0,0 A 0,0

B 0,0 B 0,1 B 0,2 B 0,3

A 1,1 A 1,1 A 1,1 A 1,1

B 1,0 B 1,1 B 1,2 B 1,3,

A 2,2 A 2,2 A 2,2 A 2,2

B 2,0 B 2,1 B 2,2 B 2,3

A 3,3 A 3,3 A 3,3 A 3,3

B 3,0 B 3,1 B 3,2 B 3,3

Algorithmus von Fox

Iteration (Anzahl der Prozess-Zeilen)

KommunikationsschrittProzess sendet Block von B nach oben und empfängt Block von B von unten

Pro Zeile wird ein Block von A per Broadcast entlang der Zeile übertragen (entfällt in der letzten Iteration)

A 0,0 A 0,0 A 0,0 A 0,0

B 0,0 B 0,1 B 0,2 B 0,3

A 1,1 A 1,1 A 1,1 A 1,1

B 1,0 B 1,1 B 1,2 B 1,3,

A 2,2 A 2,2 A 2,2 A 2,2

B 2,0 B 2,1 B 2,2 B 2,3

A 3,3 A 3,3 A 3,3 A 3,3

B 3,0 B 3,1 B 3,2 B 3,3

A 0,0 A 0,1 A 0,2 A 0,3

B 1,0 B 1,1 B 1,2 B 1,3

A 1,0 A 1,1 A 1,2 A 1,3

B 2,0 B 2,1 B 2,2 B 2,3,

A 2,0 A 2,1 A 2,2 A 2,3

B 3,0 B 3,1 B 3,2 B 3,3

A 3,0 A 3,1 A 3,2 A 3,3

B 0,0 B 0,1 B 0,2 B 0,3

Algorithmus von Fox- Komplexität

Berechnungen Gesamt(wie bei Algorithmus von Cannon)

Übertragung einer Submatrix von B

Broadcast einer Submatrix von A

Laufzeitvergleich: Algorithmen von Cannon und Fox

Algorithmus von Cannon ist schneller wenn:

log)2(2p

pppp log)2(2

ppp log4

Agenda

Einleitung

Fazit und Ausblick

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,0 a 2,1 a 2,2 a 2,3 a 2,4 a 2,5 a 2,6

a 3,0 a 3,1 a 3,2 a 3,3 a 3,4 a 3,5 a 3,6

a 4,0 a 4,1 a 4,2 a 4,3 a 4,4 a 4,5 a 4,6

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5 b 0,6 b 0,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5 b 1,6 b 1,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5 b 2,6 b 2,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5 b 3,6 b 3,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5 b 4,6 b 4,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5 b 5,6 b 5,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4 b 6,5 b 6,6 b 6,7

Algorithmus von Li (C stationär)Umordnung besteht aus 2 Phasen

Blöcke werden verschoben

Zeilen/Spalten werden verschoben

Für jede Prozess-ZeileSpaltenindex der ersten Spalte von A soll Zeilenindex der ersten Zeile von B entsprechen

Für jede Prozess-SpalteZeilenindex der ersten Zeile von B soll Spaltenindex der ersten Spalte von A entsprechen

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,2 a 2,3 a 2,4 a 2,5 a 2,6 a 2,0 a 2,1

a 3,2 a 3,3 a 3,4 a 3,5 a 3,6 a 3,0 a 3,1

a 4,2 a 4,3 a 4,4 a 4,5 a 4,6 a 4,0 a 4,1

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 3,5 b 3,6 b 3,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 4,5 b 4,6 b 4,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 5,5 b 5,6 b 5,7

b 6,5 b 6,6 b 6,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 0,5 b 0,6 b 0,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 1,5 b 1,6 b 1,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 2,5 b 2,6 b 2,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4

b 0,0 b 0,1 b 2,2 b 2,3 b 2,4 b 4,5 b 4,6 b 4,7

b 1,0 b 1,1 b 3,2 b 3,3 b 3,4 b 5,5 b 5,6 b 5,7

b 2,0 b 2,1 b 4,2 b 4,3 b 4,4 b 6,5 b 6,6 b 6,7

b 0,5 b 0,6 b 0,7

b 3,0 b 3,1 b 5,2 b 5,3 b 5,4 b 1,5 b 1,6 b 1,7

b 4,0 b 4,1 b 6,2 b 6,3 b 6,4 b 2,5 b 2,6 b 2,7

b 5,0 b 5,1 b 0,2 b 0,3 b 0,4 b 3,5 b 3,6 b 3,7

b 6,0 b 6,1 b 1,2 b 1,3 b 1,4

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,3 a 2,4 a 2,5 a 2,6 a 2,0 a 2,1 a 2,2

a 3,3 a 3,4 a 3,5 a 3,6 a 3,0 a 3,1 a 3,2

a 4,3 a 4,4 a 4,5 a 4,6 a 4,0 a 4,1 a 4,2

b 0,0 b 0,1 b 2,2 b 2,3 b 2,4 b 4,5 b 4,6 b 4,7

b 1,0 b 1,1 b 3,2 b 3,3 b 3,4 b 5,5 b 5,6 b 5,7

b 2,0 b 2,1 b 4,2 b 4,3 b 4,4 b 6,5 b 6,6 b 6,7

b 0,5 b 0,6 b 0,7

b 3,0 b 3,1 b 5,2 b 5,3 b 5,4 b 1,5 b 1,6 b 1,7

b 4,0 b 4,1 b 6,2 b 6,3 b 6,4 b 2,5 b 2,6 b 2,7

b 5,0 b 5,1 b 0,2 b 0,3 b 0,4 b 3,5 b 3,6 b 3,7

b 6,0 b 6,1 b 1,2 b 1,3 b 1,4

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,3 a 2,4 a 2,5 a 2,6 a 2,0 a 2,1 a 2,2

a 3,3 a 3,4 a 3,5 a 3,6 a 3,0 a 3,1 a 3,2

a 4,3 a 4,4 a 4,5 a 4,6 a 4,0 a 4,1 a 4,2

Algorithmus von Li (C stationär)Iteration (bis dieser Zustand wieder erreicht wird)

Alle Prozesse führen eine (sequentielle) Matrix Multiplikation aus (soweit möglich)

KommunikationsschrittProzess sendet Block von A nach links und empfängt Block von A von rechts (wenn alle Elemente von A benutzt)

Prozess sendet Block von B nach oben und empfängt Block von B von unten (wenn alle Elemente von B benutzt)

b 0,0 b 0,1 b 2,2 b 2,3 b 2,4 b 4,5 b 4,6 b 4,7

b 1,0 b 1,1 b 3,2 b 3,3 b 3,4 b 5,5 b 5,6 b 5,7

b 2,0 b 2,1 b 4,2 b 4,3 b 4,4 b 6,5 b 6,6 b 6,7

b 0,5 b 0,6 b 0,7

b 3,0 b 3,1 b 5,2 b 5,3 b 5,4 b 1,5 b 1,6 b 1,7

b 4,0 b 4,1 b 6,2 b 6,3 b 6,4 b 2,5 b 2,6 b 2,7

b 5,0 b 5,1 b 0,2 b 0,3 b 0,4 b 3,5 b 3,6 b 3,7

b 6,0 b 6,1 b 1,2 b 1,3 b 1,4

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,3 a 2,4 a 2,5 a 2,6 a 2,0 a 2,1 a 2,2

a 3,3 a 3,4 a 3,5 a 3,6 a 3,0 a 3,1 a 3,2

a 4,3 a 4,4 a 4,5 a 4,6 a 4,0 a 4,1 a 4,2

Algorithmus von Li (C stationär)Iteration (bis dieser Zustand wieder erreicht wird)

KommunikationsschrittProzess sendet Block von A nach links und empfängt Block von A von rechts (wenn alle Elemente von A benutzt)

a 0,2 a 0,3 a 0,4 a 0,5 a 0,6 a 0,0 a 0,1

a 1,2 a 1,3 a 1,4 a 1,5 a 1,6 a 1,0 a 1,1

a 2,5 a 2,6 a 2,0 a 2,1 a 2,2 a 2,3 a 2,4

a 3,5 a 3,6 a 3,0 a 3,1 a 3,2 a 3,3 a 3,4

a 4,5 a 4,6 a 4,0 a 4,1 a 4,2 a 4,3 a 4,4

Algorithmus von Li - Komplexität

Größe des Prozess Grid: p × q

Größe der Matrizen:A: m × n

B: n × o

C: m × o

Zeitaufwand für Multiplikation von 2 Elementen + Addition zum Ergebnis: χ

Zeit zum Aufbau einer Kommunikation: λ

Zeit zum Übertragen eines Matrix-Elements: 1 / β

Algorithmus von Li- Komplexität

Berechnungen Gesamt

Übertragung einer Submatrix von A

Umordnungsschritt

nomn24

nomn242 pq

pnoqmnnomn41qp8

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,0 a 2,1 a 2,2 a 2,3 a 2,4 a 2,5 a 2,6

a 3,0 a 3,1 a 3,2 a 3,3 a 3,4 a 3,5 a 3,6

a 4,0 a 4,1 a 4,2 a 4,3 a 4,4 a 4,5 a 4,6

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5 b 0,6 b 0,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5 b 1,6 b 1,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5 b 2,6 b 2,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5 b 3,6 b 3,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5 b 4,6 b 4,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5 b 5,6 b 5,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4 b 6,5 b 6,6 b 6,7

MM3 – Zeilen-Version

1. Schritt:Für jede Prozess-Zeile

Spalte wird gesucht, deren Spaltenindex dem Zeilenindex der ersten Zeile von B entspricht

Der Teil der Submatrix ab dieser Spalte wird per Broadcast entlang der Zeile übertragen

Die nicht benutzten Elemente der Submatrizen werden im letzten Schritt übertragen und verwendet

a 0,0 a 0,1 a 0,0 a 0,1 a 0,0 a 0,1

a 1,0 a 1,1 a 1,0 a 1,1 a 1,0 a 1,1

a 2,3 a 2,3 a 2,3

a 3,3 a 3,3 a 3,3

a 4,3 a 4,3 a 4,3

a 0,0 a 0,1 a 0,0 a 0,1 a 0,0 a 0,1

a 1,0 a 1,1 a 1,0 a 1,1 a 1,0 a 1,1

a 2,3 a 2,3 a 2,3

a 3,3 a 3,3 a 3,3

a 4,3 a 4,3 a 4,3

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5 b 0,6 b 0,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5 b 1,6 b 1,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5 b 2,6 b 2,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5 b 3,6 b 3,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5 b 4,6 b 4,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5 b 5,6 b 5,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4 b 6,5 b 6,6 b 6,7

Iteration (bis dieser Zustand wieder erreicht wird)

KommunikationsschrittNächster Prozess sendet Block von A per Broadcast (wenn alle Elemente von A in dieser Zeile benutzt)

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,0 a 2,1 a 2,2 a 2,3 a 2,4 a 2,5 a 2,6

a 3,0 a 3,1 a 3,2 a 3,3 a 3,4 a 3,5 a 3,6

a 4,0 a 4,1 a 4,2 a 4,3 a 4,4 a 4,5 a 4,6

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5 b 0,6 b 0,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5 b 1,6 b 1,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5 b 2,6 b 2,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5 b 3,6 b 3,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5 b 4,6 b 4,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5 b 5,6 b 5,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4 b 6,5 b 6,6 b 6,7

Iteration (bis dieser Zustand wieder erreicht wird)

MM3 - Komplexität

Berechnungen Gesamt

Erster + letzter Broadcast-Schritt

mnqlogqlog2

mnqlog)1q(

mnqlogqlog2

no pmn q log q1qlog)1(q p

a 0,0 a 0,1 a 0,0 a 0,1 a 0,0 a 0,1

a 1,0 a 1,1 a 1,0 a 1,1 a 1,0 a 1,1

a 2,3 a 2,3 a 2,3

a 3,3 a 3,3 a 3,3

a 4,3 a 4,3 a 4,3

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5 b 0,6 b 0,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5 b 1,6 b 1,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5 b 2,6 b 2,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5 b 3,6 b 3,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5 b 4,6 b 4,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5 b 5,6 b 5,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4 b 6,5 b 6,6 b 6,7

Nachteile von MM3

Schlechte Verteilung der Last

Zusätzlicher Broadcast Schritt

MM4 behebt diese Probleme durch einen Umordnungsschritt

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,0 a 2,1 a 2,2 a 2,3 a 2,4 a 2,5 a 2,6

a 3,0 a 3,1 a 3,2 a 3,3 a 3,4 a 3,5 a 3,6

a 4,0 a 4,1 a 4,2 a 4,3 a 4,4 a 4,5 a 4,6

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5 b 0,6 b 0,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5 b 1,6 b 1,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5 b 2,6 b 2,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5 b 3,6 b 3,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5 b 4,6 b 4,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5 b 5,6 b 5,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4 b 6,5 b 6,6 b 6,7

Die Spalten von A werden nach links verschoben, sodass diese Spalte die erste in ihrem Block ist

Pro Zeile wird ein Block von A per Broadcast entlang der Zeile übertragen

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,1 a 2,2 a 2,3 a 2,4 a 2,5 a 2,6 a 2,0

a 3,1 a 3,2 a 3,3 a 3,4 a 3,5 a 3,6 a 3,0

a 4,1 a 4,2 a 4,3 a 4,4 a 4,5 a 4,6 a 4,0

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5 b 0,6 b 0,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5 b 1,6 b 1,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5 b 2,6 b 2,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5 b 3,6 b 3,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5 b 4,6 b 4,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5 b 5,6 b 5,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4 b 6,5 b 6,6 b 6,7

Die Spalten von A werden nach links verschoben, sodass diese Spalte die erste in ihrem Block ist

Pro Zeile wird ein Block von A per Broadcast entlang der Zeile übertragen

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,1 a 2,2 a 2,3 a 2,4 a 2,5 a 2,6 a 2,0

a 3,1 a 3,2 a 3,3 a 3,4 a 3,5 a 3,6 a 3,0

a 4,1 a 4,2 a 4,3 a 4,4 a 4,5 a 4,6 a 4,0

a 0,0 a 0,1 a 0,0 a 0,1 a 0,0 a 0,1

a 1,0 a 1,1 a 1,0 a 1,1 a 1,0 a 1,1

a 2,3 a 2,4 a 2,3 a 2,4 a 2,3 a 2,4

a 3,3 a 3,4 a 3,3 a 3,4 a 3,3 a 3,4

a 4,3 a 4,4 a 4,3 a 4,4 a 4,3 a 4,4

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5 b 0,6 b 0,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5 b 1,6 b 1,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5 b 2,6 b 2,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5 b 3,6 b 3,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5 b 4,6 b 4,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5 b 5,6 b 5,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4 b 6,5 b 6,6 b 6,7

MM4 – Zeilen-VersionIteration (bis dieser Zustand wieder erreicht wird)

Umordnung wird rückgängig genacht

a 0,0 a 0,1 a 0,0 a 0,1 a 0,0 a 0,1

a 1,0 a 1,1 a 1,0 a 1,1 a 1,0 a 1,1

a 2,3 a 2,4 a 2,3 a 2,4 a 2,3 a 2,4

a 3,3 a 3,4 a 3,3 a 3,4 a 3,3 a 3,4

a 4,3 a 4,4 a 4,3 a 4,4 a 4,3 a 4,4

MM4 - Komplexität

Berechnungen Gesamt

Umordnungsschritt

mnqlogq

no pmn q) log q(21pqlogq2

Anmerkungen

Keiner der vorgestellten Algorithmen ist für alle Fälle optimal

Algorithmus von Li:Matrix mit den meisten Elementen sollte stationär sein

Vorteile auf großen Prozessor Grids

MM3, MM4, MM5

Zeilen-Version hat Vorteile, wenn die Blockgröße von A klein ist, oder die Anzahl der Prozess-Spalten gering ist

Agenda

Einleitung

Fazit und Ausblick

Matrix Multiplikation ist gut parallelisierbar

Kein Algorithmus ist für alle Fälle optimal

Für jeden Fall kann der optimale Algorithmus gewählt werden

Vorgestellte Algorithmen sind nur ein kleiner AusschnittWeitere Varianten der Algorithmen von Cannon und Fox

Broadcast-Broadcast Algorithmus

Algorithmen auf Hypercubes

Literatur

John Gunnels, Calvin Lin, Grog Morrow, Robert van de Geijn: Analysis of a Class of Parallel Matrix Multiplication Algorithms, Proc. Int’l Parallel Processing Symp., 1998.

Jin Li: A Poly: Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies, Mississippi, 1996.

Michael J. Quinn: Parallel Programming with C with MPI and OpenMP, Boston, Mass., McGraw-Hill, 2004.

Fragen

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,0 a 2,1 a 2,2 a 2,3 a 2,4 a 2,5 a 2,6

a 3,0 a 3,1 a 3,2 a 3,3 a 3,4 a 3,5 a 3,6

a 4,0 a 4,1 a 4,2 a 4,3 a 4,4 a 4,5 a 4,6

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5 b 0,6 b 0,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5 b 1,6 b 1,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5 b 2,6 b 2,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5 b 3,6 b 3,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5 b 4,6 b 4,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5 b 5,6 b 5,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4 b 6,5 b 6,6 b 6,7

Bei MM3 und MM4 können Zeilen- und Spaltenversion unabhängig von der Form der Matrix oder des Prozess Grids benutzt werden

Bei MM5

Zeilenversion nur, wenn q ≤ p

Spaltenversion nur, wenn p ≤ q

Also hier: Spaltenversion

a 0,0 a 0,1 a 0,2 a 0,3 a 0,4 a 0,5 a 0,6

a 1,0 a 1,1 a 1,2 a 1,3 a 1,4 a 1,5 a 1,6

a 2,0 a 2,1 a 2,2 a 2,3 a 2,4 a 2,5 a 2,6

a 3,0 a 3,1 a 3,2 a 3,3 a 3,4 a 3,5 a 3,6

a 4,0 a 4,1 a 4,2 a 4,3 a 4,4 a 4,5 a 4,6

b 0,0 b 0,1 b 0,2 b 0,3 b 0,4 b 0,5 b 0,6 b 0,7

b 1,0 b 1,1 b 1,2 b 1,3 b 1,4 b 1,5 b 1,6 b 1,7

b 2,0 b 2,1 b 2,2 b 2,3 b 2,4 b 2,5 b 2,6 b 2,7

b 3,0 b 3,1 b 3,2 b 3,3 b 3,4 b 3,5 b 3,6 b 3,7

b 4,0 b 4,1 b 4,2 b 4,3 b 4,4 b 4,5 b 4,6 b 4,7

b 5,0 b 5,1 b 5,2 b 5,3 b 5,4 b 5,5 b 5,6 b 5,7

b 6,0 b 6,1 b 6,2 b 6,3 b 6,4 b 6,5 b 6,6 b 6,7

Iteration(bis dieser Zustand wieder erreicht wird)

Pro Prozess-SpalteAlle Zeilen von B, die für Berechnungen verwendet werden können, werden per Broadcast entlang der Spalte übertragen(evtl. mehrere Schritte)

Prozess sendet Block von A nach links und empfängt Block von A von rechts

b 0,0 b 0,1 b 2,2 b 2,3 b 2,4 b 4,5 b 4,6 b 4,7

b 1,0 b 1,1 b 3,2 b 3,3 b 3,4 b 5,5 b 5,6 b 5,7

b 6,5 b 6,6 b 6,7

b 1,0 b 1,1 b 2,2 b 2,3 b 2,4 b 4,5 b 4,6 b 4,7

b 2,0 b 2,1 b 3,2 b 3,3 b 3,4 b 5,5 b 5,6 b 5,7

b 6,5 b 6,6 b 6,7

MM5 - Komplexität

Berechnungen Gesamt

Anzahl der Broadcast Schritte pro Iteration schlecht vorauszusagen

Falls q mod p = 0 (Spalten Version), reicht immer 1 Broadcast Schritt

Dann: Kommunikation Gesamt

noplogq

mn qno p log q1plogqq

Parallele Algorithmen zur Matrix Multiplikation Matthias Dohm Parallele Algorithmen zur Matrix...

Documents

Session 2 - Multiplikation. Ich beachte die Lebenslinie

Parallele Programmierung und Parallele Algorithmen : Matrix- Vektor - Multiplikation

Kapitel IV. Lineare Abbildungen · Lineare Abbildungen Einführung: Linearit ät der Matrix-Vektor-Multiplikation 13 Lineare Abbildungen 13.1 Einführung: Linearit ät der Matrix-Vektor-Multiplikation

Sanders: Parallele Algorithmen Parallele Algorithmenalgo2.iti.kit.edu/sanders/courses/paralg19/vorlesung.pdfSanders: Parallele AlgorithmenNovember 25, 2019 2 Warum Parallelverarbeitung

Multiplikation und Division von Brüchen

Russische Corpuslinguistik (parallele Textcorpora mit Russisch)

Mathematik Lernheft M9 - Multiplikation und Division, Das ...€¦ · Zahlenraum 100 - Teil3 Einführung in die Multiplikation und Division Lernheft M7 = Teil1: Orientierung im Hunderterbereich

Arbeitsplan Mathe Jg. 5 Multiplikation & Division 5.pdfArbeitsplan Mathe Jg. 5 Multiplikation & Division 11.01. bis 29.01.2021 Hinweis: Bitte halte dich an den angegebenen Bearbeitungszeitraum,

Parallele Datenverarbeitung Pig, Hive & SystemT/JAQL

Digitaler Matrix-Router Digital Matrix Router

Proseminar: Parallele Algorithmenalgo2.iti.kit.edu/img/content/presentation.pdf · Proseminar: Parallele Algorithmen Von Theorie zu Praxis Peter Sanders, Jochen Speck, Daniel Funke

Schnelle Multiplikation großer Zahlen

Parallele Numerische Verfahren - mathematik.uni-marburg.deschmitt/num/np13s_v.pdf · rithmen der Linearen Algebra (z.B. Matrix-Vektor-Multiplikation) sieht man daher erhebliche E

Lehrerheft - bildungsserver.hamburg.debildungsserver.hamburg.de/contentblob/3871646/c6b5f5a268612602bc824fb... · ner Matrix mit einem Skalar, Multiplikation von Matrix und Vektor)

3236DA4 Multiplikation Dezimalzahlen - Persen

Standardsoftwarebasiertes Projektcontrolling für parallele

Kartei Schriftliche Multiplikation 2 · Title: Kartei Schriftliche Multiplikation 2.pdf Author: Lena Created Date: 7/25/2018 11:50:26 AM

Baugruppen für parallele Kinematiken

Parallele Rechnerarchitektur II - Heidelberg University...Parallele Rechnerarchitektur II Stefan Lang Interdisziplinäres Zentrum für Wissenschaftliches Rechnen Universität Heidelberg

Parallele Korpora - GitHub Pagesspartusch.github.io/legacy-website/papers/parallele_korpora.pdf · pus.html Wortalignierte Korpora. Parallele Korpora: Programme Manatee (Server) verarbeitet