Upload
gisfrid-heintzelman
View
110
Download
3
Embed Size (px)
Citation preview
fakultät für informatikinformatik 12
technische universität dortmund
Memory-architecture aware compilation
- Sessions 15-17 -
Peter MarwedelTU DortmundInformatik 12
Germany
Slides use Microsoft cliparts. All Microsoft restrictions apply.
- 2 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Schedule of the course
Time Monday Tuesday Wednesday Thursday Friday
09:30-11:00
1: Orientation, introduction
2: Models of computation + specs
5: Models of computation + specs
9: Mapping of applications to platforms
13: Memory aware compilation
17: Memory aware compilation
11:00 Brief break Brief break Brief break Brief break
11:15-12:30
6: Lab*: Ptolemy
10: Lab*: Scheduling
14: Lab*: Mem. opt.
18: Lab*: Mem. opt.
12:30 Lunch Lunch Lunch Lunch Lunch
14:00-15:20
3: Models of computation + specs
7: Mapping of applications to platforms
11: High-level optimizations*
15: Memory aware compilation
19: WCET & compilers*
15:20 Break Break Break Break Break
15:40-17:00
4: Lab*: Kahn process networks
8: Mapping of applications to platforms
12: High-level optimizations*
16: Memory aware compilation
20: Wrap-up
* Dr. Heiko Falk
- 3 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Using these ideas with an gcc-based tool flow
Source is split into 2 different files by specially developed memory optimizer tool *.
applicationsource
profile Info.
main mem. src
spm src.
linker script*Built with tool design suite ICD-C available from ICD (see www.icd.de/es)
*Built with tool design suite ICD-C available from ICD (see www.icd.de/es) .exe
.ld linker
ARM-GCCCompiler
ARM-GCCCompiler
.c
.c
.c
.txt
Memory optimizer(Incl. ICD-C*)
- 4 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Partitioning
scratch pad 0, 256 entries
scratch pad 1, 2 k entries
scratch pad 2, 16 k entries
background memory
addr
esse
s
0
Small is beautiful:
One small SPM is beautiful.
May be, several smaller SPMs are even more beautiful?
- 5 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Considered partitions
# of partitions
number of partitions of size:
4k 2k 1k 512 256 128 64
7 0 1 1 1 1 1 2
6 0 1 1 1 1 2 0
5 0 1 1 1 2 0 0
4 0 1 1 2 0 0 0
3 0 1 2 0 0 0 0
2 0 2 0 0 0 0 0
1 1 0 0 0 0 0 0
Example of all considered memory partitions for a total capacity of 4096 bytes
- 6 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Optimization for multiple memories
i
iijj
j nxeC ,Minimize
With ej: energy required per access to memory j,and xj,i= 1 if object i is mapped to memory j, =0 otherwise,and ni: number of accesses to memory object i,subject to the constraints:
i
jiij SSPSxj ,:
j
ijxi 1: ,
With Si: size of memory object i,SSPj: size of memory j.
Main memory included as a special case of j
- 7 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results for parts of GSM coder/decoder
A key advantage of partitioned scratchpads for multiple applications is their ability to adapt to the size of the current working set.
„Working set“
- 8 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Energy optimizationin horizontally partitioned caches
Energy savings due to two effects Reduction in miss rate AccessEnergy (mini cache) <
AccessEnergy (main cache)
Reduction in miss rate Aligned with performance Exploited by performance improvement
techniques
Less Energy per Access in mini cache Inverse to performance Energy can decrease even if there are more
misses Opposite to performance optimization
techniques!!
Processor Pipeline
Main Cache
Mini Cache
Memory
[A. Shrivastava, I. Issenin, N. Dutt: Compilation techniques for energy reduction in horizontally partitioned cache architectures, Intern. Conf.on Compilers, Architectures and Synthesis For Embedded Systems (CASES), 2005, pp. 90-96]
- 9 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Simple energy-oriented heuristic are good
Energy of the best energy partition
0%
10%
20%
30%
40%
50%
60%
70%
susan
adpcm_enc
adpcm_dec
gsm_enc
gsm_decjpeg
lameh263
blowfish_enc
blowfish_dec
dijkstra
average
% E
ner
gy
savi
ng
s
OPT
OM2N
OMN
base
base
E
EE min
Memory subsystem energy savings achieved by OMN (greedy)
Base Configuration – All pages are mapped to main cache
OMN achieves 50% memory energy savings
- 10 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Horizontally Partitioned Caches: Results
0%
20%
40%
60%
80%
100%
120%
140%
Benchmarks
% E
ne
rgy
Dif
fere
nc
e
Plot
Best Performing Partition consumes 58% more Energy
0%
1%
2%
3%
4%
5%
6%
7%
susanadpcm_enc
adpcm_decgsm_enc
gsm_dec jpeg lame h263
blowfish_enc
blowfish_decdijkstra
average
Benchmarks
% D
iffe
ren
ce in
mem
ory
late
ncyPlot
Best Energy Partition looses 2% Performance
energybest
runtimebestenergybest
R
RR
_
__
energybest
energybestruntimebest
E
EE
_
__
- 11 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Stack allocation
1. Run stack size analyzerUse stack as 1 large array of returned size Mapped to SPM if small and frequently accessed
[approach by Steinke et al. @ Dortmund]
2. Potentially map frequently & less frequently used elements to different memories
[approach by Barua et al.]
Considers global & local variables (no instructions)
[O. Avissar, R. Barua, D. Stewart: An Optimal Memory Allocation Scheme for Scratch-Pad-Based Embedded Systems, Transactions on Embedded Computing Systems, 2002]
- 12 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Split stack
2 stack pointers (overhead)1. Single stack frame per
procedure (fast procedures)2. Split frames (time consuming
procedures)
foo(){ int a; float b; …}
Stack in SRAM
aSP1
Growth
Stack in DRAM
Growthb
SP2
- 13 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
IP model for split stack
Cost function (linear):Run-time = Access time for global variables + Access time for stack variablesConstraints (linear): Each variable has to be mapped to 1
memory For all paths in the call graph:
The sum of all objects mapped to SPM is |SPM|
Model actually considers an arbitrary number of memories, each with its own size and access time(s) ”partitioning”.
main
foo
bar
rome
- 14 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results
entire stack fits into SPM
|SPM|=20% of |data|, split allocation
Smallest variables to SPM, split allocation
Global vars to SPM, stack in DRAMentire stack in DRAMall data
- 15 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Heap allocation
1. Run heap size analyzerUse heap as 1 large array of returned size
2. Potentially map heap fragments to SPMProblems:• Object sizes frequently not known at compile time• Avoid illegal references• Approaches using a level of indirection suffer from
additional overhead
- 16 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Heap allocation by Barua
Modified malloc first checks if space in bin is available, otherwise allocates in main RAM
Modified free deallocates space in bin.
[A. Dominguez, S. Udayakumaran, R. Barua: Heap Data Allocation to Scratch-Pad Memory in Embedded Systems, Journal of Embedded Computing, 2005]
Copying between memories is done at region boundaries (before & after loops, at procedure entry and exit).
At each copy point, a bin is allocated in SPM for heap data. Objects always have the same address within the bin.
EA
CB
D
CBA
B
1 2 3 4
256
512
768
1024Mem
ory
offs
et
Regions
- 17 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Other work
R. McIlroy, P. Dickmann, J. Sventek: Efficient Dynamic Heap Allocation of Scratch-Pad Memory, 7th international symposium on memory management, June, 2008malloc replacement:
• allocates blocks in SPM, data structures and algorithms for allocating blocks and fractions of blocks in SPM
• not using any profiling information,
• no demonstration of speedup or energy efficiency
- 18 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Considering SPM size at link time
Avoiding executables generated for a specific SPM size Profiler stores variables sorted by their “frequency per
byte” (FPB) in executable Compiler identifies access to variables with unknown
memory allocation by using global symbols. After loading the program, a custom installer
• Reads SPM size of current architecture• Decides (using FPBs) which variables to put into SPM• Patches addresses of variables (incl. stack variables)• Tries to use SPM space not used on a calling path• Also moves code into SPM
[N. Nguyen, A. Dominguez, R. Barua: Memory Allocation for Embedded Systems with a Compile-Time-Unknown Scratch-Pad Size, Intern. conf. on Compilers, architectures and synthesis for embedded systems (CASES), 2005, p. 115-125]
- 19 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Comparison with codeoptimized for specific SPM-Size
© ACM
- 20 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Scratch-pad/tightly coupled memorybased predictability
C program
SPM size
executable
Actualperformance
Worst caseexecution time
memory-awarecompiler ARMulator
aiT
Pre run-time scheduling is often the only practical means of providing predictability in a complex system. [Xu, Parnas]
Time-triggered, statically scheduled operating systems Let‘s do the same for the memory system
Are SPMs really more timing predictable?Analysis using the aiT timing analyzer
- 21 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Architectures considered
ARM7TDMI with 3 different memory architectures:Main memory
LDR-cycles: (CPU,IF,DF)=(3,2,2)STR-cycles: (2,2,2)* = (1,2,0)
Main memory + unified cacheLDR-cycles: (CPU,IF,DF)=(3,12,6)STR-cycles: (2,12,3)* = (1,12,0)
Main memory + scratch padLDR-cycles: (CPU,IF,DF)=(3,0,2)STR-cycles: (2,0,0)* = (1,0,0)
- 22 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results for G.721
L. Wehmeyer, P. Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop on worst-case execution time analysis, (WCET), 2004
L. Wehmeyer, P. Marwedel: Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software, Design Automation and Test in Europe (DATE), 2005
Using Scratchpad: Using Unified Cache:
Yes, they are clearly more timing predictable!
- 23 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Non-overlaying allocation problematic formultiple hot spots Overlaying allocation
Effectively results in a kind of compiler-controlled overlays for SPM
Address assignment within SPM required
CPU
Memory
Memory
SPM
- 24 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Overlaying allocation by Barua et al. (1)
Potential copying @ control points (major locality changes): Beginning and end of procedures Before and after loops Before and after if statements Beginning and end of then and else parts Before and after switch statements and their cases
Number control points!
[S. Udayakumaran, A. Dominguez, R. Barua: Dynamic Allocation for Scratch-Pad Memory using Compile-Time Decisions, ACM Transactions in Embedded Computing Systems, Vol. V, 2006, Pages 472 - 511]
main(){ if(…) {proc-D()} else {…} proc-A() proc-B()}proc-A() { proc-C()}proc-B { proc-C() while(…) {Y=…}}proc-C() {X=…}proc-D() {…}
main()
proc_B()proc_A()
if_header
proc_C()Loo
p
YX
then else
proc_D()
1
2 7
3 4
4 5
15
17
16
18
6 3
9,13 10,14
12
8 11
- 25 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Overlaying allocation by Barua et al. (2)
At 1st control point: Initialize SPM with variables accessed in first region in decreasing order of freq-per-byte
Traverse control points according to control point #At each control point:
• Compute variables that potentially should be swapped in, based on # of accesses per byte
• Compute variables that should be swapped out
• Compute variables that should remain
• Compute addresses of variables to be moved in
• Generate swap code
- 26 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Runtime reduction with respect tonon-overlaying (“static”) allocation
© ACM, 2006
- 27 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Overlaying allocation by Verma et al. (1)
C
DEF A
USE A
USE A
MOD A USE C
USE C
B1
B2
B3
B4
B5
B6
B7
B8
Based on control flow graph.
[M.Verma, P.Marwedel: Dynamic Overlay of Scratchpad Memory for Energy Minimization, ISSS, 2004]
- 28 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Overlaying allocation by Verma et al. (2)
SPILL_STORE(A);
SPILL_LOAD(C);
SPILL_STORE(A);
SPILL_LOAD(C);
SPILL_LOAD(A);SPILL_LOAD(A);
C
DEF A
USE A
USE A
MOD A USE C
USE C
B1
B2
B3
B4
B5
B6
B7
B8
B9
B10
Global set of ILP equations reflects cost/benefit relations of potential copy points
Code handled like data
- 29 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Runtime/energy reduction with respect tonon-overlaying (“static”) allocation
- 30 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Overlaying allocation within scratch pad- Energy, execution time, code size -
0,00
0,20
0,40
0,60
0,80
1,00
1,20
1,40
1,60
1,80
2,00
edge_detection m peg adpcm his togram m ultisort avg.
Benchmarks
Total Energy Execution Time Code Size
Steinke’s non-overlaying Allocation
- 31 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Hardware-support for block-copying
The DMA unit was modeled in VHDL, simulated, synthesized. Unit only makes up 4% of the processor chip.
The unit can be put to sleep when it is unused.
Code size reductions of up to 23% for a 256 byte SPM were determined using the DMA unit instead of the overlaying allocation that uses processor instructions for copying.
Code size reductions of up to 23% for a 256 byte SPM were determined using the DMA unit instead of the overlaying allocation that uses processor instructions for copying.
DMA Scratch-pad ProcessorMemory
[Lars Wehmeyer, Peter Marwedel: Fast, Efficient and Predictable Memory Accesses, Springer, 2006]
- 32 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
References to large arrays (1)- Regular accesses -
for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) U[i][j]=U[i][j] + V[i][k] * W[k][j]
Tiling
[M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, A. Parikh: Dynamic Management of Scratch-Pad Memory Space, DAC, 2001, pp. 690-695]
for (it=0; it<n; it=it+Sb)
{read_tile V[it:it+Sb-1, 1:n]
for (jt=0; jt<n; jt=jt+Sb)
{read_tile U[it:it+Sb-1, jt:jt+Sb-1] read_tile W[1:n,jt:jt+Sb-1]
U[it:it+Sb-1,jt:jt+Sb-1]=U[it:it+Sb-1,jt:jt+Sb-1]
+ V[it:it+Sb-1,1:n] * W [1:n, jt:jt+Sb-1] write_tile U[it:it+Sb-1,jt:jt+Sb-1]
}}
- 33 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
References to large arrays- Irregular accesses -
for each loop nest L in program P { apply loop tiling to L based on the access patterns of regular array references; for each assignment to index array X update the block minimum and maximum values of X; compute the set of array elements that are irregularly referenced in the current inter-tile iteration; compare the memory access costs for using and not using SPM; if (using SPM is beneficial) execute the intra-tile loop iterations by using the SPM else execute the intra-tile loop iterations by not using the SPM}
[G. Chen, O. Ozturk, M. Kandemir, M. Karakoy: Dynamic Scratch-Pad Memory Management for Irregular Array Access Patterns, DATE, 2006]
- 34 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results for irregular approach
Cache
Kandemir@DAC01
Kandemir@DATE06
- 35 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Hierarchical memories: Memory hierarchy layer assignment (MHLA) (IMEC)
Partition n.1Partition n.1
ProcessorProcessor
Partition 1.1Partition 1.1
SPM-module 1.1.1
SPM-module 1.1.2
Partition 1.2Partition 1.2
Cache-module 1.2.1
Cache-module 1.2.2
Partition 2.1Partition 2.1 Partition 2.2Partition 2.2
n la
yers
with
"pa
rtiti
ons"
co
nsis
ting
of m
odul
es …
[E. Brockmeyer et al.: Layer Assignment Techniques for Low Energy in Multi-Layered Memory Organisations, Design, Automation and Test in Europe (DATE), 2003.]
- 36 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Memory hierarchy layer assignment (MHLA)- Copy candidates -
int A[250]A'[0..99]=A[0..99];for (i=0; i<10; i++) for (j=0; j<10; j++) for (k=0; k<10; k++) for (l=0; l<10; l++) f(A'[j*10+l])size=100; reads(A)=100
int A[250]A'[0..99]=A[0..99];for (i=0; i<10; i++) for (j=0; j<10; j++) for (k=0; k<10; k++) for (l=0; l<10; l++) f(A'[j*10+l])size=100; reads(A)=100
int A[250]for (i=0; i<10; i++) {A'[0..99]=A[0..99]; for (j=0; j<10; j++) for (k=0; k<10; k++) for (l=0; l<10; l++) f(A'[j*10+l])}size=100;reads(A)=1000
int A[250]for (i=0; i<10; i++) {A'[0..99]=A[0..99]; for (j=0; j<10; j++) for (k=0; k<10; k++) for (l=0; l<10; l++) f(A'[j*10+l])}size=100;reads(A)=1000
int A[250]for (i=0; i<10; i++) for (j=0; j<10; j++) {A"[0..9]=A[j*10..j*10+9]; for (k=0; k<10; k++) for (l=0; l<10; l++) f(A"[l])}size=10; reads(A)=1000
int A[250]for (i=0; i<10; i++) for (j=0; j<10; j++) {A"[0..9]=A[j*10..j*10+9]; for (k=0; k<10; k++) for (l=0; l<10; l++) f(A"[l])}size=10; reads(A)=1000
int A[250]for (i=0; i<10; i++) for (j=0; j<10; j++) for (k=0; k<10; k++) for (l=0; l<10; l++) f(A[j*10+l])size=0; reads(A)=10000
int A[250]for (i=0; i<10; i++) for (j=0; j<10; j++) for (k=0; k<10; k++) for (l=0; l<10; l++) f(A[j*10+l])size=0; reads(A)=10000
10 100
100
1000
10000
size
reads(A)
A', A" in small memory
Copy candidate
- 37 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Memory hierarchy layer assignment (MHLA)- Goal -
Goal: For each variable: find permanent layer, partition and module & select copy candidates such that energy is minimized.
Conflicts between variables
[E. Brockmeyer et al.: Layer Assignment Techniques for Low Energy in Multi-Layered Memory Organisations, Design, Automation and Test in Europe (DATE), 2003.]
- 38 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Memory hierarchy layer assignment (MHLA)- Approach -
More general hardware architecture than the Dortmund approach, but no global optimization.
More general hardware architecture than the Dortmund approach, but no global optimization.
Approach: start with initial variable allocation incrementally improve initial solution
such that total energy is minimized.
Approach: start with initial variable allocation incrementally improve initial solution
such that total energy is minimized.
NOT assignedcopy candidates
assignedcopy candidates Platform
A’
A
A”
L3
L2
L1
L0
1250
11000
1000
10000
250
Current assignmentNOT assignedcopy candidates
assignedcopy candidate Platform
A’
A
A”
L3
L2
L1
L0
1250
11000
1000
10000
250
Next assignment
100step 1
350
step2
1100
100
- 39 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Summary
Non-Overlaying allocation• Stack allocation• Heap allocation• Timing predictability
Overlaying SPM allocation• Single process
• Barua’s call graph approach• Verma’s CDFG-based approach• Kandemir’s tiling approach• IMEC’s multiple levels
- 40 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Coffee/tea break (if on schedule)
Q&A?
- 41 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Schedule of the course
Time Monday Tuesday Wednesday Thursday Friday
09:30-11:00
1: Orientation, introduction
2: Models of computation + specs
5: Models of computation + specs
9: Mapping of applications to platforms
13: Memory aware compilation
17: Memory aware compilation
11:00 Brief break Brief break Brief break Brief break
11:15-12:30
6: Lab*: Ptolemy
10: Lab*: Scheduling
14: Lab*: Mem. opt.
18: Lab*: Mem. opt.
12:30 Lunch Lunch Lunch Lunch Lunch
14:00-15:20
3: Models of computation + specs
7: Mapping of applications to platforms
11: High-level optimizations*
15: Memory aware compilation
19: WCET & compilers*
15:20 Break Break Break Break Break
15:40-17:00
4: Lab*: Kahn process networks
8: Mapping of applications to platforms
12: High-level optimizations*
16: Memory aware compilation
20: Wrap-up
* Dr. Heiko Falk
- 42 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Static set of multiple applications:Saving/Restoring Context Switch
Saving Context Switch Utilizes SPM as a common region
shared all processes Contents of processes are copied
on/off the SPM at context switch Good for small scratchpads
P1
P2
P3
Scratchpad
Process P3Process P1Process P2
Saving/Restoring
Saving/Restoring at context switch
- 43 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Non-Saving Context Switch
Process P1
Process P3
Process P2
Scratchpad
Process P1
Non-Saving Context Switch Partitions SPM into disjoint regions at
compile-time Each process is assigned a SPM region Copies contents during initialization Good for large scratchpads
Process P2
Process P3
P1
P2
P3
- 44 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Hybrid Context Switch
Hybrid Context Switch (Hybrid) Disjoint + Shared SPM regions Good for all scratchpads Analysis at compile-time is similar to
non-Saving ApproachScratchpad
Process P1,P2, P3
Process P1
Process P2
Process P3
Process P1Process P2Process P3
P1
P2
P3Saving/Restoring at context switch
Saving/Restoring at context switch
- 45 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Multi-process Scratchpad Allocation: Results
For small SPMs (64B-512B) Saving is better For large SPMs (1kB- 4kB) Non-Saving is better Hybrid is the best for all SPM sizes. Energy reduction @ 4kB SPM is 27% for Hybrid approach
80
90
100
110
120
130
140
150
160
64 128 256 512 1024 2048 4096
Scratchpad Size (bytes)
En
erg
y C
on
sum
ptio
n (
mJ)
Energy (SPA) Energy (Non-Saving)
Energy (Saving) CopyEnergy (Saving)
Energy (Hybrid) CopyEnergy (Hybrid)
27%
SPA: Single Process Approach
edge detection,
adpcm, g721, mpeg
- 46 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Dynamic set of multiple applications
MEMMEM
CPU
SPM
SPMManager
SPMManager
App. 2App. 2
App. 1App. 1
App. nApp. n
SP
M
App. 3
App. 2
App. 1
t
Address space:
SPM
??
Compile-time partitioning of SPM no longer feasible
Introduction of SPM-manager
• Runtime decisions, butcompile-time supported
[R. Pyka, Ch. Faßbach, M. Verma, H. Falk, P. Marwedel: Operating system integrated energy aware scratchpad allocation strategies for multi-process applications, SCOPES, 2007]
- 47 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Approach overview
App. 2App. 2
App. 1App. 1
App. nApp. n AllocationManager
AllocationManager
Standard Compiler(GCC)
Standard Compiler(GCC)
OperatingSystem
OperatingSystem
Compile-timeTransformations
Compile-timeTransformations
Profit values / Allocation hints
2 steps: compile-time analysis & runtime decisions
No need to know all applications at compile-time
Capable of managing runtime allocated memory objects
Integrates into an embedded operating system
Using MPArm simulator from U. Bologna
- 48 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results
MEDIA+ Energy Baseline: Main memory only Best: Static for 16k 58% Overall best: Chunk 49%
MEDIA+ Cycles Baseline: Main memory only Best: Static for 16k 65% Overall best: Chunk 61%
- 49 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Comparison of SPMM to Caches for SORT
Baseline: Main memory only SPMM peak energy reduction by
83% at 4k Bytes scratchpad Cache peak: 75% at 2k 2-way
cache
SPMM capable ofoutperforming caches
OS and libraries are not considered yet
Chunk allocation results:
SPM Size Δ 4-way
1024 74,81%
2048 65,35%
4096 64,39%
8192 65,64%
16384 63,73%
- 50 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
SPM+MMU (1)
How to use SPM in a system with virtual addressing? Virtual SPM
Typically accesses MMU + SPM in parallel not energy efficient
Real SPM suffers from potentiallylong VA translation
Egger, Lee, Shin (Seoul Nat. U.):Introduction of small µTLB translating recent addresses fast.
[B. Egger, J. Lee, H. Shin: Scratchpad memory management for portable systems with a memory management unit, CASES, 2006, p. 321-330 (best paper)]
Proc. $SPM MMU
- 51 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
SPM+MMU (2)
µTLB generates physical address in 1 cycle
if address corresponds to SPM, it is used
otherwise, mini-cache is accessed Mini-cache provides reasonable
performance for non-optimized code
µTLB miss triggers main TLB/MMU
SPM is used only for instructions instructions are stored in pages pages are classified as cacheable,
non-cacheable, and “pageable”(= suitable for SPM)
CPU core
MMU
unified TLB
minicache
TAG RAM
DATA RAMSPM
SPM base reg. comparator
µTLB
instruction VA
PA
- 52 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
SPM+MMU (3)
Application binaries are modified: frequently executed code put into pageable pages.
Initially, page-table entries for pageable code are marked invalid
If invalid page is accessed, a page table exception invokes SPM manager (SPMM).
SPMM allocates space in SPM and sets page table entry
If SPMM detects more requests than fit into SPM, SPM eviction is started
Compiler does not need to know SPM size virtual memory
SPM
main memory
physical memory
stack/heapregion
pageable region
cached region
uncached region
pageable region
cached region
uncached region
PC
- 53 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Extension to SNACK-pop(post-pass optimization)
H. Cho, B. Egger, J. Lee, H. Shin: Dynamic Data Scratchpad Memory Management for a Memory Subsystem with an MMU, LCTES, 2007
objectfiles
libraries
disassemble
profile code generation
building dynamiccall graph
cloning functions
inserting SPMmanager calls
executable image generation
ILP solver
dynamic call graph
profiling image
architecture simulator
profile data
executable image
input data
- 54 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Cloning of functions
q
m n
g
f
q
m n
g
f
g’
Computation of which block should be moved in and out for a certain edge
Generation of an ILP Decision about copy operations at compile time.
- 55 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results for SNACK-pop (1)
© ACM, 2007
- 56 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results for SNACK-pop (2)
© ACM, 2007
- 57 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Multi-processor ARM (MPARM) Framework
Homogenous SMP ~ CELL processor Processing Unit : ARM7T processor Shared Coherent Main Memory Private Memory: Scratchpad Memory
SPM SPMSPMSPM
InterruptDevice
SemaphoreDevice
ARM ARM ARM ARM
Interconnect (AMBA or STBus)
Shared Main Memory
- 58 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Application Example: Multi-Processor Edge Detection
Source SinkCompute Processors
Source, sink and n compute processors
Each image is processed by an independent compute processor
• Communication overhead is minimized.
- 59 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results:Scratchpad Overlay for Edge Detection
0
500
1000
1500
2000
2500
128 256 512 1024 2048 4096 8192 16384
Scratchpad Size per Processor (Bytes)
To
tal
En
erg
y C
on
sum
pti
on
(m
J)
Energy (SO): 1 CP
Energy (SO): 2 CP
Energy (SO): 3 CP
Energy (SO): 4 CP
2 CPs are better than 1 CP, then energy consumption stabilizes
Best scratchpad size: 4kB (1CP& 2CP) 8kB (3CP & 4CP)
- 60 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
ResultsDES-Encryption
0
10
20
30
40
50
60
70
0 128 256 512 1k 2k 4k 8k 16k
Scratchpad Size [Bytes]
En
erg
y C
on
sum
pti
on
[µJ
]
Energy Consum ption [µJ]
DES-Encryption: 4 processors: 2 Controllers+2 Compute Engines
Energy values from ST Microelectronics
Result of ongoing cooperation between U. Bologna and U. Dortmund supported by ARTIST2 network of excellence.
- 61 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
MPSoC with shared SPMs
[M. Kandemir, I. Kadayif, A. Choudhary, J. Ramanujam, I. Kolcu: Compiler-Directed Scratch Pad Memory Optimization for Embedded Multiprocessors, IEEE Trans. on VLSI, Vol. 12, 2004, pp. 281-286]
© IEEE, 2004
- 62 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Energy benefits despite large latenciesfor remote SPMs
© IEEE, 2004
DRAM: 80 cycles
- 63 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Extensions
Using DRAM Applications to Flash memory
(copy code or execute in place):according to own experiments: very much parameter dependent
Trying to imitate advantages of SPM with caches: partitioned caches, etc.
PhD thesis of Lars Wehmeyer
- 64 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Improving predictability for caches
Loop caches Mapping code to less used part(s) of the index space Cache locking/freezing Mapping pieces of software to specific ways
Methods:- Generating appropriate way in software- Allocation of certain parts of the address space to a
specific way- Including way-identifiers in virtual to real-address
translation “Caches behave almost like a scratch pad”
- 65 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Way prediction/selective direct mapping
[M. D. Powell, A. Agarwal, T. N. Vijaykumar, B. Falsafi, K. Roy: Reducing Set-Associative Cache Energy via Way-Prediction and Selective Direct-Mapping, MICRO-34, 2001]
© ACM
- 66 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Hardware organization for way prediction
© ACM
- 67 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results for the paper on way prediction (1)
Cache energy andprediction overhead
Energy component Relative energy
Parallel access cache read (4 ways read)
1.00
1 way read 0.21
Cache write 0.24
Tag array energy (incl. in the above numbers)
0.06
1024x4bit prediction table read/write
0.007
© ACM
System configuration parameters
Instruction issue & decode bandwidth
8 issues per cycle
L1 I-Cache 16K, 4-way, 1 cycle
Base L1 D-Cache 16K, 4-way, 1 or 2 cycles, 2ports
L2 cache 1M, 8-way, 12 cycle latency
Memory access latency
80 cycles+4 cycles per 8 bytes
Reorder buffer size 64
LSQ size 32
Branch predictor 2-level hybrid
- 68 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results for the paper on way prediction (2)
© ACM
- 69 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results for the paper on way prediction (2)
© ACM
- 70 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Locality of Reference
Spatial Locality: The concept that likelihood of referencing an item is higher if
an item close to it was just referenced.
Temporal Locality: The concept that a item that is referenced at one point in
time will be referenced again in the near future.
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
c[i][j] += a[i][k] * b[k][j]; }}}
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
c[i][j] += a[i][k] * b[k][j]; }}}
- 71 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Locality: Matrix Multiplication
I1 = (i,j,k) and I2 = (i,j,k+1) c[i][j] Temporal Locality
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
c[i][j] += a[i][k] * b[k][j]; }}}
for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
c[i][j] += a[i][k] * b[k][j]; }}}
c a b
*+
(i,j)
- 72 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Row major order vs.column major order allocation
Array p[j][k]Array p[j][k]Row major order (C)Row major order (C) Column major order
(FORTRAN)
Column major order (FORTRAN)
j=0
j=1
j=2
…
…
k=0
k=1
k=2
j=0
j=1
…j=0
j=1
…j=0
j=1
…
- 73 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Locality: Matrix Multiplication (2)
I1 = (i,j,k) and I2 = (i,j,k+1) c[i][j] Temporal Locality (layout independent) a[i][k] Spatial Locality (Row major order) b[k][j] Spatial Locality (Column major order)
c a b
(i,j)
- 74 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Data Layout Transformations - Example
Change layout of arrays according to the access pattern
int c[4][4];
for (i=0; i<4; i++) {
for (j=0; j<2; j++) {
c[i][j]= …; }}}
int c[4][4];
for (i=0; i<4; i++) {
for (j=0; j<2; j++) {
c[i][j]= …; }}}
int c[2][8];
for (i=0; i<4; i++) {
for (j=0; j<2; j++) {
c[2*i+j]= …; }}}
int c[2][8];
for (i=0; i<4; i++) {
for (j=0; j<2; j++) {
c[2*i+j]= …; }}}
Poor cache behavior Good cache behavior
Increases the spatial locality of array accesses
[V. Loechner et al.: Precise Data Locality Optimization of Nested Loops, The Journal of Supercomputing, Vol 21, 2002]
- 75 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Cache Aware Compiler Optimizations
Performance of a cache depends upon locality of references.
Data Cache: Loop Restructuring Transformations Temporal Locality Data Layout Transformations Spatial Locality
Instruction Cache: Code Layout Transformations Spatial Locality
- 76 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Loop Restructuring Transformations- Loop Interchange -
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
c[j][i] += … }}}
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
c[j][i] += … }}}
for (j=0; i<N; j++) {
for (i=0; i<M; i++) {
c[j][i] += … }}}
for (j=0; i<N; j++) {
for (i=0; i<M; i++) {
c[j][i] += … }}}
Poor cache behavior Good cache behavior
0
N
M 0
N
M
Assumption: row major order
Best performance if innermost loop corresponds to rightmost array index
[D. F. Bacon et al.: Compiler Transformations for High-Performance Computing, ACM Computing Surveys, 1994]
- 77 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Loop interchange: Example
…#define iter 400000int a[20][20][20];void computeijk() {int i,j,k;
for (i = 0; i < 20; i++) {for (j = 0; j < 20; j++) {
for (k = 0; k < 20; k++) {a[i][j][k] += a[i][j][k];}}}}
void computeikj() {int i,j,k;for (i = 0; i < 20; i++) {
for (j = 0; j < 20; j++) {for (k = 0; k < 20; k++) {
a[i][k][j] += a[i][k][j] ;}}}}…start=time(&start);for(z=0;z<iter;z++)computeijk();end=time(&end); printf("ijk=%16.9f\n",1.0*difftime(end,start));
(SUIF interchanges array indexes instead of loops; low resolution time calls)
Improved locality
- 78 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results:strong influence of the memory architecture
Loop structure: i j k
Tim
e [s
]
[Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004]
Ti C6xx~ 57%
Intel Pentium3.2 %
Sun SPARC35%
Processorreduction to [%]
Dramatic impact of locality
- 79 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Loop tiling/loop blockingOriginal version
for (i=1; i<=N; i++)for(k=1; k<=N; k++){
r=X[i,k]; /* to be allocated to a register*/for (j=1; j<=N; j++)
Z[i,j] += r* Y[k,j]} % Never reusing information in the cache for Y and Z if N is large or cache is small (O(N³) references for Z and Y).
j++
k++
i++ j++
k++
i++
j++
k++
i++
- 80 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Loop tiling/loop blockingtiled version
for (kk=1; kk<= N; kk+=B) for (jj=1; jj<= N; jj+=B) for (i=1; i<= N; i++) for (k=kk; k<= min(kk+B-1,N); k++){ r=X[i][k]; /* to be allocated to a register*/ for (j=jj; j<= min(jj+B-1, N); j++) Z[i][j] += r* Y[k][j] }
Reuse factor of B for Z andN for Y,
O(N³/B) accesses to main memory for Z
k++, j++
jj
kk
j++
k++
i++
jj
k++
i++
Same elements for next iteration of i
- 81 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Example
In practice, results by Buchwald are disappointing.One of the few cases where an improvement was achieved:Source: similar to matrix mult.
[Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004]
Tiling-factor
SPARC
Pentium
- 82 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Summary
Overlaying SPM allocation• Multiple processes
• Dortmund approach: saving, non-saving, hybrid• Dynamic set of processes• MMU-based approach (SNU)
• Multiple processors• Verma’s approach (MPARM)• Kandemir’s MPSoC approach
Caches• Locked caches (presentation by Heiko Falk)• Locality of reference• Loop transformations, tiling
- 83 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Questions (if on schedule) ?
Q&A?
- 84 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Schedule of the course
Time Monday Tuesday Wednesday Thursday Friday
09:30-11:00
1: Orientation, introduction
2: Models of computation + specs
5: Models of computation + specs
9: Mapping of applications to platforms
13: Memory aware compilation
17: Memory aware compilation
11:00 Brief break Brief break Brief break Brief break
11:15-12:30
6: Lab*: Ptolemy
10: Lab*: Scheduling
14: Lab*: Mem. opt.
18: Lab*: Mem. opt.
12:30 Lunch Lunch Lunch Lunch Lunch
14:00-15:20
3: Models of computation + specs
7: Mapping of applications to platforms
11: High-level optimizations*
15: Memory aware compilation
19: WCET & compilers*
15:20 Break Break Break Break Break
15:40-17:00
4: Lab*: Kahn process networks
8: Mapping of applications to platforms
12: High-level optimizations*
16: Memory aware compilation
20: Wrap-up
* Dr. Heiko Falk
- 85 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Loop fusion (merging), loop fission
for(j=0; j<=n; j++) for (j=0; j<=n; j++) p[j]= ... ; {p[j]= ... ;for (j=0; j<=n; j++) , p[j]= p[j] + ...} p[j]= p[j] + ...
Loops small enough to Better locality for allow zero overhead access to p.Loops Better chances for
parallel execution.
Which of the two versions is best?Architecture-aware compiler should select best version.
- 86 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Example: simple loops
void ss1() {int i,j; for (i=0;i<size;i++){ for (j=0;j<size;j++){ a[i][j]+= 17;}} for(i=0;i<size;i++){ for (j=0;j<size;j++){ b[i][j]-=13;}}}
void ms1() {int i,j; for (i=0;i< size;i++){ for (j=0;j<size;j++){ a[i][j]+=17; } for (j=0;j<size;j++){ b[i][j]-=13; }}}
void mm1() {int i,j; for(i=0;i<size;i++){ for(j=0;j<size;j++){ a[i][j] += 17; b[i][j] -= 13;}}}
#define size 30#define iter 40000int a[size][size];float b[size][size];
- 87 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results: simple loops
Runtime
0
20
40
60
80
100
120
X86 gcc 3.2 -03 x86 gcc 2.95 -o3 Sparc gcc 3xo1 Sparc gcc 3x o3
Plattform
%
Merged loops superior; except Sparc with –o3
Merged loops superior; except Sparc with –o3
ss1
ms1
mm1
(100% ≙ max)
- 88 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Loop unrolling
for (j=0; j<=n; j++) p[j]= ... ;
for (j=0; j<=n; j+=2)
{p[j]= ... ; p[j+1]= ...}
factor = 2
Better locality for access to p.
Less branches per execution of the loop. More opportunities for optimizations.
Tradeoff between code size and improvement.
Extreme case: completely unrolled loop (no branch).
- 89 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Example: matrixmult
#define s 30#define iter 4000int a[s][s],b[s][s],c[s][s];void compute(){int i,j,k; for(i=0;i<s;i++){ for(j=0;j<s;j++){ for(k=0;k<s;k++){ c[i][k]+= a[i][j]*b[j][k];}}}}
extern void compute2() {int i, j, k; for (i = 0; i < 30; i++) { for (j = 0; j < 30; j++) { for (k = 0; k <= 28; k += 2) {{int *suif_tmp; suif_tmp = &c[i][k]; *suif_tmp= *suif_tmp+a[i][j]*b[j][k];} {int *suif_tmp; suif_tmp=&c[i][k+1]; *suif_tmp=*suif_tmp +a[i][j]*b[j][k+1]; }}}} return;}
- 90 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results
Ti C6xx
Intel PentiumSun SPARCProcessor
Benefits quite small; penalties may be large [Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004]
factorfactor
- 91 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results: benefits for loop dependences
Small benefits;
Ti C6xx
Processor
#define s 50#define iter 150000int a[s][s], b[s][s];void compute() { int i,k; for (i = 0; i < s; i++) { for (k = 1; k < s; k++) { a[i][k] = b[i][k]; b[i][k] = a[i][k-1];}}}
[Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004]
factor
- 92 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Prefetching
Prefetch instructions load values into the cachePipeline not stalled for prefetching
Prefetching instructions introduced in ~1985-1995 Potentially, all miss latencies can be avoided Disadvantages:
• Increased # of instructions• Potential premature eviction of cache line• Potentially pre-loads lines that are never used
Steps• Determination of references requiring prefetches• Insertion of prefetches (early enough!)
[R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures, Morgan-Kaufman, 2002]
- 93 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results for prefetching
[Mowry, as cited by R. Allen & K. Kennedy]
© Morgan-Kaufman, 2002
Not very impressive!
- 94 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Optimization for exploiting processor-memory interface: Problem Definition (1)
[A. Shrivastava, E. Earlie, N. Dutt, A. Nicolau: Aggregating processor free time for energy reduction,Intern. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp. 154-159]
XScale is stalled for 30% of time, but each stall duration is small
Average stall duration = 4 cycles
Longest stall duration < 100 cycles
Break-even stall duration for profitable switching
360 cycles
Maximum processor stall
< 100 cycles
NOT possible to switch the processor to IDLE mode
Based on slide byA. Shrivastava
- 95 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Optimization for exploiting processor-memory interface: Problem Definition (2)
CT (Computation Time): Time to execute an iteration of the loop, assuming all data is present in the cache
DT (Data Transfer Time): Time to transfer data required by an iteration of a loop between cache and memory
Consider the execution of a memory-bound loop (DT > CT) Processor has to stall
Time
ActivityProcessor Activity
Memory Bus Activity
Processor activity is dis-continuous
Memory activity is dis-continuous
for (int i=0; i<1000; i++)
c[i] = a[i] + b[i];
Based on slide by A. Shrivastava
- 96 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Optimization for exploiting processor-memory interface: Prefetching Solution
Time
ActivityProcessor Activity
Memory Bus Activity
Each processor activity period increases
for (int i=0; i<1000; i++)prefetch a[i+4];prefetch b[i+4]; prefetch c[i+4];c[i] = a[i] + b[i];
Memory activity is continuous
Processor activity is dis-continuous
Memory activity is continuous
Total execution time reduces
Based on slide by A. Shrivastava
- 97 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Code Layout Transformations (1)
Execution counts based approach: Sort the functions according to execution counts
f4 > f1 > f2 > f5 > f3
Place functions in decreasing order of execution counts
f1
f2
f3
f4
f5
(1100)
(900)
(400)
(2000)
(700)
[S. McFarling: Program optimization for instruction caches, 3rd International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), 1989]
- 98 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Code Layout Transformations (2)
Execution counts based approach: Sort the functions according to execution counts
f4 > f1 > f2 > f5 > f3
Place functions in decreasing order of execution counts
Transformation increases spatial locality.Does not take in account calling order
f4
(1100)
(700)
(400)
(2000)
(900)
f4
f1
f2
f5
f3
f4
f2 f5
f1 f3
- 99 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Code Layout Transformations (3)
Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first
traversal.
f4 > f2 > f1 > f3 > f5
Increases spatial locality.
(2000) f4
f4
f2 f5
f1 f3[W. W. Hwu et al.: Achieving high instruction cache performance with an optimizing compiler, 16th Annual International Symposium on Computer Architecture, 1989]
- 100 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Code Layout Transformations (3)
Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first
traversal.
f4 > f2 > f1 > f3 > f5
Increases spatial locality.
(900)
(2000) f4
f2
f4
f2 f5
f1 f3
- 101 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Code Layout Transformations (4)
Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first
traversal.
f4 > f2 > f1 > f3 > f5
Increases spatial locality.
(900)
(1100)
(2000) f4
f2
f1
f4
f2 f5
f1 f3
- 102 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Code Layout Transformations (5)
Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first
traversal.
f4 > f2 > f1 > f3 > f5
Increases spatial locality.
f4
f2 f5
f1 f3
f4
(900)
(1100)
(400)
(2000) f4
f2
f1
f3
- 103 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Code Layout Transformations (6)
Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first
traversal.f4 > f2 > f1 > f3 > f5
Combined with placing frequently executed traces at the top of the code space for functions.
Increases spatial locality.
f4
(900)
(1100)
(400)
(2000)
(700)
f4
f2
f5
f1
f3
f4
f2 f5
f1 f3
- 104 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Memory allocation for arrays
Initial arrays Unfolded allocation
- 105 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Inter-array folding and Intra-array folding
[E. de Greef, F. Catthoor, H. De Man: Array Placement for Storage Size Reduction in Embedded Multimedia Systems, Intern. Conf. on Application-Specific Systems, Architectures and Processors, 1997]
- 106 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Requires optimization of address computations
Array folding implemented in IMEC’s DTSE optimization. Leads to costly div and mod ops. ADOPT address optimizations remove these operations
E.g.: mod → ++ and reset on pointers (indexes)
Architecture-aware compiler should find best transformation
for(i=0; i<20; i++)B[i%4];
tmp=0;for(i=0; i<20; i++)
{if(tmp>=4) tmp-=4;B[tmp];tmp++;}
- 107 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results (Mcycles for cavity benchmark)
ADOPT&DTSE required to achieve real benefit
[C.Ghez et al.: Systematic high-level Address Code Transformations for Piece-wise Linear Indexing: Illustration on a Medical Imaging Algorithm, IEEE WS on Signal Processing System: design & implementation, 2000, pp. 623-632]
- 108 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Comparison Flash/Microdrive
Sandisk Type IFlash
Sandisk Type IIFlash
IBM Microdrive DSCM-10340
Capacity [MB] 64 300 340
Power [W](standby/operating)
0,15/0.66 0,15/0,66 0,07/0.83
Write cycles 300.000 300.000 unlimited
Mean-time between failures [h]
>1.000.000 >1.000.000 service-life=min(5J, 8800 h operating)
Error rates, uncorrectable
< 1 per 1014 <1 per 1014 <1 per 1013
Max. power ons unlimited unlimited 300.000
Shock tolerance 2000 G; 2000 G 2000 G; 175 G; 1500 G
Source: Hennessy/Patterson, Computer Architecture, 2002
- 109 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
NOR- and NAND-Flash
NOR: Transistor between bit line and groundNAND: Several transistor between bit line and ground
was
at
[ww
w.s
amsu
ng.c
om/P
rodu
cts/
Se
mic
on-
duct
or/
Fla
sh/F
lash
New
s/ F
lash
Str
uctu
re.
htm
] (2
007)
contact
contact
- 110 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Properties of NOR-
and NAND-Flash
memories
Type/Property NOR NAND
Random access Yes No
Erase block Slow Fast
Size of cell Larger Small
Reliability Larger Smaller
Execute in place Yes No
Applications Code storage, boot flash, set top box
Data storage, USB sticks, memory cards
[ww
w.s
amsu
ng.c
om/P
rodu
cts/
Sem
icon
duct
or/F
lash
/Fla
shN
ews/
Fla
shS
truc
ture
.htm
]
- 111 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Characteristics of NAND Flash memory
Memory partitioned into blocks (typ. 16-256 KB),blocks partitioned into pages (typ. 0.5-5 KB).Read/write operations performed in page units.
Single Level Cell (SLC)
Multi Level Cell (MLC)
Read (page) 25 µs ≫ 25 µs
Write (page) 300 µs ≫ 300 µs
Erase (block) 2 ms 1.5 ms
J. Lee, S. Kim, H. Kwin, C. Hyun, S, Ahn, J. Choi, D. Lee, S.Noh: Block Recycling Schemes and Their Cost-based Optimization in NAND Flash Memory Based Storage System, EMSOFT’07, Sept. 2007
- 112 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Page/sector mapping flash transaction layer (FTL)
Inverted page table stored in flash memory (extra bits);“normal page” table constructed during initialization. Page table may become largeUsed in low capacity NOR Flash memories
Block 0
Block 1
Block 2
Block 3
logi-cal sec-tor num-ber
page mapping table
15
0
page
sector page + extra bits
- 113 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Exploiting regularity
Usually, long sequence of sequential writes
- 114 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Block mapping flash transaction layer (FTL)
Mapping tables smaller than for page-based FTLs. used in high capacity NAND Flash memories Overall operation is simple, but successive writes require copying into a new block Degraded performance for random and repeated writes. Hybrid schemes
Block 0
Block 1
Block 2
Block 3
&
logi-cal sec-tor num-ber
block mapping table
Offset
physicalsector number
concat
15
0
- 115 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Wear-leveling
Example (Lofgren et al., 2000, 2003):• Each erase unit carries erase counter• One erase unit set aside as a spare• When one of the most worn out units is reclaimed, its
counter is compared to least-worn out unit. If is large:
• content of least-worn-out ( constants) spare • content of most worn-out least worn-out• most worn-out unit becomes the new spare
Counter increment may be lost if power is lost between erase and counter update
Attempts to avoid erase counter in the same erase unitSource: Gal, Toledo, ACM Computing Surveys, June 2005
- 116 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Flash-specific file systems
Two-layer approach can be inefficient:• FTL emulates flash as a magnetic disc• Standard file system assumes magnetic disc
Example: deleted sectors not marked not reclaimed Log-structured file systems just append new information
• For disc-based file system:- Fast writes- Slow reads (head movement for gather operations)
• Ideal for flash-based file system:- Writes done in new sectors- Reads not slow: no head movement
Specific log-based flash file systems- JFFS2 (NOR)- YAFFS (NAND) Source: Gal, Toledo, ACM Computing Surveys, June 2005
- 117 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Flash-aware application data structures
Direct use of flash-specific properties in applications
• Typically requires partitioning of the flash memory and possibly wasted space within partitions
Execute-in-place
• Used with NOR-flash, directly addressable by processor
• Problematic in systems without MMU (no FTL feasible!): - instructions must be stored contiguously in flash
- instructions cannot move
• Code needed during erase cannot be stored in flash, unless suspended writing or erasing feasible
Source: Gal, Toledo, ACM Computing Surveys, June 2005
- 118 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Flash memory as main memory
One approach published (Wu, Zwaenepoel, 1994): Uses MMU RAM + Flash mapped to memory map Reads from Flash read single words from Flash Writes copy block of data into RAM,
all updates done in RAM If the RAM is full, a block is copied back to Flash Crucial issue: Speed of writes.
Proposal based on wide bus between Flash and RAM, so that writes are sufficiently fast Larger erase units, increased wear-out feasible.
M. Wu, W. Zwaenepoel: eNVy: A nonvolatile, main memory storage system. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems. 1994, p. 86–97.
- 119 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Memory hierarchies beyond main memory
Massive datasets are being collected everywhere Storage management software is billion-$ industry
Examples (2002):Phone: AT&T 20TB phone call
database, wireless trackingConsumer: WalMart 70TB
database, buying patterns WEB: Web crawl of 200M pages
and 2000M links, Akamai stores 7 billion clicks per day
Geography: NASA satellites generate 1.2TB per day
[© Larse Arge, I/O-Algorithms, http://www.daimi.au.dk/~large/ioS07/]
More New Information Over Next 2 Years Than in All Previous History
- 120 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Example: LIDAR Terrain Data
~1,2 km
~ 280 km/h at 1500-2000m
~ 1,5 m between measurements
COWI A/S (and others) is currently scanning Denmark
[© Larse Arge, I/O-Algorithms, http://www.daimi.au.dk/~large/ioS07/]
- 121 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Application Example: Flooding Prediction
+1 meter
+2 meter
[© Larse Arge, I/O-Algorithms, http://www.daimi.au.dk/~large/ioS07/]
- 122 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
N= # of items in the problem instanceB = # of items per disk blockM = # of items that fit in main memory
T = # of items in output
I/O: Move block between memory and disk
We assume (for convenience) that M >B2
External Memory Model
[© Larse Arge, I/O-Algorithms, http://www.daimi.au.dk/~large/ioS07/]
D
M
P
Block I/O
- 123 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Scalability Problems: Block Access Matters
Example: Reading an array from disk• Array size N = 10 elements• Disk block size B = 2 elements• Main memory size M = 4 elements (2 blocks)
1 2 10 9 5 6 3 4 8 71 5 2 6 3 8 9 4 7 10
Algorithm 2: N/B=5 I/OsAlgorithm 1: N=10 I/Os
Difference between N and N/B large since block size is large• Example: N = 256 x 106, B = 8000 , 1ms disk access
time N I/Os take 256 x 103 sec = 4266 min = 71 hr N/B I/Os take 256/8 sec = 32 sec
[© Larse Arge, I/O-Algorithms, http://www.daimi.au.dk/~large/ioS07/]
- 124 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Re-writing algorithms for memory hierarchies
Analysis of algorithm complexity mostly using the RAM (random access machine; const. mem. acc. times) model outdated take memory hierarchies explicitly into account.Example: Usually, divide-&-conquer algorithms are good. “Cache”-oblivious algorithms (are good for any size of
the faster memory and any block size). Assuming• Optimal replacement (Belady’s algorithm)• 2 Memory levels considered (there can be more)• Full associativity• Automatic replacement
[Piyush Kumar: Cache Oblivious Algorithms, in: U. Meyer et al. (eds.): Algorithms for Memory Hierarchies, Lecture Notes in Computer Science, Volume 2625, 2003, pp. 193-212]
[Naila Rahman: Algorithms for Hardware Caches and TLB, in: U. Meyer et al. (eds.): Algorithms for Memory Hierarchies, Lecture Notes in Computer Science, Volume 2625, 2003, pp. 171-192]
Un
like
ly to
be
eve
r a
uto
ma
tic
- 125 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Fundamental Bounds
Internal External Scanning: N Sorting: N log N Permuting Searching: Note:
• Linear I/O: O(N/B)• Permuting not linear• Permuting and sorting bounds are equal in all
practical cases• B factor VERY important:
Which results apply to flash memory?
NBlog
BN
BN
BMlogBN
NBN
BN
BN
BM log
}log,min{BN
BN
BMNN
N2log
[© Larse Arge, I/O-Algorithms, http://www.daimi.au.dk/~large/ioS07/]
- 126 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Summary
Exploitation of caches
• Simple loop transformations
• Prefetching
• Code layout transformations
Exploitation of the main (primary) memory
• Array folding, DTSE (IMEC)
“Secondary” memory
• Exploitation of flash memory as secondary memory
• Algorithms exploiting secondary memory
- 127 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Brief break (if on schedule)
Q&A?