Upload
erin-wilkinson
View
213
Download
0
Embed Size (px)
Citation preview
fakultät für informatikinformatik 12
technische universität dortmund
Optimizing embedded softwarefor timing-predictabilityand memory-awareness
Peter Marwedel
TU Dortmund, Informatik 12Informatik Centrum Dortmund (ICD)
Dortmund, Germany
- 2 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
What is an Embedded System?
- 3 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Embedded Systems
“Dortmund“ Definition [Peter Marwedel]:
Information processing systems embedded into a larger product
Berkeley Modell [Ed Lee]:
Embedded software is software integrated with physical* processes. The technical problem is managing time and concurrency in computational systems.
Main reason for buying is not information processing
* “cyber-physical systems”
- 4 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
The Problem with Timing in Current IT
Difficulty in expressing timing in specs PCs are designed for good average case timing Sources of unpredictable or difficult-to-predict timing
• Caches• Virtual memory• Many communication systems• Speculation• Multiprocessing• Shared resources
Real timing abstracted in O-notation “Computability” ill-defined for embedded software
- 5 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Problems with Classical CS
“The lack of timing in the core abstraction is a flaw, from the perspective of embedded software, …”
“What is needed is nearly a reinvention of computer science “Ed Lee: Absolutely Positively on Time, IEEE Computer, July, 2005
Introduction of the term “Cyber-physical system” (CPS)Initially another term for ES; Emphasizing physics:
• time• energy• space
- 6 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Motivation
Worst-case execution time aware compilation
Memory-architecture aware compilation
Summary
Outline
- 7 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Worst Case Execution Time Analysis
© AbsInt
Real-timeconstraint
WCETEST
- 8 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Current Trial-and-Error Based Development
1. Specification of ES software
2. Generation of Code (ANSI-C or similar)
3. Compilation for given target processor
4. Execution and/or simulation of machine code,using a (e.g. random) set of input data
5. Measurement-based computation of “estimated worst case execution time” (WCETmeas)
6. Adding safety margin (e.g. 20%) on top of WCETmeas and call this “WCEThypo”
7. If “WCEThypo” > real-time constraint: change some detail, go back to 1 or 2.
- 9 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Problems with this Approach
Dependability Computed “WCEThypo” not a safe approximation
Time constraint may be violated
Design time How to find necessary changes?
How many iterations until successful?
“Make the common case fast” a wrong approach for RT-systems Computer architecture and compiler techniques focus on average speed
Circuit designers know it’s wrong
Compiler designers (typically) don’t
“Optimizing” compilers unaware of costfunctions other than code size
period
Common case fast
- 10 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Contributions towards Solving the Problem
Predictable Hardware Architectures (currently an exception) Predictable Operating Systems Modified Algorithms: same execution time for all data (too far) Design of formal WCETEST analysis tools
Design of a Compiler which• considers WCET as cost function• Approach:
- Integration with WCETEST analysis
- Using timing model for specific optimizations
- 11 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
A WCET-Aware C-Compiler (WCC)
ICD-CParser
ANSI C ICD-CIR
CodeSelector
LLIR
RegisterAllocator
LLIR CRL2 CRL2
aiT WCET
Analysis
CRL2 +
WCETEST
CRL2 LLIR
WCET-Opt. ASM
Analyses,
Optimi-
zations
Target Processor:Infineon TriCoreTC1796
[H. Falk et al.]
- 12 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Challenges for WCETEST-Minimization
Worst-Case Execution Path (WCEP)
WCETEST of a program = Length of longest execution path (WCEP) in that program
WCETEST-Minimization: Reduction of the longest path
Other optimizations do not result in a reduction of WCETEST
Optimizations need to know the WCEP
- 13 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
WCET-oriented optimizations
Extended loop analysis (CGO 09) Instruction cache locking (CODES/ISSS 07) Cache partitioning (WCET-Workshop 09) Procedure cloning (WCET-Workshop 07, CODES/ISSS 07,
SCOPES 08) Procedure positioning (ECRTS 08) Function inlining (SMART 09) Loop unswitching/invariant paths (SCOPES 09) Loop unrolling (ECRTS 09) Register allocation (DAC 09) Scratchpad optimization (DAC 09) Extension towards multi-objective optimization (RTSS 08)
- 14 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Loop Unrolling as an Example
Unrolling replaces the original loop with several instances of the loop body Positive Effects
• Reduced overhead for loop control• Enables instruction level parallelism (ILP)• Offers potential for following optimizations
Unroll early in optimization chain Negative Effects
• Aggressive unrolling leads to I-cache overflows• Additional spill code instructions • Control code may cancel positive effects
Consequences of transformation hardly knownConsequences of transformation hardly known
- 15 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
WCET-aware Loop Unrolling via Backannotation
WCET-information available at assembly level
Unrolling to be applied at internal representation of source code
Solution: Back-annotation: experimental worst-case execution time aware compiler WCC allows feeding information from assembly code back to source code
• WCET data• Assembly code size• Amount of spill code
Memory architecture info available
High-Level IR(Source Code)
Low-Level IR(Assembly Code)
Back-Annotation
Mem. Spec.
- 16 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Results – WCET
100%: Avg. WCET for all benchmarks with –O3 & no unrolling WCET reduction between 10.2% and 15.4%WCET-driven Unrolling outperforms std. unrolling by 13.7%
Rel
ativ
e W
CE
T [
%]
ST
AN
DA
RD
- 17 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Register Allocation: Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
Re
lati
ve
WC
ET E
ST
[%]
(Op
tim
iza
tio
n L
ev
el -
O3
)
100% = WCETEST using Standard Graph Coloring (highest degree)
93%
24%
69%
[H. Falk]
- 18 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Relative WCETEST with I-Cache Locking5 Benchmarks/ARM920T/Postpass-Optimization
(ARM920T)
[S. Plazar et al.]
- 19 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Motivation
Worst-case execution time aware compilation
Memory-architecture aware compilation
Summary
Outline
- 20 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Importance of Energy Efficiency
Cou
rtes
y: P
hilip
s
© Hug
o D
e M
an,
IME
C,
200
7
Efficient software design needed, otherwise, the price for software flexibility cannot be paid.Efficient software design needed, otherwise, the price for software flexibility cannot be paid.
poor design techniques
IPE=Inherent power efficiencyAmI=Ambient Intelligence
GO
Ps/
J
- 21 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Energy Consumption in Mobile Devices
[O. Vargas (Infineon Technologies): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ Univ. Dortmund) for providing this source.
- 22 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Trends for the Speeds
Speed gap between processor and main DRAM increases
[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]
2
4
8
2 4 5
Speed
years
CPU Per
form
ance
(1.5
-2 p
.a.)
DRAM (1.07 p.a.)
31
2x every 2 years
10
Similar problems also for embedded systems & MPSoCs
In the future:Memory access times >> processor cycle times
“Memory wall” problem
- 23 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Hierarchical Memoriesusing Scratch Pad Memories (SPM)
Address space
Fast
Energy-efficient
Usually timing- predictable
scratch pad memory
0
FFF..
main
SPM
processor
HierarchyHierarchy
ExampleExample
no tag memory
SPM
selectSelection is by an appropriate address decoder (simple!)
SPM is a small, physically separate memory mapped into the address space
- 24 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Migration of Data and InstructionsGlobal Optimization Model
Which object (array, loop, etc.) to be stored in SPM?
Non-overlaying memory allocation:
Gain gk & size sk for each object k.
Maximise gain G = gk, respecting size of SPM SSP sk.
Solution: knapsack algorithm.
Overlaying allocation:
Moving objects back and forth between hierarchy levels
Processor
Scratch pad memory,capacity SSP
main memory
?
For i .{ }
for j ..{ }
while ...
Repeat
function ...
Array ...
Int ...
Array
Example:
- 25 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Scratchpad optimizations
Energy model (PATMOS, 2001) Comparison with cache (TR 762, 2001; CODES 2002) Non-overlaying allocation (TR 756, 2001; DATE 2002) Overlaying allocation (ISSS 2002; PACS 2003; CODES 2004;
GI 2005; TVLSI 2006; SAMOS 2006; Springer 2007) Partitioning (ASPDAC 2003; WMPI 2004) Predictability (ASPDAC 2004; WCET 2004; DATE 2005;
Springer 2006) Cooperation with cache (DATE 2004; TCAD 2006) Multi-processes (ESTIMEDIA 2005) Allocation at run-time (SCOPES 2007)
- 26 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Dynamic Set of Multiple Applications
MEMMEM
CPU
SPM
SPMManager
SPMManager
App. 2App. 2
App. 1App. 1
App. nApp. n
SP
M
App. 3
App. 2
App. 1
t
Address space:
SPM
??
Compile-time partitioning of SPM no longer feasible
Introduction of SPM-manager
Runtime decisions, butcompile-time supported
[R. Pyka, Ch. Faßbach, M. Verma, H. Falk, P. Marwedel: Operating system integrated energy aware scratchpad allocation strategies for multi-process applications, SCOPES, 2007]
- 27 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Comparison of SPMM to Caches for SORT
Baseline: Main memory only SPMM peak energy reduction by 83%
at 4k Bytes scratchpad Cache peak: 75% at 2k 2-way cache
SPMM capable ofoutperforming caches
OS and libraries are not considered yet
Chunk allocation results:
SPM Size Δ 4-way
1024 74,81%
2048 65,35%
4096 64,39%
8192 65,64%
16384 63,73%
- 28 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Motivation
Worst-case execution time aware compilation
Memory-architecture aware compilation
Summary
Outline
- 29 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
Conclusion
Embedded systems are tightly integrated with physics
Timing is currently poorly handled
Integration of WCET cost model into compiler WCC allows for a systematic reduction of the WCET in the PREDATOR project
• Optimizations can be reconsidered for WCET reduction
Memory architecture aware compilation is the consequence of non-uniform memory accesses
• Scratch pads combine energy- and timing-efficiency