29
fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter Marwedel TU Dortmund, Informatik 12 Informatik Centrum Dortmund (ICD) Dortmund, Germany

Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

Embed Size (px)

Citation preview

Page 1: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

fakultät für informatikinformatik 12

technische universität dortmund

Optimizing embedded softwarefor timing-predictabilityand memory-awareness

Peter Marwedel

TU Dortmund, Informatik 12Informatik Centrum Dortmund (ICD)

Dortmund, Germany

Page 2: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 2 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

What is an Embedded System?

Page 3: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 3 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Embedded Systems

“Dortmund“ Definition [Peter Marwedel]:

Information processing systems embedded into a larger product

Berkeley Modell [Ed Lee]:

Embedded software is software integrated with physical* processes. The technical problem is managing time and concurrency in computational systems.

Main reason for buying is not information processing

* “cyber-physical systems”

Page 4: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 4 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

The Problem with Timing in Current IT

Difficulty in expressing timing in specs PCs are designed for good average case timing Sources of unpredictable or difficult-to-predict timing

• Caches• Virtual memory• Many communication systems• Speculation• Multiprocessing• Shared resources

Real timing abstracted in O-notation “Computability” ill-defined for embedded software

Page 5: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 5 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Problems with Classical CS

“The lack of timing in the core abstraction is a flaw, from the perspective of embedded software, …”

“What is needed is nearly a reinvention of computer science “Ed Lee: Absolutely Positively on Time, IEEE Computer, July, 2005

Introduction of the term “Cyber-physical system” (CPS)Initially another term for ES; Emphasizing physics:

• time• energy• space

Page 6: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 6 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Motivation

Worst-case execution time aware compilation

Memory-architecture aware compilation

Summary

Outline

Page 7: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 7 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Worst Case Execution Time Analysis

© AbsInt

Real-timeconstraint

WCETEST

Page 8: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 8 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Current Trial-and-Error Based Development

1. Specification of ES software

2. Generation of Code (ANSI-C or similar)

3. Compilation for given target processor

4. Execution and/or simulation of machine code,using a (e.g. random) set of input data

5. Measurement-based computation of “estimated worst case execution time” (WCETmeas)

6. Adding safety margin (e.g. 20%) on top of WCETmeas and call this “WCEThypo”

7. If “WCEThypo” > real-time constraint: change some detail, go back to 1 or 2.

Page 9: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 9 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Problems with this Approach

Dependability Computed “WCEThypo” not a safe approximation

Time constraint may be violated

Design time How to find necessary changes?

How many iterations until successful?

“Make the common case fast” a wrong approach for RT-systems Computer architecture and compiler techniques focus on average speed

Circuit designers know it’s wrong

Compiler designers (typically) don’t

“Optimizing” compilers unaware of costfunctions other than code size

period

Common case fast

Page 10: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 10 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Contributions towards Solving the Problem

Predictable Hardware Architectures (currently an exception) Predictable Operating Systems Modified Algorithms: same execution time for all data (too far) Design of formal WCETEST analysis tools

Design of a Compiler which• considers WCET as cost function• Approach:

- Integration with WCETEST analysis

- Using timing model for specific optimizations

Page 11: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 11 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

A WCET-Aware C-Compiler (WCC)

ICD-CParser

ANSI C ICD-CIR

CodeSelector

LLIR

RegisterAllocator

LLIR CRL2 CRL2

aiT WCET

Analysis

CRL2 +

WCETEST

CRL2 LLIR

WCET-Opt. ASM

Analyses,

Optimi-

zations

Target Processor:Infineon TriCoreTC1796

[H. Falk et al.]

Page 12: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 12 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Challenges for WCETEST-Minimization

Worst-Case Execution Path (WCEP)

WCETEST of a program = Length of longest execution path (WCEP) in that program

WCETEST-Minimization: Reduction of the longest path

Other optimizations do not result in a reduction of WCETEST

Optimizations need to know the WCEP

Page 13: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 13 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

WCET-oriented optimizations

Extended loop analysis (CGO 09) Instruction cache locking (CODES/ISSS 07) Cache partitioning (WCET-Workshop 09) Procedure cloning (WCET-Workshop 07, CODES/ISSS 07,

SCOPES 08) Procedure positioning (ECRTS 08) Function inlining (SMART 09) Loop unswitching/invariant paths (SCOPES 09) Loop unrolling (ECRTS 09) Register allocation (DAC 09) Scratchpad optimization (DAC 09) Extension towards multi-objective optimization (RTSS 08)

Page 14: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 14 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Loop Unrolling as an Example

Unrolling replaces the original loop with several instances of the loop body Positive Effects

• Reduced overhead for loop control• Enables instruction level parallelism (ILP)• Offers potential for following optimizations

Unroll early in optimization chain Negative Effects

• Aggressive unrolling leads to I-cache overflows• Additional spill code instructions • Control code may cancel positive effects

Consequences of transformation hardly knownConsequences of transformation hardly known

Page 15: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 15 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

WCET-aware Loop Unrolling via Backannotation

WCET-information available at assembly level

Unrolling to be applied at internal representation of source code

Solution: Back-annotation: experimental worst-case execution time aware compiler WCC allows feeding information from assembly code back to source code

• WCET data• Assembly code size• Amount of spill code

Memory architecture info available

High-Level IR(Source Code)

Low-Level IR(Assembly Code)

Back-Annotation

Mem. Spec.

Page 16: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 16 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Results – WCET

100%: Avg. WCET for all benchmarks with –O3 & no unrolling WCET reduction between 10.2% and 15.4%WCET-driven Unrolling outperforms std. unrolling by 13.7%

Rel

ativ

e W

CE

T [

%]

ST

AN

DA

RD

Page 17: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 17 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Register Allocation: Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

Re

lati

ve

WC

ET E

ST

[%]

(Op

tim

iza

tio

n L

ev

el -

O3

)

100% = WCETEST using Standard Graph Coloring (highest degree)

93%

24%

69%

[H. Falk]

Page 18: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 18 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Relative WCETEST with I-Cache Locking5 Benchmarks/ARM920T/Postpass-Optimization

(ARM920T)

[S. Plazar et al.]

Page 19: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 19 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Motivation

Worst-case execution time aware compilation

Memory-architecture aware compilation

Summary

Outline

Page 20: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 20 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Importance of Energy Efficiency

Cou

rtes

y: P

hilip

s

© Hug

o D

e M

an,

IME

C,

200

7

Efficient software design needed, otherwise, the price for software flexibility cannot be paid.Efficient software design needed, otherwise, the price for software flexibility cannot be paid.

poor design techniques

IPE=Inherent power efficiencyAmI=Ambient Intelligence

GO

Ps/

J

Page 21: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 21 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Energy Consumption in Mobile Devices

[O. Vargas (Infineon Technologies): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ Univ. Dortmund) for providing this source.

Page 22: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 22 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Trends for the Speeds

Speed gap between processor and main DRAM increases

[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]

2

4

8

2 4 5

Speed

years

CPU Per

form

ance

(1.5

-2 p

.a.)

DRAM (1.07 p.a.)

31

2x every 2 years

10

Similar problems also for embedded systems & MPSoCs

In the future:Memory access times >> processor cycle times

“Memory wall” problem

Page 23: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 23 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Hierarchical Memoriesusing Scratch Pad Memories (SPM)

Address space

Fast

Energy-efficient

Usually timing- predictable

scratch pad memory

0

FFF..

main

SPM

processor

HierarchyHierarchy

ExampleExample

no tag memory

SPM

selectSelection is by an appropriate address decoder (simple!)

SPM is a small, physically separate memory mapped into the address space

Page 24: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 24 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Migration of Data and InstructionsGlobal Optimization Model

Which object (array, loop, etc.) to be stored in SPM?

Non-overlaying memory allocation:

Gain gk & size sk for each object k.

Maximise gain G = gk, respecting size of SPM SSP sk.

Solution: knapsack algorithm.

Overlaying allocation:

Moving objects back and forth between hierarchy levels

Processor

Scratch pad memory,capacity SSP

main memory

?

For i .{ }

for j ..{ }

while ...

Repeat

function ...

Array ...

Int ...

Array

Example:

Page 25: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 25 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Scratchpad optimizations

Energy model (PATMOS, 2001) Comparison with cache (TR 762, 2001; CODES 2002) Non-overlaying allocation (TR 756, 2001; DATE 2002) Overlaying allocation (ISSS 2002; PACS 2003; CODES 2004;

GI 2005; TVLSI 2006; SAMOS 2006; Springer 2007) Partitioning (ASPDAC 2003; WMPI 2004) Predictability (ASPDAC 2004; WCET 2004; DATE 2005;

Springer 2006) Cooperation with cache (DATE 2004; TCAD 2006) Multi-processes (ESTIMEDIA 2005) Allocation at run-time (SCOPES 2007)

Page 26: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 26 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Dynamic Set of Multiple Applications

MEMMEM

CPU

SPM

SPMManager

SPMManager

App. 2App. 2

App. 1App. 1

App. nApp. n

SP

M

App. 3

App. 2

App. 1

t

Address space:

SPM

??

Compile-time partitioning of SPM no longer feasible

Introduction of SPM-manager

Runtime decisions, butcompile-time supported

[R. Pyka, Ch. Faßbach, M. Verma, H. Falk, P. Marwedel: Operating system integrated energy aware scratchpad allocation strategies for multi-process applications, SCOPES, 2007]

Page 27: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 27 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Comparison of SPMM to Caches for SORT

Baseline: Main memory only SPMM peak energy reduction by 83%

at 4k Bytes scratchpad Cache peak: 75% at 2k 2-way cache

SPMM capable ofoutperforming caches

OS and libraries are not considered yet

Chunk allocation results:

SPM Size Δ 4-way

1024 74,81%

2048 65,35%

4096 64,39%

8192 65,64%

16384 63,73%

Page 28: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 28 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Motivation

Worst-case execution time aware compilation

Memory-architecture aware compilation

Summary

Outline

Page 29: Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter

- 29 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2009

Conclusion

Embedded systems are tightly integrated with physics

Timing is currently poorly handled

Integration of WCET cost model into compiler WCC allows for a systematic reduction of the WCET in the PREDATOR project

• Optimizations can be reconsidered for WCET reduction

Memory architecture aware compilation is the consequence of non-uniform memory accesses

• Scratch pads combine energy- and timing-efficiency