267
A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades Dr.-Ing. vorgelegt von Joachim Falk aus Erlangen

A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

A Clustering-Based MPSoC DesignFlow for Data Flow-Oriented

Applications

Der Technischen Fakultätder Friedrich-Alexander-Universität

Erlangen-Nürnbergzur

Erlangung des Doktorgrades Dr.-Ing.

vorgelegt von

Joachim Falk

aus Erlangen

Page 2: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Als Dissertation genehmigtvon der Technischen Fakultätder Friedrich-Alexander-Universität Erlangen-NürnbergTag der mündlichen Prüfung: 21.11.2014

Vorsitzende des Promotionsorgans: Prof. Dr.-Ing. habil. Marion Merklein

Gutachter: Prof. Dr.-Ing. Jürgen TeichProf. Dr.-Ing. habil. Christian HaubeltProf. Dr. Ir. Twan Basten

Page 3: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Acknowledgments

First, I would like to express my sincere gratitude to Prof. Dr.-Ing. Jürgen Teichfor his supervision and useful critiques of this research work. Moreover, I wantto give my appreciation to Prof. Dr.-Ing. Christian Haubelt and Prof. Dr. Ir.Twan Basten for agreeing to be the co-examiners of this thesis. I also want tothank the other members of the Ph.D. committee, Prof. Dr.-Ing. Michael Glaßand Prof. Dr.-Ing. Sebastian Sattler. I am particularly grateful for the fruitfuldiscussions with my colleagues from the University of Erlangen-Nuremberg andthe University of Rostock. In particular, I want to thank Prof. Dr.-Ing. ChristianHaubelt, Christian Zebelein, Jens Gladigau, and Martin Streubühr for theirassistance in developing solutions to the various challenges encountered duringmy work on this thesis. My special thanks are extended to Prof. Dr.-Ing. MichaelGlaß, Tobias Schwarzer, Stefan Wildermann, and Kerstin Falk for proofreadingparts of this thesis. I would also like to recognize Rainer Dorsch for providingthe opportunity to validate parts of the presented design flow in an industry-relevant case study.Last but not least, I wish to acknowledge the invaluable support given by my

family while undertaking the presented research work.

iii

Page 4: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität
Page 5: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Contents

1 Introduction 1

2 Data Flow Fundamentals 72.1 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Static Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Homogeneous Data Flow . . . . . . . . . . . . . . . . . . 102.2.2 Synchronous Data Flow . . . . . . . . . . . . . . . . . . 122.2.3 Cyclo-Static Data Flow . . . . . . . . . . . . . . . . . . . 15

2.3 Dynamic Data Flow . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Kahn Process Networks . . . . . . . . . . . . . . . . . . 192.3.2 Boolean Data Flow . . . . . . . . . . . . . . . . . . . . . 212.3.3 Dennis Data Flow . . . . . . . . . . . . . . . . . . . . . . 222.3.4 FunState . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Hierarchical Integration of Finite State Machines and Data Flow 272.4.1 Refining Data Flow Actors via Finite State Machines . . 282.4.2 Refining Finite State Machine States via Data Flow . . . 31

3 SysteMoC 333.1 Electronic System Level Design and SystemC . . . . . . . . . . 343.2 Integration of SysteMoC and Design Space Exploration . . . . . 373.3 The Model of Computation of SysteMoC . . . . . . . . . . . . . 44

3.3.1 SysteMoC Actors . . . . . . . . . . . . . . . . . . . . . . 443.3.2 SysteMoC Network Graphs . . . . . . . . . . . . . . . . 483.3.3 SysteMoC FIFO Channels . . . . . . . . . . . . . . . . . 50

3.4 The SysteMoC Language . . . . . . . . . . . . . . . . . . . . . . 503.4.1 Specification of Actors . . . . . . . . . . . . . . . . . . . 513.4.2 Specification of the Communication Behavior . . . . . . 523.4.3 Specification of the Network Graph . . . . . . . . . . . . 56

3.5 Hierarchical SysteMoC Models . . . . . . . . . . . . . . . . . . . 573.5.1 Hierarchical Finite State Machines . . . . . . . . . . . . 573.5.2 Hierarchical Network Graphs . . . . . . . . . . . . . . . 61

v

Page 6: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Contents

3.5.3 The Cluster Operation . . . . . . . . . . . . . . . . . . . 623.6 From SystemC to SysteMoC . . . . . . . . . . . . . . . . . . . . 653.7 Automatic Model Extraction . . . . . . . . . . . . . . . . . . . . 703.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4 Clustering 834.1 Clustering and Scheduling Overhead . . . . . . . . . . . . . . . 854.2 Clustering as an Operation of Refinement . . . . . . . . . . . . . 974.3 Cluster Refinement in Static Data Flow Graphs . . . . . . . . . 100

4.3.1 Refining a Cluster to an SDF Actor . . . . . . . . . . . . 1014.3.2 Refining a Cluster to a CSDF Actor . . . . . . . . . . . . 110

4.4 Cluster Refinement in Dynamic Data Flow Graphs . . . . . . . 1204.4.1 Clustering Condition . . . . . . . . . . . . . . . . . . . . 1224.4.2 Automata-Based Clustering . . . . . . . . . . . . . . . . 1294.4.3 Rule-Based Clustering . . . . . . . . . . . . . . . . . . . 1454.4.4 Scheduling Static Sequences . . . . . . . . . . . . . . . . 1584.4.5 Proving Semantic Equivalence . . . . . . . . . . . . . . . 1604.4.6 FIFO Channel Capacity Adjustments . . . . . . . . . . . 168

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1864.5.1 Speedup from Overhead Reduction . . . . . . . . . . . . 1874.5.2 Design Space for Clustering . . . . . . . . . . . . . . . . 195

4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

5 Synthesis 2035.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2045.2 Synthesis in the SystemCoDesigner Design Flow . . . . . . . 2055.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075.4 Synthesis of Actor-Oriented Designs . . . . . . . . . . . . . . . . 209

5.4.1 Actor Transformation . . . . . . . . . . . . . . . . . . . . 2095.4.2 Generating the Architecture . . . . . . . . . . . . . . . . 2115.4.3 Communication Resources . . . . . . . . . . . . . . . . . 212

5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2135.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

6 Outlook and Conclusions 2176.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2176.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . 220

German Part 223

Bibliography 229

Author’s Own Publications 243

vi

Page 7: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Contents

List of Symbols 247

Acronyms 253

Index 255

vii

Page 8: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität
Page 9: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

1Introduction

Due to the rising hardware capabilities of embedded systems, their offered func-tionality has increased tremendously. However, the comparatively slow improve-ments in battery technology in the same time frame has caused only a slightrelaxation of power and energy constraints for portable devices. Therefore,power-efficient implementations of the requested functionality is strongly de-sired, leading to heterogeneous implementations using power-efficient dedicatedaccelerators and multicore processors. Other embedded systems like automo-tive controller networks are naturally heterogeneous distributed systems. Theabove mentioned reasons are contributing factors to the situation that, in alllikelihood, future embedded systems of medium complexity will no longer con-sist of a single computation resource, but multiple connected and heterogeneousresources. Hence, languages used for implementing and modeling applicationsfor these future embedded systems need the ability to exploit the parallelisminherent in such an architecture.

“Although threads seem to be a small step from sequential compu-tation, in fact, they represent a huge step. They discard the mostessential and appealing properties of sequential computation: un-derstandability, predictability, and determinism.”— From “The Problem with Threads,” by Edward A. Lee [Lee06]

This fact causes problems for the traditional implementation of embeddedsystems via sequential programming languages, as these languages handle con-currency by using threads. Data flow, in contrast, is a modeling paradigm well-suited for the modeling of concurrent systems by concurrently executing actors.Thus, data flow is a natural fit for the distributed nature encountered in futureembedded systems. This thesis is concerned with providing improvements fordesign flows targeting such systems by: (1) providing a modeling language withformal underpinnings in data flow modeling, thus, enabling a trade-off betweenanalyzability and expressiveness of the described application; (2) contributinga methodology called clustering that exploits these possible trade-offs to gen-erate efficient schedules for applications containing actors of high analyzability

1

Page 10: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

1. Introduction

and low expressiveness; and (3) enabling the synthesis of the modeled appli-cations via a synthesis back-end that supports multiple targets such as singlecore software implementations, virtual prototypes of distributed Multi-ProcessorSystem-on-Chip (MPSoC) systems, and dedicated hardware implementations.In the following, the key contributions as well as the limitations of the presentedwork are outlined. Then, an outline of the structure of this work is given.

Contributions and LimitationsIn this thesis, a design flow for data flow-oriented applications from an exe-cutable specification down to a virtual prototype [GHP+09] will be presented.The scope of the work are executable specifications that can be modeled as dataflow graphs like brake-by-wire [GTZ12] or torque vectoring [ZGBT13] from thesignal processing domain, models from the image processing domain such as, forinstance, a Motion-JPEG [KSS+09*] decoder or a pedestrian detection system[KFL+13], as well as applications from the packet processing domain like anInfiniBand network controller [FZH+10*, ZFH+10*].The first key contribution of this work is the SystemC Models of Computation

(SysteMoC) modeling language [FHZT13*, FHT06*] for the Electronic SystemLevel (ESL) and its integration into the SystemCoDesigner Design SpaceExploration (DSE) design flow [KSS+09*]. This language has strong formal un-derpinnings in data flow modeling with the distinction that the expressivenessof the data flow model used by an actor is not chosen in advance but determinedfrom the implementation of the actor [FHZT13*, ZFHT08*]. The second keycontribution of this work is the clustering methodology [FKH+08*, FZHT13*]that exploits the presence of highly analyzable static actors in the data flowgraph even in the presence of dynamic actors. Actors are called static if theyconform to the known data flow models Homogeneous (Synchronous) DataFlow (HSDF) [CHEP71], Synchronous Data Flow (SDF) [LM87b] or Cyclo-Static Data Flow (CSDF) [BELP96] as introduced in Chapter 2. The clusteringmethodology is neither limited to data flow graphs consisting only of staticactors, nor does it require a special treatment of these static actors by the de-veloper of the SysteMoC application. Finally, a correctness proof [FZHT13*]of the presented clustering methodology is given. The third key contributionof this work is the synthesis back-end [ZHFT12*, ZFHT12*, ZFS+14*] for theSysteMoC language. The automatic generation of virtual prototypes is requiredfor the evaluation of the clustering methodology by generating implementationsand measuring the resulting throughput of the generated implementations.In order to determine the data flow model used by a SysteMoC actor, the

SysteMoC language, in contrast to SystemC, enforces a distinction betweencommunication and computation of an actor. While modeling of the executablespecification in SysteMoC is the preferred design entry to specify this data flow

2

Page 11: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

graph, the executable specification can in theory [FZK+11*] also be specified viapure SystemC if certain coding standards are enforced. These coding standardswould replace the enforcement by the SysteMoC language of a clear distinc-tion between communication and computation for each actor. Furthermore, thepure SystemC-based design entry must also be modeled as a data flow graphin contrast to the architecture centric models used in industry. These architec-ture centric models closely correspond to virtual prototypes containing alreadymemories, buses, and other MPSoC architecture components.

This thesis shows how this distinction between communication and computa-tion can be exploited in order to classify a SysteMoC actor into one of the staticdata flow models (shaded in Figure 1.1) in the hierarchy of data flow modelexpressiveness. The ability of classification [FHZT13*, ZFHT08*] serves as adistinction between SysteMoC and Ptolemy [EJL+03, HLL+04, Pto14] as well asPtolemy Classics [BHLM94] where each actor is associated with a director whichexplicitly gives the Model of Computation (MoC) under which the actor is exe-cuted. This is important as the data flow model of SysteMoC is chosen in orderto provide great expressiveness. Hence, the analyzability, e.g., with respect todeadlock freeness or the ability to be executed in bounded memory, of a generalSysteMoC model is correspondingly low. However, if this expressiveness is notused by a SysteMoC actor, then the classification of an actor might detect thissituation and classify the actor into a data flow model of lower expressivenessbut higher analyzability. The classification provides only a sufficient criterion ifa general SysteMoC actor conforms to one of the static data flow models. Thislimitation stems from the fact that the problem in general is undecidable.

For modeling at the ESL, the presented work contributes [FHT06*] an effi-cient representation and simulation of data flow model-based executable specifi-cations. To demonstrate the usage of SysteMoC for real-world applications, thelanguage has also been used [FZH+10*] in an industrial cooperation to modelthe executable specification of an InfiniBand network controller, which has beenused to show the productivity gain that can be achieved by integrating theSysteMoC ESL model into the design flow used by the industry partner. Thisproductivity gain has been achieved by integrating the verification flow for thehardware design of the InfiniBand network controller with the verification flowfor the software development of the firmware responsible to control the networkcontroller via usage of the developed SysteMoC ESL model. In order to facil-itate modeling of real-world network applications like the InfiniBand networkcontroller, the SysteMoC language must support the NDF model [Kos78].

A key point of the SysteMoC language in contrast to other data flow modelingframeworks like Ptolemy Classics, Ptolemy, and YAPI [dKSvdW+00] is the abil-ity of executable specifications modeled in SysteMoC to act as an input modelfor DSE [HFK+07*, KSS+09*] as well as to serve as an evaluator [SFH+06*] for

3

Page 12: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

1. Introduction

incr

easi

ng

ex

pre

ssiv

enes

s

incr

easi

ng

an

aly

zab

ilit

y

HSDF

SDF

CSDF

BDF

DDF

KPN

NDF

Figure 1.1: Hierarchy of various data flow models from the least expressive butmost analyzable HSDF model to the most expressive but least ana-lyzable NDF model.

the DSE to evaluate timing and power characteristics of the executable specifi-cation for the different design decisions that are to be explored.The identification of static actors by means of classification enables the second

key contribution, the clustering methodology presented in Chapter 4, to refinethe data flow graph by replacing islands of static actors by composite actors. Is-lands of static actors are connected subgraphs of these static actors in the wholedata flow graph described by an executable specification modeled in SysteMoC.Later, in the implementation generated by the synthesis back-end, the remain-ing actors which are not replaced by a composite actor as well as all compositeactors are scheduled by a dynamic scheduler at run time. Thus, by combin-ing multiple static actors into one composite actor, the clustering methodology[FKH+08*, FZK+11*, FHZT13*] reduces the number of actors which have tobe scheduled by the dynamic scheduler. This reduction is of benefit as it hasbeen shown [BBHL95, BL93, BML97] that compile-time scheduling of static ac-tors [LM87b] produces more efficient implementations in terms of latency andthroughput than run-time scheduling of the same actors by a dynamic scheduler.Furthermore, an improved rule-based representation [FZHT11*, FZHT13*]

of the schedule contained in the composite actor is also discussed. The rule-based representation trades latency and throughput in order to achieve a morecompact code size of the schedule contained in a composite actor as comparedto the automata-based representation used in [FKH+08*, FZK+11*, FHZT13*].

4

Page 13: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Finally, the synthesis back-end [ZHFT12*, ZFHT12*] generates an imple-mentation by transforming the executable specification modeled in SysteMoCaccording to a set of design decisions for the implementation. Examples for de-sign decisions are allocation and binding decisions of the DSE tool as well as theschedules generated by the clustering methodology. The allocation and bind-ing decisions define the set of hardware resources which are used in the virtualprototype and the information which SysteMoC actor is implemented on whichresource of the virtual prototype. The schedules generated by the clusteringmethodology are used by the synthesis back-end to reduce the scheduling over-head of the generated software. The presented synthesis back-end contributesto the realization of the promised productivity gain of ESL modeling by provid-ing an automatic way to generate virtual prototypes for SysteMoC applications.Furthermore, the synthesis back-end is used for the realization of synthesis-in-the-loop by the DSE design flow. Synthesis-in-the-loop enables the DSE toevaluate its design decisions based on real implementations.

OutlineThe structure of this thesis follows the design flow proposed in this work. Thefundamentals of data flow modeling are introduced in Chapter 2. This chapteris also used to introduce the basic notation which is used throughout this the-sis. Furthermore, the trade-off between the expressiveness and the analyzabilityof different data flow models is discussed. Based on these discussions, the un-derlying data flow model of the input language of the proposed design flow ischosen.The input language—an extension of SystemC—called SysteMoC is intro-

duced in Chapter 3. The SysteMoC language is used to model the executablespecification which is the input for the proposed design flow.In Chapter 4, a methodology called clustering is introduced. The clustering

methodology exploits the derived classification of SysteMoC actors to gener-ate optimized schedules for islands of static actors. The goal of the clusteringmethodology is the replacement of the static island by a single actor, imple-menting a schedule for the static actors in the island in such a way that theperformance of the whole graph in terms of latency and throughput is opti-mized.Subsequently, in Chapter 5, a synthesis back-end will be presented that sup-

ports multiple targets such as single core software implementations, virtual pro-totypes of distributed MPSoC systems, and dedicated hardware implementa-tions.Finally, a conclusion of the presented work and an outlook on possible future

work is given in Chapter 6.

5

Page 14: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität
Page 15: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2Data Flow Fundamentals

Data flow is a modeling paradigm well-suited for the modeling of concurrentsystems by concurrently executing actors. Thus, data flow is a natural fit forthe distributed nature encountered in future embedded systems. Furthermore,the design entry for the presented design flow for these future embedded systemsis given by an executable specification modeled in the SystemC Models of Com-putation (SysteMoC) language. This language has strong formal underpinningsin data flow modeling.In this chapter, a general introduction to data flow is given in Section 2.1.

Furthermore, the trade-off between analyzability and expressiveness of theseModels of Computation (MoCs) will be discussed. Data flow models can beclassified into static data flow models, known from literature for their highanalyzability but low expressiveness, and dynamic data flow models, known fortheir high expressiveness but low analyzability.First, three important static data flow models Homogeneous (Synchronous)

Data Flow (HSDF), Synchronous Data Flow (SDF), and Cyclo-Static DataFlow (CSDF) known in literature will be recapitulated in Section 2.2. Next, inSection 2.3, the dynamic data flow models Kahn Process Network (KPN) as wellas its usual realization as Dennis Data Flow (DDF), Boolean Data Flow (BDF),and Functions Driven by State Machines (FunState)—a realization of the Non-Determinate Data Flow (NDF) model—are discussed. Finally, ways to integrateFinite State Machines (FSMs) into data flow models are reviewed in Section 2.4.Note that parts of this chapter are derived from [FHZT13*].

2.1 Data Flow

A Data Flow Graph (DFG) [Kah74] is a graph consisting of vertices called ac-tors and directed edges called channels. Whereas the actors are used to modelfunctionality, thus computations to be executed, the channels represent datacommunication and storage requiring memory to be realized. Channel mem-ory is conceptionally considered to be unbounded, thus representing a possiblyinfinite amount of storage. Moreover, it is customary [Den74] to separate the

7

Page 16: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

Channel 𝑐1

snk(𝑐1)src(𝑐1)

𝑎1 𝑎2𝑐1

(a) Initial state of the DFG 𝑔

Actor 𝑎1

Token carrying value 𝜈1 hasbeen produced by 𝑎1

𝜈1𝑎1 𝑎2

(b) After production of a token by actor 𝑎1

Actor 𝑎1

Token carrying value 𝜈2 hasbeen produced by 𝑎1

𝜈2 𝜈1𝑎1 𝑎2

(c) After production of a second token byactor 𝑎1

Actor 𝑎2

Token carrying value 𝜈1

will be consumed by 𝑎2

𝜈1𝜈2𝑎1 𝑎2

(d) Consumption of the first produced to-ken by 𝑎2

Figure 2.1: A very simple DFG consisting of a channel communicating data pro-duced by actor 𝑎1 and consumed by actor 𝑎2

computation of an actor into distinct steps. These steps are called actor firings.An actor firing is an atomic computation step that consumes a number of datacalled tokens from each incoming channel and produces a number of tokens oneach outgoing channel. More formally, a DFG is defined as follows:

Definition 2.1 (Data Flow Graph). A DFG is a directed graph 𝑔 = (𝐴,𝐶),where the set of vertices 𝐴 represents the actors and the set of edges 𝐶 ⊆ 𝐴×𝐴represents the channels. Additionally, a delay function delay : 𝐶 → 𝒱* isgiven.1 It assigns to each channel (𝑎src, 𝑎snk) = 𝑐 ∈ 𝐶 a (possibly empty)sequence of initial tokens.2 The set 𝒱 is the set of data values which can becarried by a token.

An example of a very simple DFG according to Definition 2.1 is depicted inFigure 2.1. It consists of two actors 𝑎1 and 𝑎2 which are connected by a channel𝑐1. A channel has, if not otherwise noted, infinite channel capacity, i.e., it canstore an infinite number of tokens that are in transit between the two actorsconnected by the channel. For notational convenience, two functions are usedto refer to the source actor and the sink actor of a channel.

1The ’*’-operator is used to denote Kleene closure of a value set. It generates the set ofall possible finite sequences of values from a value set, that is 𝑋* = ∪𝑛 ∈ N0 : 𝑋𝑛. N0

denotes the set of non-negative integers, that is { 0, 1, 2, . . . }.2In some data flow models that abstract from token values, the delay function may only

return a non-negative integer that denotes the number of initial tokens on the channel. Insuch a context, the number of initial tokens may also be called the delay of a channel.

8

Page 17: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.2 Static Data Flow

∙ The src : 𝐶 → 𝐴 function denotes the source actor 𝑎 = src(𝑐) produc-ing tokens onto a channel 𝑐, e.g., src(𝑐1) = 𝑎1 for the DFG depicted inFigure 2.1.

∙ The snk : 𝐶 → 𝐴 function denotes the sink actor 𝑎 = snk(𝑐) consumingtokens from a channel 𝑐, e.g., snk(𝑐1) = 𝑎2 for the DFG depicted inFigure 2.1.

The channel 𝑐1 enables a transmission of data values 𝜈1, 𝜈2, . . . from 𝑎1 to 𝑎2.Each data value is carried by a token. For data flow models, a token representsthe atomic unit of data production and consumption. Tokens are queued (asexemplified in Figures 2.1b to 2.1c) and de-queued (Figure 2.1d) on a channelin First In First Out (FIFO) order. Between Figure 2.1a and Figure 2.1b aso-called actor firing of actor 𝑎1 has occurred, producing a token with value 𝜈1.Another actor firing, producing an additional token with value 𝜈2, has occurredin Figure 2.1c. While Figures 2.1c and 2.1d depict the state of the DFG afterthe first actor firing of actor 𝑎1 has produced the token with value 𝜈1 but beforeit has been consumed by the first firing of actor 𝑎2. The state of a DFG is givenby the number and values of tokens on each channel as well as, possibly, theinternal states of all actors. For analysis purposes of static data flow models,the values of these tokens do not need to be considered. A resulting next stateof the DFG after firing of actor 𝑎2 is not depicted in Figure 2.1, but consists ofthe DFG where only the token with value 𝜈2 remains on the channel.An actor is fireable (also called enabled) if and only if all the tokens the next

actor firing would consume are present on the input channels of the actor, e.g.,the actor 𝑎2 is enabled starting from Figure 2.1b as the token with value 𝜈1,which is the only token that will be consumed by firing of 𝑎2, is present on thechannel.Data flow is a very general MoC. Decades of research [CHEP71, Den74,

Kah74, LM87b, LM87a, Buc93, Par95, BELP96, TSZ+99, GLL99, BB01, HB07,PSKB08, GJRB11] into its applications have lead to a multitude of different dataflow models. All of them make different trade-offs between their expressivenessand their analyzability, e.g., with respect to deadlock freeness, the ability tobe executed in bounded memory, or the possibility to be scheduled at compiletime. In the following, the most important data flow MoCs are briefly reviewed,starting with those data flow models with the least expressiveness.

2.2 Static Data Flow

Of primary interest for the expressiveness of a data flow model are so-calledconsumption and production rates. The consumption rate (cons(𝑐)) of an actorfiring from a connected channel 𝑐 is the number of tokens which are consumed

9

Page 18: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

by that actor from the channel while firing. To exemplify, consider Figure 2.1dwhere the first firing of actor 𝑎2 consumes one token from channel 𝑐1. Therefore,the consumption rate of the first firing of actor 𝑎2 from channel 𝑐1 is one. Theproduction rate (prod(𝑐)) is defined analogously. That is, the production rateof an actor firing to a connected channel is the number of tokens which areproduced by that actor on the channel while firing. To exemplify, considerFigure 2.1b where the first firing of actor 𝑎1 produces one token onto channel𝑐1. Therefore, the production rate of the first firing of actor 𝑎1 to channel 𝑐1 isone.An actor is called a static data flow actor if its consumption and production

rates neither depend on the values of the tokens which are consumed by the actornor on the points in time at which tokens arrive on the channels connected to theactor.3 Hence, the consumption and production rates depend on some internalstate of the actor which is not influence by the values of the tokens which areconsumed by the actor. It must be assured that both the transitions between thisinternal states as well as the derivation of the consumption and production ratesfrom this internal state are deterministic. Thus, the communication behaviorof static actor can be fully predicted at compile time. The consumption andproduction rates of an actor are even further constrained in well-known staticdata flow models, which are presented in the following.

2.2.1 Homogeneous Data Flow

The simplest data flow model is Homogeneous (Synchronous) Data Flow (HSDF).Data flow graphs corresponding to this data flow model are also known as markedgraphs [CHEP71] in literature. The communication behavior of HSDF actorsis constraint in such a way that each actor firing must produce and consumeexactly one token on, respectively, the outgoing and incoming channels, i.e.,∀𝑐 ∈ 𝐶 : cons(𝑐) = prod(𝑐) = 1.Due to the low modeling power of the HSDF model, it is also highly analyz-

able, e.g., with respect to the ability to be executed in bounded memory. Toexemplify, the existence of an infinite sequence of actor firings for the actorsof an HSDF graph such that the number of queued tokens on any channel ofthe HSDF graph does not exceed a finite bound is considered. The existence ofsuch an infinite sequence for an HSDF graph guarantees that the graph can beexecuted in bounded memory. It is proven in [CHEP71] that an HSDF graphhas such an infinite sequence of actor firings if and only if each directed cycleof the graph contains at least one initial token. An HSDF graph satisfying thiscriterion is depicted in Figure 2.2a. The graph contains three directed cycles,

3In principle, the requirement of independence of the token arrival times can be neglected ifsequence determinacy [LSV98] is still assured.

10

Page 19: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.2 Static Data Flow

Initial token

𝑎2

𝑎3

𝑎4

𝑎5

𝑎6𝑎1

𝑐2 𝑐3

𝑐6

𝑐 4

𝑐8 𝑐5𝑐1

𝑐 7

(a) Initial state of an HSDF graph

𝑎2

𝑎3

𝑎4

𝑎5

𝑎6𝑎1

𝑐2 𝑐3

𝑐6𝑐 7

𝑐 4

𝑐8 𝑐5𝑐1

(b) First of three directed cycles in theHSDF graph

𝑎2

𝑎3

𝑎4

𝑎5

𝑎6𝑎1

𝑐2 𝑐3

𝑐6𝑐 7

𝑐 4

𝑐8 𝑐5𝑐1

(c) Second of three directed cycles in theHSDF graph

𝑎2

𝑎3

𝑎4

𝑎5

𝑎6𝑎1

𝑐2 𝑐3

𝑐6𝑐 7

𝑐 4

𝑐8 𝑐5𝑐1

(d) Third of three directed cycles in theHSDF graph

Figure 2.2: An HSDF graph containing three directed cycles

which are depicted in Figure 2.2b - Figure 2.2d, each containing the channel 𝑐2carrying the required initial token.If each directed cycle of the HSDF graph contains at least one initial token,

then there exists a finite sequence of actor firings 𝑎 that brings the HSDF graphto the same state as it started from, e.g., as depicted in Figure 2.3 for the graphfrom Figure 2.2a. The sequence of actor firings 𝑎 is also called an iteration of theDFG. The length of an iteration is the number of its actor firings, e.g., six actorfirings for the iteration 𝑎 = ⟨𝑎1, 𝑎5, 𝑎4, 𝑎6, 𝑎3, 𝑎2⟩.4 For HSDF, this correspondsexactly to the number of actors in the graph. An infinite repetition of suchan iteration, e.g., ⟨𝑎1, 𝑎5, 𝑎4, 𝑎6, 𝑎3, 𝑎2⟩*, is also called a Periodic Static-OrderSchedule (PSOS) [DSB+13] of the graph.On the other hand, if the initial token is removed from channel 𝑐2, the actors

𝑎2 to 𝑎6 are deadlocked. This is due to the fact that each of the five actors isnow part of a so-called delay-less cycle. A cycle is called delay-less if none of itsedges are carrying an initial token. Note that the resulting HSDF graph is not

4The notation ⟨𝑥1, 𝑥2, 𝑥3⟩ is used to denote sequences of values 𝑥𝑖 which can be manipulatedvia sequence head and tail operations, concatenations, and similar operations.

11

Page 20: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

𝑐1 𝑐8 𝑐5

𝑐 7

𝑐2 𝑐3 𝑐 4

𝑐6

𝑎3

𝑎4 𝑎6𝑎2𝑎1

𝑎5

(a) Initial state of an HSDFgraph

𝑐1 𝑐8 𝑐5

𝑐 7 𝑐6

𝑐2 𝑐3 𝑐 4

𝑎3

𝑎4 𝑎6𝑎2𝑎1

𝑎5

(b) After firing of actor 𝑎1

𝑐1 𝑐8 𝑐5

𝑐 7 𝑐6

𝑐2 𝑐3 𝑐 4

𝑎3

𝑎4 𝑎6𝑎2𝑎1

𝑎5

(c) After firing of actor 𝑎5

𝑐1 𝑐8 𝑐5

𝑐 7 𝑐6

𝑐2 𝑐3 𝑐 4

𝑎3

𝑎4 𝑎6𝑎2𝑎1

𝑎5

(d) After firing of actor 𝑎4

𝑐1 𝑐8 𝑐5

𝑐 7 𝑐6

𝑐2 𝑐3 𝑐 4

𝑎3

𝑎4 𝑎6𝑎2𝑎1

𝑎5

(e) After firing of actor 𝑎6

𝑐1 𝑐8 𝑐5

𝑐 7 𝑐6

𝑐2 𝑐3 𝑐 4

𝑎3

𝑎4 𝑎6𝑎2𝑎1

𝑎5

(f) After firing of actor 𝑎3

𝑐1 𝑐8 𝑐5

𝑐 7 𝑐6

𝑐2 𝑐3 𝑐 4

𝑎3

𝑎4 𝑎6𝑎2𝑎1

𝑎5

(g) After firing of actor 𝑎2

Figure 2.3: Example of an iteration for the DFG from Figure 2.2

deadlocked, but an infinite accumulation of tokens on channel 𝑐1 is occurringdue to the infinite firing of the only fireable actor 𝑎1.

2.2.2 Synchronous Data Flow

In the Synchronous Data Flow (SDF) [LM87a] MoC the communication be-havior of the actors is constrained to have consumption and production ratesbeing constant for all different firings of an actor. Moreover, consumption andproduction rates for different connected channels are assumed to be arbitrarypositive integer constants. Therefore, in SDF, the consumption and productionrates can be expressed only by two functions cons and prod, respectively.

∙ The cons : 𝐶 → N function specifies for each channel 𝑐 ∈ 𝐶 the numberof consumed tokens from the channel 𝑐 by an actor firing of the actorsnk(𝑐).5

5The symbol N is used to denote the set of natural numbers, that is the set { 1, 2, 3, . . . }.

12

Page 21: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.2 Static Data Flow

2 3

2 3

3

prod(𝑐1) = 3

cons(𝑐1) = 1

𝑎1 𝑎3𝑎2𝑐1

𝑐2

𝑐3

Figure 2.4: Example of a Synchronous Data Flow graph

∙ The prod : 𝐶 → N function specifies for each channel 𝑐 ∈ 𝐶 the number ofproduced tokens onto the channel 𝑐 by an actor firing of the actor src(𝑐).

It is customary to annotate the consumption and production rates at thebeginnings and endings of the channel edges in visual representations of SDFgraphs. An example of an SDF graph is depicted in Figure 2.4. Consumptionand production rates of one, e.g., cons(𝑐1) = 1, are traditionally not annotatedto reduce clutter.The question if a DFG can be executed in bounded memory is now also

considered for SDF graphs. The answer to this question lies in the calculationof an iteration for the SDF graph. It is proven in [LM87b] that such an iterationis unambiguously defined by the consumption and production rates of the actorsin the SDF graph.The iteration is determined by so-called balance equations. Each balance

equation corresponds to one channel in the SDF graph. The balance equationfor a channel 𝑐 is given below:

𝜂src(𝑐) · prod(𝑐) = 𝜂snk(𝑐) · cons(𝑐)

In the above equation, the variable 𝜂src(𝑐) denotes the number of actor firings ofthe actor producing tokens onto channel 𝑐 while 𝜂snk(𝑐) denotes the number ofactor firings of the actor consuming tokens from channel 𝑐. With prod(𝑐) andcons(𝑐) the left and right side of the equation denote the number of tokens thathave been produced and consumed by the 𝜂src(𝑐) source actor and 𝜂snk(𝑐) sinkactor firings, respectively. For an iteration, both sides of the equation must beequal, otherwise the number of consumed and produced tokens do not balance,thus an iteration would not bring the graph into the same state as it has startedfrom.To exemplify, the balance equations for the SDF graph depicted in Figure 2.4

are given below:

𝜂1 · 3 = 𝜂2 · 1 (2.1)𝜂2 · 2 = 𝜂3 · 3 (2.2)𝜂3 · 3 = 𝜂2 · 2 (2.3)

13

Page 22: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

Each of the three balance equations (cf. Equations (2.1) to (2.3)) correspond toone of the channels 𝑐1 to 𝑐3, respectively. The number of firings for actor 𝑎1 to𝑎3 are given by 𝜂1 to 𝜂3.6 The above balance equations have a solution with𝜂1 = 1 ·𝑛, 𝜂2 = 3 ·𝑛, and 𝜂3 = 2 ·𝑛, where 𝑛 is any positive integer. It is provenin [LM87b] that if there is no solution besides the trivial zero vector, that is𝜂1 = 𝜂2 = 𝜂3 = 0, then the SDF graph either deadlocks or cannot be executedin bounded memory. On the other hand, if there exists such a solution, then theSDF graph is called consistent. Then, the SDF graph may still either deadlockor can be executed in bounded memory.However, firings can only be non-negative integers. Therefore, if there is a

solution besides the trivial zero solution, then an 𝑛 must be selected that resultsin non-negative integer values for the variables 𝜂1 . . . 𝜂3. Using the conventionestablished in [BML97], the minimum positive integer solution to the set ofbalance equations, e.g., here with 𝑛 = 1, is called the repetition vector 𝜂rep ofan SDF graph.The SDF graph depicted in Figure 2.4 has a repetition vector of 𝜂rep1 =

1, 𝜂rep2 = 3, 𝜂rep3 = 2 (or 𝜂rep = (1, 3, 2) for short).Moreover, the existence of a repetition vector does not guarantee that the SDF

graph can be executed iteratively in bounded memory, that is the existence of arepetition vector is a necessary criterion, but not a sufficient one. To exemplify,the initial tokens from the SDF graph depicted in Figure 2.4 are removed. Thisdoes not change the calculation or existence of the repetition vector of the SDFgraph. However, without any initial tokens, neither actor 𝑎2 nor actor 𝑎3 canever be fired. Therefore, an infinite accumulation of tokens on channel 𝑐1 mayoccur assuming an infinite firing of the only fireable actor 𝑎1.In contrast, with the initial tokens as depicted in Figure 2.4, an iteration

can be realized by the sequence ⟨𝑎3, 𝑎1, 𝑎2, 𝑎2, 𝑎3, 𝑎2⟩ of actor firings as shownin Figure 2.5. The length of this iteration is determined by summing all theentries of the repetition vector.7 ∑

𝑛∈ℐ(𝜂rep)

𝜂rep𝑛

In general, the test for checking whether a computed repetition vector may alsoexecute an iterative deadlock-free schedule is to fire each fireable actor as manytimes as implied by the repetition vector until the iteration is finished or adeadlock has occurred. Note that in the general case, the length of the iterationmay be exponential in the size of the SDF graph. Hence, in contrast to HSDF

6To avoid the usage of double subscripts, the technically correct notation 𝜂𝑎𝑛will not be

used.7In the following, the function ℐ is used to denote the set of indices of a vector, e.g., with𝜂rep = (𝜂rep1 , 𝜂rep2 , . . . 𝜂rep𝑚 ) the index set of 𝜂rep is ℐ(𝜂rep) = { 1, 2, . . .𝑚 }.

14

Page 23: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.2 Static Data Flow

3

2 3

2 3

𝑐2

𝑐3

𝑐1𝑎1 𝑎3𝑎2

(a) Initial state of the SDFgraph

3

2 3

2 3

𝑐2

𝑐3

𝑐1𝑎1 𝑎3𝑎2

(b) After firing of actor 𝑎3

3

2 3

2 3

𝑐2

𝑐3

𝑐1𝑎1 𝑎3𝑎2

(c) After firing of actor 𝑎1

3

2 3

2 3

𝑐2

𝑐3

𝑐1𝑎1 𝑎3𝑎2

(d) After firing of actor 𝑎2

3

2 3

2 3

𝑐2

𝑐3

𝑐1𝑎1 𝑎3𝑎2

(e) After firing of actor 𝑎2

3

2 3

2 3

𝑐2

𝑐3

𝑐1𝑎1 𝑎3𝑎2

(f) After firing of actor 𝑎3

3

2 3

2 3

𝑐2

𝑐3

𝑐1𝑎1 𝑎3𝑎2

(g) After firing of actor 𝑎2

Figure 2.5: Example of an iteration for the SDF graph from Figure 2.4

models, where the question of an infinite execution sequence in bounded memorycan be answered in polynomial time [Gou80], the problem is only solvable inexponential time for SDF graphs [PBL95].

2.2.3 Cyclo-Static Data Flow

An extension of the SDF model is Cyclo-Static Data Flow (CSDF). In the CSDF[BELP96] MoC, the communication behavior of an actor is constrained to havecyclically varying consumption and production rates between actor firings. Thelength of this cycle is known as the number 𝜏 of phases of a CSDF actor. Anactor firing of a CSDF actor is also known as a CSDF phase. To accommodatethe cyclical varying consumption and production rates, the definition of thecons and prod function have to be extended to return vectors of consumptionand production rates, respectively.

∙ The cons : 𝐶 → N𝜏0 function specifies for each channel 𝑐 ∈ 𝐶 a vector

(𝑛0, 𝑛1, . . . , 𝑛𝜏−1) of length 𝜏 (the number of phases of the CSDF actorsnk(𝑐)). The vector entries correspond to the consumption rates of theactor snk(𝑐) in its different phases, e.g., in its first phase, this actor con-sumes 𝑛0 tokens from the channel 𝑐.

∙ The prod : 𝐶 → N𝜏0 function specifies for each channel 𝑐 ∈ 𝐶 a vector

(𝑛0, 𝑛1, . . . , 𝑛𝜏−1) of length 𝜏 (the number of phases of the CSDF actorsrc(𝑐)). The vector entries correspond to the production rates of the

15

Page 24: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

actor src(𝑐) in its different phases, e.g., in its second phase, this actorproduces 𝑛1 tokens onto the channel 𝑐.

An example of a CSDF graph is depicted in Figure 2.6. One can see theannotated vectors for the consumption and production rates. Only actor 𝑎1 isa degenerated case with only one CSDF phase. This degenerated case is equiv-alent to an SDF actor with a production of 6 tokens per firing on channel 𝑐1.The two remaining actors 𝑎2 and 𝑎3 are true CSDF actors with 2 and 4 phases,respectively. As can be seen, the vectors of consumption and production ratesannotated to an actor have to be of the same length. This length exactly corre-sponds to the number of CSDF phases 𝜏 of the actor. Furthermore, in contrastto SDF, it is permissible for some entries of the production or consumption ratevector to be zero. This is used to denote that the actor will not produce orconsume any tokens on the channel for phases where the corresponding entryin the vector of production or consumption rates is zero, e.g., the actor 𝑎3 doesnot produce any tokens on channel 𝑐3 in its fourth CSDF phase.

(1,2,6,1)(2,3)

(7,1,2,0)(3,2)

(1,2)6

cons(𝑐1) = (1, 2)

prod(𝑐1) = (6)

𝑎1 𝑎3𝑎2

𝑐2

𝑐3

𝑐1

Figure 2.6: Example of a Cyclo-Static Data Flow graph

The question whether there exists an infinite sequence of actor firings for theactors of a CSDF graph such that the number of queued tokens on any channel ofthe graph does not exceed a finite bound is now considered. Again, an iterationhas to be determined for the CSDF graph first. The number of actor firings inan iteration may again be expressed via the repetition vector. For the purposeof calculating the repetition vector, all CSDF actor can be replaced by SDFactors with consumption and production rates derived by summing all rates inthe corresponding vectors of consumption and production rates of the CSDFactors. Therefore, the resulting balance equation for a channel 𝑐 is as follows.8

𝜂src(𝑐) ·∑

𝑛∈ℐ(prod(𝑐))

prod(𝑐)(𝑛) = 𝜂snk(𝑐) ·∑

𝑛∈ℐ(cons(𝑐))

cons(𝑐)(𝑛)

8In the following, a vector 𝑥 = (𝑥1, 𝑥2, . . . 𝑥𝑛) is also considered to be a function from itsindex set ℐ(𝑥) = { 1, 2, . . . 𝑛 } to its entries, that is 𝑥(𝑖) = 𝑥𝑖.

16

Page 25: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.3 Dynamic Data Flow

To exemplify, the obtained balance equations for the CSDF graph depicted inFigure 2.6 are as follows:

𝜂1 · 6 = 𝜂2 · 3 (2.4)𝜂2 · 5 = 𝜂3 · 10 (2.5)𝜂3 · 10 = 𝜂2 · 5 (2.6)

The Equation (2.4) represents channel 𝑐1, Equation (2.5) corresponds to 𝑐2,and finally Equation (2.6) for 𝑐3. The variables 𝜂1, 𝜂2, and 𝜂3 represent thenumber repetitions of all phases of a CSDF actor. Therefore, the number ofindividual firings 𝜂rep𝑛 of a CSDF actor 𝑎𝑛 can be derived by multiplying thenumber of CSDF phases 𝜏𝑛 of the actor 𝑎𝑛 with the number of phase repetitions𝜂𝑛. As is the case for SDF models, the smallest solution of the balance equationsuniquely determines the (minimal) repetition vector of the CSDF graph, e.g., forthe CSDF graph depicted in Figure 2.6, the repetition vector can be determinedas 𝜂rep = (𝜂rep1 , 𝜂rep2 , 𝜂rep3 ) = (𝜂1 · 𝜏1, 𝜂2 · 𝜏2, 𝜂3 · 𝜏3) = (1 · 1, 2 · 2, 1 · 4) = (1, 4, 4).Furthermore, the existence of a valid iteration, e.g., as depicted in Figure 2.7,corresponding to the repetition vector needs to be verified as, like in SDF, theexistence of the repetition vector is only a necessary but not sufficient criterionfor the existence of the iteration.

2.3 Dynamic Data Flow

In contrast to static data flow models, where the consumption and productionrates cannot be influenced by the values of the consumed tokens, dynamic dataflow actors can vary their consumption and production rates depending on thehistory of the consumed tokens and on the token to be consumed. Dynamic dataflow is the first introduced data flow model where consumption and productionrates depend on the data values being consumed and produced. Hence, somemore mathematical notations are required. First, signals are introduced in orderto represent the history of tokens being transported over a channel:

Definition 2.2 (Signal). A signal 𝑠 ∈ 𝑆 is (a possibly infinite) sequence ofvalues carried by the tokens being transported over a channel. The set of allsignals 𝑆 is the set of all possible finite and infinite sequences 𝒱** of values fromthe set of universal values 𝒱 .9

The number of values in a signal will be given by #𝑠, e.g., #⟨5, 8, 7⟩ = 3.This number will also be called the length of a signal. Furthermore, to access a

9The ’**’-operator is used to denote infinite Kleene closure of a value set. It generatesthe set of all possible finite and infinite sequences of values from the value set, that is𝑋** = ∪𝑛 ∈ N0,∞ : 𝑋𝑛. Where N0,∞ denotes the set of non-negative integers includinginfinity, that is { 0, 1, 2, . . .∞}.

17

Page 26: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

6 (1,2)

(3,2)

(2,3)

(1,2,6,1)

(7,1,2,0)

𝑐3

𝑐2

𝑐1𝑎1 𝑎2 𝑎3

(a) Initial state of the CSDFgraph

6 (1,2)(2,3)

(3,2)

(1,2,6,1)

(7,1,2,0)𝑐1𝑐3

𝑐2

𝑎1 𝑎2 𝑎3

(b) After firing of actor 𝑎3

6 (1,2)(2,3)

(3,2)

(1,2,6,1)

(7,1,2,0)𝑐1𝑐3

𝑐2

𝑎1 𝑎2 𝑎3

(c) After firing of actor 𝑎1

6 (1,2)(2,3)

(3,2)

(1,2,6,1)

(7,1,2,0)𝑐1𝑐3

𝑐2

𝑎1 𝑎2 𝑎3

(d) After firing of actor 𝑎2

6 (1,2)(2,3)

(3,2)

(1,2,6,1)

(7,1,2,0)𝑐1𝑐3

𝑐2

𝑎1 𝑎2 𝑎3

(e) After firing of actor 𝑎3

6 (1,2)(2,3)

(3,2)

(1,2,6,1)

(7,1,2,0)𝑐1𝑐3

𝑐2

𝑎1 𝑎2 𝑎3

(f) After firing of actor 𝑎2

6 (1,2)(2,3)

(3,2)

(1,2,6,1)

(7,1,2,0)𝑐1𝑐3

𝑐2

𝑎1 𝑎2 𝑎3

(g) After firing of actor 𝑎2

6 (1,2)(2,3)

(3,2)

(1,2,6,1)

(7,1,2,0)𝑐1𝑐3

𝑐2

𝑎1 𝑎2 𝑎3

(h) After firing of actor 𝑎3

6 (1,2)(2,3)

(3,2)

(1,2,6,1)

(7,1,2,0)𝑐1𝑐3

𝑐2

𝑎1 𝑎2 𝑎3

(i) After firing of actor 𝑎2

6 (1,2)(2,3)

(3,2)

(1,2,6,1)

(7,1,2,0)𝑐1𝑐3

𝑐2

𝑎1 𝑎2 𝑎3

(j) After firing of actor 𝑎3

Figure 2.7: Example of an iteration for the CSDF graph from Figure 2.6

certain value for any signal, the signal is considered to be a function 𝑠 : ℐ(𝑠)→ 𝒱from its index set to the set of values. If not further specified, the index set of asignal will always be starting from zero and continue in increments of one up tothe index #𝑠− 1 of the last defined value in the signal, e.g., given 𝑠 = ⟨5, 8, 7⟩,then 𝑠(0) = 5, 𝑠(1) = 8, and 𝑠(2) = 7. Moreover, some notation is required tomanipulate signals themselves:

∙ The tail𝑛 : 𝑆 → 𝑆 function discards the first 𝑛 elements of the signal andreturns the tail of the signal, e.g., tail2(⟨5, 8, 7, . . .⟩) = ⟨7, . . .⟩.

∙ The ’a’-operator is used to concatenate two signals, e.g., ⟨3, 4⟩a⟨8, . . .⟩ =⟨3, 4, 8, . . .⟩.

For convenience reasons, a set of signals can be bundled into a vector 𝑠 ofsignals, e.g., (𝑠1, 𝑠2, . . . 𝑠𝑛) = 𝑠 ∈ S. Likewise, the ’a’-operator is applied tovectors of signals by pointwise extension, that is (𝑠1, 𝑠2, . . . 𝑠𝑛)a(𝑠′1, 𝑠′2, . . . 𝑠′𝑛) =(𝑠1

a𝑠′1, 𝑠2a𝑠′2, . . . 𝑠𝑛

a𝑠′𝑛). The set S itself is used to denote the set of all suchvectors of bundled signals.

18

Page 27: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.3 Dynamic Data Flow

2.3.1 Kahn Process Networks

Kahn Process Networks (KPNs) are one of the oldest data flow MoCs. Theoriginal paper [Kah74] of Kahn used a denotational semantics to describe thebehavior of a KPN actor. In this denotational semantics, an actor 𝑎 is describedby a monotonic function 𝒦𝑎.10

𝒦𝑎 : 𝑆𝑛 → 𝑆𝑚

This function, also called a Kahn function, is mapping (possibly infinite) se-quences of tokens 𝑠𝑖 ∈ 𝑆 on the 𝑛 actor input ports to (possibly infinite) se-quences of tokens 𝑠𝑜 ∈ 𝑆 on the 𝑚 actor output ports.The behavior of a KPN is described by an equation system that represents the

connections of these actors to each other. To exemplify, the DFG in Figure 2.8is considered. The corresponding equation systems is given below:11

(𝑠1, 𝑠2) = 𝒦1(𝑠1) (2.7)

(𝑠3) = 𝒦2(⟨𝜈initial⟩a𝑠3, 𝑠2) (2.8)

The Equation (2.7) models the self-loop on actor 𝑎1 while Equation (2.8) modelsthe self-loop on actor 𝑎2 including the initial token with value 𝜈initial. Bothequations are coupled via signal 𝑠2 that models the channel connecting actor 𝑎1and 𝑎2.

Channel 𝑐2 carrying signal 𝑠2

Actor input port

Channel 𝑐1 carrying signal 𝑠1Actor output port

𝑎1 𝑎2𝑐2

𝑐1 𝑐3 𝜈 initial

Figure 2.8: Example of a Kahn Process Network

10In fact Kahn functions are required to be Scott-continuous, which is a slightly strongerrequirement than pure monotonicity. However, in practice every computer implementablefunction working on concrete signals is Scott-continuous if it is monotonic. For an in-depthdiscussion of the subtleties involved, see [LSV98].

11To avoid the usage of double subscripts, the technically correct notation 𝒦𝑎𝑛 will not beused.

19

Page 28: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

The definition of the Kahn functions 𝒦1 and 𝒦2 are given next. Here, thenotation 𝜆 denotes the signal without any values (𝜆 = ⟨⟩), that is #𝜆 = 0.

𝒦1(𝑠) = (𝑠, 𝑠) (2.9)

𝒦2(𝑠𝑎, 𝑠𝑏) =

{⟨𝑠𝑎(0) + 𝑠𝑏(0)⟩a𝒦2(tail1(𝑠𝑎), tail1(𝑠𝑏)) if #𝑠𝑎 ≥ 1 ∧#𝑠𝑏 ≥ 1𝜆 otherwise

(2.10)

As can be seen, the function 𝒦1 simply replicates its input signal to both of itsoutputs. In contrast to this, the function 𝒦2 describes a pointwise addition ofits input signals 𝑠𝑎 and 𝑠𝑏.However, if Equations (2.7) to (2.10) are considered without further con-

straints, then multiple solutions are possible. This is due to the fact that thesignal 𝑠1 is not constraint by Equation (2.7). Hence, the signal 𝑠1 can take onany value from the set of all signals 𝑆. This is an unsatisfactory situation fora denotational description of the KPN from Figure 2.8. From an operationalpoint of view, one would expect the signal 𝑠1 (corresponding to channel 𝑐1) tohave a single solution, the empty signal 𝜆, that is the signal of length zero. Thisis due to the deadlock implied by the delay-less self-loop of actor 𝑎1 via channel𝑐1.It is, however, possible to select from the set of solutions a unique solution

that corresponds to the operational point of view. This solution is called theleast fixed point of the set of equations. To formally define a least fixed point, adefinition of the prefix order ⊑ for signals is required. The prefix order 𝑠1 ⊑ 𝑠2 isa partial order that is used to denote that a signal 𝑠1 is a prefix of another signal𝑠2. More formally, 𝑠1 ⊑ 𝑠2 if and only if ℐ(𝑠1) ⊆ ℐ(𝑠2) and ∀𝑖∈ℐ(𝑠1) : 𝑠1(𝑖) = 𝑠2(𝑖).To exemplify, ⟨3, 1⟩ ⊑ ⟨3, 1⟩ ⊑ ⟨3, 1, 2⟩ ⊑ ⟨3, 1, 2, . . .⟩, but ⟨3, 1⟩ ⊑ ⟨4, 1, 2⟩ aswell as ⟨3, 1⟩ ⊑ ⟨3⟩.For convenience reasons, the prefix order is extended to vectors 𝑠1, 𝑠2 ∈ S of

signals. A vector 𝑠1 is a prefix of another vector 𝑠2 if all corresponding vectorelements are prefixes of each other, that is 𝑠1 ⊑ 𝑠2 if and only if ℐ(𝑠1) = ℐ(𝑠2)and ∀𝑖∈ℐ(𝑠1) : 𝑠1(𝑖) ⊑ 𝑠2(𝑖).Assuming that Ssolution is the set of all solutions for the Equations (2.7)

to (2.10), then the least fixed point Slfp is the least element in the set of so-lutions, that is Slfp ∈ { 𝑠 ∈ Ssolution | ∀𝑠′∈Ssolution∖{ 𝑠 } : 𝑠′ ⊑ 𝑠 }. However, thequestion remains if the least fixed point for such a set of equations is unique.Fortunately, it has been proven that, assuming all functions in the set of equa-tions are monotonic, then the least fixed point of such a set of equations eitherexists and is unique or does not exist at all [DP90]. The monotonicity require-ment for the functions corresponds to the customary definition of monotonicitywith the prefix order ⊑ as its ordering relation, that is a function 𝒦𝑎 : 𝑆

𝑛 → 𝑆𝑚

is monotonous if and only if ∀𝑠1, 𝑠2 ∈ 𝑆𝑛 : (𝑠1 ⊑ 𝑠2) =⇒ 𝒦𝑎(𝑠1) ⊑ 𝒦𝑎(𝑠2).

20

Page 29: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.3 Dynamic Data Flow

A constructive way to find the least fixed point starts with the empty signal𝜆 for all signals. To exemplify, the KPN depicted in Figure 2.8 and its cor-responding set of Equations (2.7) to (2.10) are considered. The three signals𝑠1, 𝑠2, and 𝑠3 that correspond to the three channels depicted in Figure 2.8 areinitialized to the empty signal 𝜆.

𝑠0 = (𝑠01, 𝑠02, 𝑠

03) = (𝜆, 𝜆, 𝜆) (2.11)

The empty signal denotes that at the beginning of the fixed-point computationno token values have been produced onto a channel. Note that the initial token𝜈initial depicted on channel 𝑐3 is already considered by Equation (2.8).The set of equations describing a KPN is used to calculate an updated vector

of signals 𝑠𝑛 = (𝑠𝑛0 , 𝑠𝑛1 , . . . 𝑠

𝑛𝑚) from the signals 𝑠𝑛−1 in the previous step. For the

KPN depicted in Figure 2.8, the vector of signals 𝑠𝑛 = (𝑠𝑛1 , 𝑠𝑛2 , 𝑠

𝑛3 ) is computed

by recursively applying Equations (2.7) to (2.8) to 𝑠𝑛−1 starting from the initialvector of signals given in Equation (2.11).

(𝑠𝑛1 , 𝑠𝑛2 ) = 𝒦1(𝑠

𝑛−11 ) (𝑠𝑛3 ) = 𝒦2(⟨𝜈initial⟩a𝑠𝑛−1

3 , 𝑠𝑛−12 )

Due to the monotonicity of the functions in the set of equations, the signals in theprevious step must always be a prefix of the updated set of signals, that is 𝑠𝑛−1 ⊑𝑠𝑛. This fact prevents the existence of a cycle in the sequence 𝑠0, 𝑠1, . . . 𝑠𝑛 ofupdated signal vectors, thus guaranteeing the existence of a unique least fixedpoint if no contradictions are encountered in the set of equations. This step isrepeated until a fixed point has been reached. As can be seen, for the examplefrom Figure 2.8 the initial state with all signals empty is already the least fixedpoint of the Equations (2.7) to (2.8).

𝑠01 = 𝑠11 = 𝜆 𝑠02 = 𝑠12 = 𝜆 𝑠02 = 𝑠12 = 𝜆

The question whether there exists an infinite sequence of actor firings for theactors of a KPN such that the number of queued tokens on any channel of thegraph does not exceed a finite bound is now considered. Unsurprisingly, thisproblem is undecidable. This is due to the fact that Kahn functions themselvesare general recursive functions, and therefore are Turing equivalent [Ros65].

2.3.2 Boolean Data Flow

More surprisingly, a KPN model remains Turing equivalent [Buc93], hence theproblem of execution in bounded memory undecidable, if the Kahn functions arerestrictions to perform simple arithmetic operations as well as the two primitiveswitch (𝒦swt(𝑠1, 𝑠2) = (𝑠out,1, 𝑠out,2)) and select (𝒦sel(𝑠1, 𝑠2, 𝑠3) = 𝑠out) func-tions. A KPN model with these restrictions is also known as a Boolean Data

21

Page 30: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

Flow (BDF) graph. The BDF MoC realizes the switch and select functions viadedicated actors. The simple arithmetic operations are supported by BDF viaits ability to model static actors, e.g., like an SDF actor implementing an addi-tion. The switch actor, depending on the truth value of a control token from itsinput signal 𝑠1, forwards a token from its input signal 𝑠2 to its first or secondoutput. The select actor acts in reverse. Depending on the truth value of acontrol token from signal 𝑠1, it either forwards a token from input signal 𝑠2 or𝑠3 to its output.The usage of these two functions together with the arithmetic primitives en-

able the construction of arbitrary control flow logic. The channels (of infinitecapacity) can be used to represent an infinite memory. Together with the controlflow logic, a Turing machine can be implemented by the BDF model [Buc93].

2.3.3 Dennis Data Flow

While the denotational description of an actor is clear from its Kahn function𝒦𝑎, an implementation of this functionality is usually not based on a func-tion mapping infinite sequences to infinite sequences, but uses the operationsemantics of Dennis Data Flow (DDF) [Den74].12 To exemplify, the operationalimplementation (cf. Example 2.1) of the Kahn actor 𝑎2 is considered. As theactor is considered in isolation, the notion of channels is no longer appropriateto identify the source and destination of tokens. Hence, the notion of ports isused to identify the source and destination of consumed and produced tokens.A depiction of the actor with its input and output ports has already been givenin Figure 2.8, see also Figure 2.9.In the following, the notion of ports will be used interchangeably with the

channels/signals connected to these ports. Hence, expressions like cons(𝑖),prod(𝑜), and #𝑖 are used and refer to the consumption rate, production rate,and number of tokens on the channel connected to the respective input or outputport.

Output port of actor 𝑎2

Input ports of actor 𝑎2

𝑎2𝑜1

𝑖1

𝑖2

Figure 2.9: The KPN actor 𝑎2 from Figure 2.8 considered in isolation

12In fact, this implementation preference is so strong, that KPN actors are usually understoodto be constrained to the subset that can be implemented by the operational semantics ofDDF.

22

Page 31: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.3 Dynamic Data Flow

For the KPN in Figure 2.8, the input port 𝑖2 simply corresponds to the signal𝑠2. In the operational view, this is represented by the channel 𝑐2. This channelis queuing the tokens produced by actor 𝑎1 in FIFO order. The self-loop fromoutput port 𝑜1 to input port 𝑖1 is more complex. In the operational view, it isrepresented by channel 𝑐3. The channel is a FIFO containing the initial token𝜈initial. This is represented in the denotational view (cf. Equation (2.8)) by thetemporary signal 𝑠′3 = ⟨𝜈initial⟩

a𝑠3.The Kahn function𝒦2 from Equation (2.10) is implemented by a DDF [Den74]

actor as depicted in Example 2.1. The DDF model uses (blocking) read (?) andwrite (!) primitives for its implementation. As is the case for 𝒦2, the operationalimplementation needs one token from input port 𝑖1 and one token from inputport 𝑖2 to produce a token on the output port 𝑜1. As the read primitive isblocking, the read operation on port 𝑖2 will only be executed after a token fromport 𝑖1 was available. In the following, t and f are used to denote Boolean truthvalues.

Example 2.1 Operational implementation of the Kahn function 𝒦2

1: procedure 𝒦2(𝑖1, 𝑖2, 𝑜1)2: while t do ◁ Infinite iteration3: 𝑖1 ? 𝜈1 ◁ Read a value from the channel attached to port 𝑖14: 𝑖2 ? 𝜈2 ◁ Read a value from the channel attached to port 𝑖25: 𝑜1 ! 𝜈1 + 𝜈2 ◁ Write the sum to the channel attached to port 𝑜1

However, the blocking semantics of the DDF operational paradigm can onlyimplement sequential functions [KP78]. While any sequential function is a Kahnfunction [Lee97], the reverse is not true [Win93]. One could argue that inthe real world, the restriction of KPN to the operational semantics of DDF ispermissible. However, this raises the problem of compositionality. A data flowmodel is compositional if it is possible to take an arbitrary connected subgraphand represent it as an actor in the data flow model. If a data flow modelis compositional, then the highly desirable operation of abstraction becomesseamlessly possible. An abstraction operation can be performed by hiding theimplementation complexity of an arbitrary connected subgraph of the modelbehind the interface of an actor. If the data flow model is compositional, thenthis composite actor, which represents the subgraph, can be handled like anyother actor in the system. Otherwise, the composite actor is always an exceptionand needs special treatment.To exemplify, consider the data flow subgraph depicted in Figure 2.10. In

this subgraph, the actors 𝑎3 and 𝑎4 represent HSDF actors. All HSDF actorscan also be represented by DDF actors. The corresponding Kahn function𝒦3 ≡ 𝒦4 ≡ 𝒦3,4 is given in Equation (2.12). The topology of the subgraph canbe represented by the Equations (2.13) and (2.14).

23

Page 32: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

𝑖1

𝑖2

𝑜1

𝑜2

𝜈 ini1

𝜈 ini2

𝑎3

𝑎4

𝑠2𝑠1

𝑔𝛾

Figure 2.10: A data flow subgraph consisting of two HSDF actors

𝒦3,4(𝑠𝑎, 𝑠𝑏) =

{(⟨𝑠𝑎(0)⟩, ⟨𝑠𝑏(0)⟩)a𝒦3,4(tail1(𝑠𝑎), tail1(𝑠𝑏)) if #𝑠𝑎≥1 ∧#𝑠𝑏≥1(𝜆, 𝜆) otherwise

(2.12)

(𝑜1, 𝑠1) = 𝒦3(𝑖1, ⟨𝜈ini2⟩a𝑠2) (2.13)

(𝑜2, 𝑠2) = 𝒦4(𝑖2, ⟨𝜈ini1⟩a𝑠1) (2.14)

It turns out that such a system has a least fixed point for any input signal𝑖1 and 𝑖2 [LSV98], and this fixed point again represents a Kahn function. Inthat sense, the KPN model is compositional. Another important characteristicsemerges from this fact. The behavior of a KPN model, be it a complete graphor a subgraph, is independent from the scheduling of actors, which is givenby the operational implementation. Such a data flow model is called sequencedeterminate [LSV98]. However, the resulting Kahn function is not generallyrepresentable via the operational semantics of DDF. Therefore, DDF is non-compositional. To exemplify, the Kahn function 𝒦𝛾 for the least fixed point ofthe equation system Equation (2.12) to Equation (2.14) is given below.

𝒦𝛾(𝑖1, 𝑖2) =

⎧⎨⎩(⟨𝑖1(0)⟩, 𝜆)a𝒦h1(tail1(𝑖1), 𝑖2) if #𝑖1 ≥ 1(𝜆, ⟨𝑖2(0)⟩)a𝒦h2(𝑖1, tail1(𝑖2)) if #𝑖2 ≥ 1(𝜆, 𝜆) otherwise

(2.15)

where 𝒦h1(𝑖1, 𝑖2) =

{(𝜆, ⟨𝑖2(0)⟩)a𝒦𝛾(𝑖1, tail1(𝑖2)) if #𝑖2 ≥ 1(𝜆, 𝜆) otherwise

(2.16)

where 𝒦h2(𝑖1, 𝑖2) =

{(⟨𝑖1(0)⟩, 𝜆)a𝒦𝛾(tail1(𝑖1), 𝑖2) if #𝑖1 ≥ 1(𝜆, 𝜆) otherwise

(2.17)

The problem is caused by the requirement to either serve input 𝑖1 or input 𝑖2(cf. Equation (2.15)). Obviously, another operational semantics for KPN modelsis required. In the following, the FunState model is introduced, which, amongstothers, can implement 𝒦𝛾.

24

Page 33: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.3 Dynamic Data Flow

2.3.4 FunState

In contrast to the DFG defined in Definition 2.1, Functions Driven by StateMachines (FunState)13 [TSZ+99] are bipartite graphs of functions and channels.This enables the FunState model to have potentially multiple readers and writersfor a channel. The activation of functions is controlled by an FSM. Functionscan be annotated with execution times. Note that the FunState FSM onlytriggers an execution of a function, but does not wait for its completion. Hence,input tokens are consumed by the function when it is triggered by the FSMbut output tokens are not guaranteed to have been produced after the FSMtransition has been taken. More formally, the definition of a FunState networkis given below:

Definition 2.3 (FunState Network [TTN+98]). A network 𝑔 = (ℱ , 𝐶, 𝐸) (of abasic FunState model) contains a set of storage units 𝐶, a set of functions ℱ ,and a set of directed edges 𝐸 ⊆ (𝐶 × ℱ) ∪ (ℱ × 𝐶). A FunState network 𝑔is a directed bipartite graph. Data flow is represented by moving tokens. Themarking 𝑚 is the distribution of tokens to storage units. The marking 𝑚0 is theinitial marking.

The FunState model does not only support FIFO channels for its storageunits, but also registers. The problem of an operational implementation of 𝒦𝛾

by the DDF model is bypassed by the ability of the FunState FSM to wait forconditions on multiple input channels. However, due to this relaxation com-pared to the DDF model, FunState models in general are no longer determinate[LSV98], that is the output of a FunState model may depend on the schedulingstrategy employed to execute it. Hence, FunState is a representative of the Non-Determinate Data Flow (NDF) MoC [Kos78]. More formally, the definition ofa FunState FSM is given below:

Definition 2.4 (FunState FSM [TSZ+99]). A FunState FSM ℛ = (𝑄, 𝑇, 𝑞0)(of a basic FunState model) consists of a set of named states 𝑄 and a set oftransitions 𝑇 = 𝑄 × 𝑄 × ℱcondition × ℱ) and a unique initial state 𝑞0 ∈ 𝑄.ℱcondition is the set of possible state transition conditions. In the basic model, acondition 𝑘 ∈ ℱcondition is a logical propositions over (in)equalities of the form:<query of storage unit> <relational operator> <constant>For example, #𝑐 ≥ 1 evaluates to true if the channel 𝑐 contains at least onetoken.

However, the FunState model, as introduced above, is not compositional, asit only possesses a central FSM. To obtain compositionality, a hierarchical13The model was initially published [TTN+98] as State Machine Controlled Flow Diagrams

(SCF), but in later literature it is referred to as FunState, which is a short for FunctionsDriven by State Machines.

25

Page 34: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

#𝑖2 ≥ 1∧#𝑐1 ≥ 1/𝑎4

#𝑖2 ≥ 1∧#𝑐1 ≥ 1/𝑎4

#𝑐2 ≥ 1/𝑎3

#𝑖1 ≥ 1∧#𝑐2 ≥ 1/𝑎3

#𝑖1 ≥ 1∧

#𝑖1 ≥ 1∧#𝑖2 ≥ 1/𝑓 1

#𝑖1 ≥ 1∧#𝑖2 ≥ 1/𝑓 2

𝑞0

𝑞0

𝑞0

𝑞1

𝑞2

𝑜2

𝑜2𝑖2

𝑖2

𝑖1 𝑜1

𝑖1𝑜1

𝑐1 𝑐2

𝑓 1

𝑓 2

𝑜1𝑜2𝑖2

𝜈 ini1

𝜈 ini2

𝑖1

𝑎4

𝑎3

𝑎𝛾

Figure 2.11: Operational implementation of the two HSDF actors from Fig-ure 2.10 by a subgraph of a FunState network

FunState model is used. In this model, an FSM can be associated with a sub-graph and the subgraph is represented by a vertex in its containing graph. Toallow the subgraphs to be self-contained, a notion of ports is again used. Theseports are used interchangeably with the channels to which they are connected.To exemplify, the FunState network from Figure 2.11 is considered. This net-work corresponds to the KPN subgraph from Figure 2.10. Both actors 𝑎3 and𝑎4 from Figure 2.10 are represented by subgraphs in Figure 2.11, but these sub-graphs themselves act as functions in the containing graph 𝑎𝛾 . The FSM ofactor 𝑎3 activates the function 𝑓1 if the channels connected to the ports 𝑖1 and𝑖2 of actor 𝑎3 have at least one token available. The activation of the FSM ofactor 𝑎3 and 𝑎4 is itself controlled by the FSM of the composite actor 𝑎𝛾 . Bydisregarding the FunState network inside of actor 𝑎𝛾 and only considering theFSM of 𝑎𝛾 , the following three operations are possible:

∙ If the FSM of 𝑎𝛾 is an equivalent to the behavior of the contained FunStatenetwork of 𝑎𝛾 , then a composition operation has been performed. The im-plementation complexity of the contained subgraph of 𝑎𝛾 is hidden behindthe actor interface of 𝑎𝛾 .

∙ If the FSM of 𝑎𝛾 is simplified in such a way that it implements a superset ofthe behavior of the contained FunState network of 𝑎𝛾 , then an abstraction

26

Page 35: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.4 Hierarchical Integration of Finite State Machines and Data Flow

step has been performed. The behavior has to be a superset, as lateranalysis steps will enforce a property for all possible behaviors of a DFG.This analysis will use the FSM of 𝑎𝛾 as the description for the containedsubgraph. Therefore, enforcement of a property for all possible behaviors,even artificial behaviors not occurring in the real implementation, is asafe operation as it implies enforcement of the property for the behaviorsoccurring in real implementation of the system.

∙ If the FSM of 𝑎𝛾 is a subset to the behavior of the contained FunState net-work of 𝑎𝛾 , then a refinement step has been performed. Refinements stepshave to be reflected in the real implementation of the systems, otherwiseanalysis steps on the refined FSM will not consider all possible behaviorsof the system.

For a formal definition of the compositional FunState model, see [TSZ+99,TTN+98].

2.4 Hierarchical Integration of Finite State Machinesand Data Flow

One of the first modeling approaches integrating Finite State Machines (FSMs)with data flow models is *charts (pronounced ”star charts”) [GLL99, Pto14]. Theconcurrency model in *charts is not restricted to be data flow: Other choices arediscrete event models or synchronous/reactive models. However, in the scope ofthis section, the discussion has been limited to the FSM/data flow integration.First, a formal definition of a deterministic FSM will be given:

Definition 2.5 (FSM). A deterministic FSM is a five tuple (𝑄,Σ,Δ, 𝜎, 𝑞0),where 𝑄 is a finite set of states, Σ is a set of symbols denoting the possibleinputs, Δ is a set of symbols denoting possible outputs, 𝜎 : 𝑄× Σ→ 𝑄×Δ isthe transition mapping function, and 𝑞0 ∈ 𝑄 is the initial state.

In one reaction, an FSM maps its current state 𝑞src ∈ 𝑄 and an input symbol𝑎 ∈ Σ to a next state 𝑞dst ∈ 𝑄 and an output symbol 𝑏 ∈ Δ, where (𝑞dst, 𝑏) =𝜎(𝑞src, 𝑎). Given a sequence of input symbols, an FSM performs a sequence ofreactions starting in its initial state 𝑞0. Thereby, a sequence of output symbolsin Δ is produced.FSMs are often represented by state transition diagrams. In state transition

diagrams, vertices correspond to states and edges model state transitions, seeFigure 2.12. Edges are labeled by 𝑔𝑢𝑎𝑟𝑑/𝑎𝑐𝑡𝑖𝑜𝑛 pairs, where 𝑔𝑢𝑎𝑟𝑑 ∈ Σ specifiesthe input symbol triggering the corresponding state transition and 𝑎𝑐𝑡𝑖𝑜𝑛 ∈ Δ

27

Page 36: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

𝑞1𝑞0𝑎/𝑢

𝑏/𝑣

Figure 2.12: A state transition graph representing an FSM

specifies the output symbol to be produced by the FSM reaction. The edgewithout source state points to the initial state.One reaction of the FSM is given by triggering a single enabled state tran-

sition. An enabled state transition is an outgoing transition from the currentstate where the guard matches the current input symbol. Thereby, the FSMchanges to the destination state of the transition and produces the output sym-bol specified by the action. To simplify the notation, each state is assumed tohave implicit self transitions for input symbols not used as guards for any ex-plicit transition. The action in such cases is empty or the output symbol 𝜖 ∈ Δindicating the absence of output.FSMs as presented in Definition 2.5 can be easily extended to multiple input

signals and multiple output signals. In such a scenario, input and output sym-bols are defined by subdomains, i.e., Σ = Σ1×· · ·×Σ𝑛 and Δ = Δ1×· · ·×Δ𝑚.The *charts approach allows for nesting of DFGs and FSMs by allowing the

refinement of a data flow actor via FSMs and allowing an FSM state to berefined via a DFG. First, the refinement of a data flow actor by an FSM willbe discussed.

2.4.1 Refining Data Flow Actors via Finite State Machines

To use an FSM as a refinement of a data flow actor, the consumption andproduction rates have to be determined from the FSM for the refined actor.However, the derivation of the consumption and production rates depends onthe MoC of the DFG containing the actor. The tokens on the channels connectedto the refined data flow actor are mapped to the input alphabet Σ of the refiningFSM. In the guard of the FSM, the notation 𝑖[𝑛] refers to the 𝑛th token on thechannel connected to the input port 𝑖. In the action of the FSM, the notation𝑜[𝑛] refers to the 𝑛th token to produce on the channel connected to the outputport 𝑜. Each unique 𝑖[𝑛] has a symbol subdomain Σ𝑚 associated with it and thecross product of these subdomains forms the input alphabet to the FSM. Thesame is true for 𝑜[𝑛] and the output alphabet Δ, thus perfectly matching theFSM model (see Definition 2.5). In each reaction of the FSM, the correspondingaction emits a symbol from the output alphabet of the FSM, which is used toderive a token value to be produced on each outgoing edge of the actor. However,the semantics of the FSM changes depending on the MoC of the DFG the FSMis embedded into.

28

Page 37: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.4 Hierarchical Integration of Finite State Machines and Data Flow

∙ Finite State Machine in Kahn Process Network:

If an FSM refines a DDF actor, then each state of the FSM is associatedwith a number of tokens that will be consumed by the next firing of theactor. The semantics is as follows: For a given state 𝑞 of the FSM, thenumber of consumed tokens is determined by examining each outgoingtransition of the state. For each input port 𝑖 of the actor, the maximumindex 𝑛 is determined with which the input port 𝑖 occurs in any of thetransitions leaving the state 𝑞. If the input port 𝑖 does not occur inany transition, an index of −1 is assumed. Then, for the given state𝑞, the consumption rate for each input port 𝑖 is the maximum index 𝑛incremented by one. Note that this derivation of the consumption ratesimplicitly produces actors which are continuous in Kahn’s sense [Kah74]and can be implemented by a DDF actor.

To illustrate the above concept, consider state 𝑞0 in the FSM depicted inFigure 2.13 refining the actor 𝑎2. In this case, regardless if the transitionto 𝑞1 or 𝑞2 is taken, the actor will consume exactly two tokens from 𝑖1 andone token from 𝑖2.

𝑖1[0]/𝑜1[0]

𝑖2[0]/𝑜1[0]

𝑖1[0] ∧ 𝑖1[1]

𝑖2[0]

𝑞1𝑞0

𝑞2

𝑖2

𝑖1 𝑜1𝑎2𝑎1 𝑎3

KPN graph

Figure 2.13: Refinement of DDF actor 𝑎2 by an FSM in *charts

∙ Finite State Machine in Synchronous Data Flow:

When an FSM refines an SDF actor, it must externally obey SDF seman-tics, i.e., it must consume and produce a fixed number of tokens in eachfiring. Consider Figure 2.14 taken from [GLL99] as an example. Let usassume that the depicted data flow graph contains only HSDF actors, thenthe minimal repetition vector is 𝜂rep1 = 𝜂rep2 = 1. Since the channel be-tween the actors 𝑎1 and 𝑎2 contains no initial tokens, actor 𝑎1 fires beforeactor 𝑎2 in each iteration. In the first iteration, actor 𝑎1 fires if at leastone token is available on each input edge. Afterwards, HSDF actor 𝑎2 isenabled in the first iteration but also HSDF actor 𝑎1 might be enabledin the second iteration assuming sufficient input data is available. Hence,both FSMs can execute concurrently.

29

Page 38: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

In the following, the refinement of SDF actors by FSMs will be considered.The refined SDF actor must externally obey SDF semantics. Hence, theconsumption and production rates for each port of the refined SDF actormust be determined from the maximum of the consumption and produc-tion rates that is exhibited in all transitions of the FSM refining the SDFactor. In the simplest case, it holds that for every port the consumptionor production rate of that port is identical in all transitions of the FSM.An example of such an FSM can be seen for the refinement of the SDFactor 𝑎2 in Figure 2.14. The FSM neatly corresponds to the SDF modelconsuming one token and producing one token in each reaction.

1 11

1

1

𝑖1[0]/𝑜1[0]𝑖1[0]/𝑜1[0]

𝑖2[0]𝑞1𝑞0 𝑞0

𝑖2

𝑖1 𝑜1 𝑖1 𝑜1

SDF graph𝑎2𝑎1

Figure 2.14: Refinement of SDF actors by FSMs in *charts

However, the FSM for actor 𝑎1 from Figure 2.14 does not correspond to thesimple case. In this case, the consumption or production rate of a portof the refined SDF actor is determined from the maximum of the con-sumption or production rate of this port for every transition of the FSM.Consumed tokens which are not used be the FSM are simply discarded. Ifthe values for the tokens that must be produced according to the produc-tion rates of the SDF actor are not provided by the FSM transition, thenthe so-called default value 𝜖 will be used for the produced token. Thissolution is a little unsatisfactory and indeed there is the HeterochronousData Flow (HDF) approach [GLL99] that trades compositionality for theability to have state-dependent consumption and production rates.

∙ Finite State Machine in Heterochronous Data Flow: The idea of HDF issimilar to parameterized data flow modeling [BB01] and scenario-awaredata flow modeling [TGB+06] where changes of consumption and produc-tion rates cannot be performed at arbitrary points in time, but only afteran iteration of the DFG has finished. Here, an iteration represents a sce-nario, and after a scenario has been finished, the DFG can switch to a newscenario with possibly changed consumption and production rates. Param-eterized data flow models and HDF models differ in the ability of param-eterized data flows to express an infinite number of scenarios while HDFmodels can only express a finite number. For the finite case, a compile-timeanalysis, e.g., with respect to throughput [TGB+06, SGTB11][GFH+14*],required buffer capacities, deadlock freeness, etc., can be performed.

30

Page 39: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2.4 Hierarchical Integration of Finite State Machines and Data Flow

In HDF [GLL99], all (actor) FSMs in the HDF graph are forced to onlychange state once the HDF graph has executed a full iteration. After theiteration is finished, the HDF actors are free to update their state leadingto new consumption and production rates for the HDF actors in the sys-tem. With these new consumption and production rates, a new balanceequation may be solved and a new repetition vector calculated which isexecuted in the next iteration. For the duration of this next iteration, allHDF actors have to keep their consumption and production rates unmod-ified. Thus, a product FSM can be computed for a HDF graph where eachstate represent a scenario and the transitions correspond to the possibleswitches from one scenario to another one. This product FSM correspondsto the FSM of the FSM scenario-aware data flow model [SGTB11] that isused to represent scenarios (states) and scenario switches (transitions).

Let us consider again Figure 2.14. But now instead of an SDF domain, anHDF domain is assumed. In the case that actor 𝑎1 is in state 𝑞0, to executea full iteration of the HDF graph, the actors 𝑎1 and 𝑎2 are executed exactlyonce. Note that while actor 𝑎1 is executed, it remains in state 𝑞0 regardlessof the value 𝑖1[0] of the first token. After the full iteration of the HDFgraph (𝜂rep1 = 𝜂rep2 = 1) has finished, the FSM of actor 𝑎1 may change itsstate to 𝑞1 if the value 𝑖1[0] of the first token on input port 𝑖1 evaluatesto true. In the case that actor 𝑎1 is now in state 𝑞1, a full iteration ofthe HDF graph corresponds to the sole execution of the actor 𝑎1. Afterthe full iteration of the HDF graph has finished, the FSM of actor 𝑎1 maychange its state to 𝑞0 depending on the value 𝑖2[0] of the first token oninput port 𝑖2.

2.4.2 Refining Finite State Machine States via Data Flow

Previously, it has been seen how an FSM can be used to refine a data flow actor.On the opposite side, an FSM can be used to coordinate between multipleDFGs. This coordination is achieved by refining FSM states by DFGs. TheDFG is composed into a single actor which is executed if the FSM is in therefined state. To refine a state by a DFG, a notion of iteration is necessaryas the execution of one reaction of the FSM has to terminate. An iterationhas been chosen as a natural boundary to stop the execution of the embeddedDFG. However, the existence of a finite iteration is undecidable for generalDFGs. Hence, the application of refinements of states to DFGs is restricted in*charts to static data flow models, i.e., HSDF, SDF, and CSDF models, whichprovide such a notion of iteration naturally. Moreover, combining the actorsin a subgraph of a DFG into a single composite actor, which will execute aniteration for the subgraph, is not always possible.

31

Page 40: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

2. Data Flow Fundamentals

2

1

1 2 2

𝑖1[0] ∧ 𝑖1[1] ∧ 𝑖1[2] ∧ 𝑖1[3]/𝑜1[0]

𝑖2[0] ∧ 𝑖2[1]𝑞1𝑞0

𝑜1𝑖1𝑖2

𝑖1𝑖2

𝑜1

𝑎3𝑎2

SDF graph𝑔𝛾

𝑎1

Figure 2.15: An example actor refining an FSM state via an SDF graph

As an example, consider Figure 2.15, which is taken from [GLL99]. Thetop level FSM of the actor 𝑎1 consists of two states 𝑞0 and 𝑞1. If the currentstate of the FSM is not refined, e.g., 𝑞0, then the FSM reacts like described inSection 2.4.1. If the current state of the FSM is refined by an SDF graph, e.g.,𝑞1, then for a reaction of the FSM exactly as many tokens will be required asfor the execution of the iteration of the DFG refining the state of the FSM.Assuming that the FSM of actor 𝑎1 is in state 𝑞1, then each activation of 𝑎1

will execute the embedded SDF graph for one iteration. If at least one tokenof the tokens 𝑖2[0] and 𝑖2[1] evaluates to false, then the embedded SDF graphwill execute one iteration, but the FSM will remain in state 𝑞1. If the FSM isnot embedded in an HDF graph and the tokens 𝑖2[0] and 𝑖2[1] evaluate both totrue, then a state transition to state 𝑞0 will be taken after the embedded SDFhas finished its iteration.The exact behavior of the FSM of actor 𝑎1 depends on the MoC of the DFG

the actor is embedded in. If the FSM is embedded in the DDF or SDF domain,then the FSM executes a transition after the corresponding DFG has finishedits iteration. If the FSM is embedded in the HDF domain, then this transitionis delayed until all parent graphs have finished their iteration.In the next chapter, the SysteMoC language, the input for the proposed

clustering-based design flow, is introduced and its MoC is classified in termsof the hierarchy of expressiveness as discussed in this chapter.

32

Page 41: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3SysteMoC

In this chapter, the first key contribution of this work, the SystemC Modelsof Computation (SysteMoC) modeling language [FHZT13*, FHT06*] for theElectronic System Level (ESL) will be presented. This language has strongformal underpinnings in data flow modeling with the distinction that the ex-pressiveness of the data flow model used by an actor is not chosen in ad-vance but determined from the implementation of the actor via classification[FHZT13*, ZFHT08*]. In order to determine the data flow model used by aSysteMoC actor by means of classification, the SysteMoC language, in contrastto SystemC, enforces a distinction between communication and computation ofan actor. A further key point of the SysteMoC language is the ability of exe-cutable specifications modeled in SysteMoC to act as an input model for DesignSpace Exploration (DSE) as well as to serve as an evaluator [SFH+06*] duringDSE to evaluate timing and power characteristics of the executable specificationfor the different design decisions that are to be explored. Note that parts of thischapter are derived from [FHT06*, FHT07*, ZFHT08*, FZK+11*, FHZT13*].In the following, an outline for the rest of this chapter is given: In Section 3.1,

the importance of SystemC for modeling at the ESL will be discussed. Then, theintegration of an executable specification modeled in SysteMoC into the Sys-temCoDesigner [HFK+07*, KSS+09*] DSE methodology will be summarizedin Section 3.2.Next, the SysteMoC modeling language will be introduced, first by presenting

the underlying Model of Computation (MoC) of SysteMoC in Section 3.3, andthen by presenting the syntax of SysteMoC in Section 3.4. Subsequently, inSection 3.5, SysteMoC models are extended by a notion of hierarchy.A data flow graph that is conforming to the MoC of SysteMoC can in theory

also be extracted from a pure SystemC model if certain coding standards areenforced. In Section 3.6, an example of such an extraction will be given.After a data flow graph conforming to the MoC of SysteMoC is available, a

classification algorithm introduced in Section 3.7 is used to determine for eachactor of the graph whether the actor belongs to one of the three static data flowmodels of computation, i.e., Homogeneous (Synchronous) Data Flow (HSDF),Synchronous Data Flow (SDF), and Cyclo-Static Data Flow (CSDF). The

33

Page 42: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

classification algorithm provides only a sufficient criterion if a general SysteMoCactor conforms to one of the static data flow models of computation. Thislimitation stems from the fact that the problem in general is undecidable.Finally, the related work will be discussed in Section 3.8.

3.1 Electronic System Level Design and SystemC

The research area called Electronic System Level (ESL) design tries to copewith the problem of rising hardware and software design complexity. Industryis increasingly adopting new design methodologies that employ abstraction lev-els beyond Register Transfer Level (RTL). The rising level of abstraction onthe hardware design side gives rise to the opportunity to unify hardware andsoftware modeling by a common modeling language. One of such approaches isthe ESL ecosystem around the SystemC [GLMS02, Bai05] modeling language.The SystemC framework is an open source C++ class library. It was initially

promoted by the Open SystemC Initiative (OSCI) to facilitate an industry-widestandard for the interchange of high-level models. The Application Program-ming Interface (API) of this library has been standardized [Bai05] by the IEEE.This has enabled independent support of this API be various Electronic DesignAutomation (EDA) vendors like Mentor Graphics [Men13], Synopsys [Syn13],Imperas [Imp13], Forte Design Systems [For13], and many more.Furthermore, existing high-level synthesis tools like Forte Cynthesizer [For13],

Catapult C from Calypto Design Systems [Cal13], or Vivado from Xilinx [XIL14]allow the translation of behavioral SystemC models to the RTL level. Hence,these tools bridge the gap from a high-level behavioral implementation of analgorithm in SystemC to an implementation of the algorithm at RTL, which isrequired to realize the algorithm as an Application-Specific Integrated Circuit(ASIC) or in a Field Programmable Gate Array (FPGA). As SystemC itselfis a C++-based class library, a transformation of SystemC into a pure C++software implementation is also possible if the SystemC application is modeledat an appropriate level of abstraction and certain constraints are satisfied. Thus,SystemC is uniquely positioned as a design language for the ESL.In [Tei12], Teich presents the so-called double roof model of hardware/soft-

ware codesign. This model illustrates (cf. Figure 3.1) the usage of successiverefinements in order to derive implementations containing hardware and soft-ware subsystems from a functional specification at ESL. The double roof modelcan be understood as an extension of the Y-chart approach [KDWV02] by ex-plicitly separating the development chains for hardware and software.In contrast to dedicated software languages like C, C++, and Java or dedi-

cated hardware description languages like Very High Speed Integrated CircuitHardware Description Language (VHDL) and Verilog Hardware Description

34

Page 43: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.1 Electronic System Level Design and SystemC

Module (Task)

Block

Architecture

Logic

Behavioral View

Structural View

Software Hardware

System

. . . . . .

Figure 3.1: The double roof model of hardware/software codesign from [Tei12]shows the system level (shaded) at the top connecting the software(left) and hardware development chains (right) through successivesynthesis refinements (vertical arrows). Each synthesis step maps afunctional specification onto a structural implementation on the nextlower level of abstraction.

Language (Verilog), the SystemC language has broad applicability to repre-sent the functionality at various levels of abstraction depicted in the doubleroof model. SystemC was initially developed by OSCI to represent virtual pro-totypes and to serve as an input language for behavioral synthesis tools likeForte Cynthesizer [For13], Catapult C from Calypto Design Systems [Cal13],and Vivado from Xilinx [XIL14]. Virtual prototypes are situated in the doubleroof model [GHP+09] at the structural view of the system level. The behavioralview of the system level in the double roof model corresponds to the executablespecification at ESL.On the hardware development chain side of the double roof model, SystemC

is used as an input and output language for behavioral synthesis tools. Thesebehavioral synthesis tools translate an algorithmic problem description to theRTL level. This translation corresponds to a refinement at the architecturelevel of the double roof model from the behavioral view to the structural view.SystemC can also be used to represent the functionality at the logic level ofabstraction. However, it is not well-suited for modeling at this level. Dueto its heritage from C++, SystemC can also be used adequately to representfunctionality at the module and block level of the software development chainside of the double roof model. In the following, SystemC is examined in moredetail:

35

Page 44: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

∙ SystemC is a C++ class library. Thus, it can easily integrate modelswritten in C, C++, or any language that supports C linkage.14

∙ SystemC provides data types for hardware modeling, e.g., arbitrary lengthbit vectors, four valued logic types, corresponding arbitrary length logicvectors, arbitrary precision integers, fixed-point data types with variousrounding and saturation modes, and many more. Furthermore, for mod-eling purposes the user can define his own data type by usage of commonC++ constructs.

∙ At the core of a SystemC simulator resides a discrete event simulationkernel. Events can be generated and notified at user-specified times inthe future. SystemC provides a mechanism to associate actions with anevent. If the simulation time reaches the specified notification time of anevent, the associated actions are executed. Based on the discrete eventsimulation kernel, SystemC provides a cooperative multitasking environ-ment. The SystemC processes provided by the cooperative multitaskingenvironment can be enabled by the notification of an event or suspendthemselves waiting for the notification of an event.

∙ The structure of a SystemC design is borrowed from the world of hardwaredescription languages. Hence, it corresponds to the design structure asseen in the VHDL and the Verilog. Therefore, SystemC designs representthe architecture of a system. The structure consists of concurrently exe-cuting modules. The modules communicate via ports. These ports shouldbe the only way for modules to communicate with their environment. Thechannels implement a certain interface and are themselves connected to aport, which requires this interface. In the parlance of data flow, a modulecorresponds to an actor. However, SystemC channels are not restricted toFirst In First Out (FIFO) communication, but may implement any kindof communication fabric.

∙ Finally, SystemC provides a number of standard channels. These prede-fined channel types range from signals, as known from VHDL and Verilog,to registers, and FIFO channels.

The rising complexity of current designs stems not only from the increase indesign functionality, but also from the potentially exponential increase in thenumber of ways this design functionality can be implemented in an embeddedsystem. At the root of the double roof model (shaded in Figure 3.1) is a DSE

14Due to the pervasiveness of C/C++-based software, every programming language is almostguaranteed to support C linkage.

36

Page 45: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.2 Integration of SysteMoC and Design Space Exploration

methodology [BTT98]. This methodology tackles the problem of the exponen-tial explosion of the number of ways in which the desired functionality can beimplemented in an embedded system by performing an automatic optimizationof the functionality’s implementation. In the next section, the integration of anexecutable specification modeled in SysteMoC into this DSE methodology willbe discussed.

3.2 Integration of SysteMoC and Design SpaceExploration

A DSE methodology, e.g., like the ESL design flow (cf. Figure 3.2) used by Sys-temCoDesigner [HFK+07*, KSS+09*], typically starts with an architectureindependent description of the functionality a system shall implement. Thisdescription, given by a so-called problem graph, encodes the functionality thatwill be implemented on a still to be determined architecture of the system. Thedata flow domain has proven to be an apt representation for the descriptionof this functionality. In this case, the actors of the data flow graph representthe functionality. The channels connecting the actors specify the dependen-cies and the communication of the actors amongst each other. The possiblevariety of architectures is modeled by a so-called architecture graph. The ar-chitecture graph is composed of resources and edges connecting two resourcesthat are able to communicate with each other. Finally, mapping edges encodefor each actor of the problem graph the set of resources of the architecturegraph onto which the actor can be bound. A mapping edge from an actoror channel of the problem graph to a resource of the architecture graph de-notes that the actor or channel can be realized by an implementation via usageof this resource. The whole graph structure depicted in Figure 3.3 is calleda specification graph [BTT98]. Next, the set of non-dominated individuals isexplored by SystemCoDesigner using Multi-Objective Evolutionary Algo-rithms (MOEAs) [SD94, ZLT02] together with symbolic Boolean Satisfiabil-ity (SAT) based optimization techniques [HST06, LSG+09]. Note that eachindividual of the MOEA is given by an allocation of a certain number of ver-tices in the architecture graph and a feasible binding of vertices of the problemgraph onto architecture graph vertices. In each iteration step in the DSE, thepopulation of individuals is updated by the MOEA by means of crossover andmutation. In order to select parent individuals to perform crossover, the objec-tives for comparing individuals are estimated within the DSE.To exemplify, consider the problem graph depicted in Figure 3.3. This prob-

lem graph corresponds to a data flow graph that implements Newton’s iterativealgorithm for calculating the square root. The actor 𝑎1 is the source produc-

37

Page 46: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

𝑟4

𝑟2 𝑟1𝑟5

𝑟3𝑟6

𝑎1 𝑎2

𝑎4𝑎3

𝑎5

Write

Executable specification

written in SysteMoC

Automatic extractionof the problem graph

Reuse executablespecification

Specify architecture graphSpecify mappings

Executable specification

written in SysteMoC

Back annotation ofdesign decisions

Select implementation

DSE

Evolutionaryalgorithm

Analyticalevaluation

individualsNon-dominated

Figure 3.2: Overview of an ESL design flow using the SystemCoDesignerDSE methodology

ing the numbers for which the square root should be calculated. The actors𝑎2, 𝑎3, and 𝑎4 correspond to Newton’s iterative algorithm for calculating thesquare root of a number emitted by actor 𝑎1. Finally, 𝑎5 models the output ofthe calculated square root. The approximation step of the iterative square rootcalculation is realized by actor 𝑎3, while checking the error bound in order toterminate the approximation is performed by actor 𝑎2. Hence, actor 𝑎3 requiresfloating point division, and actor 𝑎2 requires floating point multiplication. Con-sidering the architecture graph, two Central Processing Units (CPUs) 𝑟1 and 𝑟2,a dedicated hardware accelerator 𝑟3, a memory 𝑟4, and two buses 𝑟5 and 𝑟6 canbe identified. While CPU 𝑟1 possesses a hardware floating point unit, no suchhardware support for floating point calculations is present in CPU 𝑟2. Hence,floating point calculations must be emulated in software for CPU 𝑟2. Thus,

38

Page 47: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.2 Integration of SysteMoC and Design Space Exploration

Problem graph Mapping edge Architecture graph

𝑟6|P2P 𝑟2|CPU2

𝑟3|Approx

𝑟4|MEM 𝑟5|BUS

𝑟1|CPU1

𝑐5𝑐6

𝑎5|Sink

𝑎1|Src

𝑐1 𝑐2 𝑎3|Approx

𝑎4|Dup

𝑐4𝑎2|SqrLoop 𝑐3

Figure 3.3: Show is a specification graph consisting of a problem graph describ-ing Newton’s iterative algorithm for calculating the square root, anarchitecture graph encoding the resources available to implementthis algorithm, and the mapping edges specifying for each actor andeach channel of the problem graph the set of resources usable forimplementing the actor or channel.

while all actors can be mapped to both CPUs 𝑟1 and 𝑟2, the actors 𝑎2 and 𝑎3will perform significantly worse on CPU 𝑟2 than on CPU 𝑟1. To improve thissituation, the hardware accelerator 𝑟3 for actor 𝑎3 is connected to CPU 𝑟2 viathe point-to-point link 𝑟6. The memory 𝑟4, which is connected by the bus 𝑟5 tothe two CPUs 𝑟1 and 𝑟2, is used to provide the program memory required bythe CPUs to implement the actors bound to these CPUs as well as the memoryrequired to implement the channels 𝑐1− 𝑐6. If the actor 𝑎3 is bound to resource𝑟3, then the channels 𝑐2 − 𝑐3 are also bound to 𝑟3, using the internal memoryprovided by this resource.The problem graph of a specification graph can now be implemented in various

ways (cf. Figure 3.4) by the provided architecture graph. Such an implemen-tation is represented by a so-called implementation graph. An implementation

39

Page 48: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

graph is a specification graph that has exactly one mapping edge activated foreach actor and each channel. The activated mapping edge of an actor or chan-nel decides onto which resource of the architecture graph the actor or channel isbound. Resources of the architecture graph not being a target of a mapping edgeare eliminated. The DSE methodology tackles the problem of the exponentialexplosion of the number of ways the desired functionality can be implementedin an embedded system by performing an automatic optimization of the imple-mentation of the functionality. The output of the automatic optimization is aset of non-dominated implementations of the functionality with respect to theobjectives under evaluation by the DSE methodology (cf. Figure 3.2).However, the optimization needs a way to compare the implementations with

each other. Here, two kinds of evaluators are used: (1) analytical evaluators,e.g, the cost of the system, which is derived from adding all the costs of theresources used, and (2) simulative evaluators, e.g., the bandwidth of the im-plementation derived by a simulation of an executable specification for a giveninput stimulus. A simulative evaluator [SFH+06*] is realized by parameterizing[SGHT10, XRW+12] a given executable specification, for example as part ofSystemCoDesigner, in such a way that it conforms to the design decisionstaken by the DSE for the implementation under evaluation. Hence, simulativeevaluators can only be used if an executable specification of the system is avail-able. Furthermore, the result of a simulative evaluator generally depends on theinput stimulus given to the executable specification.To exemplify, consider the four implementation graphs depicted in Figure 3.4.

As can be seen, exactly one mapping edge is activated for each actor and eachchannel. The target of the mapping edge selects the resource onto which theactor or channel is bound. Each actor or channel will be realized by using theresource onto which it was bound to implement the functionality of the actoror to provide the storage for the FIFO channel. Shown are two implementa-tions using a CPU without (cf. Figure 3.4a) and with (cf. Figure 3.4c) hardwarefloating point unit, an implementation (cf. Figure 3.4b) using a CPU withouthardware floating point unit, but with an additional hardware accelerator thatprovides hardware support for the floating point division required by actor 𝑎3,and an implementation (cf. Figure 3.4d) using both CPUs and the accelerator.The memory is always present to provide the storage for the FIFO channels andthe program code for the actors. The buses are present as required in order toprovide communication between the resources used by an implementation. It isevident that the non-functional design properties, e.g., the timing and hence thebandwidth, of an implementation will differ depending on the mapping of the ac-tors and channels to the resources. Furthermore, the cost of an implementationwill differ depending on the resources used by the implementation.For DSE, SystemC may be a good choice when used as a front end language.

However, some shortcomings of the language have to be removed. The under-

40

Page 49: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.2 Integration of SysteMoC and Design Space Exploration

𝑟2|CPU2

𝑟4|MEM 𝑟5|BUS

𝑎5|Sink

𝑐1 𝑐2

𝑐3 𝑐4

𝑐6 𝑐5

𝑎1|Src

𝑎3|Approx

𝑎2|SqrLoop

𝑎4|Dup

(a) Implementation graph of a solution usingCPU 𝑟2 requiring software emulation tosupport floating point operations

𝑟2|CPU2𝑟6|P2P

𝑟3|Approx

𝑟4|MEM 𝑟5|BUS

𝑎5|Sink

𝑐1 𝑐2

𝑐3 𝑐4

𝑐6 𝑐5

𝑎1|Src

𝑎3|Approx

𝑎2|SqrLoop

𝑎4|Dup

(b) Implementation graph of a solution whereCPU 𝑟2 is supported by an additionalhardware accelerator 𝑟3

𝑟1|CPU1

𝑟4|MEM 𝑟5|BUS

𝑎5|Sink

𝑐1 𝑐2

𝑐3 𝑐4

𝑐6 𝑐5

𝑎1|Src

𝑎3|Approx

𝑎2|SqrLoop

𝑎4|Dup

(c) Implementation graph of a solution us-ing CPU 𝑟1 possessing a hardware floatingpoint unit

𝑟1|CPU1

𝑟2|CPU2𝑟6|P2P

𝑟3|Approx

𝑟4|MEM 𝑟5|BUS

𝑎5|Sink

𝑐1 𝑐2

𝑐3 𝑐4

𝑐6 𝑐5

𝑎1|Src

𝑎3|Approx

𝑎2|SqrLoop

𝑎4|Dup

(d) Implementation graph of a solution usingall resources that have initially been spec-ified in the architecture graph

Figure 3.4: Depicted above are four possible implementations of the desired func-tionality of the system.

lying problem for these shortcomings of SystemC is the focus of SystemC onarchitecture centric descriptions. Moreover, the following issues prevented theeasy usage of general SystemC as a front end language for the SystemCoDe-signer DSE methodology:

∙ General SystemC models may implement any kind of communication fab-ric for the communication of SystemC modules amongst each other.

41

Page 50: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

∙ General SystemC models mix the functionality and the architecture usedfor implementing this functionality.

∙ General SystemC models do not separate computation and communicationbehavior of an actor.

The first issue can be solved by restricting SystemC applications to use onlychannel types exhibiting FIFO semantics like sc_fifo.The second issue can be removed by insisting that the SystemC application

is a purely functional model omitting annotations of timing values or any othernon-functional properties like values for power consumption required to executea part of the functionality. However, when insisting on this constraint, a Sys-temC application used as design entry becomes unusable for simulation-basedevaluation during exploration. In contrast to the architecture independent andpurely functional description that is used as a design entry, simulative evalu-ators require the presence of architecture dependent non-functional propertieslike power consumption values and execution times for all parts of the func-tionality in order to evaluate an implementation. The concrete values for thesearchitecture dependent non-functional properties will, of course, be dependenton the design decisions taken by the DSE for the implementation under evalua-tion. Hence, the concrete values of these properties will vary amongst differentsimulations.The third issue prevents the extraction of a certain MoC from each actor

of a general SystemC application. This is important as the knowledge that aset of actors conforms to one of the static data flow models introduced in Sec-tion 2.2 enables the analysis of this set of actors. Examples of such analysesare deadlock detection or scheduling analysis information enabling the genera-tion of more efficient schedules for these actors than would be possible withoutthis information. With the algorithms presented in Sections 3.6 and 3.7, it is,however, possible to formulate sufficient conditions for a constrained subset ofSystemC to conform to certain MoCs.15 Fortunately, these constraints turn outto be very similar to the synthesizable subset of SystemC.In summary, general SystemC cannot be used as a front end language for

DSE at the ESL, and the constrained subset as given above is unusable forsimulation-based performance evaluation. Two approaches can be taken to re-solve this problem. In the first approach, the SystemC model used as a designentry will be automatically annotated with the architecture dependent non-functional properties corresponding to the design decisions taken by the DSE,and then this annotated SystemC model must be compiled in order to be used asan evaluator by the DSE. In the second approach, the SystemC model used as a15Only a sufficient condition can be given as the determination of a MoC from a SystemC

specification is undecidable in general.

42

Page 51: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.2 Integration of SysteMoC and Design Space Exploration

design entry must support the ability to be run-time configurable based on thedesign decisions taken by the DSE for the implementation under evaluation withthe varying concrete values for the architecture dependent non-functional prop-erties. The first approach can produce simulators with higher performance thanthe second approach of run-time configurability. However, the first approachalways incurs the overhead of compilation before the generated simulator canbe executed. Thus, if the overall execution time required by a single simulationrun is small when compared to the required time for compilation, as is typicallythe case, then the second approach will still be the faster one.Consequently, SysteMoC—an extension of SystemC—uses the second ap-

proach of run-time configurability in order to tackle the problem. SysteMoCenables SystemC-based modeling to be used by SystemCoDesigner as (1) afront end language, as (2) a simulative evaluator [SFH+06*][SGHT09, XRW+12]for the design decisions taken by the DSE for the implementation under eval-uation, and as (3) a golden model from which virtual prototypes for selectedimplementations can be derived via usage of the synthesis back-end presentedin Chapter 5. In particular, SysteMoC provides the following abilities on top ofSystemC:

∙ SysteMoC provides the ability to automatically extract the problem graphfrom an executable specification modeled in this language.

∙ SysteMoC provides the ability to automatically extract a Finite StateMachine (FSM) modeling the communication behavior of an actor foreach actor of the problem graph.

∙ SysteMoC provides the ability to back annotate design decisions taken bythe DSE via run-time configuration into the executable specification, e.g.,timing and power behavior depending on the architecture chosen by theDSE.

The model extracted from the SysteMoC application via usage of the firsttwo abilities will be presented in the next section. The missing informationto complete the specification of the DSE problem, i.e., possible architecturesand mappings, has to be added (cf. Figure 3.2) manually by the designer. Theresulting specification graph can then be used by the SystemCoDesignerDSE methodology to perform an automatic optimization of the implementationof the desired functionality encoded in the executable specification written inSysteMoC.

43

Page 52: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

3.3 The Model of Computation of SysteMoC

SysteMoC extends SystemC modules to communicate over particular FIFOchannels and enforces the usage of an FSM inside each module to express theircommunication behavior. Such an extended SystemC module will be called aSysteMoC actor. In actor-oriented design, actors are objects which executeconcurrently and can only communicate with each other via channels instead ofmethod calls as known in object-oriented design. An actor 𝑎 may only commu-nicate with other actors through its dedicated actor input ports 𝑎.𝐼 and actoroutput ports 𝑎.𝑂.16 These ports are connected to channels. While SysteMoCsupports various channel types, e.g., signals used to describe the synchronousreactive MoC, in this thesis the considered channels are constrained to haveFIFO semantics. The basic entity of data transfer for this kind of channels isregulated by the notion of tokens that are transmitted via these channels.

3.3.1 SysteMoC Actors

A SysteMoC actor consists of three parts:

∙ The actor input ports and output ports for communication.

∙ The actor functionality for transforming tokens.

∙ The actor FSM that encodes the communication behavior of the actor.

These three parts are visualized in Figure 3.5 for the SqrLoop actor 𝑎2 fromFigure 3.3.As can be seen, a SysteMoC actor is very similar to a hierarchical Functions

Driven by State Machines (FunState) model [TSZ+99], e.g., as depicted in Fig-ure 2.11 on page 26. The actor FSM corresponds to the FSM of the compositeactor 𝑎𝛾 . However, internal FIFOs and FSMs for the individual functions arediscarded in SysteMoC actors. Moreover, the guards on the transitions of theactor FSM must also contain checks for the availability of space on the actoroutput ports as SysteMoC FIFOs are of limited capacity. Finally, in contrast toFunState, a transition of an actor FSM will not only trigger a function, called anaction in SysteMoC, but completes only after the action has finished execution.More formally, the following definition of a SysteMoC actor can be derived:

Definition 3.1 (Actor [FHT06*]). An actor is a tuple 𝑎 = (𝒫 ,ℱ ,ℛ) containinga set of actor ports 𝒫 = 𝐼 ∪ 𝑂 partitioned into actor input ports 𝐼 and actoroutput ports 𝑂, the actor functionality ℱ = ℱ𝑎𝑐𝑡𝑖𝑜𝑛 ∪ ℱ𝑔𝑢𝑎𝑟𝑑 partitioned into aset of actions and a set of guards, as well as the actor FSM ℛ.16The ‘.’-operator, e.g., 𝑎.𝐼, is used to denote member access of tuples whose members have

been explicitly named in their definition, e.g., member 𝐼 of the actor 𝑎 from Definition 3.1.

44

Page 53: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.3 The Model of Computation of SysteMoC

#𝑖2 ≥ 1 ∧ ¬𝑓 check ∧#𝑜1 ≥ 1/𝑓 copyInput

#𝑖2 ≥ 1 ∧ 𝑓 check ∧#𝑜2 ≥ 1/𝑓 copyApprox

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/𝑓 copyStore

𝑓 copyApprox

𝑓 copyStore

𝑓 check

𝑓 copyInput

𝑡3

𝑡1

𝑡2𝑖2

𝑖1

Inpu

tpo

rts𝑎2.𝐼

={𝑖

1,𝑖

2}

Outpu

tpo

rts𝑎2.𝑂

={𝑜

1,𝑜

2}

𝑜1

𝑜2

Functionality 𝑎2.ℱ

Actor FSM 𝑎2.ℛ

𝑎2|SqrLoop

𝑞start 𝑞loop

Figure 3.5: Shown is the visual representation of the SqrLoop actor 𝑎2 fromFigure 3.3. The SqrLoop actor is composed of a set of input ports𝐼 and a set of output ports 𝑂, its functionality ℱ , and its FSM ℛthat is determining the communication behavior of the actor.

As can be seen, SysteMoC in contrast to Kahn Process Network (KPN) andFunState distinguishes its functions further into actions 𝑓𝑎𝑐𝑡𝑖𝑜𝑛 ∈ ℱ𝑎𝑐𝑡𝑖𝑜𝑛 andguards 𝑓𝑔𝑢𝑎𝑟𝑑 ∈ ℱ𝑔𝑢𝑎𝑟𝑑. To exemplify, the SqrLoop actor depicted in Figure 3.5is considered. This actor has three actions { 𝑓copyStore, 𝑓copyInput, 𝑓copyApprox } =ℱ𝑎𝑐𝑡𝑖𝑜𝑛 and one guard function { 𝑓check } = ℱ𝑔𝑢𝑎𝑟𝑑. Naturally, the guard 𝑓check isonly used in a transition guard of the FSM 𝑎2.ℛ while the actions of the FSM aredrawn from the set of actions ℱ𝑎𝑐𝑡𝑖𝑜𝑛. Furthermore, there is a notion of a func-tionality state 𝑞func ∈ 𝑄func of an actor. The functionality state is an abstractrepresentation of the C++ variables in the real implementation of a SysteMoCactor. This functionality state is required for notational completeness, that isfor the formal definitions of action and guard functions.17

∙ 𝑓𝑎𝑐𝑡𝑖𝑜𝑛 : 𝑄func × 𝑆|𝐼| → 𝑄func × 𝑆|𝑂|

∙ 𝑓𝑔𝑢𝑎𝑟𝑑 : 𝑄func × 𝑆|𝐼| → { t, f }18

Both types of functions depend on the functionality state of their actor. Further-more, an action may update this state, while a guard function may not. Hence,all SysteMoC actor firings are sequentialized over this functionality state, andtherefore multiple actor firings of the same actor cannot be executed in parallel.That is, in data flow parlance, all SysteMoC actors have a virtual self-loop with17In the definition below, the customary notation |𝑋| is used to denote the cardinality of a

set 𝑋.18In the following, t and f are used to denote Boolean truth values.

45

Page 54: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

one initial token that corresponds to 𝑞func ∈ 𝑄func. However, as the functionalitystate is not used for any optimization and analysis algorithms presented in thisdissertation, it has been excluded from the definition of an actor. Hence, foranalysis purposes the evaluation of a guard function is assumed to always beable to return both t (true) and f (false).Furthermore, the cons and prod notation of FunState is reused to specify the

length of token sequences that may be accessed by the guard functions 𝑓𝑔𝑢𝑎𝑟𝑑 ∈ℱ𝑔𝑢𝑎𝑟𝑑 as well as accessed and written by the action functions 𝑓𝑎𝑐𝑡𝑖𝑜𝑛 ∈ ℱ𝑎𝑐𝑡𝑖𝑜𝑛.

∙ The cons : (𝐼 × ℱ) → N function specifies for each input port/channelthe number of tokens required on this channel to compute the action orguard 𝑓 ∈ ℱ .19 Note that, in contrast to actions, the computation of aguard function does not consume any tokens required for its evaluation.

∙ The prod : (ℱ𝑎𝑐𝑡𝑖𝑜𝑛×𝑂)→ N function specifies for each output port/chan-nel the number of tokens produced by the action 𝑓𝑎𝑐𝑡𝑖𝑜𝑛 ∈ ℱ𝑎𝑐𝑡𝑖𝑜𝑛 for thischannel.

An actor FSM resembles the FSM definition from FunState, but with slightlydifferent definitions for actions and guards. More formally, the definition is givenbelow:

Definition 3.2 (Actor FSM [FHT06*]). The FSM ℛ of an actor 𝑎 is a tuple(𝑄, 𝑞0, 𝑇 ) containing a finite set of states 𝑄, an initial state 𝑞0 ∈ 𝑄, and a finiteset of transitions 𝑇 . A transition 𝑡 ∈ 𝑇 itself is a tuple (𝑞src, 𝑘, 𝑓𝑎𝑐𝑡𝑖𝑜𝑛, 𝑞dst)containing the source state 𝑞src ∈ 𝑄, from where the transition is originating,and the destination state 𝑞dst ∈ 𝑄, which will become the next current state afterthe execution of the transition starting from the current state 𝑞src. Furthermore,if the transition 𝑡 is taken, then an action 𝑓𝑎𝑐𝑡𝑖𝑜𝑛 from the set of functions ofthe actor functionality 𝑎.ℱ𝑎𝑐𝑡𝑖𝑜𝑛 will be executed. Finally, the execution of thetransition 𝑡 itself is guarded by the guard 𝑘.

A transition 𝑡 will be called an outgoing transition of a state 𝑞 if and only if thestate 𝑞 is the source state 𝑡.𝑞src of the transition. Correspondingly, a transitionwill be called an incoming transition of a state 𝑞 if and only if the state 𝑞 is thedestination state 𝑡.𝑞dst of the transition. Furthermore, a transition is enabledif and only if its guard 𝑘 evaluates to t and it is an outgoing transition of thecurrent state of the actor.An actor is enabled if and only if it has at least one enabled transition. The

firing of an actor corresponds to a non-deterministic selection and execution of19Remember that in this thesis, the notion of ports are used interchangeably with the channels

connected to these ports. Thus, both cons as well as prod can also be used with channelsinstead of ports.

46

Page 55: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.3 The Model of Computation of SysteMoC

one transition out of the set of enabled transitions of the actor. In general,if multiple actors have enabled transitions, then the transition, and hence itscorresponding actor, is chosen non-deterministically by the SysteMoC run-timesystem out of the set of all enabled transitions in all actors. In summary, theexecution of a SysteMoC model can be divided into three phases:

∙ Determine the set of enabled transitions by checking each transition ineach actor of the model. If this set is empty, then the simulation of themodel will terminate.

∙ Select a transition 𝑡 from the set of enabled transitions, and fire the corre-sponding actor by executing the selected transition 𝑡, thus computing theassociated action 𝑓𝑎𝑐𝑡𝑖𝑜𝑛.

∙ Consume and produce tokens as encoded in the selected transition 𝑡. Thismight enable new transitions. Go back to the first step.

In contrast to FunState, a guard 𝑘 is more structured. Moreover, it is parti-tioned into the following three components:

∙ The input guard which encodes a conjunction of input predicates on thenumber of available tokens on the input ports, e.g., #𝑖1 ≥ 1 denotes aninput predicate that tests if at least one token is available at the actorinput port 𝑖1.

∙ The output guard which encodes a conjunction of output predicates onthe number of free places on the output ports, e.g., #𝑜1 ≥ 1 denotes anoutput predicate that tests if at least one free place is available at theactor output port 𝑜1.

∙ The functionality guard which encodes a logical composition of guard func-tions of the actor, e.g., ¬𝑓check. Hence, the functionality guard dependson the functionality state and the token values on the input ports.

Therefore, the guard 𝑘 is encoded by a Boolean function depending on thenumber of available tokens (#𝑖1,#𝑖2 . . . ,#𝑖|𝐼|) ∈ N|𝐼|

0 and the number of freeplaces (#𝑜1,#𝑜2 . . . ,#𝑜|𝑂|) ∈ N|𝑂|

0 on the actor input and output ports, re-spectively. Furthermore, due to the guard function 𝑓𝑔𝑢𝑎𝑟𝑑 ∈ ℱ𝑔𝑢𝑎𝑟𝑑 that can beused inside a guard 𝑘, a guard typically requires the actor functionality state𝑞func ∈ 𝑄func and the token sequences (𝑠1, 𝑠2 . . . , 𝑠|𝐼|) ∈ 𝑆|𝐼| available on theinput ports 𝐼 in order to determine whether it is enabled or not.

𝑘 : N|𝐼|0 × N

|𝑂|0 ×𝑄func × 𝑆|𝐼| → { t, f }

47

Page 56: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

Input and output predicates are only allowed to be composed via logical con-junction. Thus, there exists a unique least vector of available tokens and freeplaces 𝑛 = (𝑛𝑖1 , 𝑛𝑖2 . . . , 𝑛𝑖|𝐼| , 𝑛𝑜1 , 𝑛𝑜2 . . . , 𝑛𝑜|𝑂|) that are required to enable theguard 𝑘. More formally, each transition 𝑡 ∈ 𝑇 satisfies the below given conditiondefining its unique least vector 𝑛.

∃𝑛 ∈ N|𝒫|0 : ∀𝑠 ∈ 𝑆|𝐼|, 𝑞func ∈ 𝑄func : 𝑘(𝑛, 𝑞func, 𝑠) =⇒ ∀𝑛′ ≥ 𝑛 : ¬𝑘(𝑛′, 𝑞func, 𝑠)

Furthermore, the cons and prod notation will be used to refer to the entriesof the above specified unique least vector 𝑛.

∙ The cons : (𝑇 × 𝐼) → N function specifies for each transition and inputport/channel combination the number of tokens 𝑛𝑖 = cons(𝑡, 𝑖) consumedif the transition is taken.

∙ The prod : (𝑇 ×𝑂)→ N function specifies for each transition and outputport/channel combination the number of tokens 𝑛𝑜 = prod(𝑡, 𝑜) producedif the transition is taken.

In other words, this unique least vector is required to enable the transition 𝑡.Note that it is also the number of tokens that are consumed/produced on theinput/output ports if the transition is taken. Hence, for a SysteMoC model tobe well formed, the number of tokens accessed on the different input and outputports by an action 𝑡.𝑓𝑎𝑐𝑡𝑖𝑜𝑛 associated with the transition 𝑡 as well as the guardfunctions 𝑓𝑔𝑢𝑎𝑟𝑑 used in the transition guard 𝑡.𝑘 has to conform to the consump-tion and production rates of the transition. That is, ∀𝑖 ∈ 𝐼 : cons(𝑖, 𝑡.𝑓𝑎𝑐𝑡𝑖𝑜𝑛) =cons(𝑡, 𝑖) ∧ ∀𝑓𝑔𝑢𝑎𝑟𝑑 contained in 𝑡.𝑘 : cons(𝑖, 𝑓𝑔𝑢𝑎𝑟𝑑) ≤ cons(𝑡, 𝑖) and ∀𝑜 ∈ 𝑂 :prod(𝑡.𝑓𝑎𝑐𝑡𝑖𝑜𝑛, 𝑜) = prod(𝑡, 𝑜).

3.3.2 SysteMoC Network Graphs

In contrast to the data flow models introduced in Chapter 2, FIFO channels oflimited capacity are used within SysteMoC models. Note that the determinationof limited channel capacities that do not introduce deadlocks into a model is ingeneral undecidable [Par95] for dynamic data flow graphs. The introduction oflimited channel capacities into a SysteMoC model is necessary, however, whenrefining a model for subsequent implementation in hardware or software. Here,synthesis back-ends only support limited channels.Actor-oriented designs are often represented by bipartite graphs consisting

of channels 𝑐 ∈ 𝐶 and actors 𝑎 ∈ 𝐴, which are connected via point-to-pointconnections from an actor output port 𝑜 to a channel and from a channel to anactor input port 𝑖. In the following, such a representation is called a networkgraph, an example of which can be seen in Figure 3.6. This graph corresponds

48

Page 57: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.3 The Model of Computation of SysteMoC

ApproximationLoop Check

ApproximationLoop Body

𝑎2|SqrLoop𝑜 2 𝑖 2

𝑜 1𝑖 1

𝑐1 𝑐2

𝑐5𝑐6𝑎5|Sink

𝑜1𝑎1|Src

𝑖1

𝑜 1

𝑜 1 𝑖 1

𝑖 2

𝑐4

Newton square root approximation

Actor output port 𝑜2 ∈ 𝑎4.𝑂Actor input port 𝑖1 ∈ 𝑎5.𝐼

Cha

nnel

𝑐 6∈𝑔 s

qr,flat.𝐶

𝑐3

𝑃(𝑐

3)=

(∞,(𝜈))

Network graph 𝑔sqr,flat

𝑔sqr,flat|SqrRoot

Actor

𝑎3∈𝑔 s

qr,flat.𝐴

oftype

“App

rox”

𝑖1

𝑜2

𝑎3|Approx

𝑎4|Dup

Figure 3.6: The network graph 𝑔sqr,flat displayed above implements Newton’s it-erative algorithm for calculating the square roots of an infinite inputsequence generated by the Src actor 𝑎1. The square root values aregenerated by Newton’s iterative algorithm SqrLoop actor 𝑎2 forthe error bound checking and 𝑎3 - 𝑎4 to perform an approximationstep. After satisfying the error bound, the result is transported tothe Sink actor 𝑎5.

exactly to the problem graph contained in the specification graph depicted inFigure 3.3.In fact, in the SystemCoDesigner DSE methodology, the names problem

graph and network graph denote the same graph structure. However, for histor-ical reasons, the name problem graph [BTT98] is commonly used if this graphis considered from the DSE perspective. Furthermore, from a broader DSEperspective, a problem graph may not necessarily exhibit data flow semanticsand only encode some kind of data communication. In contrast to this, thename network graph is used if data flow aspects are in focus. More formally, thefollowing definition of a non-hierarchical network graph can be derived:

Definition 3.3 (Non-Hierarchical Network Graph [FHT06*, FHZT13*]). Anon-hierarchical network graph is a directed bipartite graph 𝑔n = (𝑉,𝐸) con-taining a set of vertices 𝑉 = 𝐶 ∪𝐴 partitioned into channels 𝐶 and actors 𝐴, aset of directed edges 𝑒 = (𝑣src, 𝑣snk) ∈ 𝐸 ⊆ (𝐶×𝐴.𝐼)∪ (𝐴.𝑂×𝐶) from channels

49

Page 58: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

𝑐 ∈ 𝐶 to actor input ports 𝑖 ∈ 𝐴.𝐼 as well as from actor output ports 𝑜 ∈ 𝐴.𝑂to channels.20

Finally, the delay function delay (reused from Definition 2.1) and the channelcapacity function size are used to denote a finite sequence of initial tokens aswell as the channel capacity denoting the maximal number of tokens a channelcan store, respectively.

delay : 𝐶 → 𝒱*

size : 𝐶 → N0

3.3.3 SysteMoC FIFO Channels

It can be seen (cf. Figure 3.7) that the actor FSM controls which action toexecute, decides how many tokens to consume and produce on each port byadvancing the read or write pointer of the corresponding FIFO, and controls thesize of a random access region (the two big black boxes in Figure 3.7) of eachFIFO. Hence, as only tokens inside the random access region can be modified bythe actions or read by the guards, the actor FSM controls the communicationbehavior of the actor. A random access region has been introduced here, tosatisfy the need of general SysteMoC applications for random access and formultiple access to the same token value. For example, in all action and guardfunctions, the syntax portvariable[<n>] is used for accessing the nth tokenin the random access region. This region is relative to the read or write pointerof the connected FIFO. An implementation based on conventional FIFOs andinternal buffering of the tokens inside the actor itself, would have been alsofeasible, but would mix implementation decisions, i.e., using internal buffering,with model requirements.

3.4 The SysteMoC Language

In this section, the SysteMoC language, a class library based on SystemC, andits syntax will be presented. Newton’s iterative square root algorithm (cf. Fig-ure 3.6) will be used as a running example to give a short introduction into thesyntax of the SysteMoC language.

20 The ‘.’-operator has a trivial extension to sets of tuples, e.g., 𝐴.𝐼 =⋃

𝑎∈𝐴 𝑎.𝐼, which isused throughout this document.

50

Page 59: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.4 The SysteMoC Language

Advances Advances

size ofExecutesControls Controls

size of Executes

Write pointerRead pointer

𝑐1𝑖1 𝑜1

𝑎1.ℛ 𝑎2.ℛ

Containing6 tokens

SysteMoC FIFO of size 11

+𝑖1[0];

𝑓 sink ∈ 𝑎1.ℱ𝑎𝑐𝑡𝑖𝑜𝑛 𝑓 src ∈ 𝑎2.ℱ𝑎𝑐𝑡𝑖𝑜𝑛

int i = 𝑖1[1]*𝑖1[2]

𝑜1[1] = i++;𝑜1[0] = j++;

𝑎2|ASrc𝑎1|ASink

Figure 3.7: A simple source-sink example to explain the semantics of a SysteMoCFIFO associated with a channel

3.4.1 Specification of Actors

Each actor in SysteMoC is represented by a C++ class, e.g., the class SqrLoopas defined in the following Example 3.1 is representing the actor 𝑎2 shown inFigure 3.6.

Example 3.1 SysteMoC implementation of the SqrLoop actor 𝑎2

1 class SqrLoop: public smoc_actor {2 public:3 smoc_port_in<double> i1, i2;4 smoc_port_out<double> o1, o2;5 private:6 double v;78 void copyStore() {o1[0] = v = i1[0];}9 void copyInput() {o1[0] = v; }

10 void copyApprox() {o2[0] = i2[0]; }11 bool check() const {return fabs(v-i2[0]*i2[0])<0.01;}1213 smoc_firing_state start, loop;14 public:15 SqrLoop(sc_module_name name);16 };

Each SysteMoC actor is derived from the smoc_actor base class (cf. Ex-ample 3.1 Line 1) provided by the SysteMoC library. The input and output

51

Page 60: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

ports of an actor are specified by member variables of type smoc_port_inand smoc_port_out as exemplified in Example 3.1 Lines 3 and 4, respec-tively.21 Furthermore, actors can have member variables, e.g., the variable vdeclared in Line 6, that realize the functionality state 𝑞func ∈ 𝑄func defined inDefinition 3.1. The actions and guards of an actor are represented by methodsof the class, e.g., the methods copyStore, copyInput and copyApprox(cf. Lines 8 to 10) correspond to the actions 𝑓copyStore, 𝑓copyInput, and 𝑓copyAppory

of the SqrLoop actor, while the method check represents the guard function𝑓check. The copyStore method is responsible for forwarding the input tokenvalue on input port 𝑖1 to the output port 𝑜1 and storing the value into thefunctionality state represented by the member variable v for later replicationon the output port 𝑜1 by the method copyInput. This replication is requiredif the achieved accuracy by one square root iteration step by actor 𝑎3 is belowthe bound determined by the guard check. On the other hand, if the approxi-mation is within the error bound, then the action copyApprox is executed toforward this approximation to the output port 𝑜2.Note that actions themselves do not control token production or consumption,

but only read the values of input tokens for data processing and generation ofvalues of output tokens. The communication behavior, i.e., token productionand consumption, is controlled solely by the actor FSM (cf. Definition 3.2) aswill be detailed next.

3.4.2 Specification of the Communication Behavior

The communication behavior of an actor is an abstraction from the functionalbehavior of the actor. The abstraction of the communication behavior con-centrates on the tokens which are consumed and produced by an actor. InSysteMoC, the abstraction is done by representation of the communication be-havior by an FSM (cf. Definition 3.2). The actions and guards of the actorfunctionality can only read and write values of tokens contained in the randomaccess region defined in Section 3.3.3. This random access region is controlledby the actor FSM in order to enforce the specified communication behavior onthe actor functionality. Note that if no interaction between the actor FSM andthe functionality state 𝑞func of an actor would be permissible, then SysteMoCactors could not implement general Kahn actors. Hence, SysteMoC uses thenotion of guard functions to implement this interaction. To exemplify, the actorFSM of the SqrLoop actor is depicted in Figure 3.8. This FSM is constructedby the code shown in Example 3.2 Lines 10 to 23.

21Standard SystemC FIFO ports could not be used as the semantics of SysteMoC FIFOsextends standard FIFO semantics by the random access region defined in Section 3.3.3.

52

Page 61: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.4 The SysteMoC Language

#𝑖2 ≥ 1 ∧ 𝑓 check ∧#𝑜2 ≥ 1/𝑓 copyApprox

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/𝑓 copyStore

#𝑖2≥ 1 ∧

¬𝑓 check∧#

𝑜1≥ 1/𝑓

copyInput

𝑡3

𝑡2

𝑡1

Initial state 𝑎2.ℛ.𝑞0 Actor FSM 𝑎2.ℛ

𝑞start 𝑞loop

Figure 3.8: The actor FSM of the SqrLoop actor 𝑎2

Example 3.2 SysteMoC implementation of the SqrLoop actor 𝑎2

1 class SqrLoop: public smoc_actor {2 ...3 smoc_firing_state start, loop;4 public:5 SqrLoop(sc_module_name name)6 : smoc_actor(name, start),7 i1("i1"), i2("i2"), o1("o1"), o2("o2"),8 start("start"), loop("loop")9 {

10 start =11 i1(1) >>12 o1(1) >>13 CALL(SqrLoop::copyStore) >> loop14 ;15 loop =16 (i2(1) && !GUARD(SqrLoop::check)) >>17 o1(1) >>18 CALL(SqrLoop::copyInput) >> loop19 |20 (i2(1) && GUARD(SqrLoop::check)) >>21 o2(1) >>22 CALL(SqrLoop::copyApprox) >> start23 ;24 }25 };

The states 𝑄 of the actor FSM themselves are represented by member vari-ables of type smoc_firing_state, e.g., the state start and loop in Line 3,that is 𝑄 = { 𝑞start, 𝑞loop }. The initial state of the actor FSM is determined by

53

Page 62: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

providing the base class smoc_actor with the corresponding state. For theSqrLoop actor, the initial state 𝑞0 is set by Line 6 to 𝑞start (start).The syntax for naming SystemC and also SysteMoC entities is demonstrated

in Lines 7 and 8, where the actor input and output ports as well as the states ofthe actor FSM are named, respectively.22 However, in the interest of conciseness,it will be assumed that all SystemC/SysteMoC entities are named like theirdeclarations in the source code shown in the following examples, but this namingwill not be shown explicitly.The transition 𝑡1 from 𝑞start to 𝑞loop is given in Lines 11 to 13. Transition

𝑡2 and 𝑡3 are defined accordingly in Lines 16 to 18 and Lines 20 to 22. Ascan be seen from Example 3.2, a transition has the following syntax: (inputguard && functionality guard) » output guard » CALL( action function )» destination state. Hence, a transition in SysteMoC syntax is divided intofour components joined via the ’»’-operator. To be more precise, only the lastpart, the destination state, is mandatory, while the first three components areoptional. For example, if no outputs are produced, then the ’output guard »’part can be dropped from the transition definition. Likewise, if the transitiondoes not depend on the functionality state, then the ’&& functionality guard ’part can be dropped from the transition definition. In the following, the partsof a transition definition are considered in more detail:

∙ The input guard is encoded by a logical conjunction of input predicateson the number of available tokens on the input ports. The notation ofan input predicate #𝑖 ≥ 𝑛 denoting the test for at least 𝑛 available to-kens on input port 𝑖 is reflected in SysteMoC syntax by the expressioninput_port_variable(n), e.g., the first part of transition 𝑡1 fromLine 11 consists only of an input guard with the single input predicatei1(1), which denotes that at least one token on input port i1 is needed.Input predicates can only be composed by a conjunction that is repre-sented by the C++ logical and operator ’&&’ in SysteMoC syntax, e.g.,(i1(1) && i2(1)) is permissible while (i1(1) || i2(1)) or anyother composition is forbidden.

∙ The functionality guard is encoded by a logical composition of guard func-tions of the actor. The guard functions are given via the GUARD syntax.

22 The need to name entities stems from the fact that SystemC, and therefore also SysteMoC,is a C++ class library based approach. The inherent deficiencies [GD09] of C++ withregard to program code introspection prevent the SystemC run-time system from acquiringthe names under which the SystemC and SysteMoC entities are defined in the sourcecode. Furthermore, without proper names for the states, ports, channels, actors, etc.,the SystemC/SysteMoC run-time system cannot generate proper diagnostics messages.Hence, almost every SystemC/SysteMoC entity can be named by providing a string as itsfirst parameter.

54

Page 63: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.4 The SysteMoC Language

The mathematical definition of a guard function 𝑓𝑔𝑢𝑎𝑟𝑑 : 𝑄func × 𝑆|𝐼| →{ t, f } does not return a functionality state. To enforce this property, it isrequired that guard functions are implemented by const member func-tions of the actor. Example 3.1 Line 11 demonstrate this declaration ofguard functions. The C++ compiler enforces that const member func-tions, and hence guard functions, can not modify the actor variables orcall other non-const member functions that might modify those variables.Guard functions can, by definition, be composed via logical composition.Therefore, SysteMoC allows the composition of guard functions via theC++ ’!’, ’||’, ’&&’, ’==’, and ’!=’ operators.

If both input guard and functionality guard are present, then they arejoined by a conjunction, e.g., Lines 16 and 20 of Example 3.2 which com-bine the input predicate i2(1) with the negated guard function ¬𝑓checkand the guard function 𝑓check, respectively.

∙ The output guard is encoded by a logical conjunction of output predicateson the number of free places required on the output ports. The SysteMoCsyntax for output guards is analogous to the syntax for input guards. TheLines 12 and 17 of Example 3.2 are demonstrating the usage of an out-put guard that consists of the single output predicate o1(1) denoting arequest for at least one free place on the channel connected to the outputport o1, that is #𝑜1 ≥ 1.

∙ The action function is given via the CALL syntax. To exemplify, the ac-tions of the transitions 𝑡1, 𝑡2, and 𝑡3 are specified, respectively, in Lines 13,18, and 22 to call the three member functions copyStore, copyInput,and copyApprox. These three member functions correspond to the ac-tion functions 𝑓copyStore, 𝑓copyInput, and 𝑓copyApprox of the mathematicalmodel.

∙ Finally, the destination state of the transition is specified last. In contrastto the previous four parts, which are optional, the destination state is amandatory part of each transition specification.

In SysteMoC, each FSM state is defined explicitly by an assignment of a listof outgoing transitions. To exemplify, the state 𝑞start is defined in Example 3.2Line 10 by assigning transition 𝑡1 (Lines 11 to 13) to it. If a state has multipleoutgoing transitions, e.g., like state 𝑞loop with the transitions 𝑡2 (Lines 16 to 18)and 𝑡3 (Lines 20 to 22), then the outgoing transitions are joined via the ’|’-operator (Line 19) and assigned to the state variable (Line 15).

55

Page 64: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

3.4.3 Specification of the Network Graph

Similar to actors, network graphs are represented as C++ classes. All actorsand channels of a network graph must be instantiated in the constructor ofthe class representing the network graph. To exemplify, the SqrRoot classlisted in Example 3.3 corresponding to the network graph shown in Figure 3.6is considered. As can be seen in Line 1, the class representing the networkgraph is derived from the smoc_graph base class provided by the SysteMoClibrary. The actors 𝑎1, 𝑎2, . . . 𝑎5 of the network graph are declared in the Lines 3to 7 and instantiated in the constructor in Line 11. Finally, the FIFO channels𝑐1, 𝑐2 . . . 𝑐6 are specified in the Lines 12 to 20. The SqrRoot example uses theconnectNodePorts method of SysteMoC to connect input to output ports.In the simplest case (Lines 12 to 15), the connectNodePorts methods takestwo arguments the output port 𝑜, from where the channel starts, and the inputport 𝑖, which is the destination of the channel.

Example 3.3 Specification of the network graph for the SqrRoot algorithm

1 class SqrRoot: public smoc_graph {2 protected:3 Src a1;4 SqrLoop a2;5 Approx a3;6 Dup a4;7 Sink a5;8 public:9 SqrRoot(sc_module_name name)

10 : smoc_graph(name),11 a1("a1"), a2("a2"), a3("a3"), a4("a4"), a5("a5"){12 connectNodePorts(a1.o1, a2.i1); // 𝑐113 connectNodePorts(a2.o1, a3.i1); // 𝑐214 connectNodePorts(a2.o2, a5.i1); // 𝑐615 connectNodePorts(a4.o2, a2.i2); // 𝑐516 connectNodePorts(a3.o1, a4.i1, // 𝑐417 smoc_fifo<double>(2)); // size(𝑐4) = 218 connectNodePorts(a4.o1, a3.i2, // 𝑐319 smoc_fifo<double>(3) // size(𝑐3) = 320 << 2); // delay(𝑐3) = ⟨2⟩21 }22 };

56

Page 65: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.5 Hierarchical SysteMoC Models

However, it is also possible to explicitly parameterize the created channel𝑐 with its channel capacity of size(𝑐) tokens and a sequence of initial tokensdelay(𝑐). To exemplify, consider Lines 17 and 19, where a third parameter,the channel initializer smoc_fifo<T>(n) « initial tokens. . ., is givento the method connectNodePorts. If no channel initializer is given, thena FIFO channel with a channel capacity of one token and without any initialtokens is created between the output port 𝑜 and the input port 𝑖. If a channelinitializer is given, then it must be parameterized with the data type T carriedby the channel, the channel capacity n in units of tokens, and a possibly emptysequence of initial tokens provided via usage of the ’«’-operator. To exemplify,consider Line 17 of Example 3.3, where a FIFO channel with a channel capacityof two tokens is created between the ports 𝑎3.𝑜1 and 𝑎4.𝑖1. In this case, thechannel initializer is parameterized with the C++ double data type, denotingthat the token values carried by the channel will be of the C++ double datatype. An example for providing an initial token is given in Line 20, where the’«’-operator is used to provide the double value 2 as an initial token for thecreated channel between 𝑎4.𝑜1 and 𝑎3.𝑖2.

3.5 Hierarchical SysteMoC Models

For the specification and modeling of complex systems, a notion of hierarchy isvery helpful. Furthermore, hierarchy is used in Chapter 4 to describe algorithmsfor model refinements of SysteMoC models. In particular, hierarchy will beused to group an island of static actors into a cluster for later refinement by theclustering methodology in order to optimize the generated schedule for the staticactors contained in the island in such a way that the performance of the wholegraph in terms of latency and throughput is optimized. Hence, hierarchy willbe introduced into SysteMoC models by allowing a hierarchical specification ofboth network graphs and FSMs.

3.5.1 Hierarchical Finite State Machines

While FSM is a formal model for reasoning about control flow, the modelingof complex control flow via non-hierarchical FSM specifications has turned outto be inadequate [Har87]. Hence, SysteMoC supports hierarchical FSM speci-fications [ZFH+10*] similar to the statecharts [Har87] notion as introduced byHarel. In the following, the semantics of these hierarchical FSM specificationswill be explained and the choice for interleaving semantics instead of concur-rency for these hierarchical FSM specifications will be motivated. The notationof hierarchical FSM specifications allows to concisely specify complex controlflow. The two main features responsible for this advantage are as follows:

57

Page 66: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

∙ Introduction of hierarchy by using so-called xor states (cf. Figure 3.9a).This allows the grouping of states implementing a certain functionalityinto one xor state, e.g., usually used to specify subsystems in a more com-plex specification. Furthermore, this enables one specified transition torepresent an outgoing transition for each state contained inside an xorstate, e.g., some form of reset transition to abort the execution of a sub-system or switch modes from one subsystem to another. The name xorstate is derived from the invariant that if the FSM is in an xor state, thenit is in exactly one of the states contained in the xor state.

∙ Representation of concurrency/interleaving by using so-called and states(cf. Figure 3.10). In the non-hierarchical case, the modeling of the controlflow of multiple concurrent or interleaved executions will require a form ofproduct FSM, and hence the number of states needed to be created by handwill exhibit an exponential explosion. In contrast to this, and states docreate these product FSMs implicitly and no longer require hand modeling.Hence, the modeling of much larger specifications becomes feasible in acompact way.

Leaf states 𝑞𝐴, 𝑞𝐵, and 𝑞𝐶

Initial state 𝑞𝐴 of xor state 𝑞𝐷

Initial state 𝑞𝐷 of the FSM

xor state 𝑞𝐷

#𝑖1 ≥ 1

#𝑖3 ≥ 1

#𝑖2 ≥ 1

#𝑖4 ≥ 1

𝑞𝐷

𝑞𝐴

𝑞𝐶

𝑞𝐵

(a) Hierarchical version of the FSM

Only leaf states remaining

#𝑖1 ≥ 1

#𝑖3 ≥ 1

#𝑖4 ≥ 1

#𝑖2 ≥ 1

#𝑖2 ≥ 1

𝑞𝐴

𝑞𝐶

𝑞𝐵

(b) Corresponding non-hierarchical version

Figure 3.9: Example of expressing hierarchy in SysteMoC FSMs via usage of xorstates

To exemplify xor states, consider the hierarchical FSM depicted in Figure 3.9a.It consists of one xor state 𝑞𝐷 containing the two leaf states 𝑞𝐴 and 𝑞𝐶 as wellas a leaf state 𝑞𝐵 on the same hierarchy level as 𝑞𝐷. The corresponding non-hierarchical FSM is depicted in Figure 3.9b. Transitions between leaf states havethe same semantics as in the non-hierarchical FSM model, e.g., the transition

58

Page 67: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.5 Hierarchical SysteMoC Models

from 𝑞𝐴 to 𝑞𝐶 . These transitions may even connect leaf states inside differentxor states, e.g., the transition from 𝑞𝐵 to 𝑞𝐶 . Transitions with an xor state assource state are attached as outgoing transition to each state contained in thexor state, e.g., the transition from 𝑞𝐷 to 𝑞𝐵 in Figure 3.9a which is replicatedinto two transitions 𝑞𝐴 to 𝑞𝐵 and 𝑞𝐶 to 𝑞𝐵 in Figure 3.9b. Transitions with anxor state as destination state will be mapped in the non-hierarchical FSM to atransition where the destination state is determined by the initial state of thetarget xor state, e.g., the transition from 𝑞𝐵 to 𝑞𝐷 corresponds to the transitionfrom 𝑞𝐵 to 𝑞𝐴 in Figure 3.9b.

and state 𝑞𝑌

xor states 𝑞𝐴 and 𝑞𝐷

𝛼/𝑓

3;emit(𝜉)

𝛾/𝑓

4

𝛼/𝑓

1;emit(𝜉)

𝛽/𝑓

2

𝑞𝑌𝑞𝐷𝑞𝐴

𝑞𝐸

𝑞𝐹

𝑞𝐵

𝑞𝐶

Figure 3.10: Example of a hierarchical FSM with an and state

To exemplify and states, consider the (hierarchical) statechart FSM depictedin Figure 3.10. One central decision to determine the semantics of and states isthe question if and states model concurrency or interleaving. As the depictedFSM belongs to the synchronous reactive domain [BCE+03], it has input events(𝛼, 𝛽, 𝛾) instead of input predicates and emits output events (emit(𝜉)) insteadof output predicates. In the synchronous reactive domain, transitions are as-sumed to be taken instantaneously23 and events may vanish between successivetransitions. Hence, to prevent events to be lost, an and state models concur-rency in the synchronous reactive domain, that is all transitions enabled by theset of input events are taken concurrently. Therefore, in statecharts, a firingof the hierarchical FSM can change the state in multiple xor states containedin an and state, e.g., as can be seen in the four transitions from 𝑞𝐵,𝐶 to 𝑞𝐶,𝐹

in Figure 3.11a. The actions of these four transitions are derived by parallelcomposition (as denoted by the ‖ operator) from the actions 𝑓1, 𝑓2, 𝑓3, and 𝑓4in the individual xor states 𝑞𝐴 and 𝑞𝐷 contained in the and state.23In practice, this means faster than the worst case inter arrival time of events from the

environment of the synchronous reactive system.

59

Page 68: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

𝛽∧ ¬

𝛾∧ ¬

𝛼/𝑓 2

𝛾 ∧ ¬𝛽 ∧ ¬𝛼/𝑓4

𝛾/𝑓4 𝛽/

𝑓 2

𝛼/𝑓 1‖em

it(𝜉)

𝛼/𝑓3 ‖em

it(𝜉)

𝛽∧𝛾/𝑓

2‖𝑓

4

𝛼/𝑓

1‖𝑓

3‖emit(𝜉)

𝛼∧𝛾/𝑓

1‖𝑓

4‖emit(𝜉)

𝛼∧𝛽/𝑓

2‖𝑓

3‖emit(𝜉)

𝑞𝐶,𝐸 𝑞𝐵,𝐹

𝑞𝐶,𝐹

𝑞𝐵,𝐸

(a) Corresponding non-hierarchicalFSM derived from the hierarchi-cal FSM depicted in Figure 3.10in case of concurrency semanticsfor this hierarchical FSM

𝛽/𝑓 2

𝛼/𝑓 1‖em

it(𝜉)

𝛼/𝑓3 ‖em

it(𝜉)

𝛾/𝑓4

𝛽/𝑓 2

𝛾/𝑓4

𝛼/𝑓 1‖em

it(𝜉)

𝛼/𝑓3 ‖em

it(𝜉)

𝑞𝐶,𝐸 𝑞𝐵,𝐹

𝑞𝐶,𝐹

𝑞𝐵,𝐸

(b) Corresponding non-hierarchicalFSM derived from the hierarchi-cal FSM depicted in Figure 3.10in case of interleaving semanticsfor this hierarchical FSM

Figure 3.11: Interpretation of an and state as modeling either concurrency orinterleaving

However, the concurrency interpretation has the following implications:

∙ The parallel composition of two actions, e.g., 𝑓1‖𝑓2, must be a well definedoperation. However, when the actions 𝑓1, 𝑓2 : 𝑄func → 𝑄func manipulatea shared functionality state 𝑞func ∈ 𝑄func

24, then this is not generally thecase. For parallel composition to be a well defined operation, the nextfunctionality state resulting from applying both 𝑓1 and 𝑓2 must be in-dependent of the sequence of application of the actions to the currentfunctionality state, that is 𝑓1 ∘ 𝑓2 ≡ 𝑓2 ∘ 𝑓1.25 Note that in general, actionsencoded in SysteMoC do not have this property.

∙ It must be possible to create a union of output events generated by theset of enabled transitions. This is trivially possible if transitions generateevents, then the union corresponds to the set union. To exemplify, considerthe transitions with the actions 𝑓1 and 𝑓3 from Figure 3.10. Both tran-sitions generate the output event 𝜉, thus the transition with the parallelcomposition 𝑓1‖𝑓3 from Figure 3.11a will also only generate 𝜉. However,

24As defined above, actions not only manipulate the functionality state, but are also giventhe values of the input tokens and calculate the values of the output tokens. However, theproblem described here is independent of this fact and only based on the manipulation ofthe functionality state.

25The ‘∘’-operator, e.g., 𝑓 ∘ 𝑔, is used to denote function composition, i.e., 𝑓 ∘ 𝑔(𝑥) ≡ 𝑓(𝑔(𝑥)).

60

Page 69: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.5 Hierarchical SysteMoC Models

in SysteMoC, the output event 𝜉 will correspond to an output predicateon the same output port 𝑜. Thus, two token values will be generated by𝑓1 and 𝑓3 for one output token.

Due to the above given implications, interleaving semantics has been chosenfor hierarchical SysteMoC FSMs. Hence, actions of taken transitions are exe-cuted sequentially, while concurrency is provided due to the data flow nature ofSysteMoC.In general, a hierarchical FSM can be constructed by an arbitrary nesting

of xor states and and states inside each other. This nesting is terminatedby leaf states which do not contain further states. In SysteMoC, xor statesand and states are represented by the C++ classes smoc_xor_state andsmoc_and_state, respectively. Leaf states are represented by the C++ classsmoc_firing_state, as known from the specification of non-hierarchicalFSMs in SysteMoC.

3.5.2 Hierarchical Network Graphs

Next, the notion of clusters will be introduced. Clusters are used to introducehierarchy into the non-hierarchical network graph model. A cluster is basicallya data flow graph 𝑔 with input and output ports. Such a graph will be called anopen system as it exchanges data with its cluster environment via its cluster in-put and output ports. Furthermore, the availability of tokens on the input portsof the cluster will be controlled from outside the cluster by the cluster environ-ment. This is in contrast to the closed system nature of the traditional definitionof a data flow graph as given in Definition 2.1. Traditional data flow graphsmight communicate with an outside environment by using dedicated source andsink actors representing interfaces to this outside environment. However, theconsumption and production of tokens by these dedicated source and sink ac-tors is controlled by the scheduler of the data flow graph itself and not by theenvironment.Hierarchy is introduced into a cluster or network graph by allowing it to

contain subclusters, i.e., the vertices 𝑔.𝑉 of a cluster are partitioned into actors𝑎 ∈ 𝑔.𝐴, channels 𝑐 ∈ 𝑔.𝐶, and additionally also subclusters 𝑔𝛾 ∈ 𝑔.𝐺. Moreformally, a cluster can be defined as follows:

Definition 3.4 (Cluster [FHZT13*]). A cluster is a directed bipartite graph𝑔 = (𝒫 , 𝑉, 𝐸) containing a set of cluster ports 𝒫 = 𝐼 ∪ 𝑂 partitioned intocluster input ports 𝐼 and cluster output ports 𝑂, a set of vertices 𝑉 = 𝐶 ∪𝐴∪𝐺partitioned into channels 𝐶, actors 𝐴, and subclusters 𝐺, a set of directed edges𝑒 = (𝑣src, 𝑣snk) ∈ 𝐸 ⊆ ((𝐶 ∪ 𝐼)× (𝐴.𝐼 ∪𝐺.𝐼)) ∪ ((𝐴.𝑂 ∪𝐺.𝑂)× (𝐶 ∪ 𝑂)) fromchannels 𝑐 ∈ 𝐶 or cluster input ports 𝑖′ ∈ 𝐼 to actor or subcluster input ports𝑖 ∈ 𝐴.𝐼 ∪𝐺.𝐼 as well as from actor or subcluster output ports 𝑜 ∈ 𝐴.𝑂∪𝐺.𝑂 to

61

Page 70: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

channels or cluster output ports 𝑜′ ∈ 𝐺.𝑂. These edges are further constrainedsuch that exactly one edge is incident to each actor, subcluster, or cluster portand the in-degree and out-degree of each channel in the graph is exactly one.Moreover, the delay function delay and the channel capacity function size arereused from Definition 3.3 for non-hierarchical network graphs.

In the following, an actor 𝑎in ∈ 𝑔.𝐴𝐼 of a cluster 𝑔 will be called an input actorif it consumes tokens from channels of its cluster environment. Analogously, anactor 𝑎out ∈ 𝑔.𝐴𝑂 of a cluster 𝑔 will be called an output actor if it producestokens on channels of its cluster environment. Actors which are neither inputnor output actor are called intermediate actors. More formally, the set of inputactors 𝐴𝐼 of a cluster 𝑔 is defined as 𝑔.𝐴𝐼 = { 𝑎 ∈ 𝑔.𝐴 | ∃(𝑣src, 𝑣snk) ∈ 𝑔.𝐸 :𝑣src ∈ 𝑔.𝐼∧𝑣snk ∈ 𝑎.𝐼 } while the set of output actors 𝐴𝑂 of a cluster 𝑔 is definedanalogously as 𝑔.𝐴𝑂 = { 𝑎 ∈ 𝑔.𝐴 | ∃(𝑣src, 𝑣snk) ∈ 𝑔.𝐸 : 𝑣snk ∈ 𝑔.𝑂 ∧ 𝑣src ∈ 𝑎.𝑂 }.Note that 𝐴𝐼 and 𝐴𝑂 may not be disjoint sets.To distinguish between open and closed systems, the notion of hierarchical

network graphs is introduced. These graphs are simply clusters with an emptyset of cluster input and output ports.

Definition 3.5 (Hierarchical Network Graph [FHZT13*]). A hierarchical net-work graph 𝑔n ∈ 𝐺𝑛 = { 𝑔 | 𝑔 ∈ 𝐺 ∧ 𝑔.𝒫 = ∅ } is a cluster without clusterports.26

As can be seen from Definitions 3.4 and 3.5, a non-hierarchical network graph𝑔n, as given in Definition 3.3, is simply a hierarchical network graph withoutany subclusters, i.e., 𝑔n.𝐺 = ∅. Furthermore, a cluster 𝑔 will be called non-hierarchical if it does not contain a subcluster, i.e., 𝑔.𝐺 = ∅, and hierarchicalotherwise. In the following, it will be assumed that a network graph or clustermay contain subclusters if they are not explicitly stated to be non-hierarchical.

3.5.3 The Cluster Operation

The transformation of a possibly non-hierarchical original cluster into a hier-archical cluster will be denoted by usage of the Γ : 𝐺 × 2𝑉 → 𝐺 × 𝐺 clusteroperator. This operation takes an original cluster and a set of vertices 𝑉𝛾 togenerate a clustered network graph containing a subcluster which replaces theset of clustered vertices 𝑉𝛾 . An example of such a cluster operation can be seen

26 In contrast to the usage of 𝑔.𝐺, which denotes the set of subclusters inside the cluster 𝑔, orthe usage of 𝐺 where it is apparent from the context to refer to some cluster 𝑔, the usageof 𝐺 alone and without any context is used to denote the set of all possible clusters. Thisconvention is additionally used accordingly for the set of all possible actors 𝐴, channels𝐶, edges 𝐸, ports 𝒫, and vertices 𝑉 = 𝐴 ∪ 𝐶 ∪𝐺.

62

Page 71: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.5 Hierarchical SysteMoC Models

in Figure 3.6 depicting the original network graph 𝑔sqr,flat and Figure 3.12 dis-playing the clustered network graph 𝑔sqr containing the subcluster 𝑔𝛾appx

whichreplaces the clustered actors and channels 𝑉 sqr

𝛾 = { 𝑎3, 𝑎4, 𝑐3, 𝑐4 } ⊆ 𝑔sqr,flat.𝑉 ,i.e., Γ(𝑔sqr,flat, 𝑉 sqr

𝛾 ) = (𝑔sqr, 𝑔𝛾appx).

𝑎2|SqrLoop

𝑜 2 𝑖 2𝑜 1𝑖 1

𝑐1 𝑐2

𝑐5𝑐6

𝑜1

𝑖1

𝑜 1

𝑜 1 𝑖 1

𝑖 2

𝑐3 𝑐4

Network graph 𝑔sqr and its subcluster 𝑔𝛾appx∈ 𝑔sqr.𝐺

𝑔sqr|SqrRoot

Cluster input port 𝑖1 ∈ 𝑔𝛾appx.𝐼

Cluster output port 𝑜1 ∈ 𝑔𝛾appx.𝑂 Channel 𝑐4 ∈ 𝑔𝛾appx

.𝐶

Actor

𝑎3∈𝑔 𝛾

appx.𝐴

ofsubc

luster

𝑔 𝛾appx

𝑔𝛾appx|ApproxBody

𝑎1|Src

𝑎5|Sink

𝑖1

𝑜2

𝑖1

𝑜1

𝑎3|Approx

𝑎4|Dup

Figure 3.12: The hierarchical network graph 𝑔sqr displayed above is derived fromthe non-hierarchical network graph 𝑔sqr,flat depicted in Figure 3.6on page 49 by clustering its actors and channels { 𝑎3, 𝑎4, 𝑐3, 𝑐4 } ⊂𝑔sqr,flat.𝑉 into a subcluster 𝑔𝛾appx

∈ 𝑔sqr.𝐺. Furthermore, the sub-cluster 𝑔𝛾appx

has the following sets of input and output actors𝐴𝐼 = { 𝑎3 } and 𝐴𝑂 = { 𝑎4 }, respectively.

Without loss of generality, because only actor ports, which are also providedby a subcluster, are referenced inside the following definition, all containedsubclusters inside the original graph 𝑔orig are treated like actors. Should anactor be indeed a cluster, then the actor ports will refer to the equivalent clusterports.

Definition 3.6 (Cluster Operation [FHT07*]). The cluster operation Γ : 𝐺 ×2𝑉 → 𝐺 × 𝐺 is a partial function which maps Γ(𝑔orig, 𝑉𝛾 ) = (𝑔, 𝑔𝛾 ) an originalgraph 𝑔orig ∈ 𝐺 and a set of channels and actors 𝑉𝛾 ⊆ 𝑔orig.𝑉 contained in thisoriginal graph into a clustered graph 𝑔 ∈ 𝐺 and a subcluster 𝑔𝛾 ∈ 𝑔.𝐺 containedin the clustered graph. The cluster operation requires that all channels to cluster𝑐 ∈ 𝐶𝛾 = 𝑉𝛾 ∩ 𝑔orig.𝐶 are connected to actors to cluster 𝑎 ∈ 𝐴𝛾 = 𝑉𝛾 ∩ 𝑔orig.𝐴,i.e., ∀𝑐 ∈ 𝐶𝛾 : ∃𝑖 ∈ 𝐴𝛾 .𝐼, 𝑜 ∈ 𝐴𝛾 .𝑂 : (𝑐, 𝑖), (𝑜, 𝑐) ∈ 𝑔orig.𝐸. The clustered graphand its subcluster are derived from the original graph 𝑔orig and the set 𝑉𝛾 ofchannels and actors to cluster in the following way:

63

Page 72: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

∙ The vertices 𝑉𝛾 are replaced in the clustered graph 𝑔 by the subcluster𝑔𝛾 ∈ 𝑔.𝐺 which contains the replaced vertices 𝑉𝛾 , i.e., 𝑔𝛾 .𝑉 = 𝑉𝛾 , 𝑔.𝑉 =(𝑔orig.𝑉 ∖ 𝑉𝛾 ) ∪ { 𝑔𝛾 }.

∙ The ports 𝑔.𝒫 of the clustered graph 𝑔 are taken from the original graphwithout any modifications, i.e., 𝑔.𝒫 = 𝑔orig.𝒫 .

∙ The input ports 𝑔𝛾 .𝐼 of the subcluster 𝑔𝛾 are derived via an one-to-onecorrespondence from the edges 𝑒in ∈ 𝐸in directed from vertices of theclustered graph to vertices of the subcluster, i.e., ∃𝑓in : 𝐸in → 𝑔𝛾 .𝐼 : 𝑓in isa bijection where 𝐸in = { 𝑒 ∈ 𝑔orig.𝐸 | 𝑒.𝑣src ∈ 𝑔.𝐶∪𝑔.𝐼, 𝑒.𝑣snk ∈ 𝑔𝛾 .𝐴.𝐼 }.27

∙ The output ports 𝑔𝛾 .𝑂 of the subcluster 𝑔𝛾 are derived according to thesame scheme as the input ports, i.e., ∃𝑓out : 𝐸out → 𝑔𝛾 .𝑂 : 𝑓out is a bijectionwhere 𝐸out = { 𝑒 ∈ 𝑔orig.𝐸 | 𝑒.𝑣src ∈ 𝑔𝛾 .𝐴.𝑂, 𝑒.𝑣snk ∈ 𝑔.𝐶 ∪ 𝑔.𝑂 }.

∙ After defining the vertices and ports of the clustered graph and its subclus-ter, the edges connecting these vertices have to be defined. These edgesare derived from the edges 𝑒 ∈ 𝑔orig.𝐸 of the original graph. Two caseshave to be distinguished: (1) Edges 𝑒in ∈ 𝐸in or 𝑒out ∈ 𝐸out which crossthe boundary between the clustered graph and its subcluster and have tobe transformed into two edges, one in the clustered graph to or from a sub-cluster port and one edge in the subcluster from or to a subcluster port,respectively. (2) The remaining edges 𝑒 ∈ 𝑔orig.𝐸∖(𝐸in∪𝐸out) which can betaken verbatim into their corresponding clustered graph and its subcluster.Thus, more formally the edges of the clustered graph are defined as follows:𝑔.𝐸 = { 𝑒 ∈ 𝑔orig.𝐸 | 𝑒.𝑣src, 𝑒.𝑣snk ∈ 𝑔.𝒫∪𝑔.𝐶∪𝑔.𝐴.𝒫 }∪{ (𝑒in.𝑣src, 𝑓in(𝑒in)) |𝑒in ∈ 𝐸in } ∪ { (𝑓out(𝑒out), 𝑒out.𝑣snk) | 𝑒out ∈ 𝐸out } while the edges of thesubcluster are defined accordingly 𝑔𝛾 .𝐸 = { 𝑒 ∈ 𝑔orig.𝐸 | 𝑒.𝑣src, 𝑒.𝑣snk ∈𝑔𝛾 .𝐶 ∪ 𝑔𝛾 .𝐴.𝒫 } ∪ { (𝑓in(𝑒in), 𝑒in.𝑣snk) | 𝑒in ∈ 𝐸in } ∪ { (𝑒out.𝑣src, 𝑓out(𝑒out)) |𝑒out ∈ 𝐸out }.

For a better understanding of the preceding definition, the clustering opera-tion Γ(𝑔sqr,flat, { 𝑎3, 𝑎4, 𝑐3, 𝑐4 }) = (𝑔sqr, 𝑔𝛾appx

) depicted in Figure 3.6 on page 49and Figure 3.12 is considered again. For this example, the edges crossing theboundary between the clustered graph and its subcluster are 𝑒in = (𝑐2, 𝑎3.𝑖1)and 𝑒out = (𝑎4.𝑜2, 𝑐5) which are mapped via the bijections 𝑓in and 𝑓out tothe subcluster ports 𝑔𝛾appx

.𝒫 = { 𝑖1, 𝑜1 } as follows: 𝑓in(𝑒in) = 𝑔𝛾appx.𝑖1 and

𝑓out(𝑒out) = 𝑔𝛾appx.𝑜1. Furthermore, these two edges are translated into two edges

of the clustered graph (𝑒in → (𝑐2, 𝑔𝛾appx.𝑖1) ∈ 𝑔sqr.𝐸 and 𝑒out → (𝑔𝛾appx

.𝑜1, 𝑐5) ∈

27Here the ‘.’-operator is again used to refer to the source and sink of an edge 𝑒, respectively,with the denotation 𝑒.𝑣src and 𝑒.𝑣snk, as given in Definition 3.3 and Definition 3.4.

64

Page 73: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.6 From SystemC to SysteMoC

𝑔sqr.𝐸) and two edges of the subcluster (𝑒in → (𝑔𝛾appx.𝑖1, 𝑎3.𝑖1) ∈ 𝑔𝛾appx

.𝐸 and𝑒out → (𝑎4.𝑜2, 𝑔𝛾appx

.𝑜1) ∈ 𝑔𝛾appx.𝐸).

The semantics of a hierarchical network graph 𝑔n is defined by the semanticsof its corresponding non-hierarchical network graph 𝑔′n. This non-hierarchicalnetwork graph is obtained by recursively applying the dissolve operation Γ−1 un-til the resulting network graph no longer contains any subclusters. The dissolveoperation Γ−1 is the inverse of the cluster operation. In Chapter 4, the structuralhierarchy introduced above will be used to specify refinements by performingquasi-static scheduling for the data flow graph contained in a cluster.

3.6 From SystemC to SysteMoC

While modeling languages with explicit assertions about the used MoC are the-oretically very appealing, it is a fact that most models are written in somekind of C or C++-based legacy languages. One of these languages is SystemC[GLMS02, Bai05] which is very popular and used at the ESL as real-world systemdescription language for industrial projects. However, as SystemC is a C++-based class library, it inherits all the problems of C++, e.g., the possibilityof unstructured communication over shared memory and Turing completeness.The question answered in this section is whether such legacy code might betransferred into MoCs such as SysteMoC. It is clear that due to Turing com-pleteness, such a transformation can only provide a sufficient criterion to detecta MoC for a given SystemC code. The first step is called communication extrac-tion. The second step to bridge the gap between legacy code and the MoC worldwill be presented in Section 3.7. Communication extraction, as published alsoin [FZK+11*], is an algorithm to derive an actor FSM from a SystemC modulewritten in a constrained subset of SystemC.Fortunately, the synthesizable subset of SystemC [MSS+09] is a good basis

for such a constrained subset of SystemC analyzable by the presented method-ology. Furthermore, SystemC strongly suggests and the synthesizable subsetof SystemC requires the representation of a design in an actor-oriented way[ET06, JBP06]. Designs represented in an actor-oriented way consist of actorswhich can only communicate via channels connecting the actors via ports, hencealleviating the unstructured communication problem inherited from C++. Onekey insight is that the number of possible states for the control flow exhibited bya module written in the synthesizable subset of SystemC must be finite. This re-quirement stems from the observation that behavioral synthesis tools like ForteCynthesizer [For13], Catapult C from Calypto Design Systems [Cal13], and Vi-vado from Xilinx [XIL14] represent the control flow of the original behavioralSystemC module as an FSM in the synthesized RTL SystemC module.

65

Page 74: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

However, the synthesizable subset of SystemC is not constrained to use chan-nels exhibiting FIFO semantics. Therefore, synthesizable SystemC applicationsdo not purely consist of actors communicating in a data flow manner in general.Hence, the further constraint that communication must be conducted via FIFOchannels is required in order to extract a SysteMoC model from a SystemCapplication. In practice, SystemC sc_fifo primitives are used to implementchannels of the desired FIFO semantics and limited channel capacity. An ex-ample of a SystemC actor satisfying this constraint is depicted in Example 3.4.

Example 3.4 SystemC module implementing the 𝑎2|SqrLoop actor from Fig-ure 3.6

1 // Definition of the SqrLoop actor class2 class SqrLoop : public sc_core : : sc_module {3 public :4 // Declaration of fifo input and output ports5 sc_core : : sc_f i fo_in<double> i1 , i 2 ;6 sc_core : : sc_fi fo_out<double> o1 , o2 ;7 private :8 // SystemC thread process9 void loop ( ) {

10 BB1: double tmp , r e s ;11 BB2: while ( true ) {12 BB3: tmp = i1 . read ( ) ; // get token from Src13 do { /* Do one approximation s t ep */14 BB4: o1 . wr i t e (tmp ) ;15 BB5: r e s = i 2 . read ( ) ;16 } while ( /* Good enough? */17 std : : f abs (tmp−r e s * r e s ) >=0.01);18 BB6: o2 . wr i t e ( r e s ) ; // write token to Sink19 }20 BB7: return ;21 }22 public :23 SC_HAS_PROCESS( SqrLoop ) ;24 SqrLoop ( sc_core : : sc_module_name name)25 : sc_core : : sc_module (name)26 { SC_THREAD( loop ) ; }27 } ;

Communication extraction [FZK+11*] uses the Control Flow Graph (CFG)of a SystemC module which is extracted via the Abstract Syntax Tree (AST)

66

Page 75: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.6 From SystemC to SysteMoC

generated by the C Language Family Frontend of the LLVM Compiler Infras-tructure (clang) [Lat11] parser. The CFG used is very similar to the ones knownfrom compiler design, i.e., it is a directed graph where a vertex represents a BasicBlock (BB) and an edge represents a possible successor BB after the BB at thesource of the edge has finished execution. A BB itself represents a maximal pieceof sequential code which neither contains any jump targets except at its startnor contains any control flow except the possible multiple outgoing edges at itsend. In case of multiple outgoing edges of a BB, the edges must be annotatedwith mutually exclusive Boolean conditions to ensure a deterministic executionsemantics.

The communication extraction analysis requires that the SystemC modulecontains exactly one SystemC thread, i.e., exactly one SC_THREAD(method)is defined in the module’s constructor and assumes that only functions in theSystemC module itself issue read or write operations on the ports of the mod-ule. All other code, e.g., standard library functions, global helper code, andcode reachable via function pointers or virtual methods are assumed to haveno access to the ports of the SystemC module which calls these functions. TheC++ type of a SystemC module is extracted via the usage of C++ Run TimeType Information (RTTI) at run time after the elaboration phase of SystemC.Furthermore, CFGs from compiler design have call and return vertices whichenable them to represent infinite recursion. This would prevent the constructionof a finite CFG by inlining the CFGs of the called functions. Therefore, anyrecursion within a SystemC module is also forbidden. The CFG of a given Sys-temC module is constructed by starting from the SC_THREAD marked methodand is extended by recursively inlining all CFGs of called methods.

Fortunately, the above limitations correspond well to the requirements ofthe synthesizable subset of behavioral compilers like Forte Cynthesizer [For13]or Catapult C from Calypto Design Systems [Cal13]. This coincidence is notsurprising as behavioral compilers also represent the control flow by an FSM.

To exemplify, the SystemC module depicted in Example 3.4 is considered.The module implements the 𝑎2|SqrLoop actor from Figure 3.6. The CFGcorresponding to the loopmethod is depicted in Figure 3.13. The CFG deviatesfrom a conventional CFG as known from compiler design, since BBs are splitsuch that all read statements in a BB precede all write statements in theBB. An example of such a split are the BBs BB4 and BB5 which would be asingle BB in a traditional CFG. The requirement to split this block stems fromthe fact that the actions 𝑓𝑎𝑐𝑡𝑖𝑜𝑛 ∈ ℱ𝑎𝑐𝑡𝑖𝑜𝑛 will be assembled from the code inthe BBs and token consumption and production requirements of an action arechecked by the input/output guard at the start of the action. If a BB wouldcontain a read statement after a write statement, an erroneous dependencyfrom the token consumed by the read statement to the token produced by the

67

Page 76: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

BB1start

BB2

BB7

BB3 BB4

BB6 BB5

#𝑖1 ≥ 1

f

#𝑜1 ≥ 1

#𝑖2 ≥ 1#𝑜1 ≥ 1∧|tmp− res2| ≥ 0.01

#𝑜2 ≥ 1 ∧ |tmp− res2| < 0.01

Figure 3.13: CFG of the loop method from Example 3.4

write statement would be generated by the input/output guard at the startof the action containing this BB.It should be noted that the CFG depicted in Figure 3.13 is already annotated

with input/output guards for each BB, e.g., the input guard #𝑖1 ≥ 1 on theedge between BB2 and BB3 denotes that at least one token must be available onthe FIFO channel connected to the input port 𝑖1 to execute BB3. The CFG isconverted into the corresponding actor FSM by converting each BB into a state,e.g., BB1–BB7 is converted to the states 𝑞1–𝑞7, and adding an initial state 𝑞0. Atransition from the initial state 𝑞0 to the state corresponding to the initial BB,e.g., state 𝑞1 derived from BB1, is added. The action for each transition is theBB from which the destination state of the transition is derived. The derivedinitial actor FSM can be seen in Figure 3.14.To reduce the number of states and assemble basic blocks into longer actions,

a number of post-processing rules may be applied to the initial FSM. Theserules will be applied iteratively until no more rules match. In order to expressthese post-processing rules more succinctly, two transitions 𝑡𝑎 and 𝑡𝑏 are calledsequentially mergeable if the destination state of transition 𝑡𝑎 is the source stateof transition 𝑡𝑏 and 𝑡𝑏 does not consume any tokens if tokens have been pro-duced by 𝑡𝑎. Furthermore, two transitions 𝑡𝑎 and 𝑡𝑏 are called parallel mergeableif they have the same source and destination state as well as the same input/out-put guard. More explicitly, the following post-processing rules [FZK+11*] areapplied to the initial actor FSM:

Rule 1 Elimination of sequentially mergeable transitions for split/join states.States with at least one incoming transition and exactly one outgoingtransition will be called join states. States with at least one outgoing

68

Page 77: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.6 From SystemC to SysteMoC

𝑞0start 𝑞1

𝑞2

𝑞7

𝑞3 𝑞4

𝑞6 𝑞5

t/BB1

t/BB2 #𝑖1 ≥ 1/BB3

f/BB7

#𝑜1 ≥ 1/BB4

#𝑖2 ≥ 1/BB5#𝑜1 ≥ 1∧|tmp− res2|≥ 0.01/BB4

#𝑜2 ≥ 1 ∧ |tmp− res2| < 0.01/BB6

t/BB2

Figure 3.14: Transformation of the CFG from Figure 3.13 into an FSM

𝑞0start

𝑞2

𝑞7

𝑞4

𝑞6

t/BB1; BB2 #𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/BB3; BB4

f/BB7

#𝑖2 ≥ 1 ∧#𝑜1 ≥ 1∧|tmp− res2| ≥ 0.01/BB5; BB4

#𝑖2 ≥ 1 ∧#𝑜2 ≥ 1∧|tmp− res2| < 0.01/BB5; BB6

t/BB2

Figure 3.15: The actor FSM after applying rule 1 to eliminate the states 𝑞1, 𝑞3,and 𝑞5 from Figure 3.14

transition and exactly one incoming transition will be called split states.These states can be eliminated if each incoming transition is sequen-tially mergeable with each outgoing transition. The state is eliminatedby creating a transition 𝑡 for each combination of incoming transition𝑡𝑎 and outgoing transition 𝑡𝑏. The action of the created transition 𝑡is the concatenation of the actions of 𝑡𝑎 and 𝑡𝑏 while the guard is theconjunction of the guards of 𝑡𝑎 and 𝑡𝑏. The source state of a transition𝑡 is the source state of 𝑡𝑎 and the destination state of 𝑡 is the desti-nation state of 𝑡𝑏. Examples of such states are 𝑞1, 𝑞3, 𝑞5, and 𝑞6 from

69

Page 78: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

Figure 3.14. The resulting FSM after applying this rule to the states𝑞1, 𝑞3, and 𝑞5 can be seen in Figure 3.15.

Rule 2 Merging of parallel mergeable transitions. The resulting transition 𝑡from the parallel transitions 𝑡𝑎 and 𝑡𝑏 has the same input/output guardas the two original transitions had. The guard function for the result-ing merged transition 𝑡 consists of the (possibly simplified) disjunctionof the original guard functions. The resulting action of the mergedtransition 𝑡 will consist of an appropriate if construct containingthe actions of the original transitions.28

Rule 3 Elimination of self-loops. Self loops which contain no input/outputguards, i.e., the action on the transition of the self-loop neither con-sumes nor produces tokens, can be eliminated by concatenating thecontained action with an appropriate looping construct for theactions of transitions entering the state with the self-loop under elimi-nation.

Finally, state machine minimization is applied on the resulting transformedactor FSM and subsequently loop unrolling for loops with known boundaries.Note that these are only loops with loop bodies containing read or write op-erations, due to the elimination of all other loops by the post-processing rules.The resulting FSM after applying the above steps is depicted in Figure 3.16.It can be seen that the resulting FSM strongly resembles the actor FSM fromFigure 3.5. The only difference is the presence of the transition from 𝑞0 to 𝑞2which is responsible for initializing the variables tmp and res.

3.7 Automatic Model Extraction

In order to be able to apply MoC specific analysis methods such as static sched-ule generation or deadlock detection, it is important to recognize actors belong-ing to well-known data flow models of computation such as HSDF, SDF andCSDF. It will be shown that this can be accomplished by inspection and analysisof the actor FSM of a SysteMoC actor. The methodology for actor classifica-tion has been published in [FHZT13*, ZFHT08*]. Note that automatic modelextraction by means of classification has mainly been researched by Zebelein.However, this step is required for the design flow proposed in this thesis and,hence, will be briefly summarized below.

28Note that in the SqrLoop example, this and the following rule will never apply since there areno parallel mergeable transitions or self-loops without communication. A more complexexample requiring the application of these rules can be found in [FZK+11*].

70

Page 79: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.7 Automatic Model Extraction

𝑞0start

𝑞2 𝑞4

t/BB1; BB2 #𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/BB3; BB4

#𝑖2 ≥ 1 ∧#𝑜1 ≥ 1∧|tmp− res2| ≥ 0.01/ BB5; BB4

#𝑖2 ≥ 1 ∧#𝑜2 ≥ 1∧|tmp− res2| < 0.01/BB5; BB6; BB2

Figure 3.16: The actor FSM after applying rule 1 to eliminate the states 𝑞6 fromFigure 3.15

The most basic representation of static actors encoded as actor FSMs canbe seen in Figure 3.17. To exemplify, the SDF actor depicted in Figure 3.17acorresponds to the Dup actor 𝑎4 as displayed in Figure 3.6. The depicted actorFSM contains only one transition, which consumes one token from the actorinput port 𝑖1 and duplicates this token on the output ports 𝑜1 and 𝑜2. Clearly,the actor exhibits a static communication behavior corresponding to the SDFMoC.

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1 ∧#𝑜2 ≥ 1/𝑓dup

𝑞0

(a) Basic representation of an SDF actor viaan actor FSM

#𝑖2 ≥ 1 ∧#𝑜2 ≥ 1

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1

𝑞0 𝑞1

(b) Basic representation of a CSDF actorvia an actor FSM

Figure 3.17: Basic representations of an SDF and a CSDF actors via FSMs

However, in order for the classification algorithm to ignore the guard functionsused by an actor FSM, certain assumptions have to be made. It will be assumedthat each actor state has at least one enabled outgoing transition if sufficienttokens are present on the actor input ports. This is required in order to be ableto activate the equivalent SDF or CSDF actor an infinite number of times. Itwill also be assumed that an actor will consume or produce tokens every nowand then. These assumptions are not erroneous as, otherwise, there exists aninfinite loop in the execution of the actor where a cycle of the FSM is traversedand each transition in this cycle neither consumes nor produces any tokens. Thiscan clearly be identified as a bug in the implementation of the actor, similar

71

Page 80: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

to an action of a transition that never terminates. For the exact mathematicaldefinition of the above given requirements, see [ZFHT08*].The idea of the classification algorithm is to check if the communication be-

havior of a given actor can be reduced to a basic representation, which can beeasily classified into the SDF or CSDF MoC. Note that the basic representationsfor both SDF and CSDF models are FSMs with a single cycle, e.g., as depictedfor the CSDF actor in Figure 3.17b.Not all actor FSMs exhibit such a regular pattern (cf. Figure 3.18). However,

some of them, e.g., Figures 3.18c and 3.18d, still represent static actors. It canbe distinguished if the analysis of the actor functionality state (cf. Definition 3.1)is required to decide whether an actor is a static actor, e.g., Figure 3.18c, ornot, e.g., Figure 3.18d. In the following, the actor functionality state will not beconsidered. Therefore, the presented classification algorithm will fail to classifyFigure 3.18c as a static actor. In that sense, the algorithm only provides asufficient criterion for the problem of static actor detection. A sufficient andnecessary criterion cannot be given as the problem is undecidable in the generalcase.The algorithm starts by deriving a set of classification candidates solely on the

basis of the specified actor FSM. Each candidate is later checked for consistencywith the entire FSM state space via Algorithm 1. If one of the classificationcandidates is accepted by Algorithm 1, then the actor is recognized as a CSDFactor where the CSDF phases are given by the accepted classification candidate.

Definition 3.7 (Classification Candidate [ZFHT08*]). A possible CSDF behav-ior of a given actor is captured by a classification candidate 𝜇 = ⟨𝜇0, 𝜇1, . . . 𝜇𝜏−1⟩where each 𝜇 = (cons, prod) represents a phase of the CSDF behavior and 𝜏 isthe number of phases.

To exemplify, Figure 3.17b is considered. In this case, two phases are present,i.e., 𝜏 = 2, where the first phase (𝜇0) consumes one token from port 𝑖1 andproduces one token on port 𝑜1 while the second phase (𝜇1) consumes one tokenfrom port 𝑖2 and produces one token on port 𝑜2.It can be observed that all paths starting from the initial actor FSM state

𝑞0 must comply with the classification candidate. Furthermore, as CSDF ex-hibits cyclic behavior, the paths must also contain a cycle. The classificationalgorithm will search for such a path 𝑝 = ⟨𝑡1, 𝑡2, . . . , 𝑡𝑛⟩ of transitions 𝑡𝑖 in theactor FSM starting from the initial state 𝑞0. The path can be decomposedinto an acyclic prefix path 𝑝𝑎 and a cyclic path 𝑝𝑐 such that 𝑝 = 𝑝a𝑎 𝑝𝑐, i.e.,𝑝𝑎 = ⟨𝑡0, 𝑡1, . . . 𝑡𝑙−1⟩ being the prefix and 𝑝𝑐 = ⟨𝑡𝑙, 𝑡𝑙+1, . . . 𝑡𝑛−1⟩ being the cyclicpart, that is 𝑡𝑙.𝑞src = 𝑡𝑛−1.𝑞dst. After such a path 𝑝 has been found, a setof classification candidates can be derived from the set of all non-empty pre-fixes {𝑝′ ⊑ 𝑝 | #𝑝′ ∈ { 1, 2, . . . 𝑛 }} of the path 𝑝. A classification candidate𝜇 = ⟨𝜇0, 𝜇1, . . . 𝜇𝜏−1⟩ is derived from a non-empty prefix 𝑝′ by (1) unifying

72

Page 81: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.7 Automatic Model Extraction

#𝑖2 ≥ 1 ∧ 𝑓 check ∧#𝑜2 ≥ 1/𝑓 copyApprox

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/𝑓 copyStore

#𝑖2 ≥ 1 ∧ ¬𝑓 check ∧#𝑜1 ≥ 1/𝑓 copyInput

𝑡3

𝑡1

𝑡2

𝑞start 𝑞loop

(a) Depicted above is the actor FSM of the SqrLoopactor 𝑎2 that is obviously not a static actor.

#𝑖2 ≥ 1

#𝑖2 ≥ 1 ∧#𝑜2 ≥ 1

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1

#𝑜2 ≥ 1

𝑡1

𝑡2

𝑡4 𝑡3𝑞2

𝑞0 𝑞1

(b) Visualized above is an actorFSM that cannot be convertedto a basic representation forstatic actors.

#𝑖2 ≥ 1 ∧#𝑜1 ≥ 1 ∧ ¬𝑏/𝑏 = t

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/𝑏 = f

#𝑖2 ≥ 1 ∧#𝑜2 ≥ 1 ∧ 𝑏

𝑡3

𝑡1

𝑡2

𝑞0 𝑞1

(c) Shown is an actor FSM that seems to belong tothe dynamic domain, but exhibits CSDF behaviordue to the manipulation of the Boolean variable𝑏. Hence, leading to a cyclic execution of thetransitions 𝑡1, 𝑡3, and 𝑡2.

#𝑖2 ≥ 1 ∧#𝑜2 ≥ 1

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1

#𝑜2 ≥ 1 #𝑖2 ≥ 1

𝑡1

𝑡4 𝑡3

𝑡2

𝑞2

𝑞0 𝑞1

(d) Shown is an actor FSM thatcan be converted to a basicrepresentation for static ac-tors.

Figure 3.18: Various actor FSMs that do not fit the basic representation forstatic actors as exemplified in Figure 3.17

adjacent transitions 𝑡𝑗, 𝑡𝑗+1 according to the below given transition contrac-tion condition until no more contractions can be performed and (2) driving theCSDF phases 𝜇𝑗 of the classification candidate from the transitions 𝑡𝑗 of thecontracted path computed in step (1).

Definition 3.8 (Transition Contraction Condition [FHZT13*]). Two transi-tions 𝑡𝑗 and 𝑡𝑗+1 of a prefix path 𝑝′ can be contracted if 𝑡𝑗 only consumes tokens,i.e., prod(𝑡𝑗, 𝑂) = 0, or 𝑡𝑗+1 only produces tokens, i.e., cons(𝑡𝑗+1, 𝐼) = 0.29 Theresulting transition 𝑡′ has the combined consumption and production rates givenas cons(𝑡′, 𝐼) = cons(𝑡𝑗, 𝐼) + cons(𝑡𝑗+1, 𝐼) and prod(𝑡′, 𝑂) = prod(𝑡𝑗, 𝑂) +prod(𝑡𝑗+1, 𝑂).

For clarification, Figures 3.18b and 3.18d is considered and it is assumedthat the path 𝑝 = ⟨𝑡1, 𝑡3, 𝑡4⟩ has been found by the classification algorithm.29For notational brevity, the construct cons(𝑡, 𝐼) = (𝑛𝑖1 , 𝑛𝑖2 . . . , 𝑛𝑖|𝐼|) = 𝑛cons is used to

denote the vector 𝑛cons of numbers of tokens consumed by taking the transition. Theequivalent notation prod(𝑡, 𝑂) is also used for produced tokens.

73

Page 82: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

Hence, the set of all non-empty prefixes is { ⟨𝑡1⟩, ⟨𝑡1, 𝑡3⟩, ⟨𝑡1, 𝑡3, 𝑡4⟩ }. In bothdepicted FSMs, the transition 𝑡2 consumes and produces exactly as many tokensas the transition sequence ⟨𝑡3, 𝑡4⟩. However, the transition sequence ⟨𝑡3, 𝑡4⟩ inFigure 3.18d can be contracted as 𝑡3 only consumes tokens while transitionsequence ⟨𝑡3, 𝑡4⟩ in Figure 3.18b cannot be contracted. Hence, for Figure 3.18d,the classification candidate 𝜇 = ⟨𝜇0, 𝜇1⟩ derived from the prefix 𝑝′ = ⟨𝑡1, 𝑡3, 𝑡4⟩is as follows:

𝜇0.cons = cons(𝑡1, 𝐼) = (1, 0)𝜇0.prod = prod(𝑡1, 𝑂) = (1, 0)𝜇1.cons = cons(𝑡3, 𝐼) + cons(𝑡4, 𝐼) = (0, 1)𝜇1.prod = prod(𝑡3, 𝑂) + prod(𝑡4, 𝑂) = (0, 1)

On the other hand, for Figure 3.18b, the classification candidate 𝜇 = ⟨𝜇0, 𝜇1, 𝜇2⟩derived from the prefix 𝑝′ = ⟨𝑡1, 𝑡3, 𝑡4⟩ does not exhibit any contractions and isdepicted below as:

𝜇0.cons = (1, 0) 𝜇0.prod = (1, 0)𝜇1.cons = (0, 0) 𝜇1.prod = (0, 1)𝜇2.cons = (0, 1) 𝜇2.prod = (0, 0)

The underlying reasons for the transition contraction condition from Defini-tion 3.8 is discussed in the following. To illustrate, the data flow graph depictedin Figure 3.19a which uses two actors 𝑎2 and 𝑎3 containing the FSMs fromFigures 3.18b and 3.18d is considered. The resulting dependencies between thetransitions in a transition trace of the actors 𝑎2 and 𝑎3 are shown in Figure 3.19b.As can be seen in Figure 3.19c if the transition sequence ⟨𝑡3, 𝑡4⟩ of actor 𝑎3

is contracted into a single transition 𝑡𝑐, then the resulting dependencies of 𝑡𝑐are exactly the same as for transition 𝑡2. This is the reason why the FSM fromFigure 3.18d can be classified into a CSDF actor with an actor FSM as depictedin Figure 3.17b. Furthermore, the contraction is a valid transformation as itdoes not change the data dependencies in the data flow graph. This can be seenby comparing the transition dependencies depicted in Figures 3.19b and 3.19c.Compacting the transition sequence ⟨𝑡3, 𝑡4⟩ of actor 𝑎3 generates the transition𝑡𝑐, which has a data dependency from the consumption of a token on 𝑖2 to theproduction of a token on 𝑜2. However, the previous transition sequence ⟨𝑡3, 𝑡4⟩also induces this data dependency as 𝑡4 can only be taken after 𝑡3.In contrast to this, the contraction of the transition sequence ⟨𝑡3, 𝑡4⟩ of actor

𝑎2 into a transition 𝑡𝑑 does introduce a new erroneous data dependency fromthe consumption of a token on 𝑖2 to the production of a token on 𝑜2. Thedata dependency is erroneous as the original transition sequence ⟨𝑡3, 𝑡4⟩ has nosuch dependency as it first produces the token on 𝑜2 before trying to consumea token on 𝑖2. Indeed, an erroneous contraction might introduce a deadlock

74

Page 83: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.7 Automatic Model Extraction

#𝑖2 ≥ 1 #𝑜2 ≥ 1

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1

#𝑖2 ≥ 1 ∧#𝑜2 ≥ 1 #𝑖2 ≥ 1 ∧#𝑜2 ≥ 1

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1

#𝑜2 ≥ 1 #𝑖2 ≥ 1

𝑡4 𝑡3

𝑡1

𝑡2

𝑡1

𝑡4 𝑡3

𝑡2

𝑞2

𝑞0 𝑞1

𝑞2

𝑞0 𝑞1

𝑖2

𝑖1

𝑜2

𝑜1

𝑎3

𝑎1|Src 𝑎4|Sink

𝑖2

𝑖1

𝑜2

𝑜1

𝑎2

(a) Shown is a DFG containing FSMs from Figure 3.18b (in actor 𝑎2) and Figure 3.18d (inactor 𝑎3). Due to data dependencies, actor 𝑎2 will never execute transition 𝑡2. On theother hand, actor 𝑎3 is free to choose either the transition sequence ⟨𝑡3, 𝑡4⟩ or the transition𝑡2 in its execution trace.

𝑡1 𝑡2 𝑡1

𝑡3 𝑡4

𝑡1 𝑡2

𝑡4

𝑡1

𝑡3

𝑎3

𝑎2

(b) Shown are the dependen-cies in the transition traceof actor 𝑎2 and 𝑎3. Dashedlines represent data depen-dencies. Solid lines repre-sent dependencies inducedby the sequential nature ofthe FSM.

𝑡1 𝑡2

𝑡3 𝑡4

𝑡1

𝑡1 𝑡2 𝑡1

𝑡4𝑡𝑐𝑡3

𝑎2

𝑎3

(c) Shown are the dependen-cies for the contractionof the transition sequence⟨𝑡3, 𝑡4⟩ of actor 𝑎3 into thetransition 𝑡𝑐. Note thattransition 𝑡2 and the con-tracted transition 𝑡𝑐 of ac-tor 𝑎3 have the same datadependencies.

𝑡1 𝑡2 𝑡1

𝑡3 𝑡4

𝑡1 𝑡2

𝑡3 𝑡4

𝑡1

𝑡𝑑𝑎2

𝑎3

(d) Shown are the dependen-cies for the contractionof the transition sequence⟨𝑡3, 𝑡4⟩ of actor 𝑎2 into thetransition 𝑡𝑑. Thus, in-ducing two erroneous de-pendency cycles 𝑎2.𝑡𝑑 →𝑎3.𝑡2 → 𝑎2.𝑡𝑑 and 𝑎2.𝑡𝑑 →𝑎3.𝑡3 → 𝑎3.𝑡4 → 𝑎2.𝑡𝑑.

Figure 3.19: Depicted above is a DFG containing two actors with FSMs shown inFigures 3.18b and 3.18d as well as the dependencies in their transi-tion traces after various transition sequences have been contracted.

into the system as can be seen in Figure 3.19d where the transition 𝑡𝑑 is partof two dependency cycles 𝑎2.𝑡𝑑 → 𝑎3.𝑡2 → 𝑎2.𝑡𝑑 and 𝑎2.𝑡𝑑 → 𝑎3.𝑡3 → 𝑎3.𝑡4 →𝑎2.𝑡𝑑 which are not present in the original dependency structure depicted inFigure 3.19b.After the set of classification candidates has been derived from the actor FSM,

each of the candidates are checked for validity by Algorithm 1. The checks areperformed sequentially starting from the classification candidate derived fromthe shortest prefix 𝑝′ to the candidate derived from the full path 𝑝. If a candidateis accepted by Algorithm 1, then the actor is a CSDF actor with phases as givenby the accepted classification candidate 𝜇.

75

Page 84: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

Algorithm 1 Validation of a classification candidate 𝜇 for the actor FSM ℛ1: function validateClassCand(𝜇,ℛ)2: 𝜏 = #𝜇 ◁ Number of CSDF phases 𝜏3: ⟨𝜇0, 𝜇1, . . . 𝜇𝜏−1⟩ = 𝜇 ◁ CSDF phases 𝜇0, 𝜇1, . . . 𝜇𝜏−1

4: 𝜔0 = (0, 𝜇0.cons, 𝜇0.prod) ◁ Initial annotation tuple5: queue ← ⟨⟩ ◁ Set up the empty queue for breadth-first search6: ann ← ∅ ◁ Set up the empty set of annotations7: queue ← queuea(ℛ.𝑞0) ◁ Enqueue ℛ.𝑞08: ann ← ann∪{ (ℛ.𝑞0, 𝜔0) } ◁ Annotate the initial tuple9: while #queue > 0 do ◁ While the queue is not empty

10: 𝑞src = queue(0) ◁ Get head of queue11: 𝜔src = ann(𝑞src) ◁ Get annotation tuple for state 𝑞src12: queue ← tail1(queue) ◁ Dequeue head from queue13: for all 𝑡 ∈ ℛ.𝑇 where 𝑡.𝑞src = 𝑞src do14: if 𝜔src.cons ≥ cons(𝑡, 𝐼) ∨ 𝜔src.prod ≥ prod(𝑡, 𝑂) then15: return f ◁ Reject 𝜇 due to failed transition criterion I16: if 𝜔src.cons = cons(𝑡, 𝐼) ∧ prod(𝑡, 𝑂) = 0 then17: return f ◁ Reject 𝜇 due to failed transition criterion II18: 𝜔dst ← derive from 𝜔src and 𝑡 as given by Equation (3.1)19: if ∃𝜔′

dst : (𝑡.𝑞dst, 𝜔′dst) ∈ ann then ◁ Annotated tuple present?

20: if 𝜔′dst = 𝜔dst then ◁ Check annotated tuple for consistency

21: return f ◁ Reject classification candidate 𝜇

22: else ◁ No tuple annotate to state 𝑡.𝑞dst23: ann ← ann∪{ (𝑡.𝑞dst, 𝜔dst) } ◁ Annotate tuple 𝜔dst

24: queue ← queuea(𝑡.𝑞dst) ◁ Enqueue 𝑡.𝑞dst

25: return t ◁ Accept classification candidate 𝜇

The main idea of the validation algorithm is to check the classification can-didate against all possible transition sequences reachable from the initial stateof the actor FSM. However, due to the existence of contracted transitions inthe classification candidate as well as in the actor FSM, a simple matching ofa phase 𝜇𝑗 to an FSM transition is infeasible. Instead, a CSDF phase 𝜇𝑗 ismatched by a transition sequence. To keep track of the number of tokens al-ready produced and consumed for a CSDF phase, each FSM state 𝑞𝑛 will beannotated with a tuple 𝜔𝑛 = (𝑗, cons, prod) where cons and prod are the vectorsof number of tokens which are still to be consumed and produced to completethe consumption and production of the CSDF phase 𝜇𝑗 and 𝑗 denotes the CSDFphase which should be matched.The validation algorithm starts by deriving the tuple 𝜔0 (cf. Line 4) for the

initial state 𝑞0 from the first CSDF phase 𝜇0 of the classification candidate 𝜇.

76

Page 85: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.7 Automatic Model Extraction

The algorithm uses the set ann as a function 𝜔𝑛 = ann(𝑞𝑛) from an FSM stateto its annotated tuple 𝜔𝑛. Initially, the ann set is empty (cf. Line 6) denotingthat all function values ann(𝑞𝑛) are undefined, and hence no tuples have beenannotated. The annotation of tuples starts with the initial tuple 𝜔0 = ann(𝑞0)for the initial state 𝑞0 by adding the corresponding association to the ann setin Line 8.The algorithm proceeds by performing a breadth-first search (cf. Lines 5, 9,

12, and 24) of all states 𝑞𝑛 of the FSM ℛ starting from the initial state 𝑞0(cf. Line 7). For each visited state 𝑞src (cf. Line 10) its annotated tuple 𝜔src

will be derived from the set ann (cf. Line 11) and the corresponding outgoingtransitions 𝑡 of the state 𝑞src (cf. Line 13) are checked against the annotatedtuple 𝜔src (cf. Lines 14 to 17) via the transition criteria as given below:

Definition 3.9 (Transition Criterion I [ZFHT08*]). Each outgoing transition𝑡 ∈ 𝑇src = {𝑡 ∈ 𝑇 | 𝑡.𝑞src = 𝑞src} of a visited FSM state 𝑞src ∈ 𝑄 must consumeand produce less or equal tokens than specified by the annotated tuple 𝜔src,i.e., ∀𝑡 ∈ 𝑇src : 𝜔src.cons ≥ cons(𝑡, 𝐼) ∧ 𝜔src.prod ≥ prod(𝑡, 𝑂). Otherwise, theannotated tuple 𝜔src is invalid and the classification candidate 𝜇 will be rejected.

Definition 3.10 (Transition Criterion II [ZFHT08*]). Each outgoing transition𝑡 ∈ 𝑇src = {𝑡 ∈ 𝑇 | 𝑡.𝑞src = 𝑞src} of a visited FSM state 𝑞src ∈ 𝑄must not producetokens if there are still tokens to be consumed in that phase after the transition𝑡 has been taken, i.e., ∀𝑡 ∈ 𝑇src : 𝜔src.cons = cons(𝑡, 𝐼) ∨ prod(𝑡, 𝑂) = 0.Otherwise, the annotated tuple 𝜔src is invalid and the classification candidate 𝜇will be rejected.

Transition criterion I ensures that a matched transition sequence consumesand produces exactly the number of tokens as specified by the CSDF phase𝜇 of the classification candidate. Transition criterion II ensures that a tran-sition sequence induces the same data dependencies as the CSDF phase 𝜇 ofthe classification candidate. If the transition does not conform to the abovetransition criteria, then the classification candidate is invalid and will be re-jected (cf. Lines 15 and 17). Otherwise, that is conforming to both criteria, thematched transition sequence can be condensed via Definition 3.8 to the matchedCSDF phase.After the transition has been checked, the tuple 𝜔dst for annotation at the

transition destination state 𝑡.𝑞dst is computed in Line 18 according to the equa-tion given below:

consleft = 𝜔src.cons− cons(𝑡, 𝐼)prodleft = 𝜔src.prod− prod(𝑡, 𝑂)

𝑗′ = (𝜔src.𝑗 + 1) mod 𝜏

𝜔dst =

{(𝜔src.𝑗, consleft, prodleft) if consleft = 0 ∧ prodleft = 0(𝑗′, 𝜇𝑗′ .cons, 𝜇𝑗′ .prod) otherwise

(3.1)

77

Page 86: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

In the above equation, consleft and prodleft denote the consumption and pro-duction vectors remaining to match the CSDF phase 𝜇𝑗. If these remainingconsumption and production vectors are both the zero vector, then the match-ing of the CSDF phase 𝜇𝑗 to a transition sequence has been completed. Hence,the tuple 𝜔dst will be computed to match the next CSDF phase 𝜇(𝑗+1) mod 𝜏 .30

Otherwise, the tuple 𝜔dst will use the updated remaining consumption and pro-duction vectors as given by consleft and prodleft.Finally, if no tuple is annotated to the destination state 𝑡.𝑞dst, then the com-

puted tuple 𝜔dst is annotated to the destination state in Line 23 and the destina-tion state is appended to the queue for the breadth-first search in Line 24. Other-wise, an annotated tuple is already present for the destination state (cf. Line 19).If this annotated tuple 𝜔′

dst and the computed tuple 𝜔dst are inconsistent, thenthe classification candidate will be rejected (cf. Lines 19 to 21).

3.8 Related WorkThe POLIS [BCG+97] approach is focused on control dominated applications.The internal representation of POLIS is the Codesign Finite State Machine(CFSM). The POLIS approach supports the import of these CFSMs from inputlanguages that are based on an inherent FSM semantics like Esterel [BCE+03] orthe synthesizable subsets of VHDL and Verilog. A POLIS application is assem-bled from these CFSMs by connecting them in an asynchronous way, whereaseach CFSM is executed in a synchronous manner. Hence, POLIS has been de-signed to model Globally Asynchronous Locally Synchronous (GALS) systems.The asynchronous communication between different CFSMs is not constrainedto use data flow semantics, but is implemented via finite buffers which can beoverwritten. Hence, no guarantee of a successful message delivery can be given.The POLIS approach supports the generation of C and VHDL code from theCFSMs and their connection topology, however, the mapping decision for eachCFSM to either software or hardware has to be provided manually.YAPI [dKSvdW+00] is a C++ class library that implements an API for simu-

lation of Dennis Data Flow (DDF) data flow graphs. Hence, YAPI applicationsare not compositional and cannot be used to model applications from the packetprocessing domain. Applications of the packet processing domain require non-determinacy, which is missing from the DDF model employed by YAPI, in orderto react to their environment. Hence, YAPI is aimed at modeling applicationsof the signal processing domain. The YAPI approach does not provide anysynthesis or code transformation back-ends for applications modeled with theYAPI API.30The notation 𝑛 = 𝑙 mod 𝑚 for values 𝑙 ∈ Z = { . . . ,−2,−1, 0, 1, 2, . . . },𝑚 ∈ N = { 1, 2, . . . }

is used to denote the common residue 𝑛 ∈ { 0, 1, 2, . . .𝑚 − 1 } which is a non-negativeinteger smaller than 𝑚 such that 𝑙 ≡ 𝑛 (mod 𝑚).

78

Page 87: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.8 Related Work

Ptolemy II [EJL+03, Pto14], the successor of Ptolemy Classics [BHLM94], isa modeling framework that focuses on the integration of various MoCs. ThePtolemy approach achieves this integration by allowing the refinement of oneactor contained in a certain MoC by a graph of actors under a different MoC.Hence, Ptolemy can be thought of a continuation and refinement of the *charts(pronounced ”star charts”) [GLL99, Pto14] approach presented in Section 2.4.Thus, in Ptolemy each actor is associated with a MoC by a director whichexplicitly gives the MoC under which the actor is executed. Furthermore, thebehavior of an actor might change depending on the director under which theactor is executed. As far as code generation is concerned, Ptolemy II up toversions 8.0 supported the generation of C and Java code for non-hierarchicalSDF models by the Copernicus code generator. However, this generator hasbeen removed in recent Ptolemy II releases and code generation is currently nota focus of the Ptolemy research effort.While Ptolemy itself is a pure modeling environment without support for

modeling of possible architectures and their influence on the execution of thefunctionality, an extension of Ptolemy called PeaCE [HKL+07] (Ptolemy ex-tension as Codesign Environment) provides this support. The PeaCE codesignenvironment is based on the Synchronous Piggybacked Data Flow (SPDF) MoC[PCH99], an extension of SDF enabling the efficient distribution of control in-formation by piggybacking this information onto data tokens. PeaCE itself hasno dedicated input language, but specifies its models by files containing thetopology of the SPDF data flow graph and a library based approach to specifythe actors.Metropolis [BWH+03] and Metropolis II [DDG+13] are representatives of the

modeling environment using platform-based design. Platform-based design ischaracterized by a two-level approach consisting of an architecture providingservices to a functionality. The platform is defined by the set of services providedby the architecture. The functionality, which is modeled as concurrent processes,is not mapped directly to the architecture, but is adapted to the set of servicesprovided by the platform. Hence, a different architecture implementing the sameplatform will also be able to run the functionality. The modeling in Metropolisis separated into processes representing the functionality, media representing theservices provided by the platform, and so-called quantity managers to arbitrateaccess to the media by the functionality. Hence, different realizations of theplatform via different architectures are modeled by metropolis by using differentquantity managers. Quantity managers not only arbitrate the access of themedia, but also annotate each access of the media by the functionality with thecorresponding resource costs like execution time or power consumption.HetSC (Heterogeneous SystemC) [HV08, HVG+08] is a C++ class library

extending SystemC to support the specification of domains realizing differentMoCs. HetSC supports the specification of data flow MoCs like KPN [Kah74],

79

Page 88: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3. SysteMoC

KPN with bounded FIFO capacities, and SDF [LM87a] as well as none data flowMoCs like Synchronous Reactive (SR) [BCE+03, BB91] and CommunicatingSequential Processes (CSP) [Hoa78, Hoa85]. However, HetSC does not provideany support to extract a mathematical model of a functionality given by anexecutable specification modeled in HetSC. To specify a domain realizing acertain MoC, the HetSC methodology defines a set of coding standards foreach supported MoC. A domain in HetSC consists of a set of actors and theirconnecting channels that are constrained to have semantics corresponding tothe MoC of the domain. To implement a certain MoC, the HetSC approachmaps the required kind of event [LSV98] to implement the MoC into the strict-time Discrete Event (DE) [Yat93, Lee99] domain supported by the SystemCsimulation kernel. Hence, the SystemC simulation kernel does not need to beextended with additional solvers for these MoCs in order for the HetSC libraryto provide support for these MoCs. Thus, the reference implementation ofSystemC provided by OSCI can be used unmodified by the HetSC library.In contrast to this, SystemC-H [PS04] provides support for the SDF, CSP, and

FSM MoCs by extending the SystemC simulation kernel itself with dedicatedsolvers for each of the three MoCs. This enables SystemC-H to provide higherperformance simulation engines for its supported MoCs as compared to HetSC.However, modifying the SystemC library itself is detrimental to compatibilityas various EDA vendors require the usage of their own SystemC library in orderto guarantee a working tool flow.HetMoC [ZSJ10] is concerned with the integration of the Continuous Time

(CT) MoC into SystemC. HetMoC also supports integration of SR, SDF, and theDE models. However, the main focus of HetMoC is an improved integration ofthe CT domain into SystemC, thus, providing better solvers for analog systemsmodeled in SystemC as compared to the SystemC-AMS [BEG+13] extension.

Summary

In this chapter, the first key contribution of this work, the SysteMoC modelinglanguage for the ESL was presented. This language has the ability to modelexecutable specifications which can act as an input model for DSE as well asserve as an evaluator for DSE tools to evaluate timing and power characteristicsof the executable specification for the different design decisions that are eval-uated by the DSE. The SysteMoC language has strong formal underpinningsin data flow modeling with the distinction that the expressiveness of the dataflow model used by an actor is not chosen in advance but determined from theimplementation of the actor via classification. In order for the classificationto determine the data flow model used by a SysteMoC actor, the SysteMoC

80

Page 89: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

3.8 Related Work

language enforces a distinction between communication and computation of anactor by using an FSM to specify the communication behavior of an actor.The identification of static actors by means of classification enables the sec-

ond key contribution of this thesis, the clustering methodology which will bepresented in the following. Clustering is a refinement of a SysteMoC data flowgraph by replacing islands of static actors in the data flow graph by compos-ite actors. Later, in the software implementation generated by the synthesisback-end, the remaining actors which are not replaced by a composite actor aswell as all composite actors are scheduled by a dynamic scheduler at run time.Thus, by combining multiple static actors into one composite actor, the cluster-ing methodology reduces the number of actors which have to be scheduled bythe dynamic scheduler. This reduction is of benefit as it has been shown thatcompile-time scheduling of static actors produces more efficient implementationsin terms of latency and throughput than run-time scheduling of the same actorsby a dynamic scheduler.

81

Page 90: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität
Page 91: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4Clustering

In this chapter, the second key contribution of this work, a clustering method-ology [FKH+08*, FZK+11*, FHZT13*] that exploits the presence of highly an-alyzable static actors in a Data Flow Graph (DFG) even in the presence ofdynamic actors, will be presented. This contribution enables an automatic andcorrect clustering of SystemC Models of Computation (SysteMoC) applicationsto reduce the number of dynamic scheduling decisions and, thus, enables thegeneration of more efficient code for Multi-Processor System-on-Chip (MPSoC)targets.In detail, actors are called static if they conform to one of the following three

data flow models Homogeneous (Synchronous) Data Flow (HSDF) [CHEP71],Synchronous Data Flow (SDF) [LM87b], or Cyclo-Static Data Flow (CSDF)[BELP96] introduced in Chapter 2. The clustering methodology is neither lim-ited to DFGs consisting only of static actors, nor does it require a special treat-ment of these static actors by the developer of the SysteMoC application. In-deed, static actors are identified in an executable specification modeled in theSysteMoC language by means of classification as presented in the previous chap-ter. This classification enables the clustering methodology to refine the DFGby replacing islands of static actors by composite actors. Islands of static ac-tors, also called static clusters, are connected subgraphs of static actors in thewhole DFG described by an executable specification modeled in SysteMoC. Acomposite actor implements a Quasi-Static Schedule (QSS) for the actors con-tained in its corresponding static cluster. Intuitively, a QSS is a schedule inwhich a relatively large proportion of scheduling decisions have been made atcompile time. A QSS executes sequences of statically scheduled actor firings atrun time. These sequences have been determined at compile time by only post-poning required scheduling decisions to run time. Advantages of QSSs includelower scheduling overhead and improved predictability compared to dynamicscheduling at run time. Later, in the implementation generated by the synthe-sis back-end, each computation resource, e.g., a Central Processing Unit (CPU)or a hardware accelerator resource that supports the mapping of multiple actorsto itself [ZFHT12*], possesses its own dynamic scheduler that schedules the ac-tors that are mapped to its resource. After clustering and refinement has been

83

Page 92: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

performed, the entities to be scheduled at run time by a dynamic scheduler of aresource are the actors bound to this resource that have not been replaced by acomposite actor as well as the composite actors themselves that have been boundto this resource. Thus, by combining multiple static actors into one compositeactor, the clustering methodology reduces the number of actors which have tobe scheduled by the dynamic schedulers. This reduction is of benefit as it hasbeen shown [BBHL95, BL93, BML97] that compile-time scheduling of static ac-tors [LM87b] produces more efficient implementations in terms of latency andthroughput than scheduling of the same actors by a dynamic scheduler at runtime.In the following, the structure and contributions of this chapter are outlined:

In Section 4.1, dynamic scheduling and the causes of scheduling overhead intro-duced by this scheduling strategy will be discussed. Next, the required condi-tions for a refinement replacing a static cluster by a composite actor to be safewill be discussed in Section 4.2. In contrast to this, unsafe refinements mightintroduce deadlocks into the refined DFGs that result from replacing static clus-ters by composite actors. Subsequently, in Section 4.3, refinements of a DFGthat are constrained to only replace static clusters by composite actors exhibit-ing SDF or CSDF semantics will be considered. A refinement to a compositeactor exhibiting SDF semantics is also called monolithic clustering [TBRL09].In contrast to this, the second key contribution—the clustering methodol-

ogy introduced in Section 4.4—goes significantly beyond previous approaches[BL93, PBL95] as it is not confined to monolithic clustering, but produces aquasi-static scheduling interface between the generated composite actor and thecluster environment. This enhances the power of previous clustering techniquesin terms of (1) deadlock avoidance, (2) increasing the design space for cluster re-finements, and (3) providing the integration of a cluster with its cluster environ-ment. A precise definition of this design space as well as a heuristic to find staticclusters amenable to refinement by the introduced clustering methodology willbe given in Section 4.4.1. Next, an algorithm [FKH+08*, FZK+11*, FHZT13*]to compute a Finite State Machine (FSM) based representation of a QSS for astatic cluster will be given in Section 4.4.2. In contrast to the QSSs computed bythe algorithm presented by Tripakis et al. [TBRL09, TBG+13], the algorithmsemployed by the introduced clustering methodology will even generate retimingsequences. Retiming sequences are used to increase the length of the sequencesof statically scheduled actor firings executed by a QSS at run time. Intuitively,the larger the length of a sequence of statically scheduled actor firings is, thelower the scheduling overhead will be at run time. Moreover, an improvedrule-based representation of QSSs [FZHT11*, FZHT13*] will be introduced inSection 4.4.3. The rule-based representation trades latency and throughputin order to achieve a more compact code size of a QSS as compared to theautomata-based representation. Both algorithms introduced in Sections 4.4.2

84

Page 93: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.1 Clustering and Scheduling Overhead

and 4.4.3 encode actor firings that should be statically scheduled as partial rep-etition vectors. Subsequently, in Section 4.4.4, the partial repetition vectorswill be statically scheduled by transforming each partial repetition vector intoa looped schedule [HB07]. In contrast to a partial repetition vector, a loopedschedule has a direct correspondence to the sequential code that is emitted bythe synthesis back-end presented in Chapter 5 in order to realize the execution ofa sequence of statically scheduled actor firings. To conclude the theoretical as-pects of the introduced clustering methodology, a correctness proof [FZHT13*]will be given in Section 4.4.5. In order to extend the design space for refinementof static clusters into composite actors, a First In First Out (FIFO) channelcapacity adjustment algorithm is presented in Section 4.4.6. This algorithmadjusts the channel capacities of the FIFO channels used to connect a staticcluster to its cluster environment. This enables—in contrast to the approachpresented by Tripakis et al. [TBG+13]—the introduced clustering methodol-ogy to be applicable to dynamic DFGs with finite channel capacities instead ofchannel capacities that are conceptionally considered to be unbounded.In Section 4.5, the benefits obtained by applying the introduced clustering

methodology will be shown. Finally, the related work will be discussed in Sec-tion 4.6. Note that parts of this chapter are derived from [FHT07*, FKH+08*,FZK+11*, FZHT11*, FZHT13*].

4.1 Clustering and Scheduling Overhead

In this section, scheduling in the context of SystemCoDesigner will be brieflyintroduced. Then, the relevant reasons for scheduling overhead will be outlined,and it will be discussed how clustering is used to reduce this overhead. Thus,the notion of cluster FSM will be introduced to represent the QSSs that aregenerated by the proposed clustering methodology.Scheduling is the problem of determining what job to execute on what resource

at which point in time in order to meet demands in a cost-effective and timelymanner. However, in case of a design flow using the SystemCoDesigner[HFK+07*, KSS+09*] methodology, the binding of actors to resources (cf. Fig-ure 4.1) has already been determined via Design Space Exploration (DSE).Thus, scheduling in this context is concerned with determining for each resourcea sequence of actor firings of actors bound onto this resource. Furthermore, itis assumed that an actor firing once started on a resource, cannot be preemptedby another actor firing on the same resource.Strategies for determining a sequence of actor firings on a resource can be dis-

tinguished into (1) dynamic scheduling strategies that postpone all schedulingdecisions to a dynamic scheduler to be executed at run time and (2) staticscheduling strategies that compute all scheduling decisions at compile time.

85

Page 94: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

Mapping edge denoting that actor 𝑎1 will be implemented by resource 𝑟1

Architecture graph of the implementation

Network graph 𝑔sqr,flat

𝑟4|MEM 𝑟5|BUS

𝑟1|CPU1

𝑐5𝑐6

𝑎5|Sink

𝑎1|Src

𝑐1 𝑐2 𝑎3|Approx

𝑎4|Dup

𝑐4𝑎2|SqrLoop 𝑐3

𝑔sqr,flat

Figure 4.1: Depicted above is an implementation graph of an implementationof Newton’s iterative algorithm for calculating the square root usingonly a dedicated CPU to provide all the computation needs of thealgorithm. The network graph 𝑔sqr,flat consists of static (shaded) anddynamic (white) actors. The depicted implementation graph is areplication of a solution already shown and described in the previouschapter in Figure 3.4c on page 41.

However, static scheduling can only be applied if the DFG consists solely ofstatic actors. Otherwise, the presence of a dynamic actor introduces behaviorinto a DFG that cannot be predicted at compile time. Hence, not all schedulingdecisions can be predetermined at compile time. On the other hand, dynamicscheduling might introduce a significant scheduling overhead. Scheduling over-head results from the requirement of the dynamic scheduler to determine atrun time a sequence of actor firings for a resource that conforms to the datadependencies between these actors. In contrast to this, if a static schedule forthese actors is available, then this static schedule can simply be executed at run

86

Page 95: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.1 Clustering and Scheduling Overhead

time. An example of a dynamic scheduler implementing a dynamic schedulingstrategy for all actors of a DFG is given in Algorithm 2.31

Algorithm 2 Dynamic scheduler for a DFG 𝑔

1: function dynamicScheduler(𝑔)2: progress = t ◁ Mark DFG as active3: while progress do ◁ While the DFG is active4: progress ← f ◁ Reset the active state5: for all 𝑎 ∈ 𝑔.𝐴 do ◁ Iterate over all actors of the graph 𝑔6: if actor 𝑎 is enabled then ◁ Check if actor 𝑎 can be fired7: progress ← t ◁ Mark DFG as active8: fire actor 𝑎 ◁ Fire the actor 𝑎

The dynamic scheduling overhead is induced by (1) checks for the ability ofan actor to be fired (cf. Line 6) and (2) deficiencies in code generation for theactions of the actors that are executed (cf. Line 8) by the dynamic scheduler.These deficiencies result from the inability of the compiler to determine thattokens produced by a source actor could immediately be consumed by anothersink actor. Hence, token values might not need to be stored in a FIFO channel,but could be kept in a register. Indeed, these token values might not even needto be computed by the source actor if they would be discarded by the sink actor,e.g., a down-sample actor known from the signal processing domain.To exemplify, consider the implementation of Newton’s iterative algorithm for

calculating the square root using only a dedicated CPU as depicted in Figure 4.1.To coordinate the executions of the actors 𝑎1 − 𝑎5 on the CPU 𝑟1, a schedulingstrategy has to be selected for the CPU. Assuming that the dynamic schedulerdepicted in Algorithm 2 is used to schedule the actors of the network graph𝑔sqr,flat, then a possible schedule of these actors is visualized by the Gantt chartshown in Figure 4.2. This Gantt chart only shows a possible schedule up tothe point in time at which the first token arrives at the Sink actor 𝑎5. Shadedboxes in the Gantt chart correspond to execution time (cf. Line 8) spent tocompute Newton’s iterative algorithm for calculating the square root, while thewhite boxes correspond to execution time (cf. Line 6) spent by the dynamicscheduler in order to coordinate the firings of the actors 𝑎1− 𝑎5 on the CPU 𝑟1.Each box in the Gantt chart is annotated with the action or guard that

spends the execution time represented by the box. For the SqrLoop ac-tor 𝑎2, the execution of actions and functionality guards has been made ex-plicit by annotating the corresponding function from the actor functionality𝑎2.ℱ = { 𝑓copyStore, 𝑓copyInput, 𝑓copyApprox, 𝑓check } (cf. Figure 3.5 on page 45). Forthe other actors of the network graph 𝑔sqr,flat, the notation 𝑓⟨𝑎𝑛⟩ will be used to

31In the following, t and f are used to denote Boolean truth values.

87

Page 96: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

denote that the actor 𝑎𝑛 will be executed once by selecting an appropriate ac-tion from the actor functionality of the actor 𝑎𝑛. Correspondingly, the notation𝑎𝑛.𝑘

io#𝑐𝑖≥1∧#𝑐𝑜≥1 will be used to denote that the dynamic scheduler checks the

availability of tokens and free places at the channels 𝑐𝑖 and 𝑐𝑜 connected to theinput and output ports of the actor 𝑎𝑛 in order to determine (cf. Line 6) if theactor can be fired. More precisely, a white box annotated with 𝑘io

... denotes theexecution time spent for the evaluation of the input/output part of a transitionguard 𝑘 (cf. Definition 3.2 on page 46).Each schedule round depicted in Figure 4.2 corresponds to one iteration of the

loop body shown in Lines 4 to 8. Failed checks for the ability of an actor to befired, i.e., consecutive white boxes in the Gantt chart, are especially detrimentalto scheduling efficiency.

𝑟1

Time axis [abstract time units]130𝑡

Schedule round IV

159

𝑎1.𝑘

io #𝑐 1≥1

𝑎2.𝑘

io #𝑐 5≥1∧

#𝑐 6≥1

𝑎2.𝑘

io #𝑐 5≥1∧

#𝑐 2≥1

𝑎5.𝑘

io #𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑎3.𝑘

io #𝑐 2≥1∧

#𝑐 3≥1∧

#𝑐 4≥1

𝑎4.𝑘

io #𝑐 4≥1∧

#𝑐 3≥1∧

#𝑐 5≥1

𝑎2.𝑓

copyApprox

𝑎2.𝑓

check

𝑟1

86 130Time axis [abstract time units]𝑡

III

𝑎1.𝑘

io #𝑐 1≥1

𝑎5.𝑘

io #𝑐 6≥1

𝑓 ⟨𝑎4⟩

𝑎3.𝑘

io #𝑐 2≥1∧

#𝑐 3≥1∧

#𝑐 4≥1

𝑎4.𝑘

io #𝑐 4≥1∧

#𝑐 3≥1∧

#𝑐 5≥1

𝑎2.𝑘

io #𝑐 5≥1∧

#𝑐 2≥1

𝑎2.𝑓

check

𝑎2.𝑓

copyInput

𝑓 ⟨𝑎3⟩

𝑟1

40 86Time axis [abstract time units] 𝑡

II

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎4⟩

𝑎1.𝑘

io #𝑐 1≥1

𝑎5.𝑘

io #𝑐 6≥1

𝑎3.𝑘

io #𝑐 2≥1∧

#𝑐 3≥1∧

#𝑐 4≥1

𝑎4.𝑘

io #𝑐 4≥1∧

#𝑐 3≥1∧

#𝑐 5≥1

𝑎2.𝑘

io #𝑐 5≥1∧

#𝑐 2≥1

𝑎2.𝑓

copyInput

𝑎2.𝑓

check

𝑓 ⟨𝑎3⟩

𝑟1

0 40

Schedule round I

𝑡Time axis [abstract time units]

𝑓 ⟨𝑎3⟩

𝑓 ⟨𝑎4⟩

𝑓 ⟨𝑎1⟩

𝑎5.𝑘

io #𝑐 6≥1

𝑎1.𝑘

io #𝑐 1≥1

𝑎3.𝑘

io #𝑐 2≥1∧

#𝑐 3≥1∧

#𝑐 4≥1

𝑎4.𝑘

io #𝑐 4≥1∧

#𝑐 3≥1∧

#𝑐 5≥1

𝑎2.𝑘

io #𝑐 1≥1∧

#𝑐 2≥1

𝑎2.𝑓

copyStore

Figure 4.2: Gantt chart of a possible schedule of the network graph 𝑔sqr,flat bythe dynamic scheduler depicted in Algorithm 2 on the CPU 𝑟1

More precisely, the scheduling overhead contained in a possible schedule canbe defined as follows:

𝑡𝑆𝐴𝑡𝑆𝐴 + 𝑡𝑇𝐴

In this equation, 𝑡𝑆𝐴 denotes the overall scheduling time and 𝑡𝑇𝐴 denotes theoverall computation time. The overall scheduling time is the sum of all execution

88

Page 97: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.1 Clustering and Scheduling Overhead

times spent by the dynamic scheduler to select transitions of actors for execution.In contrast to this, the overall computation time is the sum of all executiontimes spent by actors to execute their actions 𝑓𝑎𝑐𝑡𝑖𝑜𝑛 ∈ ℱ𝑎𝑐𝑡𝑖𝑜𝑛 and functionalityguards 𝑓𝑔𝑢𝑎𝑟𝑑 ∈ ℱ𝑔𝑢𝑎𝑟𝑑, as well as a fixed overhead 𝑡𝑇𝐸 spent by the dynamicscheduler per execution of one transition. The exact definitions of these timesand methods to determine their values can be found in [KSH+11, KSH+12].Note that the execution times of the functionality guards have been charged

to 𝑡𝑇𝐴 because a functionality guard of an actor only needs to be evaluated atmost once per execution of an action of the actor. Considering the definition(cf. Definition 3.1 on page 44) 𝑓𝑔𝑢𝑎𝑟𝑑 : 𝑄func × 𝑆|𝐼| → { t, f } of a functionalityguard, it can be determined that a functionality guard only depends on thevalues of the tokens present at the actor input ports and on the functionalitystate. If no transition of an actor is taken, then neither can values of tokenspresent at the actor input port change, nor can the functionality state of theactor be modified. Hence, the valuation of the functionality guard will remainunchanged and the result can be cached until a transition of the actor is taken.To exemplify, consider the Gantt chart depicted in Figure 4.2. If the fixed

overhead 𝑡𝑇𝐸 is neglected, then the overall scheduling time 𝑡𝑆𝐴 can be com-puted as the sum of execution times spent by all white boxes and the overallcomputation time 𝑡𝑇𝐴 as the sum of execution times spent by all shaded boxes.Moreover, the overall execution time spent by the CPU 𝑟1 to execute the DFGtill the first token is processed by actor 𝑎5 is 𝑡𝐸𝑋𝐸𝐶 = 𝑡𝑆𝐴+ 𝑡𝑇𝐴 = 61+98 = 159abstract time units. Hence, for the schedule depicted in Figure 4.2, the followingscheduling overhead can be derived:

𝑡dyn𝑆𝐴

𝑡dyn𝐸𝑋𝐸𝐶

=𝑡dyn𝑆𝐴

𝑡dyn𝑆𝐴 + 𝑡dyn𝑇𝐴

=61

61 + 98=

61

159≈ 0.38

Which means that 38% of the overall execution time is wasted in the sched-uler. The scheduling overhead problem can be mended by coding the actors ofthe application at an appropriate level of granularity, that is, combining so muchfunctionality into one actor that the overall computation time 𝑡𝑇𝐴 dominatesthe overall scheduling time 𝑡𝑆𝐴. Note that combining functionality from actorsof very fine granularity might also reduce the overall computation time 𝑡𝑇𝐴 re-quired for a given piece of computational work. This is due to the fact thatthe compiler might be able to generate more efficient code if the functionalityis combined in one function in contrast to a distributed implementation of thefunctionality amongst many fine-grained actors. Hence, there is a penalty for dy-namic scheduling of fine-grained actors not only from the scheduling overheadbut also from code generation deficiencies for the implemented functionality.Thus, the scheduling overhead alone might understate the problem.To exemplify, consider the two variants of the network graph 𝑔sqr,flat (cf. Fig-

ure 4.1) shwon in Figures 4.3 and 4.4. Both network graphs implement Newton’s

89

Page 98: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

iterative algorithm for calculating the square root. However, they differ in thenumber of actors that must be scheduled by a dynamic scheduler. In both vari-ants, static actors have been shaded slightly gray. The Src, Sink, Approx,and Dup actors are marked graph actors consuming and producing one tokeneach on their input and output ports in each of their firings. The communicationbehavior of the SqrLoop actor is given by the FSM depicted in Figure 3.5 onpage 45. Moreover, in the network graph 𝑔sqr shown in Figure 4.4, the subcluster𝑔𝛾appx

(cf. Figure 4.3) containing the actors 𝑎3 and 𝑎4 has been replaced by thecomposite actor 𝑎𝛾appx

. This composite actor implements a QSS for the actors𝑎3 and 𝑎4.

𝑎2|SqrLoop

𝑜 2 𝑖 2𝑜 1𝑖 1

𝑐1 𝑐2

𝑐5𝑐6

𝑜1

𝑖1

𝑜 1

𝑜 1 𝑖 1

𝑖 2

𝑐3 𝑐4

𝑔sqr|SqrRoot

Cluster input port 𝑖1 ∈ 𝑔𝛾appx.𝐼

Cluster output port 𝑜1 ∈ 𝑔𝛾appx.𝑂 Channel 𝑐4 ∈ 𝑔𝛾appx

.𝐶

Actor

𝑎3∈𝑔 𝛾

appx.𝐴

ofsubc

luster

𝑔 𝛾appx

𝑔𝛾appx|ApproxBody

𝑎1|Src

𝑎5|Sink

𝑖1

𝑜1

𝑖1

𝑜2

Network graph 𝑔sqr and its subcluster 𝑔𝛾appx∈ 𝑔sqr.𝐺

𝑎3|Approx

𝑎4|Dup

Figure 4.3: Hierarchical network graph containing the subcluster 𝑔𝛾appx

In order to represent a QSS, the notion of a so-called cluster FSM is used.A cluster FSM resembles an actor FSM from Definition 3.2 on page 46 withthe exception that the actions of the transitions execute sequences of staticallyscheduled actor firings. Intuitively, a composite actor is a cluster with a clusterFSM implementing a QSS for the actors of the cluster. In case of the compositeactor 𝑎𝛾appx

, the action 𝑓⟨𝑎3,𝑎4⟩ ∈ 𝑎𝛾appx.ℱ𝑎𝑐𝑡𝑖𝑜𝑛 of the transition 𝑡1 implements the

sequence ⟨𝑎3, 𝑎4⟩ of statically scheduled actor firings. More formally, a clusterFSM is defined as follows:

Definition 4.1 (Cluster FSM [FKH+08*]). The cluster FSM ℛ of a cluster𝑔𝛾/composite actor 𝑎𝛾 is a tuple (𝑄, 𝑞0, 𝑇 ) containing a finite set of clusterstates 𝑄, an initial cluster state 𝑞0 ∈ 𝑄, and a finite set of transitions 𝑇 . Atransition 𝑡 ∈ 𝑇 itself is a tuple (𝑞src, 𝑘

io, 𝑓⟨...⟩, 𝑞dst) containing the source state

90

Page 99: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.1 Clustering and Scheduling Overhead

𝑎4|Dup

𝑎3|Approx

𝑜 1𝑖 1

𝑖 2

𝑐3 𝑐4

𝑜 1

𝑖1

𝑜2

𝑎2|SqrLoop

𝑎1|Src

𝑎5|Sink

𝑞0 𝑡1

#𝑖1 ≥ 1∧#𝑜1 ≥ 1/𝑓⟨𝑎3,𝑎4⟩

𝑜 2 𝑖 2𝑜 1𝑖 1

𝑐1 𝑐2

𝑐5𝑐6

𝑜1

𝑖1 𝑜1

𝑖1

𝑔sqr|SqrRootActor input port 𝑖1 ∈ 𝑎𝛾appx

.𝐼 𝑎𝛾appx|ApproxBody

Actor output port 𝑜1 ∈ 𝑎𝛾appx.𝑂 Guard 𝑘io

#𝑐2≥1∧#𝑐5≥1

Network graph 𝑔sqr and its composite actor 𝑎𝛾appx∈ 𝑔sqr.𝐴

Figure 4.4: Refined network graph derived from replacing the 𝑔𝛾appxsubcluster

by a corresponding composite actor 𝑎𝛾appx

𝑞src ∈ 𝑄, from where the transition is originating, and the destination state𝑞dst ∈ 𝑄, which will become the next current state after the execution of thetransition starting from the current state 𝑞src. Furthermore, if a transition 𝑡 istaken, then its action 𝑓⟨...⟩ ∈ 𝑎𝛾 .ℱ𝑎𝑐𝑡𝑖𝑜𝑛 implementing a sequence ⟨. . .⟩ ∈ 𝑔𝛾 .𝐴

*

of statically scheduled actor firings will be executed. Finally, the execution of atransition 𝑡 itself is guarded by an input/output guard 𝑘io.

The replacement of the static cluster 𝑔𝛾appxby the composite actor 𝑎𝛾appx

may reduce the overall scheduling time 𝑡𝑆𝐴 as it condenses the execution timesof the input/output guards 𝑘io

#𝑐2≥1∧#𝑐3≥1∧#𝑐4≥1 and 𝑘io#𝑐4≥1∧#𝑐3≥1∧#𝑐5≥1 into one

input/output guard 𝑘io#𝑐2≥1∧#𝑐5≥1. This input/output guard requires less ex-

ecution time than the sum of execution times required by the input/outputguards 𝑘io

#𝑐2≥1∧#𝑐3≥1∧#𝑐4≥1 and 𝑘io#𝑐4≥1∧#𝑐3≥1∧#𝑐5≥1. The input/output guard

𝑘io#𝑐2≥1∧#𝑐5≥1 is used by the composite actor 𝑎𝛾appx

to guard the transition 𝑡1.If this transition is taken, the action 𝑓⟨𝑎3,𝑎4⟩, which implements the sequence⟨𝑎3, 𝑎4⟩ of statically scheduled actor firings, is executed. A possible scheduleof the modified network graph 𝑔sqr is visualized by the Gantt chart shown inFigure 4.5. The scheduling overhead of this schedule can be computed as:

𝑡qss𝑆𝐴

𝑡qss𝐸𝑋𝐸𝐶

=𝑡qss𝑆𝐴

𝑡qss𝑆𝐴 + 𝑡qss𝑇𝐴

=41

41 + 77=

41

118≈ 0.35

Moreover, the overall computation time 𝑡qss𝑇𝐴 has also been reduced from 98 to77 abstract time units. This reduction can be explained by considering the more

91

Page 100: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝑟1𝑡

Time axis [abstract time units]94 118

Schedule round IV

𝑎1.𝑘

io #𝑐 1≥1

𝑎5.𝑘

io #𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑎𝛾appx.𝑘

io #𝑐 2≥1∧

#𝑐 5≥1

𝑎2.𝑓

copyApprox

𝑎2.𝑓

check

𝑎2.𝑘

io #𝑐 5≥1∧

#𝑐 2≥1

𝑎2.𝑘

io #𝑐 5≥1∧

#𝑐 6≥1

𝑟1

𝑡62 94Time axis [abstract time units]

Schedule round III

𝑎1.𝑘

io #𝑐 1≥1

𝑎5.𝑘

io #𝑐 6≥1

𝑎𝛾appx.𝑘

io #𝑐 2≥1∧

#𝑐 5≥1

𝑎𝛾appx.𝑓⟨𝑎3,𝑎4⟩

𝑎2.𝑘

io #𝑐 5≥1∧

#𝑐 2≥1

𝑎2.𝑓

copyInput

𝑎2.𝑓

check

𝑟1

𝑡28 62Time axis [abstract time units]

Schedule round II𝑓 ⟨𝑎1⟩

𝑎1.𝑘

io #𝑐 1≥1

𝑎5.𝑘

io #𝑐 6≥1

𝑎𝛾appx.𝑘

io #𝑐 2≥1∧

#𝑐 5≥1

𝑎𝛾appx.𝑓⟨𝑎3,𝑎4⟩

𝑎2.𝑘

io #𝑐 5≥1∧

#𝑐 2≥1

𝑎2.𝑓

check

𝑎2.𝑓

copyInput

𝑟1

0

Schedule round I

𝑡28Time axis [abstract time units]

𝑓 ⟨𝑎1⟩

𝑎1.𝑘

io #𝑐 1≥1

𝑎5.𝑘

io #𝑐 6≥1

𝑎𝛾appx.𝑘

io #𝑐 2≥1∧

#𝑐 5≥1

𝑎𝛾appx.𝑓⟨𝑎3,𝑎4⟩

𝑎2.𝑘

io #𝑐 1≥1∧

#𝑐 2≥1

𝑎2.𝑓

copyStore

Figure 4.5: Gantt chart of a possible schedule of the network graph 𝑔sqr by thedynamic scheduler depicted in Algorithm 2 on the CPU 𝑟1

efficient code generation and compiler optimizations possible for the sequence⟨𝑎3, 𝑎4⟩ of statically scheduled actor firings as compared to compiler optimiza-tions possible for the isolated firings of actor 𝑎3 and actor 𝑎4. To exemplify, inthe actor functionality 𝑓⟨𝑎3,𝑎4⟩, the channel 𝑐4 can be replaced by a local vari-able, which could potentially be allocated into a CPU register by the compiler,while the channel 𝑐3 can be replaced by a global variable. Furthermore, readand write access to these two channels by the actor functionality of the actors𝑎3 and 𝑎4 will be simplified accordingly.

Note that the timings used in the Gantt charts in this section have been de-rived from real measurements on the Xilinx Zynq-7000 platform. This platformcontains an ARMr CortexTM-A9 dual-core CPU running at 667 MHz. First,the pure C++ software implementations have been generated by applying thesynthesis back-end presented in Chapter 5 to the SysteMoC application imple-menting the square root DFG shown in Figure 4.3. For the refined DFG 𝑔sqr,the clustering methodology from Section 4.4.2 has been applied before codegeneration. Next, the executables have been generated by using version 4.7 ofthe gcc compiler suite with the -O3 option for the ARM R○ CortexTM-A9 target

92

Page 101: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.1 Clustering and Scheduling Overhead

architecture. Finally, the measures have been taken on the Xilinx ZedBoard byusing the measuring infrastructure included with the Linaro Linux-distribution.To enable a better layout of these Gantt charts, the measurements have been

converted to abstract time units by means of normalization and slight libertieshave been taken with the individual execution times of the actions and guards.To exemplify, the real timings for a dynamically scheduled (cf. Figure 4.3) com-putation of 1,000,000 integer-based square root calculations are given in Equa-tion (4.1). The corresponding timings for the refined DFG (cf. Figure 4.4) aregiven in Equation (4.2).

𝑡dyn𝑆𝐴

𝑡dyn𝐸𝑋𝐸𝐶

=𝑡dyn𝑆𝐴

𝑡dyn𝑆𝐴 + 𝑡dyn𝑇𝐴

=2.30 sec

2.30 sec + 3.23 sec=

2.30 sec

5.53 sec≈ 0.42 (4.1)

𝑡qss𝑆𝐴

𝑡qss𝐸𝑋𝐸𝐶

=𝑡qss𝑆𝐴

𝑡qss𝑆𝐴 + 𝑡qss𝑇𝐴

=1.71 sec

1.71 sec + 2.58 sec=

1.71 sec

4.29 sec≈ 0.40 (4.2)

The speedup 𝑠qss between the dynamically scheduled implementation and therefined implementation is as follows:

𝑡dyn𝐸𝑋𝐸𝐶

𝑡qss𝐸𝑋𝐸𝐶

=5.53 sec

4.29 sec= 𝑠qss ≈ 1.29

Note that this speedup cannot be explained alone from the schedule overheadreduction from 𝑡dyn𝑆𝐴 = 2.30 sec to 𝑡qss𝑆𝐴 = 1.71 sec, but stems also from the im-provement in code generation for the implementation of the functionality. Thatis to say, the overall computation times 𝑡dyn𝑇𝐴 and 𝑡qss𝑇𝐴 differ, yet they performthe same amount of useful work, i.e., 1,000,000 integer-based square root cal-culations. Hence, the scheduling overhead from Equation (4.1) understates theproblem. To exemplify, a hand-optimized implementation of Newton’s iterativealgorithm for calculating the square root is measured as 𝑡hand𝐸𝑋𝐸𝐶 = 2.66 sec for1,000,000 integer-based square root calculations. Thus, the overall overhead forthe two implementations derives from scheduling overhead and code generationdeficiencies and can be calculated as follows:

𝑡dyn𝐸𝑋𝐸𝐶 − 𝑡hand𝐸𝑋𝐸𝐶

𝑡dyn𝐸𝑋𝐸𝐶

=5.53 sec− 2.66 sec

5.53 sec≈ 0.52

𝑡qss𝐸𝑋𝐸𝐶 − 𝑡hand𝐸𝑋𝐸𝐶

𝑡qss𝐸𝑋𝐸𝐶

=4.29 sec− 2.66 sec

4.29 sec≈ 0.38

Note that this overall overhead is mathematically equivalent to the speedupif only singlecore implementations are considered. To exemplify, the overhead𝑠overhead is defined as the factor by which the hand-optimized implementation isfaster than the fully dynamically scheduled implementation:

𝑠overhead =𝑡dyn𝐸𝑋𝐸𝐶

𝑡hand𝐸𝑋𝐸𝐶

=5.53 sec

2.66 sec≈ 2.08

93

Page 102: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

Then, the overall overhead for each implementation can be computed from thespeedup 𝑠 of the implementation and the overhead 𝑠overhead as follows:

1− 𝑠

𝑠overhead= 1−

𝑡dyn𝐸𝑋𝐸𝐶

𝑡impl𝐸𝑋𝐸𝐶

𝑡dyn𝐸𝑋𝐸𝐶

𝑡hand𝐸𝑋𝐸𝐶

= 1− 𝑡hand𝐸𝑋𝐸𝐶

𝑡impl𝐸𝑋𝐸𝐶

=𝑡impl𝐸𝑋𝐸𝐶 − 𝑡hand𝐸𝑋𝐸𝐶

𝑡impl𝐸𝑋𝐸𝐶

Hence, in Section 4.5, which presents results for singlecore implementations,only the speedup needs to be considered.However, combining functionality of multiple actors into one actor can also

mean that more data has to be produced and consumed atomically which re-quires larger channel capacities and may delay computation unnecessarily, there-fore degrading the latency if multiple computation resources are available forexecution of the application. Hence, selecting a level of granularity for a func-tionality by the designer is a manual trade-off between schedule efficiency toimprove throughput and latency on a dedicated CPU solution against resourceconsumption, e.g., larger channel capacities, and scheduling flexibility requiredto improve throughput and latency on a multicore solution. Moreover, if latencyminimization is a secondary goal as compared for example to energy minimiza-tion, then clustering might improve energy efficiency due to overhead minimiza-tion even if the overall latency is not improved. Thus, if the mapping of actorsto resources itself is part of the synthesis step, e.g., as is the case in the System-CoDesigner [HFK+07*, KSS+09*] DSE methodology, an appropriate level ofgranularity can no longer be chosen in advance by the designer as it depends onthe mapping itself.An example of such a situation is the network graph 𝑔ex1 and its two possible

implementations shown in Figure 4.6. Dynamic actors are represented by whiterounded boxes, e.g., 𝑎5, while static actors are represented by shaded roundedboxes, e.g., 𝑎1 − 𝑎4, 𝑎6.All static actors except actor 𝑎4 are HSDF actors. In contrast, the actor 𝑎4

consumes three tokens from channel 𝑐4 and produces one token on channel 𝑐5when fired. The dynamic actor 𝑎5 determines how often a data value shouldbe processed by the actors 𝑎2 − 𝑎4. The two Gantt charts shown in Figure 4.7depict possible schedules of the unmodified network graph 𝑔ex1 for the two im-plementations from Figure 4.6. Each Gantt chart only shows a possible scheduleof the graph up to the point in time at which the first result token arrives atactor 𝑎6. The schedules depicted in the Gantt charts assume that each resourcehas its own dedicated dynamic scheduler as given in Algorithm 2.As can be seen, the dual CPU implementation finishes the processing of the

first result token by actor 𝑎6 at 103 abstract time units, while the single CPUimplementation requires 135 abstract time units. In order to improve this la-tency, QSS is now applied. However, as can be seen from Figures 4.8 and 4.9,

94

Page 103: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.1 Clustering and Scheduling Overhead

Architecture graph of a single CPU implementationNetwork graph 𝑔ex1

𝑔ex1

𝑟1|CPU1

𝑟3|MEM 𝑟4|BUS

𝑎1

𝑎2𝑐2

𝑐1𝑐3

𝑐6𝑐4

𝑎3𝑎5

𝑎4𝑐5

𝑎6

(a) Single CPU implementation

Network graph 𝑔ex1

Architecture graph of a dual CPU implementation

𝑔ex1

𝑟1|CPU1

𝑟2|CPU2

𝑟3|MEM 𝑟4|BUS

𝑎1

𝑎2𝑐2

𝑐1𝑐3

𝑐6𝑐4

𝑎3𝑎5

𝑎4𝑐5

𝑎6

(b) Dual CPU implementation

Figure 4.6: Two possible implementations for the network graph 𝑔ex1

different QSSs are required in order to obtain a smaller latency for the differentimplementations. Two clustering alternatives are considered: (1) clustering andrefining the actors 𝑎2 and 𝑎3 into a composite actor 𝑎𝛾2,3

(cf. Figure 4.9), and(2) further enlarging the cluster by adding the actor 𝑎4 to derive the compositeactor 𝑎𝛾2,3,4

(cf. Figure 4.8). It turns out that the second alternative with thecomposite actor 𝑎𝛾2,3,4

produces a smaller latency (91 instead of 114 abstracttime units) for the single CPU implementation.However, the sequence ⟨𝑎2, 𝑎3, 𝑎2, 𝑎3, 𝑎2, 𝑎3, 𝑎4⟩ of statically scheduled actor

firings realized by the action 𝑓⟨𝑎2,𝑎3,𝑎2,𝑎3,𝑎2,𝑎3,𝑎4⟩ of the composite actor 𝑎𝛾2,3,4can

only be executed when at least three tokens are available in the channel 𝑐2.Hence, for the second alternative, the channel capacity size(𝑐2) of the channel

𝑐2 must be able to accommodate at least three tokens. On the other hand, forthe first alternative and the unclustered case, a channel capacity of one token issufficient. Indeed, if no information is known about actor 𝑎5, then the channelcapacity has to be increased from one to five tokens in order to compensate forthe inaccessible channel capacities of size(𝑐3) = 1 and size(𝑐4) = 3 tokens of thechannels 𝑐3 and 𝑐4. A detailed treatment of the problem is given in Section 4.4.6.Furthermore, the requirement for at least three tokens in order to execute

𝑓⟨𝑎2,𝑎3,𝑎2,𝑎3,𝑎2,𝑎3,𝑎4⟩ will delay the computation of the first actor firing of actor𝑎2 compared to the schedules depicted in Figure 4.7. This does not harm thelatency in the single CPU implementation (cf. Figure 4.8a) because the CPU𝑟1 can perform other useful work until the action 𝑓⟨𝑎2,𝑎3,𝑎2,𝑎3,𝑎2,𝑎3,𝑎4⟩ can be ex-ecuted. However, in the dual CPU implementation (cf. Figure 4.8b), the onlyactor bound onto the second CPU 𝑟2 is the composite actor 𝑎𝛾2,3,4

. Hence, thefirst 13 schedule rounds 𝐼−𝑋𝐼𝐼𝐼 of 𝑟2 only consist of scheduling overhead. The

95

Page 104: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

109 135𝑡

Time axis [abstract time units]𝑟1

Schedule round IV

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 5≥1∧

#𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎6⟩

𝑓 ⟨𝑎4⟩

55 109𝑡

Time axis [abstract time units]𝑟1

Schedule round III

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑓 ⟨𝑎2⟩

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑓 ⟨𝑎3⟩

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑓 ⟨𝑎2⟩

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑓 ⟨𝑎3⟩

0 55𝑡

Time axis [abstract time units]𝑟1

Schedule round I Schedule round II𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑓 ⟨𝑎2⟩

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑓 ⟨𝑎3⟩

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

(a) Gantt chart of a possible schedule for a single CPU implementation of the graph 𝑔ex1

55 103𝑡

Time axis [abstract time units]

𝑟1

𝑟2

Schedule round XI of 𝑟1

VISchedule round V of 𝑟2

VI VII VIII IX X

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑓 ⟨𝑎4⟩

𝑓 ⟨𝑎3⟩

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑓 ⟨𝑎2⟩

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 5≥1∧

#𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎6⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

0 55Time axis [abstract time units]𝑡

I

𝑟1

𝑟2

II

IV

Schedule round III of 𝑟2 Schedule round IV of 𝑟2

Schedule round I of 𝑟1 VSchedule round II of 𝑟1 Schedule round III of 𝑟1

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑓 ⟨𝑎2⟩

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑓 ⟨𝑎3⟩

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑓 ⟨𝑎2⟩

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑓 ⟨𝑎3⟩

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎1⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎1⟩

(b) Gantt chart of a possible schedule for a dual CPU implementation of the graph 𝑔ex1

Figure 4.7: Gantt charts for the unmodified network graph 𝑔ex1

first clustering alternative with the composite actor 𝑎𝛾2,3does not have this

problem. Hence, for the dual CPU implementation, this alternative (cf. Fig-ure 4.9b) produces a smaller latency (75 instead of 82 abstract time units).

To overcome these challenges, the presented clustering-based design flow al-lows the DSE to decide the set of actors that should be clustered. In the fol-lowing, however, a heuristics will be used that maximizes the size of the createdclusters.

96

Page 105: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.2 Clustering as an Operation of Refinement

𝑡48 91Time axis [abstract time units]

𝑟1

Schedule round IV

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑎𝛾2,3,4.𝑓⟨𝑎2,𝑎3,𝑎2,𝑎3,𝑎2,𝑎3,𝑎4⟩

𝑘io #𝑐 5≥1∧

#𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎6⟩

0𝑡

48Time axis [abstract time units]𝑟1

Schedule round I Schedule round II Schedule round III𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

(a) Gantt chart of a possible schedule for a single CPU implementation of the clustered graph

48 82Time axis [abstract time units]𝑡

𝑟1

𝑟2

XVISchedule round XV of 𝑟2

V VI VII Schedule round VIII of 𝑟1

XVII XVIII XIX XX

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 5≥1∧

#𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎6⟩

0 48

I II III IV V VI VII VIIITime axis [abstract time units]

𝑡

𝑟1

𝑟2

IVSchedule round I of 𝑟1 Schedule round II of 𝑟1 Schedule round III of 𝑟1

IX X XI XII XIII XIV

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑎𝛾2,3,4.𝑓⟨𝑎2,𝑎3,𝑎2,𝑎3,𝑎2,𝑎3,𝑎4⟩

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥3∧

#𝑐 5≥1

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎1⟩

(b) Gantt chart of a possible schedule for a dual CPU implementation of the clustered graph

Figure 4.8: Gantt charts for the clustered network graph that results from clus-tering and refining the actors 𝑎2, 𝑎3, and 𝑎4 of 𝑔ex1 into a compositeactor 𝑎𝛾2,3,4

4.2 Clustering as an Operation of Refinement

Cluster refinement is the conversion of a cluster 𝑔𝛾 into a single composite actor𝑎𝛾 of equivalent behavior. Hence, a notion for equivalent behavior is required.A static cluster has sequence determinate behavior [LSV98], i.e., the behaviorof the cluster is independent of the scheduling of the actors inside the cluster.Thus, it can be modeled by a Kahn function via the relation as presented in[Lee97] between Kahn’s denotational semantics [Kah74] and the semantics ofdata flow models with the notion of firing.32 However, to use this denotational

32The requirement for sequence determinate behavior is also satisfied if the cluster containsKahn actors. Therefore, the equivalence condition as introduced in Definition 4.2 couldbe reused in the context of a more general refinement of clusters containing Kahn actors.

97

Page 106: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

55 114𝑡

Time axis [abstract time units]𝑟1

Schedule round III Schedule round IV

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑎𝛾2,3.𝑓⟨𝑎2,𝑎3⟩

𝑘io #𝑐 2≥1∧

#𝑐 4≥1

𝑎𝛾2,3.𝑓⟨𝑎2,𝑎3⟩

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑓 ⟨𝑎4⟩

𝑘io #𝑐 5≥1∧

#𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎6⟩

0 55𝑡

Time axis [abstract time units]𝑟1

Schedule round I Schedule round II𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑘io #𝑐 2≥1∧

#𝑐 3≥1

𝑘io #𝑐 3≥1∧

#𝑐 4≥1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑎𝛾2,3.𝑓⟨𝑎2,𝑎3⟩

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥1∧

#𝑐 4≥1

𝑘io #𝑐 2≥1∧

#𝑐 4≥1

𝑓 ⟨𝑎1⟩

𝑘io #𝑐 1≥1

(a) Gantt chart of a possible schedule for a single CPU implementation of the clustered graph

48 75Time axis [abstract time units]𝑡

𝑟1

𝑟2

Schedule round V of 𝑟2

Schedule round VII of 𝑟1V VI

VI VII

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑓 ⟨𝑎4⟩

𝑘io #𝑐 2≥1∧

#𝑐 4≥1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥1∧

#𝑐 4≥1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 5≥1∧

#𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎6⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

0 48Time axis [abstract time units]𝑡

𝑟1

𝑟2

I II

IV

Schedule round III of 𝑟2 Schedule round IV of 𝑟2

Schedule round I of 𝑟1 Schedule round II of 𝑟1 Schedule round III of 𝑟1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥1∧

#𝑐 4≥1

𝑘io #𝑐 2≥1∧

#𝑐 4≥1

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥1∧

#𝑐 4≥1

𝑎𝛾2,3.𝑓⟨𝑎2,𝑎3⟩

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥1∧

#𝑐 4≥1

𝑎𝛾2,3.𝑓⟨𝑎2,𝑎3⟩

𝑘io #𝑐 4≥3∧

#𝑐 5≥1

𝑘io #𝑐 2≥1∧

#𝑐 4≥1

𝑎𝛾2,3.𝑓⟨𝑎2,𝑎3⟩

𝑘io #𝑐 1≥1

𝑓 ⟨𝑎1⟩

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎1⟩

𝑓 ⟨𝑎5⟩

𝑘io #𝑐 1≥1

𝑘io #𝑐 1≥1∧

#𝑐 2≥1

𝑘io #𝑐 6≥1

𝑓 ⟨𝑎1⟩

(b) Gantt chart of a possible schedule for a dual CPU implementation of the clustered graph

Figure 4.9: Gantt charts for the clustered network graph that results from clus-tering and refining the actors 𝑎2 and 𝑎3 into a composite actor 𝑎𝛾2,3

semantics, infinite channel sizes for the channels connected to the compositeactor ports have to be assumed. In case of cluster refinement in static DFGsas presented in Section 4.3, this restriction does not matter as determining thechannel capacities at compile time in order to satisfy various criteria are decid-able problems [SGB06a, SGB08, BMMKU10, BMMKM10, WBJS06, WBS07]for static DFGs.However, for dynamic DFGs, even the question if the graph can be executed

in bounded memory is undecidable [Buc93]. Hence, in the following, the re-finement algorithms presented in the Sections 4.3.1, 4.3.2, 4.4.2, and 4.4.3 aswell as the correctness proof in Section 4.4.5 will assume infinite channel ca-pacities. The problem of finite channel capacities is discussed in Section 4.4.6.With the infinite channel capacities simplification, the denotational semanticsof Kahn Process Networks (KPNs) can be used to compare the cluster and its

98

Page 107: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.2 Clustering as an Operation of Refinement

corresponding composite actor for equivalence. More formally, the equivalencebetween a cluster and its corresponding composite actor is defined as follows:

Definition 4.2 (Equivalence Condition for Sequence Determinate Clusters).Given a sequence determinate cluster 𝑔𝛾 and a composite actor 𝑎𝛾 as well astheir corresponding Kahn functions 𝒦𝑔𝛾 and 𝒦𝑎𝛾 , then the cluster 𝑔𝛾 and thecomposite actor 𝑎𝛾 have equivalent behavior if and only if 𝒦𝑔𝛾 ≡ 𝒦𝑎𝛾 and thecluster/composite actor is connected to its cluster/composite actor environmentwith channels of infinite size.

The equivalence condition ensures two different things: (1) It requires thecluster and the composite actor to produce the same sequences of values onoutput ports when the cluster and the composite actor are given the sameinfinite sequences of input values on their input ports, and (2) it requires thecluster and the composite actor to have the same data dependencies betweenthe consumed tokens on the input ports and the produced tokens on the outputports.The satisfaction of requirement (1) by a cluster refinement operation is rela-

tively straightforward. It merely requires the composite actor and the cluster toperform the same operations on the token values. Thus, in the following only thesatisfaction of requirement (2) is considered. Requirement (2) prevents the pos-sible introduction of an artificial deadlock by the cluster refinement operationdue to missing tokens.33

The key idea of Definition 4.2 is to satisfy the so-called worst case environ-ment of the cluster since satisfying this environment will also preserve deadlockfreedom for all other possible cluster environments. The worst case environmentcontains for each output port 𝑜 ∈ 𝑔𝛾 .𝑂 and each input port 𝑖 ∈ 𝑔𝛾 .𝐼 a feedbackloop where any produced token on the output port 𝑜 is needed for the activationof an actor 𝑎 ∈ 𝑔𝛾 .𝐴 connected to the input port 𝑖. To exemplify, the DFG 𝑔sqrand its subcluster 𝑔𝛾appx

depicted in Figure 4.3 on page 90 is considered. In thiscase, each token produced on the output port 𝑜1 ∈ 𝑔𝛾appx

.𝑂 of the subcluster isrequired by the SqrLoop actor 𝑎2 to produce the next token for consumptionon the input port 𝑖1 ∈ 𝑔𝛾appx

.𝐼 of the subcluster.In particular, postponing the production of an output token would result

in a deadlock of the entire system. To satisfy the equivalence condition, thecomposite actor produced by a cluster refinement must, thus, always produce themaximum number of output tokens possible from the consumption of a minimumnumber of input tokens.A cluster refinement operator 𝜉 is used to denote the refinement of a cluster

into a composite actor. This operator transforms an original graph 𝑔 ∈ 𝐺 and33The deadlock is artificial in the sense that it was not present in the original DFG which

was refined.

99

Page 108: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

its subcluster 𝑔𝛾 ∈ 𝑔.𝐺 into a refined graph 𝑔 containing the actor 𝑎𝛾 ∈ 𝑔.𝐴replacing the subcluster while keeping the behavior of the original graph. Thisoperation is exemplified in Figure 4.3 and Figure 4.4: Depicted is the originalnetwork graph 𝑔sqr and its static subcluster 𝑔𝛾appx

which is transformed by thecluster refinement operation 𝜉(𝑔sqr, 𝑔𝛾appx

) = (𝑔sqr, 𝑎𝛾appx) into the resulting re-

fined network graph 𝑔sqr and the corresponding composite actor 𝑎𝛾appxderived

from 𝑔𝛾appx. More formally, the cluster refinement operation is defined as follows:

Definition 4.3 (Cluster Refinement Operation). Given a graph 𝑔 and its se-quence determinate subcluster 𝑔𝛾 . The cluster refinement operator 𝜉 : 𝐺×𝐺→𝐺 × 𝐴 replaces 𝑔𝛾 by a single Kahn actor 𝑎𝛾 , called composite actor, resultingin a refined graph 𝑔, i.e., 𝜉(𝑔, 𝑔𝛾 ) = (𝑔, 𝑎𝛾 ) where 𝑔𝛾 ∈ 𝑔.𝐺 and 𝑔 = 𝑔 exceptthat 𝑔.𝑉 = 𝑔.𝑉 + { 𝑎𝛾 } − { 𝑔𝛾 } as well as 𝑎𝛾 .𝒫 = 𝑔𝛾 .𝒫 .

If the subcluster 𝑔𝛾 and the produced composite actor 𝑎𝛾 satisfy the equiv-alence condition from Definition 4.2, then the cluster refinement operation willin the following be called a safe cluster refinement operation. Otherwise, theoperation will be called unsafe. Furthermore, if the refined DFG is free of dead-locks, then the refinement will be called a valid cluster refinement operation.Otherwise, the refinement operation will be called invalid.Note that safety is a local property that is decidable from the static cluster

and the produced composite actor, while validity is a global property that isdecidable from a composite actor and the static DFG containing this compositeactor. Moreover, validity in general is undecidable if the composite actor iscontained in a dynamic DFG. In contrast, if the original (dynamic) DFG is freeof deadlocks and the cluster refinement operation is a safe cluster refinement,then the cluster refinement operation is also valid and will produce a refinedDFG that is also free of deadlocks. The above given fact is a corollary of processcomposition as defined by Lee et al. in [LSV98] and the equivalence of 𝒦𝑔𝛾 and𝒦𝑎𝛾 as required by Definition 4.2 and proven in Section 4.4.5. Thus, the safetyproperty of a refinement operation can be used as a fallback for refinements ofstatic clusters in dynamic DFGs.

4.3 Cluster Refinement in Static Data Flow Graphs

In the following, a refinement of a static subcluster 𝑔𝛾 ∈ 𝑔n.𝐺 of a network graph𝑔n only containing actors exhibiting SDF or CSDF semantics is considered. Itwill be assumed that the network graph 𝑔n is a consistent and deadlock-free SDFor CSDF graph (cf. Section 2.2 on page 9).Various forms of representing the composite actor 𝑎𝛾 are compared for their

feasibility to represent the result of a safe cluster refinement operation. Con-

100

Page 109: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.3 Cluster Refinement in Static Data Flow Graphs

(8,0,0,0,

(0,0,8,0,

(0,0,0,0,

(0,0,0,0,

(0,8,0,0,

(0,0,0,0,

(0,0,0,8,

(0,0,0,0,

1,1,1,1)

1,1,1,1)

1,1,1,1)

1,1,1,1)

1,1,1,1)

1,1,1,1)

1,1,1,1)

1,1,1,1)

64

8

8

8

8

8

8

8

81

1

1

1

1

1

1164

18

8 1

18

8 1

8 1

18

18

8 1

0,0,0,0)

0,0,0,0)

0,0,8,0)

8,0,0,0)

0,0,0,0)

0,0,0,8)

0,0,0,0)

0,8,0,0)

(1,1,1,1,

(1,1,1,1,

(1,1,1,1,

(1,1,1,1,

(1,1,1,1,

(1,1,1,1,

(1,1,1,1,

(1,1,1,1,

𝑜1 𝑖1

𝑔𝛾 idct| IDCT

Figure 4.10: Subcluster 𝑔𝛾idctimplementing an IDCT

ditions will be given when an SDF or CSDF composite actor is sufficient torepresent the result of a safe cluster refinement operation for a cluster 𝑔𝛾 .As the notion of actor FSM is cumbersome for mere SDF or CSDF actors, the

usual notations to describe consumption and production rates of actors is used.Additionally, to conform to the usual visual representation of SDF or CSDFgraphs, no explicit depictions of channels are used. As is customary, the constantconsumption and production rates of these SDF and CSDF actors are annotatedon the edges, i.e., a number annotated on the head of an edge corresponds tothe consumption rate of the connected actor and a number annotated on thetail corresponds to the production rate of the connected actor. Missing numberannotations denote a consumption or production of one token. However, it is stillassumed that these graphs are specified in the notation given in Definition 3.5.

4.3.1 Refining a Cluster to an SDF Actor

Refining a cluster to an SDF composite actor, which implements a static sched-ule executing one iteration of the SDF or CSDF graph contained in the cluster,is called monolithic clustering. This approach excels at schedule overhead reduc-tion.34 An example of such a best case scenario is shown in Figure 4.10. In thisscenario, the static subcluster for the 2-dimensional IDCT can be replaced by anSDF composite actor. The SDF composite actor implements a static scheduleof the contained IDCT actors, i.e., only the input/output guard 𝑘io

#𝑖1≥64∧#𝑜1≥64

needs to be satisfied, before the composite actor can fire, thus, executing thecalculated static schedule. However, as will be later shown, this approach mightbe infeasible in the general case due to the cluster environment.More formally, the 𝜉𝑆𝐷𝐹 operator will be used to denote the refinement oper-

ator producing SDF composite actors. To exemplify, the SDF graph depicted in34In general, the longer the static scheduling sequence which can be executed by checking a

single input/output guard, the less schedule overhead is imposed by this schedule.

101

Page 110: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

Figure 4.11a is considered. A firing of the resulting SDF composite actor 𝑎𝛾1,2

depicted in Figure 4.11b executes the sequence ⟨𝑎1, 𝑎2, 𝑎2⟩ of statically scheduledactor firings. For code generation, efficiently looped versions of this sequence ofactor firings, e.g., fire actor 𝑎1 once then loop actor 𝑎2 twice, can be determinedby applying the scheduling algorithms presented in [BL93, BBHL95, BML97]to the repetition vector [BML97] 𝜂rep

𝛾1,2= (𝜂rep1 , 𝜂rep2 ) = (1, 2) for the subcluster

𝑔𝛾1,2derived via the cluster operation Γ(𝑔ex2 , { 𝑎1, 𝑎2 }) = (𝑔′ex2 , 𝑔𝛾1,2

).35

2 2

3 3

𝑎1 𝑎2

𝑎env

(a) A simple SDF graph 𝑔ex2containing the

two actors 𝑎1 and 𝑎2

3 3

2 2𝑎𝛾1,2

𝑎env

(b) Clustered and refined DFG 𝑔ex2contain-

ing the SDF composite actor 𝑎𝛾1,2

Figure 4.11: Example for a safe and valid refinement of the two SDF actors 𝑎1and 𝑎2 into a composite SDF actor 𝑎𝛾1,2

However, while the cluster refinement 𝜉𝑆𝐷𝐹 (𝑔′ex2

, 𝑔𝛾1,2) = (𝑔ex2 , 𝑎𝛾1,2

) depictedin Figure 4.11 is a safe cluster refinement operation, i.e., it will not introducedeadlock into the system regardless of the cluster environment connected tothe subcluster, the 𝜉𝑆𝐷𝐹 operator performs an unsafe cluster refinement oper-ation in the general case. Hence, for some combinations of cluster and clusterenvironment, the refinement via the 𝜉𝑆𝐷𝐹 operator might introduce a deadlock.To exemplify, the cluster refinement for a cluster containing the two actors 𝑎3

and 𝑎4 is performed for the two different DFGs 𝑔ex3a and 𝑔ex3b that are shownin Figures 4.12 and 4.13, respectively. As can be seen in Figure 4.12, the 𝜉𝑆𝐷𝐹

operation in general is unsafe due to the introduction of additional constraintsinto the scheduling of the resulting DFG. These constraints may introduce anartificial deadlock. To exemplify, the composite actor 𝑎𝛾3,4

produced by the clus-ter refinement operation imposes the constraint that the actor 𝑎4 must fire firstfollowed by three firings of the actor 𝑎3. Hence, neither the actor 𝑎3 containedin the composite actor 𝑎𝛾3,4

, nor the composite actor 𝑎𝛾3,4itself, nor the sole

actor 𝑎env in the cluster environment can fire. Thus, the refinement operationdoes produce a deadlock for the resulting DFG 𝑔ex3a depicted in Figure 4.12b.This is due to the fact that the two initial tokens on the channel between

actor 𝑎4 and actor 𝑎3 in the original graph depicted in Figure 4.12a are nolonger accessible in the refined graph 𝑔ex3a . Adding further initial tokens to thecluster environment, as has been done in the DFG 𝑔ex3b depicted in Figure 4.13,

35In general, the notation 𝜂rep𝛾 is used to denote the repetition vector of a subcluster 𝑔𝛾 .

102

Page 111: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.3 Cluster Refinement in Static Data Flow Graphs

2 2

3 3𝑎3 𝑎4

𝑎env

(a) Original DFG 𝑔ex3acontaining the two

SDF actors 𝑎3 and 𝑎4

3 3

2 2𝑎env

𝑎𝛾3,4

(b) Resulting deadlocked DFG 𝑔ex3acon-

taining the SDF composite actor 𝑎𝛾3,4

Figure 4.12: Example for an unsafe and invalid refinement of the two SDF actors𝑎3 and 𝑎4 by the 𝜉𝑆𝐷𝐹 operator

resolves this problem. Hence, the refined DFG 𝑔ex3b depicted in Figure 4.13b isfree of deadlocks.

3 3

22

𝑎3 𝑎4

𝑎env

(a) DFG 𝑔ex3bcontaining additional initial

tokens compared to DFG 𝑔ex3a

2 2

3 3

𝑎env

𝑎𝛾3,4

(b) Resulting deadlock-free DFG 𝑔ex3bcon-

taining the SDF composite actor 𝑎𝛾3,4

Figure 4.13: Example for an unsafe but valid refinement of the two SDF actors𝑎3 and 𝑎4 by the 𝜉𝑆𝐷𝐹 operator

These situations can be analyzed in detail (cf. Figure 4.14) with the help ofa so-called Acyclic Precedence Graph (APG) [LM87a, SB00]. The APG canbe derived from the unfolding of the SDF or CSDF graph into an equivalentmarked graph and the subsequent deletion of all edges of the marked graphwhich are carrying initial tokens. In the following figures, marked graphs aredepicted in order to also show the edges with the initial tokens which would havebeen deleted if the corresponding APG was shown. The unfolding operationgenerates one marked graph actor, a vertex in the unfolded graph, for each firingof the corresponding SDF actor. To exemplify, the SDF graph 𝑔ex2 depictedin Figure 4.11a and its marked graph unfolding as shown in Figure 4.14a areconsidered. The graph 𝑔ex2 has a repetition vector of 𝜂rep

ex2= (𝜂rep1 , 𝜂rep2 , 𝜂repenv) =

(3, 6, 2). Hence, the marked graph derived by unfolding 𝑔ex2 has three vertices 𝑎11,𝑎21, and 𝑎31 corresponding to the three firings of the SDF actor 𝑎1. Therefore, theunfolding operation presented in [LM87a, SB00] are in general of exponentialcomplexity in the size of the problem description of the original SDF graph

103

Page 112: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝑎12

𝑎32

𝑎52

𝑎22

𝑎42

𝑎62𝑎31

𝑎21

𝑎11

𝑎1env

𝑎2env

(a) Marked graph expansion of the DFG 𝑔ex2

𝑎3𝛾1,2

𝑎2𝛾1,2

𝑎1𝛾1,2𝑎12

𝑎32

𝑎52

𝑎22

𝑎42

𝑎62𝑎31

𝑎21

𝑎11

𝑎1env

𝑎2env

(b) Coalesced vertices

Figure 4.14: Depicted above are the marked graph expansion of the DFG 𝑔ex2depicted in Figure 4.11a and the corresponding coalesced verticesdepicted as dark shaded areas for the refinement of the cluster 𝑔𝛾1,2

into the composite actor 𝑎𝛾1,2.

[PBL95].36 The channels, the edges in the unfolded graph, are generated insuch a way by the unfolding operation that the data dependencies between theactor firings in the original SDF graph are preserved. The interested reader isreferred to the original literature [LM87a, SB00] for a detailed description of theunfolding operation.As presented in Section 2.2.1 on page 10, a marked graph has a deadlock if

and only if there exists a directed cycle in the marked graph where none of thechannels, the edges in the marked graph, in the cycle carries any initial tokens.Hence, a directed cycle in the APG signifies a deadlock in the underlying SDF ormarked graph. Now, refinements can be represented in the APG by coalescingactor firings (vertices) in the APG and only refinements of subclusters intocomposite actors which do not generate cycles by coalescing the correspondingactor firings of the refinement in the APG may be shown to produce validtransformations.To exemplify, the cluster refinement (𝑔ex2 , 𝑎𝛾1,2

) = 𝜉𝑆𝐷𝐹 (𝑔′ex2

, 𝑔𝛾1,2) is consid-

ered. To recapitulate, the repetition vector of the SDF graph 𝑔ex2 is 𝜂repex2

=(𝜂rep1 , 𝜂rep2 , 𝜂repenv) = (3, 6, 2) while the subcluster 𝑔𝛾1,2

has a repetition vector of

36While more advanced techniques for unfolding exist [Gei09], they generate a marked graphwhich is conservative for the purpose of throughput analysis and in general contains ad-ditional dependencies. Hence, the generated marked graph is unusable for the purpose ofdetermining the validity of a cluster refinement.

104

Page 113: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.3 Cluster Refinement in Static Data Flow Graphs

𝑎2𝛾3,4

𝑎1𝛾3,4𝑎13

𝑎33

𝑎53

𝑎23

𝑎43

𝑎63

𝑎24

𝑎14

𝑎3env

𝑎2env

𝑎1env

(a) Marked graph for 𝑔ex3a

𝑎1𝛾3,4

𝑎2𝛾3,4

𝑎13

𝑎33

𝑎53

𝑎23

𝑎43

𝑎63

𝑎24

𝑎14

𝑎3env

𝑎2env

𝑎1env

(b) Marked graph for 𝑔ex3b

Figure 4.15: Shown above are the marked graph expansions as well as the coa-lesced vertices depicted as dark shaded areas for the DFGs 𝑔ex3a and𝑔ex3b . Edges which are part of dependency cycles are highlighted inbold.

𝜂rep𝛾1,2

= (𝜂rep1

′, 𝜂rep2

′) = (1, 2). Hence, in the marked graph unfolding of 𝑔ex2 , the

refinement operation will lead to a coalescing of one actor firing, as 𝜂rep1 = 1, ofthe SDF actor 𝑎1 and two actor firings, as 𝜂rep2 = 2, of the SDF actor 𝑎2 into anew marked graph actor firing. In total, there will be three such coalescings inthe marked graph as determined from the equation below:

𝜂rep1

𝜂rep1

′ =𝜂rep2

𝜂rep2

′ = 3

This can be observed by considering Figure 4.14b where the actor firing 𝑎11, 𝑎12,and 𝑎22 have been coalesced into the new composite actor firing 𝑎1𝛾1,2

depicted asthe first coalescing (dark shaded area) of the three coalescings in Figure 4.14b.In the following, the vertices that are coalesced, e.g., the vertices 𝑎11, 𝑎12, and 𝑎22,and the edges connecting these vertices, e.g., the edges 𝑎11 → 𝑎12 and 𝑎11 → 𝑎22,will be called the coalescing. The composite actor firings 𝑎2𝛾1,2

and 𝑎3𝛾1,2have

been built accordingly from the constituent actor firings of the actors 𝑎1 and 𝑎2.The three situations encountered in the cluster refinements depicted in Fig-

ures 4.11 to 4.13 are now analyzed in detail via usage of the APGs depicted inFigures 4.14 and 4.15. These situations correspond to the three cases as givenbelow:

105

Page 114: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

∙ The refinement is a safe cluster refinement. This case corresponds to thesubcluster refinement (𝑔ex2 , 𝑎𝛾1,2

) = 𝜉𝑆𝐷𝐹 ∘ Γ(𝑔ex2 , { 𝑎1, 𝑎2 }) depicted inFigure 4.11 with the corresponding APG shown in Figure 4.14b.37

Note that a safe cluster refinement operation, that is an operation con-forming to the equivalence condition given in Definition 4.2, does not guar-antee that the resulting refined DFG is free of deadlocks. It only ensuresthat no additional artificial deadlocks are introduced by the refinementoperation. To exemplify, if the DFG 𝑔ex2 is modified by removing the twoinitial tokens from the channel from 𝑎env to 𝑎1, then the cluster refinementoperation (𝑔noiniex2

, 𝑎𝛾1,2) = 𝜉𝑆𝐷𝐹 ∘ Γ(𝑔noiniex2

, { 𝑎1, 𝑎2 }) is still a safe clusterrefinement operation. However, both the original DFG 𝑔noiniex2

as well asthe resulting refined DFG 𝑔noiniex2

exhibit a deadlock.

In the APG, the equivalence condition translates to ensuring that thecoalescing operations will not introduce additional cycles into the APG byreplacing each coalescing and its contained actor firings with a compositeactor firing.38 This can be ensured by checking the condition given below:

Theorem 4.1 (Safe SDF Cluster Refinement). Given a consistent anddeadlock-free static subcluster 𝑔𝛾 of a DFG 𝑔 and the APG expansion ofthe subcluster 𝑔𝛾 , that is the coalescing of actor firings corresponding tothis refinement operation. Then, the refinement operation 𝜉𝑆𝐷𝐹 (𝑔, 𝑔𝛾 ) isa safe cluster refinement operation if and only if for each pair (𝑎𝑥in, 𝑎

𝑦out) of

input/output actor firings of the coalescing, there exists a directed pathin the coalescing from the input actor firing 𝑎𝑥in to the output actor firing𝑎𝑦out.39

Proof. The key idea of Theorem 4.1 is that the SDF refinement operationmust not introduce dependencies on tokens consumed by the compositeactor for tokens produced by the composite actor that were not alreadypresent in the original cluster. The SDF refinement operation introducesfor all output actor firings in the APG expansion of the subcluster depen-dencies on all input actor firings in the APG expansion of the subcluster.Hence, these dependencies must already be present in the APG expansionof the subcluster itself. Thus, there must exist at least one directed path

37The ‘∘’-operator, e.g., 𝜉𝑆𝐷𝐹 ∘ Γ, is used to denote function composition, i.e., (𝜉𝑆𝐷𝐹 ∘Γ)(𝑔, { 𝑎𝑛 . . . 𝑎𝑚 }) ≡ 𝜉𝑆𝐷𝐹 (Γ(𝑔, { 𝑎𝑛 . . . 𝑎𝑚 })).

38There may already be cycles present in the APG. This is the case when the original DFG,e.g., 𝑔noiniex2

, has a deadlock.39The directed path must consist of APG edges contained in the coalescing. Hence, as the

figures are all showing the marked graph expansion only edges carrying no initial tokensmust be used to form the directed path.

106

Page 115: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.3 Cluster Refinement in Static Data Flow Graphs

in the original APG expansion of the subcluster 𝑔𝛾 for each combinationof input actor firing and output actor firing.

To exemplify, the APG expansion of the subcluster 𝑔𝛾1,2of the DFG 𝑔ex2

shown in Figure 4.11a appears three times in the APG expansion (cf. Fig-ure 4.14b) of the DFG 𝑔ex2 itself. These three occurrences (shaded areas)correspond to the three firings of the composite actor 𝑎𝛾1,2

. Only one ofthe three coalescings needs to be checked as they are all equivalent. Here,the coalescing 𝑎1𝛾1,2

is considered. The set of input actors of the subcluster𝑔𝛾1,2

is 𝐴𝐼 = { 𝑎1 } while the set of output actors is 𝐴𝑂 = { 𝑎2 }. Hence,the set of pairs of input/output actor firings is { (𝑎11, 𝑎12), (𝑎11, 𝑎22) } and thedirected path for each pair from the input actor firing to the output actorfiring consists of the sole edge connecting both actors in the pair.

∙ The refinement is an unsafe cluster refinement and introduces an artificialdeadlock. This case corresponds to the subcluster refinement (𝑔ex3a , 𝑎𝛾3,4

) =𝜉𝑆𝐷𝐹 ∘ Γ(𝑔ex3a , { 𝑎3, 𝑎4 }) depicted in Figure 4.12 with the correspondingAPG shown in Figure 4.15a.

One can see that the cluster refinement introduces a deadlock by observingthe dependency cycles 𝑎1𝛾3,4

→ 𝑎1env → 𝑎1𝛾3,4and 𝑎2𝛾3,4

→ 𝑎2env → 𝑎2𝛾3,4in

the APG depicted in Figure 4.15a. The first cycle 𝑎1𝛾3,4→ 𝑎1env → 𝑎1𝛾3,4

is introduced because the coalescing 𝑎1𝛾3,4does not have an initial token

free directed path from the input actor firing 𝑎14 to the output actor firings𝑎13 and 𝑎23. The second cycle 𝑎2𝛾3,4

→ 𝑎2env → 𝑎2𝛾3,4is introduced because

the coalescing 𝑎2𝛾3,4does not have an initial token free directed path from

the input actor firing 𝑎24 to the output actor firings 𝑎43 and 𝑎53. Hence, thecluster refinement 𝜉𝑆𝐷𝐹 (𝑔ex3a , { 𝑎3, 𝑎4 }) adds new constraints in the formof artificial data dependencies between the pairs (𝑎14, 𝑎13), (𝑎14, 𝑎23), (𝑎24, 𝑎43),and (𝑎24, 𝑎

53) of input/output actor firings.

These artificial data dependencies together with the data dependencies inthe cluster environment form the two cycles in the APG which signify theintroduced artificial deadlock. However, as can be seen in the next case,if the cluster environment is benign, then even the introduction of theseartificial data dependencies may not introduce a deadlock.

∙ The refinement is an unsafe cluster refinement but results in no deadlock.This case corresponds to the subcluster refinement (𝑔ex3b , 𝑎𝛾3,4

) = 𝜉𝑆𝐷𝐹 ∘Γ(𝑔ex3b , { 𝑎3, 𝑎4 }) depicted in Figure 4.13 with the corresponding APGshown in Figure 4.15b.

Compared to the previous case, the cluster environment is more forgiving.Hence, the introduction of the artificial data dependencies between the

107

Page 116: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

pairs (𝑎14, 𝑎13) and (𝑎14, 𝑎

23) do not cause a cycle in the APG as the edges

𝑎13 → 𝑎1env and 𝑎23 → 𝑎1env are missing in this environment. This globalanalysis approach to clustering is employed in [BL93].

In some cases, retiming [LLW+07] can be used to enable a safe SDF refinementoperation. In brief, retiming is an operation firing actors of the DFG in orderto achieve a desired redistribution of the initial tokens.

2 2

3 3𝑎3 𝑎4

𝑎env

(a) The retimed DFG 𝑔rex3aderived from fir-

ing the actor 𝑎3 from 𝑔ex3atwice

33

2 2𝑎env

𝑎r𝛾3,4

(b) Resulting deadlock-free DFG 𝑔rex3acon-

taining the SDF composite actor 𝑎r𝛾3,4

𝑎𝑟,1𝛾3,4

𝑎𝑟,2𝛾3,4

𝑎13

𝑎33

𝑎53

𝑎23

𝑎43

𝑎63

𝑎24

𝑎14

𝑎3env

𝑎2env

𝑎1env

(c) Marked graph for 𝑔rex3a

33

2 2

𝑎3 𝑎4

𝑞1𝑞0

#𝑜1 ≥ 2/𝑓⟨𝑎3,𝑎3⟩𝑡1

𝑡2

#𝑖1 ≥ 3 ∧#𝑜1 ≥ 3/𝑓⟨𝑎4,𝑎3,𝑎3,𝑎3⟩

𝑖1𝑜1

𝑎env

(d) Composite actor with included retiming

Figure 4.16: Example to demonstrate that retiming can be used in some casesto make 𝜉𝑆𝐷𝐹 a safe cluster refinement operation

To exemplify, the retimed DFG depicted in Figure 4.16 derived from theoriginal DFG shown in Figure 4.12 is considered. The retiming is achieved byfiring actor 𝑎3 twice, thus, moving the two initial tokens on the channel betweenactor 𝑎4 and 𝑎3 in the original graph depicted in Figure 4.12a outside of thesubcluster. Hence, the problem of hiding initial tokens of the subcluster by therefinement operation 𝜉𝑆𝐷𝐹 is averted.As can be seen in the APG depicted in Figure 4.16c, retiming translates the

unsafe cluster refinement operation 𝜉𝑆𝐷𝐹 ∘Γ(𝑔ex3a , { 𝑎3, 𝑎4 }) into the safe cluster

108

Page 117: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.3 Cluster Refinement in Static Data Flow Graphs

refinement operation 𝜉𝑆𝐷𝐹 ∘ Γ(𝑔rex3a , { 𝑎3, 𝑎4 }). In detail, it can be seen in theAPG that for both coalescings and for all their pairs of input/output actorfirings { (𝑎14, 𝑎13), (𝑎14, 𝑎23), (𝑎14, 𝑎33) } and { (𝑎24, 𝑎43), (𝑎24, 𝑎53), (𝑎24, 𝑎63) }, there exists adirected path from the input actor firing to the output actor firing containingonly edges of the coalescing carrying no initial tokens.However, retiming at compile time can only be employed if the behavior of the

actors which are fired during the retiming operation are exactly known by thecompiler and are free of side effects on parts of the system not under the controlof the compiler. Otherwise, the behavior of the actors fired during the retimingoperation cannot be replicated at compile time and can only be executed atrun time. This compile-time retiming problem can be solved by generating acomposite actor executing the retiming operation at run time. To exemplify,the composite actor as depicted in Figure 4.16d is considered. The retimingoperation is encoded in the transition 𝑡1 while the normal SDF composite actormode corresponds to the self-loop via transition 𝑡2.Generally, however, a refinement of a subcluster to a composite SDF actor

is an unsafe cluster refinement operation. Hence, it is may introduce artificialdeadlocks in the general case. This is true even in the presence of retimingtransformations. To exemplify, the SDF graph 𝑔ex4 and its retimed variant 𝑔rex4depicted in Figures 4.17 and 4.18 are considered.

2 3

2 3

𝑖1

𝑜1

𝑜2𝑎env1𝑎5

𝑔𝛾5,6

𝑎6 𝑎env2

(a) SDF graph 𝑔ex4with subcluster 𝑔𝛾5,6

3

3

2

2

2

𝑎env1

𝑎env2𝑎𝛾5,6

(b) Resulting deadlocked DFG 𝑔ex4

Figure 4.17: Example for an invalid refinement of the subcluster 𝑔𝛾5,6into the

SDF composite actor 𝑎𝛾5,6

While retiming can remove the hidden token problem of the two initial tokenscarried on the edge from actor 𝑎6 to actor 𝑎5, the deadlock induced by the SDFrefinement constraint that two firings (𝜂rep

𝛾5,6(𝑎6) = 2)40 of actor 𝑎6 have to be

40Please remember, a vector, e.g., 𝜂rep𝛾5,6

= (𝜂rep𝑎5, 𝜂rep𝑎6

), is always also considered to be afunction from its index set, e.g., here ℐ(𝜂rep

𝛾5,6) = { 𝑎5, 𝑎6 }, to its entries, e.g., 𝜂rep

𝛾5,6(𝑎6) =

𝜂rep𝑎6. Furthermore, to avoid the usage of double or triple subscripts, the notation 𝜂rep6

instead of 𝜂rep𝑎6and similar constructs is given preference.

109

Page 118: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

2

2 3

3

𝑖1

𝑜1

𝑜2𝑎5

𝑎6 𝑎env2

𝑎env1

𝑔r𝛾5,6

(a) SDF graph 𝑔rex4derived via retiming

𝑔ex4 by firing actor 𝑎5 once

3

3

2

2

2

𝑎env1

𝑎env2𝑎r𝛾5,6

(b) Refined DFG 𝑔rex4containing the SDF

composite actor 𝑎r𝛾5,6introducing a

deadlock

Figure 4.18: Example to demonstrate that retiming can not always be used tomake 𝜉𝑆𝐷𝐹 a safe cluster refinement operation

performed atomically cannot be solved via retiming. To solve this problem, arefinement to a CSDF composite actor is considered next.

4.3.2 Refining a Cluster to a CSDF Actor

In the following, the possibility of refining a subcluster into a CSDF compositeactor is explored. More formally, the 𝜉𝐶𝑆𝐷𝐹 operator will be used to denote arefinement operator producing a CSDF composite actor. Note that in contrastto the 𝜉𝑆𝐷𝐹 operator, there may exist multiple operators producing differentCSDF composite actors from one subcluster. Here, CSDF composite actors willbe called different if they differ in their communication behavior, that is takinga black box view they differ in the observable behavior from the outside.41

It will be shown that a refinement into a CSDF composite actor is sufficient toguarantee a valid cluster refinement operation for all possible subclusters if thecluster environment is known, consists only of SDF and CSDF actors, and theoriginal DFG is consistent and free of deadlocks. However, a refinement intoa CSDF composite actor is insufficient to guarantee a safe cluster refinementoperation for an arbitrary static cluster. That is to say, a refinement into aCSDF composite actor is insufficient to prevent deadlock for all possible clus-ter environments of the arbitrary static cluster even if the cluster environmentcontains only SDF or CSDF actors.42

To exemplify, refinements of the subclusters 𝑔𝛾5,6and 𝑔r𝛾5,6

of the SDF graphsdepicted in Figures 4.17 and 4.18 into a CSDF composite actor each are now41If a white box view is taken, then there may even exists multiple versions of SDF refinement

operators, each of which produce a different static schedule of the SDF actors containedin the composite actor.

42This case corresponds to the modular code generation scenario explored in [TBG+13].

110

Page 119: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.3 Cluster Refinement in Static Data Flow Graphs

considered. The CSDF refinement enables the resolution of the deadlock sit-uation caused by the atomic firings of the sequence ⟨𝑎6, 𝑎6, 𝑎5⟩ required by arefinement to an SDF composite actor.The situation for the SDF graph 𝑔ex4 can be analyzed in detail via usage of

the marked graph expansion as depicted in Figure 4.19. The artificial deadlockintroduced by the refinement operation (𝑔ex4 , 𝑎𝛾5,6

) = 𝜉𝑆𝐷𝐹 (𝑔ex4 , 𝑔𝛾5,6) can arise

due to the violation of the safe SDF cluster refinement condition from Theo-rem 4.1 for the coalescing of actor firings from Figure 4.19b. From the four pairs(𝑎16, 𝑎

15), (𝑎26, 𝑎15), (𝑎16, 𝑎26), and (𝑎26, 𝑎

16) of input/output actor firings, only the pair

(𝑎16, 𝑎26) does not violate the condition. To avoid these violations, the coalescing

of actor firings can be split as depicted in Figure 4.19c. However, this changesthe semantics of the composite actor 𝑎′

𝛾5,6to be CSDF instead of SDF.

3

3

𝑖1

𝑜1

𝑜2

𝑎26

𝑎15 𝑎env1

𝑎env2𝑎16

𝑔𝛾5,6

(a) Marked graph expan-sion of the subcluster𝑔𝛾5,6

3

3

𝑖1

𝑜1

𝑜2𝑎env1

𝑎env2

𝑎15

𝑎16

𝑎26

𝑎𝛾5,6

(b) Coalescing of actor fir-ings corresponding tothe SDF composite ac-tor 𝑎𝛾5,6

3

3

𝑖1

𝑜1

𝑜2𝑎1

′𝛾5,6

𝑎2′

𝛾5,6

𝑎3′

𝛾5,6

𝑎env2

𝑎15

𝑎16

𝑎26

𝑎env1

𝑎′𝛾5,6

(c) Grouping of actor firingscorresponding to the CSDFcomposite actor 𝑎

𝛾5,6

Figure 4.19: Depicted is a marked graph expansion of the subcluster 𝑔𝛾5,6as

well as a coalescing and grouping of actor firings representing re-finements of the subcluster to the SDF composite actor 𝑎𝛾5,6

andthe CSDF composite actor 𝑎′

𝛾5,6, respectively.

Each coalescing, depicted as a dark shaded area, of a grouping of actor firingscorresponds to a CSDF phase. There are three coalescings corresponding to thethree CSDF phases of the composite actor 𝑎′

𝛾5,6depicted in Figure 4.20.

The consumption and production rates of the CSDF phases are annotated tothe visual depiction of a CSDF actor as introduced in Section 2.2.3. Further-more, due to lack of explicit channel annotations in the figures, the notationcons(𝑎1, 𝑎2)/prod(𝑎1, 𝑎2) is used to denote the SDF or CSDF consumption/pro-duction rate(s) of the actor 𝑎2/𝑎1 on the channel between the actors 𝑎1 and 𝑎2.

111

Page 120: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

3

3

(0,1,1)

(0,1,1)

(2,0,0)

𝑎env1

𝑎env2𝑎′𝛾5,6

Figure 4.20: Refinement (𝑔′ex4

, 𝑎′𝛾5,6

) = 𝜉′𝐶𝑆𝐷𝐹 (𝑔ex4 , 𝑔𝛾5,6

) of the subcluster 𝑔𝛾5,6

of the graph 𝑔ex4 into the CSDF composite actor 𝑎′𝛾5,6

More formally, the consumption and production rates of the three CSDF phasesof the composite actor 𝑎′

𝛾5,6are given as follows:

prod(𝑎′𝛾5,6

, 𝑎env1) = (2, 0, 0)

cons(𝑎env2 , 𝑎′𝛾5,6

) = (0, 1, 1)

prod(𝑎′𝛾5,6

, 𝑎env2) = (0, 1, 1)

The consumption and production rate of the first CSDF phase is derivedfrom the first coalescing 𝑎1

′𝛾5,6

by noting that the firing 𝑎1′

𝛾5,6produces two tokens

(prod(𝑎′𝛾5,6

, 𝑎env1)(1) = 2) on the channel from actor 𝑎′𝛾5,6

to actor 𝑎env1 and con-sumes and produces no tokens (cons(𝑎env2 , 𝑎

′𝛾5,6

)(1) = prod(𝑎′𝛾5,6

, 𝑎env2)(1) =

0) on the other channels connected to the composite actor.43 Consumption andproduced rates of the two remaining CSDF phases are derived accordingly.Additionally, in contrast to an SDF refinement, dependencies have to be added

to the marked graph to force execution of the CSDF phases in sequentiallyascending order, e.g., the dotted dashed arrows in Figure 4.19c from 𝑎1

′𝛾5,6→

𝑎2′

𝛾5,6, 𝑎2′𝛾5,6

→ 𝑎3′

𝛾5,6, and 𝑎3

′𝛾5,6→ 𝑎1

′𝛾5,6

with an initial token. Hence, the conditionfor a safe CSDF cluster refinement has to be generalized from Theorem 4.1 asfollows:

Theorem 4.2 (Safe CSDF Cluster Refinement). Given (1) a consistent anddeadlock-free static subcluster 𝑔𝛾 of a DFG 𝑔, (2) the marked graph expansionof the subcluster 𝑔𝛾 , and (3) a grouping of actor firings consisting of an orderedset { 𝑎1𝛾 , 𝑎2𝛾 , . . . 𝑎𝑛𝛾 } of 𝑛 coalescings of actor firings, then each coalescing of actorfirings corresponds to one of the 𝑛 CSDF phases of the generated compositeactor. Furthermore, in order to force a sequentially ascending execution orderfor these 𝑛 CSDF phases, the directed edges 𝑎1𝛾 → 𝑎2𝛾 , 𝑎

2𝛾 → 𝑎3𝛾 , . . . 𝑎

𝑛−1𝛾 → 𝑎𝑛𝛾

43In the following, a vector 𝑥 = (𝑥1, 𝑥2, . . . 𝑥𝑛) is also considered to be a function from itsindex set ℐ(𝑥) = { 1, 2, . . . 𝑛 } to its entries, that is 𝑥(𝑖) = 𝑥𝑖.

112

Page 121: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.3 Cluster Refinement in Static Data Flow Graphs

without any initial tokens as well as the edge 𝑎𝑛𝛾 → 𝑎1𝛾 with one initial tokenwill be added to the marked graph expansion. Then, the refinement operation𝜉𝐶𝑆𝐷𝐹 (𝑔, 𝑔𝛾 ) for the given grouping of actor firings is a safe cluster refinementoperation if and only if the following three conditions hold:

∙ Condition I: The grouping of actor firings does not introduce any delay-less cycles in the modified marked graph expansion.

∙ Condition II: For each pair (𝑎𝑥in, 𝑎𝑦out) ∈ ℰforward of input/output actor

firings, there exists a delay-less directed path from the input actor firing𝑎𝑥in to the output actor firing 𝑎𝑦out in the original marked graph expansionof the subcluster 𝑔𝛾 .44

∙ Condition III: For each pair (𝑎𝑥in, 𝑎𝑦out) ∈ ℰbackward of input/output actor

firings, there exists a directed path with exactly one initial token from theinput actor firing 𝑎𝑥in to the output actor firing 𝑎𝑦out in the original markedgraph expansion of the subcluster 𝑔𝛾 .

The set ℰforward is constrained such that it only contains pairs (𝑎𝑥in, 𝑎𝑦out) of in-

put/output actor firings where

∙ 𝑎in ∈ 𝐴𝐼 is an input actor and 𝑎out ∈ 𝐴𝑂 is an output actor

∙ 𝑎𝑦out is coalesced into some coalescing 𝑎𝑚𝛾 ∈ { 𝑎1𝛾 , 𝑎2𝛾 , . . . 𝑎𝑛𝛾 }, 1 ≤ 𝑚 ≤ 𝑛

∙ 𝑎𝑥in is coalesced into some coalescing 𝑎𝑙𝛾 ∈ { 𝑎1𝛾 , 𝑎2𝛾 , . . . 𝑎𝑚𝛾 }, 1 ≤ 𝑙 ≤ 𝑚 and𝑎𝑥in = 𝑎𝑦out

The set ℰbackward is constrained such that it only contains pairs (𝑎𝑥in, 𝑎𝑦out) of

input/output actor firings where

∙ 𝑎in ∈ 𝐴𝐼 is an input actor and 𝑎out ∈ 𝐴𝑂 is an output actor

∙ 𝑎𝑦out is coalesced into some coalescing 𝑎𝑚𝛾 ∈ { 𝑎1𝛾 , 𝑎2𝛾 , . . . 𝑎𝑛𝛾 }, 1 ≤ 𝑚 ≤ 𝑛

∙ 𝑎𝑥in is coalesced into some coalescing 𝑎𝑙𝛾 ∈ { 𝑎𝑚+1𝛾 , 𝑎𝑚+2

𝛾 , . . . 𝑎𝑛𝛾 },𝑚 < 𝑙 ≤ 𝑛

Proof. The key idea of Theorem 4.2 is that the CSDF refinement operation mustnot introduce any dependencies on tokens consumed by the composite actor fortokens produced by the composite actor that were not already present in theoriginal cluster. The CSDF refinement operation introduces for all output ac-tor firings two types of dependencies: (1) delay-less dependencies on all inputactor firings in the same or sequentially previous (1 ≤ 𝑙 ≤ 𝑚) CSDF phases44A delay-less directed path is a directed path where none of the edges of the path carries

any initial tokens.

113

Page 122: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

(cf. ℰforward), and (2) dependencies with exactly one initial token on all input ac-tor firings in sequentially later (𝑚 < 𝑙 ≤ 𝑛) CSDF phases (cf. ℰbackward). Hence,the second and the third condition in Theorem 4.2 ensures that equivalent de-pendencies are already present in the original marked graph expansion of thesubcluster 𝑔𝛾 . The first condition in Theorem 4.2 simply ensures that a validschedule for all actor firings in the CSDF phases can be generated.

To exemplify, the CSDF cluster refinement (𝑔′ex4

, 𝑎′𝛾5,6

) = 𝜉′𝐶𝑆𝐷𝐹 (𝑔ex4 , 𝑔𝛾5,6

)shown in Figure 4.19c is a safe cluster refinement. For the subcluster 𝑔𝛾5,6

,the set of input actors is 𝐴𝐼 = { 𝑎6 } and the set of output actors is 𝐴𝑂 ={ 𝑎5, 𝑎6 }. Hence, for the grouping of actor firings as depicted in Figure 4.19c,delay-less dependencies are introduced for the pair of input/output actor fir-ing (𝑎16, 𝑎

26) ∈ ℰforward = { (𝑎16, 𝑎26) } and dependencies with exactly one initial

token are introduced for the pairs of input/output actor firings ℰbackward ={ (𝑎16, 𝑎15), (𝑎26, 𝑎15), (𝑎26, 𝑎16) }. As can be seen by considering Figure 4.19a, forthe pair (𝑎16, 𝑎

26) ∈ ℰforward of input/output actor firing, the edge 𝑎16 → 𝑎26 sat-

isfies the existence of a delay-less path in the original marked graph expansionof the subcluster 𝑔𝛾 . Furthermore, for all pairs of input/output actor firingsℰbackward, there are equivalent edges in the original marked graph expansion ofthe subcluster 𝑔𝛾 carrying exactly one initial token.However, the generated CSDF composite actor 𝑎′

𝛾5,6has only one actor firing

in each of its CSDF phases. This is a suboptimal situation from the perspectiveof overhead reduction. To optimize the reduction of the overall overhead, thelength of the executed sequences of statically scheduled actor firings in eachCSDF phase should be maximized. Fortunately, retiming can be employed toimprove the situation. By firing actor 𝑎5 once, the DFG depicted in Figure 4.18ais obtained. Instead of a refinement to an SDF composite actor (cf. Figures 4.18band 4.21b), the CSDF composite actor 𝑎r

′𝛾5,6

is now generated. To generatethis CSDF composite actor, the grouping of the actor firings as depicted inFigure 4.21c is used. Hence, the CSDF composite actor 𝑎r′𝛾5,6

requires only thetwo CSDF phases as given below:

prod(𝑎r′𝛾5,6

, 𝑎env1) = (0, 2)

cons(𝑎env2 , 𝑎r′𝛾5,6

) = (1, 1)

prod(𝑎r′𝛾5,6

, 𝑎env2) = (1, 1)

The first CSDF phase is derived from the coalescing 𝑎r,1′

𝛾5,6which fires actor 𝑎6

once, thus consuming one token (cons(𝑎env2 , 𝑎r′𝛾5,6

)(1) = 1) on the channel fromactor 𝑎env2 to actor 𝑎r′𝛾5,6

as well as producing one token (prod(𝑎r′𝛾5,6, 𝑎env2)(1) =

1) on the channel from actor 𝑎r′𝛾5,6

to actor 𝑎env2 . The second CSDF phase isderived from the coalescing 𝑎r,2

′𝛾5,6

which fires actor 𝑎6 once and, thus, leading

114

Page 123: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.3 Cluster Refinement in Static Data Flow Graphs

3

3

𝑖1

𝑜1

𝑜2

𝑎env2

𝑎15

𝑎16

𝑎26

𝑎env1

𝑔r𝛾5,6

(a) Marked graph expan-sion of the subcluster𝑔r𝛾5,6

3

3

𝑖1

𝑜1

𝑜2𝑎env1

𝑎env2

𝑎15

𝑎16

𝑎26

𝑎r𝛾5,6

(b) Coalescing of actor fir-ings corresponding tothe SDF composite ac-tor 𝑎r𝛾5,6

3

3

𝑖1

𝑜1

𝑜2𝑎r,2

′𝛾5,6

𝑎r,1′

𝛾5,6

𝑎env1

𝑎env2

𝑎15

𝑎16

𝑎26

𝑎r′𝛾5,6

(c) Grouping of actor firingscorresponding to the CSDFcomposite actor 𝑎r

𝛾5,6

Figure 4.21: Depicted is a marked graph expansion of the subcluster 𝑔r𝛾5,6as

well as a coalescing and grouping of actor firings representing re-finements of the subcluster to the SDF composite actor 𝑎r𝛾5,6

andthe CSDF composite actor 𝑎r′𝛾5,6

, respectively.

to cons(𝑎env2 , 𝑎r′𝛾5,6

)(2) = 1 and prod(𝑎r′𝛾5,6

, 𝑎env2)(2) = 1, while actor 𝑎5 is alsofired once, thus generating two tokens (prod(𝑎r′𝛾5,6

, 𝑎env1)(2) = 2) on the channelfrom actor 𝑎r′𝛾5,6

to actor 𝑎env1 .As previously, a composite actor for the original DFG 𝑔ex4 including the re-

timing sequence can be generated. The resulting composite actor is depicted inFigure 4.22b. The transition 𝑡1 implements the retiming operation while 𝑡2 and𝑡3 correspond to the first and second CSDF phase, respectively. The compositeactor 𝑎r′𝛾5,6

as well as the composite actor depicted in Figure 4.22b are generatedby safe cluster refinement operations. Thus, these composite actors can be usedto replace the cluster 𝑔𝛾5,6

(𝑔r𝛾5,6for 𝑎r

′𝛾5,6

) for arbitrary cluster environmentswithout introducing any artificial deadlocks.However, in the general case, cluster refinement into a CSDF composite actor

are unsafe. To exemplify, consider the cluster 𝑔𝛾1,2,6in the two cluster environ-

ments 𝑔ex5a and 𝑔ex5b depicted in Figure 4.23. The problem stems from the factthat the actor 𝑎6 has to fire first for the cluster environments 𝑔ex5a while thesequence ⟨𝑎1, 𝑎2, 𝑎2⟩ has to fire first for cluster environments 𝑔ex5b . Therefore,the decision which schedule sequence to choose depends on token availability onthe input ports 𝑖1 and 𝑖2 of the cluster 𝑔𝛾1,2,6

.However, such flexibility cannot be represented via a CSDF composite actor.

Furthermore, this situation cannot be salvaged via retiming as the four initial

115

Page 124: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

3

3

(1,1)

(1,1)

(0,2)

𝑎env1

𝑎env2𝑎r′𝛾5,6

(a) Composite actor 𝑎r′

𝛾5,6

2

2

3

3

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/𝑓⟨𝑎6⟩

#𝑜 2≥

2/𝑓 ⟨

𝑎5⟩

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1 ∧#𝑜2 ≥ 2/𝑓⟨𝑎6,𝑎5⟩

𝑡2

𝑡1

𝑡3𝑜2

𝑜1

𝑖1

𝑞1 𝑞2

𝑞0𝑎env2𝑎6

𝑎5 𝑎env1

(b) Composite actor with included retiming for 𝑔ex4

Figure 4.22: Composite actors derived by usage of retiming

2

2

2

2 4

2 4

𝑎env2𝑎env1

𝑜2 𝑖2

𝑜1𝑖1𝑎2𝑎1

𝑎6

𝑔𝛾1,2,6

(a) Cluster 𝑔𝛾1,2,6embedded in the

cluster environment 𝑔ex5a

2 2

2

2

4

4

2

𝑎env2𝑎env1

𝑖2𝑜2

𝑜1𝑖1𝑎2𝑎1

𝑎6

𝑔𝛾1,2,6

(b) Cluster 𝑔𝛾1,2,6embedded in the

cluster environment 𝑔ex5b

Figure 4.23: Example of a cluster with no safe cluster refinement operation to aCSDF composite actor.

tokens, which are the reason for the flexibility to either fire 𝑎6 or sequence⟨𝑎1, 𝑎2, 𝑎2⟩ first, in the cycle 𝑎1 → 𝑎6 → 𝑎1 cannot be removed from the cluster𝑔𝛾1,2,6

via retiming. Instead, a QSS is required, which statically schedules thesequences ⟨𝑎1, 𝑎2, 𝑎2⟩ and ⟨𝑎6⟩ but postpones the decision which sequence toexecute first to run time.In the following, the situation of the cluster 𝑔𝛾1,2,6

is considered in detail viaits marked graph expansion depicted in Figure 4.24. Note that all coalescingsare safe according to Theorem 4.1. Hence, splitting these coalescings will notenable the creation of a grouping of actor firings representing a safe clusterrefinement operation to a CSDF actor according to Theorem 4.2 if this was notpreviously possible with the coalescings as given in Figure 4.24.Thus, the remaining flexibility for building different CSDF composite actors

is in the ordering of the CSDF phases ⟨𝑎11, 𝑎12, 𝑎22⟩, ⟨𝑎16⟩, ⟨𝑎26⟩. In fact, the threeCSDF composite actors 𝑎

′𝛾1,2,6

, 𝑎′′𝛾1,2,6

, and 𝑎′′′𝛾1,2,6

as given in Figures 4.25ato 4.25c represent the three possible valid orderings of the three CSDF phases⟨𝑎11, 𝑎12, 𝑎22⟩, ⟨𝑎16⟩, ⟨𝑎26⟩ to derive a CSDF composite actor for the cluster 𝑔𝛾1,2,6

.In contrast to this, the grouping of actor firings as depicted in Figure 4.24a

has an invalid ordering of the CSDF phases as indicated by the delay-less cycle

116

Page 125: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.3 Cluster Refinement in Static Data Flow Graphs

𝑎1𝛾1,2,6

𝑎2𝛾1,2,6

𝑎3𝛾1,2,6

𝑖1 𝑜1

𝑖2𝑜2

𝑎11

𝑎12

𝑎22

𝑎16

𝑎26

𝑎𝛾1,2,6

(a) Example of an invalidgrouping of actor firings

𝑎3′

𝛾1,2,6

𝑎2′

𝛾1,2,6

𝑎1′

𝛾1,2,6

𝑖1 𝑜1

𝑖2𝑜2

𝑎11

𝑎12

𝑎22

𝑎16

𝑎26

𝑎′𝛾1,2,6

(b) Grouping of actor fir-ings corresponding tothe CSDF compositeactor 𝑎

𝛾1,2,6

𝑎2′′

𝛾1,2,6

𝑎3′′

𝛾1,2,6

𝑎1′′

𝛾1,2,6

𝑖1 𝑜1

𝑖2𝑜2

𝑎11

𝑎12

𝑎22

𝑎16

𝑎26

𝑎′′𝛾1,2,6

(c) Grouping of actor fir-ings corresponding tothe CSDF compositeactor 𝑎

′′

𝛾1,2,6

𝑎1′′′

𝛾1,2,6

𝑎3′′′

𝛾1,2,6

𝑎2′′′

𝛾1,2,6

𝑖1 𝑜1

𝑖2𝑜2

𝑎11

𝑎12

𝑎22

𝑎16

𝑎26

𝑎′′′𝛾1,2,6

(d) Grouping of actor fir-ings corresponding tothe CSDF compositeactor 𝑎

′′′

𝛾1,2,6

Figure 4.24: Marked graph expansion of the cluster 𝑔𝛾1,2,6and groupings of actor

firings representing refinements of the cluster to different CSDFcomposite actors

(edges of the marked graph that are part of cycle are bolded) in violation of thefirst condition of Theorem 4.2. However, the groupings of actor firings depictedin Figures 4.24b to 4.24d are valid but do not correspond to a safe clusterrefinement operation. This can be observed by noting that the groupings ofactor firings for the three composite actors 𝑎′

𝛾1,2,6, 𝑎′′

𝛾1,2,6, and 𝑎

′′′𝛾1,2,6

each haveat least one violation of the second condition of Theorem 4.2.

117

Page 126: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

2 2

2

𝑓⟨𝑎6 ⟩

#𝑖2 ≥

1 ∧#𝑜2 ≥

1/

𝑓 ⟨𝑎6⟩

#𝑖 2≥1∧#

𝑜 2≥1/

#𝑖 1≥

2∧#𝑜 1≥

2/𝑓 ⟨

𝑎1,𝑎

2,𝑎

2⟩

𝑐′

𝑖1

𝑜2

𝑜1

𝑖2

𝑞1

𝑞0

𝑞2

𝑎1

𝑎6

𝑎2

𝑎′

𝛾1,2,6

(a) CSDF compositeactor 𝑎

𝛾1,2,6us-

ing an FSM rep-resentation

2 2

2

#𝑖2 ≥

1 ∧#𝑜2 ≥

1/

𝑓⟨𝑎6 ⟩

#𝑖 2≥1∧#

𝑜 2≥1/

𝑓 ⟨𝑎6⟩

#𝑖 1≥

2∧#𝑜 1≥

2/𝑓 ⟨

𝑎1,𝑎

2,𝑎

2⟩

𝑐′′

𝑖1

𝑜2

𝑜1

𝑖2

𝑞1

𝑞0

𝑞3

𝑎2𝑎1

𝑎6

𝑎′′

𝛾1,2,6

(b) CSDF compositeactor 𝑎

′′

𝛾1,2,6us-

ing an FSM rep-resentation

2 2

2

#𝑖2 ≥

1 ∧#𝑜2 ≥

1/

𝑓⟨𝑎6 ⟩

𝑓 ⟨𝑎6⟩

#𝑖 2≥1∧#

𝑜 1≥1/

#𝑖 1≥

2∧#𝑜 1≥

2/𝑓 ⟨

𝑎1,𝑎

2,𝑎

2⟩

𝑐′′′

𝑖1

𝑜2

𝑜1

𝑖2

𝑞0

𝑞3

𝑞4

𝑎2𝑎1

𝑎6

𝑎′′′

𝛾1,2,6

(c) CSDF compositeactor 𝑎

′′′

𝛾1,2,6us-

ing an FSM rep-resentation

2

2 2

𝑓⟨𝑎6 ⟩

𝑓⟨𝑎6 ⟩

𝑓 ⟨𝑎6⟩

𝑓 ⟨𝑎6⟩

𝑓 ⟨𝑎1,𝑎

2,𝑎

2⟩ 𝑓 ⟨𝑎1,𝑎

2,𝑎

2⟩𝑓 ⟨𝑎1,𝑎

2,𝑎

2⟩

𝑐′

𝑐′′

𝑐′′′

𝑖1

𝑜2

𝑜1

𝑖2

𝑞1

𝑞0

𝑞2

𝑞3

𝑞4

𝑎2𝑎1

𝑎6

(d) Composite actorimplementing anFSM-based QSS

Figure 4.25: Depicted are various CSDF composite actors for the cluster 𝑔𝛾1,2,6

as well as an FSM-based composite actor derived via a safe clus-ter refinement operation for the cluster 𝑔𝛾1,2,6

. As can be seen, theFSM-based QSS subsumes all three orderings of the CSDF phases⟨𝑎11, 𝑎12, 𝑎22⟩, ⟨𝑎16⟩, and ⟨𝑎26⟩ represented by the CSDF composite ac-tors 𝑎′

𝛾1,2,6, 𝑎′′

𝛾1,2,6and 𝑎

′′′𝛾1,2,6

.

To exemplify, for the CSDF composite actors 𝑎′𝛾1,2,6

and 𝑎′′𝛾1,2,6

, the orderingof CSDF phase ⟨𝑎16⟩ before phase ⟨𝑎11, 𝑎12, 𝑎22⟩ requires according to the secondcondition of Theorem 4.2 a delay-less directed path from the input actor firing𝑎16 to the output actor firing 𝑎12 in the original marked graph expansion of thesubcluster 𝑔𝛾1,2,6

. However, such a path is missing. Furthermore, for the CSDFcomposite actor 𝑎′′′

𝛾1,2,6, the ordering of CSDF phase ⟨𝑎16⟩ after phase ⟨𝑎11, 𝑎12, 𝑎22⟩

requires according to the second condition of Theorem 4.2 a delay-less directedpath from the input actor firing 𝑎11 to the output actor firing 𝑎16 in the originalmarked graph expansion of the subcluster 𝑔𝛾1,2,6

. But this path is also missing.Hence, all possible CSDF composite actors with the three given phases are

unsafe. And, thus, none of the three CSDF composite actors can be used in bothcluster environments 𝑔ex5a and 𝑔ex5b without introducing at least in one of thecluster environments an artificial deadlock. For the cluster environments 𝑔ex5a ,the CSDF composite actor 𝑎′

𝛾1,2,6can be used for a deadlock-free execution.

118

Page 127: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.3 Cluster Refinement in Static Data Flow Graphs

However, for the cluster environments 𝑔ex5b , all of the three CSDF compositeactors 𝑎

′𝛾1,2,6

, 𝑎′′𝛾1,2,6

, or 𝑎′′′𝛾1,2,6

introduce an artificial deadlock. This seems tocontradict the following theorem, which was also stated at the beginning of thissubsection.

Theorem 4.3 (CSDF Composite Actor Sufficiency For Known CSDF ClusterEnvironments). If a cluster environment 𝑔n containing only SDF or CSDF actorsof a cluster 𝑔𝛾 is known in advance, then a refinement into a CSDF compositeactor 𝑎𝛾 is sufficient to avoid the introduction of an artificial deadlock by therefinement operation ( 𝑔n, 𝑎𝛾 ) = 𝜉𝐶𝑆𝐷𝐹 (𝑔n, 𝑔𝛾 ) for the cluster environment 𝑔n.

The contradiction can be resolved by considering CSDF composite actorscontaining more actor firings in all their CSDF phases than is strictly requiredby the repetition vector of the cluster from which the composite actor wasderived. Such a CSDF composite actor can always be derived from an iterationof the CSDF graph 𝑔n.

Proof. Given a consistent and deadlock-free SDF or CSDF graph 𝑔n and itssubcluster 𝑔𝛾 , then a Periodic Static-Order Schedule (PSOS) 𝑎* for the graph𝑔n can be calculated [BELP96, LM87b]. Furthermore, the sequence of ac-tor firings 𝑎 of an iteration of the graph 𝑔n can be partitioned, that is 𝑎 =𝑎1𝑔n

a𝑎1𝑔𝛾

a𝑎2𝑔n

a𝑎2𝑔𝛾

a. . .𝑎𝑛

𝑔na𝑎𝑛

𝑔𝛾a𝑎𝑛+1

𝑔n , into (possibly empty) subsequences 𝑎𝑗𝑔n of

actor firings containing only actor firings of the actors in the cluster environmentand subsequences 𝑎𝑘

𝑔𝛾 of actor firings containing only actor firings of actors inthe subcluster 𝑔𝛾 . Given such a partitioning, a CSDF composite actor with 𝑛CSDF phases can be constructed by deriving each phase 𝑎𝑘𝛾 from the subse-quence 𝑎𝑘

𝑔𝛾 . Then, a PSOS for the refined graph 𝑔n is given by the followingperiodic sequence of actor firings: (𝑎1

𝑔n

a⟨𝑎1𝛾 ⟩a𝑎2𝑔n

a⟨𝑎2𝛾 ⟩a. . .𝑎𝑛

𝑔na⟨𝑎𝑛𝛾 ⟩

a𝑎𝑛+1𝑔n )*

To exemplify, for the SDF graph 𝑔ex5b shown in Figure 4.23b, the followingPSOS 𝑎* can be derived via methods given in [LM87b] and its iteration par-titioned into alternating subsequences containing only actor firings from thecluster environment and subsequences containing only actor firings from thecluster.

𝑎* = ⟨𝑎1, 𝑎2, 𝑎2, 𝑎env2 , 𝑎6, 𝑎6, 𝑎6, 𝑎6, 𝑎env1 , 𝑎env1 , 𝑎1, 𝑎2, 𝑎2⟩*

𝑎 = 𝑎1𝑔ex5b

a𝑎1𝑔𝛾1,2,6

a𝑎2𝑔ex5b

a𝑎2𝑔𝛾1,2,6

a𝑎3𝑔ex5b

a𝑎3𝑔𝛾1,2,6

a𝑎4𝑔ex5b

𝑎1𝑔ex5b

= ⟨⟩ 𝑎1𝑔𝛾1,2,6

= ⟨𝑎1, 𝑎2, 𝑎2⟩𝑎2𝑔ex5b

= ⟨𝑎env2⟩ 𝑎2𝑔𝛾1,2,6

= ⟨𝑎6, 𝑎6, 𝑎6, 𝑎6⟩𝑎3𝑔ex5b

= ⟨𝑎env1 , 𝑎env1⟩ 𝑎3𝑔𝛾1,2,6

= ⟨𝑎1, 𝑎2, 𝑎2⟩𝑎4𝑔ex5b

= ⟨⟩

119

Page 128: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

Hence, the CSDF composite actor for the cluster environment 𝑔ex5b has thethree CSDF phases ⟨𝑎11, 𝑎12, 𝑎22⟩, ⟨𝑎16, 𝑎26, 𝑎36, 𝑎46⟩, and ⟨𝑎21, 𝑎32, 𝑎42⟩. Furthermore, dueto the construction methodology employed in the proof, it will not introduceany deadlocks in the refined graph 𝑔ex5b . Note, however, that the resultingCSDF composite actor contains twice the number of actor firings as requiredby the repetition vector 𝜂rep

𝛾1,2,6= (𝜂rep1 , 𝜂rep2 , 𝜂rep6 ) = (1, 2, 2) of the cluster 𝑔𝛾1,2,6

.Note also that the CSDF composite actor for the cluster environment 𝑔ex5bcan be assembled from two cycles 𝑐

′ and 𝑐′′′ as implemented in the composite

actors 𝑎′𝛾1,2,6

and 𝑎′′′𝛾1,2,6

as given in Figure 4.25, respectively. Finally, one canobserve that the three CSDF composite actors 𝑎′

𝛾1,2,6, 𝑎′′

𝛾1,2,6, and 𝑎

′′′𝛾1,2,6

can becombined into the composite actor depicted in Figure 4.25d implementing anautomata-based QSS supporting all possible combinations of the cycles in theCSDF composite actors 𝑎′

𝛾1,2,6, 𝑎′′

𝛾1,2,6, and 𝑎

′′′𝛾1,2,6

.On the other hand, as demonstrated by the cluster 𝑔𝛾1,2,6

and the two clusterenvironments 𝑔ex5a and 𝑔ex5b shown in Figure 4.23, a refinement into a CSDFcomposite actor is insufficient to guarantee a safe cluster refinement operation foran arbitrary static cluster. That is to say, a refinement into a CSDF compositeactor is insufficient to prevent deadlock for all possible cluster environments ofthe arbitrary static cluster even if the cluster environment contains only SDFor CSDF actors.

4.4 Cluster Refinement in Dynamic Data Flow Graphs

In this section, clustering of a DFG that consists of a mix of dynamic andstatic actors is considered. Such a DFG can be obtained via the classificationmethodology presented in Section 3.7. This methodology presents a sufficientcondition to identify static data flow actors in a network graph. An example ofsuch a classified network graph is given in Figure 4.26. It contains both staticdata flow actors (shaded vertices), i.e., HSDF, SDF, and CSDF actors withconstant consumption and production rates, as well as dynamic data flow actorslike the Parser and the ImageSink which are modeled by Kahn processes[Kah74].The shaded actors, JPEG Source, IDCT, and Inverse ZigZag represent

static actors (or subclusters containing only static actors, e.g., the IDCT sub-cluster from Figure 4.10) while the remaining actors could not be classified asstatic actors. In the following, safe cluster refinements according to Defini-tion 4.3 are considered for static subclusters of a DFG consisting of a mix ofdynamic and static actors. Each cluster refinement of a static subcluster resultsin a composite actor implementing a QSS for the actors in the subcluster.

120

Page 129: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

Source ZRLInverse Inverse

Quant.

Image Frame Inverse

Decoder

Sink Shuffler ZigZagIDCT

JPEG Parser Huff.

Figure 4.26: Network graph of a Motion-JPEG decoder including dynamic(white) as well as static data flow actors (shaded)

As shown in Section 4.3, only safe cluster refinement operations can be con-sidered for refining a static subcluster in an unknown cluster environment ora dynamic data flow cluster environment. Otherwise, the refinement operationmay introduce artificial deadlocks into the resulting refined DFG.45

Here, the main goal of the refinement operation is the improvement of theperformance of the whole graph in terms of latency and throughput. This goalis achieved via the more efficient QSS scheme implemented by the compositeactor in contrast to the overhead (cf. Section 4.1) induced by the dynamicscheduling of the static actors inside the cluster.Of course, a safe cluster refinement to SDF or CSDF actors as presented in

Sections 4.3.1 and 4.3.2 also reduces the overhead by implementing a PSOS forthe contained actors. However, the design space for safe cluster refinements toSDF or CSDF actors is limited by the Theorems 4.1 and 4.2. In contrast tothis, the design space for the safe cluster refinement operations presented inSections 4.4.2 and 4.4.3 is significantly larger. Indeed, the only requirement forthese safe cluster refinement operations is that the satisfaction of Definition 4.2does not require an accumulation of an infinite amount of tokens in at leastone channel contained in the refined cluster. This condition will be called theclustering condition in the following.Nonetheless, the cluster refinement operations presented in Sections 4.4.2

and 4.4.3 cannot refine arbitrary static clusters. Hence, the problem of par-titioning the set of all static actors in the network graph into a set of staticclusters conforming to the clustering condition has to be solved first. To ex-emplify, the actors IDCT and Inverse ZigZag of the Motion-JPEG decodernetwork graph form a static cluster conforming to the clustering condition whilethe static JPEG Source actor alone is a degenerated static cluster containingonly one actor.

45The original DFG is assumed to be deadlock-free. This, however, cannot be detected dueto the undecidability [Buc93] of deadlock freedom for dynamic DFGs.

121

Page 130: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

However, in the general case, partitioning the set of all static actors into con-forming static subclusters is a complex task of which the details are presented inthe next section. After such a partitioning has been computed, the refinementoperation presented in Sections 4.4.2 and 4.4.3 can be applied for each staticcluster in the partitioning. The scheduling step presented in Section 4.4.4 trans-lates the partial repetition vectors contained in the QSSs that are computed bythe algorithms presented in Sections 4.4.2 and 4.4.3 into sequences of staticallyscheduled actor firings.Definition 4.2 requires that the cluster is connected via channels of infinite

size to its cluster environment. If this no longer holds, the channel sizes of thechannels connecting the cluster and its environment have to be adjusted. Thisadjustment is presented in Section 4.4.6. Finally, after the model refinementpresented in this chapter have been performed, the methods for code generationas presented in Chapter 5 can be applied.

4.4.1 Clustering Condition

The cluster refinement operations presented in Sections 4.4.2 and 4.4.3 are basedon the generation of a finite state space representation of a QSS for a cluster.Hence, the number of states a cluster is required to exhibit in order to conform toDefinition 4.2 must be finite. To ensure such a finite state space, the clusteringcondition given below must be satisfied for any cluster.

Definition 4.4 (Clustering Condition [FKH+08*]). For a cluster 𝑔𝛾 and for foreach pair (𝑎in, 𝑎out) of input/output actors, there must exist a directed path 𝑝from the input actor 𝑎in to the output actor 𝑎out in the cluster 𝑔𝛾 . In otherwords, ∀(𝑎in, 𝑎out) ∈ 𝑔𝛾 .𝐴𝐼 × 𝑔𝛾 .𝐴𝑂 : ∃𝑝 = (𝑎1, 𝑎2, . . . 𝑎𝑛) ∈ 𝑔𝛾 .𝐴

* such that𝑎1 = 𝑎in, 𝑎𝑛 = 𝑎out, and ∀𝑘, 1 ≤ 𝑘 < 𝑛 : ∃𝑐 ∈ 𝑔𝛾 .𝐶 : (𝑎𝑘.𝑂 × { 𝑐 }) ∩ 𝑔𝛾 .𝐸 =∅ ∧ ({ 𝑐 } × 𝑎𝑘+1.𝐼) ∩ 𝑔𝛾 .𝐸 = ∅.

To exemplify, the cluster 𝑔𝛾5,6,7depicted in Figure 4.27 is considered.46 More

formally, the cluster 𝑔𝛾5,6,7= (𝒫 , 𝑉, 𝐸) is defined as follows:

𝒫 = 𝐼 ∪𝑂 = { 𝑖1 } ∪ { 𝑜1, 𝑜2 }𝑉 = 𝐴 ∪ 𝐶 = { 𝑎5, 𝑎6, 𝑎7 } ∪ { 𝑐7→6, 𝑐6→5 }𝐸 = { (𝑖1, 𝑎6.𝑖1), (𝑎6.𝑜1, 𝑜1), (𝑎5.𝑜1, 𝑜2) }∪

{ (𝑎7.𝑜1, 𝑐7→6), (𝑐7→6, 𝑎6.𝑖2), (𝑎6.𝑜2, 𝑐6→5), (𝑐6→5, 𝑎5.𝑖1) }

Now, to check if the cluster 𝑔𝛾5,6,7conforms to the clustering condition, its

set of input actors 𝐴𝐼 = { 𝑎6 } and its set of outputs actors 𝐴𝑂 = { 𝑎5, 𝑎6 }46The delay function has initially been introduced (cf. Definition 2.1 on page 8) as a function

defining a sequence of initial tokens for each channel. Hence, to be technically correct, thelength of the sequence has to be computed via the ’#’-operator.

122

Page 131: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

2

2

2

Production of two tokens prod(𝑎7.𝑜1) = 2cons(𝑎6.𝑖2) = 1

Output port 𝑜1 ∈ 𝑔𝛾5,6,7.𝑂 of cluster 𝑔𝛾5,6,7

Actor 𝑎5 ∈ 𝑔𝛾5,6,7.𝐴 of cluster 𝑔𝛾5,6,7

Consumption of two tokens cons(𝑎5.𝑖1) = 2on input port 𝑖1 of actor 𝑎5

Input port 𝑖1 ∈ 𝑎5.𝐼 of actor 𝑎5Channel 𝑐6→5 ∈ 𝑔𝛾5,6,7

.𝐶 of cluster 𝑔𝛾5,6,7

Two initial tokens #delay(𝑐6→5) = 2

𝑖1

𝑖 1𝑜 2

𝑜1

𝑜 1

𝑜1

𝑖1𝑖2𝑜1

𝑔𝛾5,6,7

𝑜2𝑎5

𝑎7 𝑎6

𝑐7→6

𝑐6→5

Figure 4.27: Cluster 𝑔𝛾5,6,7derived from adding an intermediate actor 𝑎7 to the

cluster 𝑔𝛾5,6depicted in Figure 4.22 on page 116

have to be determined. Then for each of the two pairs (𝑎6, 𝑎6) and (𝑎6, 𝑎5) ofinput/output actors, the existence of a directed path has to be verified. Forthe first pair (𝑎6, 𝑎6), this is trivially true. For the second pair (𝑎6, 𝑎5), thedirected path from the input actor 𝑎6 to the output actor 𝑎5 is 𝑝 = (𝑎6, 𝑎5) viathe channel 𝑐6→5. Hence, the subcluster 𝑔𝛾5,6,7

does conform to the clusteringcondition.In contract to this, the cluster 𝑔𝛾8,9

depicted in Figure 4.28 does not conformto the clustering condition. More formally, the cluster 𝑔𝛾8,9

= (𝒫 , 𝑉, 𝐸) isdefined as follows:

𝒫 = 𝐼 ∪𝑂 = { 𝑖1, 𝑖2 } ∪ { 𝑜1 }𝑉 = 𝐴 ∪ 𝐶 = { 𝑎8, 𝑎9 } ∪ { 𝑐8→9 }𝐸 = { (𝑖1, 𝑎8.𝑖1), (𝑎8.𝑜1, 𝑜1), (𝑎8.𝑜2, 𝑐8→9), (𝑐8→9, 𝑎9.𝑖1), (𝑖2, 𝑎9.𝑖2) }

The set of input and output actors of the cluster 𝑔𝛾8,9can be determined as

𝐴𝐼 = { 𝑎8, 𝑎9 } and 𝐴𝑂 = { 𝑎8 }, respectively. Hence, the existence of a directedpath for the two pairs (𝑎8, 𝑎8) and (𝑎9, 𝑎8) of input/output actors has to beverified. For the first pair (𝑎8, 𝑎8), this is trivially true. However, for the secondpair (𝑎9, 𝑎8), no such directed path from the input actor 𝑎9 to the output actor 𝑎8exists. Thus, the subcluster 𝑔𝛾8,9

does not conform to the clustering condition.For a potential cluster 𝑔𝛾 that does not satisfy the clustering condition, the

set of static actors 𝑔𝛾 .𝐴 can always be partitioned into subsets by consecutivelyremoving channels with the unbounded token accumulation problem from thecluster 𝑔𝛾 . This operation will result in a decomposition of the potential clusterto a set of clusters conforming to the clustering condition.To exemplify, the DFG 𝑔ex6 depicted in Figure 4.29 is considered. Consump-

tion and production rates of the static actors as well as initial tokens are not

123

Page 132: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝑖2

Channel 𝑐8→9 ∈ 𝑔𝛾8,9.𝐶

𝑖1

𝑖2

𝑖 1𝑜 2𝑜 1

𝑔𝛾8,9Actor 𝑎8 ∈ 𝑔𝛾8,9

.𝐴 of subcluster 𝑔𝛾8,9

Cluster input port 𝑖1 ∈ 𝑔𝛾8,9.𝐼

𝑖1

Cluster output port 𝑜1 ∈ 𝑔𝛾8,9.𝑂

𝑜1

𝑎9

𝑎8

𝑐8→9

Figure 4.28: Example cluster 𝑔𝛾8,9for non-conformance to the clustering condi-

tion

shown in Figure 4.29 as this information is not required for the cluster parti-tioning heuristic presented in this section. Note also that the clusters depictedin Figures 4.11, 4.12, 4.17, 4.23, 4.27, and 4.28 are all subgraphs of 𝑔ex6 and areall—apart from 𝑔𝛾8,9

—valid clusters conforming to the clustering condition.

𝑎1

𝑎env2

𝑎2

𝑎6 𝑎5

𝑎3 𝑎4𝑎7 𝑎env4

𝑎8

𝑎9

𝑎13

𝑎10

𝑎11

𝑎12

𝑎env3

𝑎env1

Figure 4.29: Example DFG 𝑔ex6 consisting of a mix of dynamic and static(shaded vertices) data flow actors

The goal of the presented heuristic is the formation of as large as possible clus-ters conforming to the clustering condition given in Definition 4.4. The heuristicstarts by partitioning the set of static actors 𝑔.𝐴𝑆 ⊆ 𝑔.𝐴 of a DFG 𝑔 into clus-ter candidates. This is done by removing all dynamic actors from the DFG andfinding connected components in the resulting graph. These connected compo-nents form the cluster candidates. In the general case, however, these clustercandidates will not satisfy the clustering condition. Hence, they have to be par-

124

Page 133: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

titioned into clusters satisfying the clustering condition by removing edges fromthe cluster candidates. Removal of edges will lead to smaller connected compo-nents forming smaller cluster candidates. The heuristic will terminate when allcluster candidates satisfy the clustering condition. Note that the heuristic willalways terminate as the removal of all edges will result in the situation that eachstatic actor is contained in its own unique cluster conforming to the clusteringcondition. This would even work if the edges to remove are chosen randomly.In the following, a greedy heuristic is presented that minimizes the number ofremoved edges:

∙ The heuristic starts by first replacing all dynamic actors in the originaldynamic DFG by generic 𝑎src and 𝑎snk placeholders. To exemplify, theDFG 𝑔ex6 is considered. The resulting graph after placeholder substitutionis depicted in Figure 4.30.

𝑎1 𝑎2

𝑎6 𝑎5 𝑎8

𝑎13

𝑎10

𝑎11

𝑎12

𝑎snk

𝑎src

𝑎3𝑎4 𝑎7 𝑎9

Figure 4.30: Modified graph 𝑔ex6 after replacement of all dynamic actors in withthe 𝑎src and 𝑎snk placeholders

∙ The heuristic proceeds with the identification of cluster candidates. Thesewill be determined by removing the 𝑎src and 𝑎snk placeholder actors andfinding connected components in the resulting graph. These connectedcomponents form the cluster candidates. To exemplify, the graph depictedin Figure 4.30 is again considered. After removal of 𝑎src and 𝑎snk, the graph

125

Page 134: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

decays into the two connected components 𝑔𝛾1−7,10−13and 𝑔𝛾8,9

as depictedin Figure 4.31.

𝑔𝛾8,9

𝑔𝛾1−7,10−13

𝑎1 𝑎2

𝑎6 𝑎5 𝑎8

𝑎13

𝑎10

𝑎11

𝑎12

𝑎snk

𝑎src

𝑎3𝑎4 𝑎7 𝑎9

Figure 4.31: Initial cluster candidates for 𝑔ex6

∙ After the cluster candidates have been identified, the set of input andoutput actors for each cluster candidate is computed. To exemplify, forthe cluster candidates 𝑔𝛾1−7,10−13

and 𝑔𝛾8,9depicted in Figure 4.31, the set

of input and output actors is as follows:

𝑔𝛾1−7,10−13.𝐴𝐼 = { 𝑎1, 𝑎4, 𝑎11 }

𝑔𝛾1−7,10−13.𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }

𝑔𝛾8,9.𝐴𝐼 = { 𝑎8, 𝑎9 }

𝑔𝛾8,9.𝐴𝑂 = { 𝑎8 }

As can be seen, input actors are connected directly to the 𝑎src placeholderactor while output actors are connected directly to the 𝑎snk placeholderactor. Furthermore, the set of input and output actors of a cluster are notnecessarily disjoint sets, e.g., the actors 𝑎8 and 𝑎11 are contained in bothinput and output actor sets of their respective clusters.

∙ After the sets of cluster input and output actors have been determinedfor each cluster candidate, each actor will be annotated with the a set

126

Page 135: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

of cluster input actors 𝐴𝐼 and a set of cluster output actors 𝐴𝑂. Toexemplify, the annotated graph depicted in Figure 4.32 is considered. Thesets of cluster input and output actors are annotated as 𝐴𝐼 = { . . . } and𝐴𝑂 = { . . . } at the corresponding actors of the cluster candidates, e.g.,𝐴𝐼 = { 𝑎8 } and 𝐴𝑂 = { 𝑎8 } at actor 𝑎8 of the cluster candidate 𝑔𝛾8,9

. Inthe following, 𝑎.𝐴𝐼 and 𝑎.𝐴𝑂 will be used to refer to the two sets of clusterinput and output actors which are annotated at actor 𝑎.

𝑔𝛾8,9

𝑔𝛾1−7,10−13

𝑎11

𝑎13

𝑎12

𝑎src

𝑎5

𝑎8

𝑎9

𝑎2𝑎1

𝑎10𝑎6

𝑎snk

𝑎3

𝑎7𝑎4

𝐴𝐼 = { 𝑎8, 𝑎9 }𝐴𝑂 = ∅

𝐴𝐼 = { 𝑎8 }𝐴𝑂 = { 𝑎8 }

𝐴𝐼 = { 𝑎1, 𝑎4, 𝑎11 }𝐴𝑂 = { 𝑎11, 𝑎12 }

𝐴𝐼 = { 𝑎1, 𝑎4, 𝑎11 }𝐴𝑂 = { 𝑎11, 𝑎12 }

𝐴𝐼 = { 𝑎1, 𝑎4, 𝑎11 }𝐴𝑂 = { 𝑎11, 𝑎12 }

𝐴𝐼 = { 𝑎1, 𝑎4 }𝐴𝑂 = { 𝑎11, 𝑎12 }

𝐴𝐼 = { 𝑎1, 𝑎4 }𝐴𝑂 = { 𝑎11, 𝑎12 }

𝐴𝐼 = { 𝑎1, 𝑎4 }𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }

𝐴𝐼 = { 𝑎1, 𝑎4 }𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }

𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }𝐴𝐼 = { 𝑎4 }

𝐴𝐼 = { 𝑎1, 𝑎4 }𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }

𝐴𝐼 = ∅𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }

𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }𝐴𝐼 = { 𝑎4 }

Figure 4.32: DFG 𝑔ex6 with annotated predecessor cluster input actors 𝐴𝐼 andsuccessor cluster output actors 𝐴𝑂

The annotated sets 𝑎.𝐴𝐼 and 𝑎.𝐴𝑂 correspond to the set of cluster inputactors that are reachable by traversing the graph in backward directionstarting from actor 𝑎 as well as the set of cluster output actors that arereachable by traversing the graph in forward direction starting from actor𝑎. To exemplify, for actor 𝑎9 no output actors are reachable by a forwardtraversal of the graph as the actor has no outgoing edges. However, theinput actor 𝑎8 is reachable by a backward traversal of the graph via theedge 𝑎8 → 𝑎9. Hence, the annotated sets are 𝑎9.𝐴𝑂 = ∅ and 𝑎9.𝐴𝐼 = { 𝑎8 }.An actor is always reachable from itself. Thus, the cluster input andoutput actor 𝑎8 is itself contained in both its annotated sets. However,the cluster input actor 𝑎9 is not reachable by a backward traversal ofthe graph as no directed path from 𝑎9 to 𝑎8 is contained in the clustercandidate 𝑔𝛾8,9

. Hence, the annotated sets of actor 𝑎8 are 𝑎8.𝐴𝑂 = { 𝑎8 }and 𝑎8.𝐴𝐼 = { 𝑎8 }.

127

Page 136: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

∙ After all actors of the cluster candidates are annotated with their sets ofcluster input and output actors, equivalence classes are computed for eachcluster candidate. An equivalence class 𝐴EC𝑛 is a set of actors of a clustercandidate with identical annotated sets 𝐴𝐼 and 𝐴𝑂. More formally, theequivalence class 𝐴EC𝑛 of the cluster candidate 𝑔𝛾 containing actor 𝑎′ is𝐴EC𝑛 = { 𝑎 ∈ 𝑔𝛾 .𝐴 | 𝑎.𝐴𝐼 = 𝑎′.𝐴𝐼 ∧ 𝑎.𝐴𝑂 = 𝑎′.𝐴𝑂 }. An equivalence classhas the property that removal of edges connecting actors of the equivalenceclass will never be required in order to partition the static actors intoclusters satisfying the clustering condition.47 Hence, these edges can beremoved from consideration by coalescing all actors of an equivalence classinto a single vertex.To exemplify, Figure 4.33 depicting the equivalence classes derived fromthe annotated DFG from Figure 4.32 is considered. As can be seen inFigure 4.32, the actors 𝑎1, 𝑎2, and 𝑎6 have identical annotated sets 𝐴𝐼 ={ 𝑎1, 𝑎4 } and 𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }. Hence, all three actors are representedby the single equivalence class 𝐴EC3 in Figure 4.33.

𝑔𝛾8,9

𝑔𝛾1−7,10−13

𝑎src

𝑎snk

𝐴EC6 = { 𝑎8 }𝐴𝐼 = { 𝑎8 }𝐴𝑂 = { 𝑎8 }

𝐴EC7 = { 𝑎9 }𝐴𝐼 = { 𝑎8, 𝑎9 }𝐴𝑂 = ∅

𝐴EC4 = { 𝑎5, 𝑎10 }𝐴𝐼 = { 𝑎1, 𝑎4 }𝐴𝑂 = { 𝑎11, 𝑎12 }

𝐴EC5 = { 𝑎11, 𝑎12, 𝑎13 }𝐴𝐼 = { 𝑎1, 𝑎4, 𝑎11 }𝐴𝑂 = { 𝑎11, 𝑎12 }

𝐴EC3 = { 𝑎1, 𝑎2, 𝑎6 }𝐴𝐼 = { 𝑎1, 𝑎4 }𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }

𝐴EC2 = { 𝑎7 }𝐴𝐼 = ∅𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }

𝐴EC1 = { 𝑎3, 𝑎4 }𝐴𝐼 = { 𝑎4 }𝐴𝑂 = { 𝑎6, 𝑎11, 𝑎12 }

Figure 4.33: Equivalence classes for the DFG 𝑔ex6

∙ After the computation of the set of equivalence classes, clashes betweendifferent equivalence classes are determined. If two equivalence classes𝐴EC𝑛 and 𝐴EC𝑚 have a clash, then all paths connecting 𝐴EC𝑛 and 𝐴EC𝑚

have to be cut. The clash relation is a symmetric relation that evaluatesto true if the following three conditions are satisfied:

47However, equivalence classes may change if edges are removed as this will change the set ofcluster input and output actors of the updated cluster candidates as well as the annotatedsets.

128

Page 137: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

– 𝐴EC𝑛 and 𝐴EC𝑚 have a directed path w.l.o.g. from 𝐴EC𝑛 to 𝐴EC𝑚

– 𝐴EC𝑛.𝐴𝐼 = 𝐴EC𝑚.𝐴𝐼 ∧ 𝐴EC𝑛.𝐴𝑂 = 𝐴EC𝑚.𝐴𝑂

– ∃𝑎out ∈ 𝐴EC𝑛.𝐴𝑂 ∖ 𝐴EC𝑚.𝐴𝑂 : 𝐴EC𝑚.𝐴𝐼 ⊆ 𝑎out.𝐴𝐼

In other words, two equivalence classes 𝐴EC𝑛 and 𝐴EC𝑚 do not clash ifthere is no directed path from 𝐴EC𝑛 to 𝐴EC𝑚 or from 𝐴EC𝑚 to 𝐴EC𝑛. Inthe following, it will be assumed that the directed path is from 𝐴EC𝑛 to𝐴EC𝑚.48 Moreover, if the equivalence classes 𝐴EC𝑛 and 𝐴EC𝑚 have iden-tical sets of predecessor input actors (𝐴EC𝑛.𝐴𝐼 = 𝐴EC𝑚.𝐴𝐼) or identicalsets of successor output actors (𝐴EC𝑛.𝐴𝑂 = 𝐴EC𝑚.𝐴𝑂), then they do notclash. Finally, a clash is present if there exists a cluster output actor𝑎out ∈ 𝐴EC𝑛.𝐴𝑂 ∖𝐴EC𝑚.𝐴𝑂 reachable from equivalence class 𝐴EC𝑛 but notfrom equivalence class 𝐴EC𝑚 that does not require a cluster input actor𝑎in ∈ 𝐴EC𝑚.𝐴𝐼 ∖ 𝑎out.𝐴𝐼 which is required by equivalence class 𝐴EC𝑚.

To exemplify, the cluster candidate 𝑔𝛾8,9and its two equivalence classes

𝐴EC6 and 𝐴EC7 are considered. This cluster has been used before (cf. Fig-ure 4.28) as an example of non-conformance to the clustering condition.Hence, as expected, these two equivalence classes clash. The equivalenceclasses 𝐴EC6 and 𝐴EC7 are connected by a directed path from 𝐴EC6 to𝐴EC7 and neither the set of cluster input actors nor the set of cluster out-put actors of 𝐴EC6 and 𝐴EC7 are identical. Furthermore, a cluster outputactor 𝑎8 reachable from 𝐴EC6 but not from 𝐴EC7 can be identified whichdoes not require the cluster input actor 𝑎9 which is required by equiva-lence class 𝐴EC7. To exemplify, the clash relation clash for all equivalenceclasses of 𝑔ex6 is given in Figure 4.34.

∙ After the clashes between different equivalence classes have been deter-mined, they must be resolved. Resolution of clashes is performed by cut-ting all paths between clashing equivalence classes. To remove the leastnumber of edges possible, a min-cut algorithm [SW97] is used to cut thesepaths.

4.4.2 Automata-Based Clustering

Now that the static clusters of the DFG are identified, these clusters will bereplaced by their corresponding composite actor implementing a QSS for thestatic actors contained in the cluster. In this section, an algorithm is presented48Note that the graph induced by the equivalence classes, e.g., 𝐹𝑖𝑔𝑢𝑟𝑒 4.33 for the cluster

candidate under evaluation, must be an acyclic graph since all actors of a strongly con-nected component in the DFG of the cluster candidate must be in the same equivalenceclass. Thus, the equivalence classes 𝐴EC𝑛 and 𝐴EC𝑚 can simply be swapped if the directedpath is from 𝐴EC𝑚 to 𝐴EC𝑛.

129

Page 138: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝐴EC𝑛clash(𝐴EC𝑛, 𝐴EC𝑚); 𝐴EC𝑚 =

𝐴EC1 𝐴EC2 𝐴EC3 𝐴EC4 𝐴EC5

𝐴EC1 f f f f t𝐴EC2 f f f f t𝐴EC3 f f f f t𝐴EC4 f f f f f𝐴EC5 t t t f f

(a) The clash relation for the equivalence classes of thecluster candidate 𝑔𝛾1−7,10−13

𝐴EC𝑛𝐴EC𝑚 =

𝐴EC6 𝐴EC7

𝐴EC6 f t𝐴EC7 t f

(b) The clash relation for theequivalence classes of the clus-ter candidate 𝑔𝛾8,9

Figure 4.34: The clash relation for all equivalence classes of 𝑔ex6

which computes for a static cluster conforming to the clustering condition acluster FSM representing a QSS. The actions on the transitions of the clusterFSM are sequences of statically scheduled actor firings that represent the staticpart of a QSS schedule while the run-time selection of the transition to executeis representing the dynamic part of the QSS schedule.As mentioned previously, the cluster refinement operations presented in this

and the next section are based on a finite state space representation of thecluster. Thus, the state of a cluster and its state space are considered in moredetail. The set of states a cluster is required to exhibit in order to conformto Definition 4.2 will be called its cluster state space. For the cluster FSMrepresentation of the QSS, the cluster state space corresponds to the state spaceof the cluster FSM.For the purpose of a refinement of a static cluster, the state of a cluster can

be abstracted from the actor functionality state (cf. Definition 3.1 on page 44)of the actors contained in the cluster because the actor functionality state ofa static actor does not influence its communication behavior. Moreover, dueto the independence of the communication behavior of the static actors in thecluster from the actual values which are consumed and produced by these actors,the state of a cluster can be abstracted from the sequences of token valuescontained in the channels of the cluster to the number of tokens contained inthe channels. Thus, the state of a cluster can be given by the number of tokens inall the channels contained in the cluster and the actor FSM state of each actorcontained in the cluster. The state space of each actor FSM is by definitionfinite. Hence, the only possibility of an infinite cluster state space is a possibleunbounded accumulation of tokens on a channel contained in the cluster.However, an alternative and more concise way to specify the state of a cluster

is used. More formally, the following definition of a cluster state can be given:

Definition 4.5 (Cluster State [FZHT13*]). The cluster state of a cluster 𝑔𝛾 isencoded by a vector 𝑞 = (𝜂in1 , 𝜂in2 , . . . 𝜂in|𝐴𝐼 |

, 𝜂out1 , 𝜂out2 , . . . 𝜂out|𝐴𝑂 |) ∈ N|𝐴𝐼∪𝐴𝑂|0

130

Page 139: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

containing the number of accumulated actor firings 𝜂in1 , 𝜂in2 , . . . 𝜂in|𝐴𝐼 |of all in-

put actors 𝐴𝐼 = { 𝑎in1 , 𝑎in2 , . . . 𝑎in|𝐴𝐼 |} of the cluster and the number of accu-

mulated actor firings 𝜂out1 , 𝜂out2 , . . . 𝜂out|𝐴𝑂 | of all output actors of the cluster{ 𝑎out1 , 𝑎out2 , . . . 𝑎out|𝐴𝑂 | }. Note that some of the actor firings 𝜂in and 𝜂out mayoverlap as the set of input actors and the set of output actors are in general notdisjoint sets.

Hence, the initial cluster state 𝑞0 of a cluster is encoded by the all zero vector0. Indeed, this representation of a cluster state does abstract from even moreinformation compared to a representation based on the state of each actor FSMand the number of tokens in all channels contained in the cluster. The missinginformation is the number of actor firings of all the intermediate actors of thecluster. Intermediate actors of a cluster are those actors of the cluster that areneither input nor output actors.Fortunately, this information is largely redundant for the cluster refinement

operations in Sections 4.4.2 and 4.4.3 as only the interaction with the clusterenvironment through the cluster’s input and output actors can be used to ad-equately capture the state of the cluster. However, in Section 4.4.4, where thesequences of statically scheduled actor firings of the QSS are determined, thenumber of firings 𝜂 ∈ N|𝐴|

0 of all actors in the cluster 𝑔𝛾 corresponding to acluster state 𝑞 ∈ N|𝐴𝐼∪𝐴𝑂|

0 is required.This information will be given by the function firings : N|𝐴𝐼∪𝐴𝑂|

0 → N|𝐴|0 . The

value 𝜂 = firings(𝑞) must, of course, contain the same number of firings for allinput and output actors as contained in the cluster state 𝑞 from which it wasderived, that is 𝜋𝐴𝐼∪𝐴𝑂

(firings(𝑞)) = 𝑞.49

To exemplify, the cluster 𝑔𝛾5,6,7depicted in Figure 4.27 is considered again.

This cluster is derived from the cluster 𝑔𝛾5,6(cf. Figure 4.22 on page 116)

by adding an intermediate actor 𝑎7. Due to identical input and output ac-tors, an analogous topology connecting them, as well as identical repetitioncounts 𝜂rep5 and 𝜂rep6 in the repetition vectors 𝜂rep

𝛾5,6= (𝜂rep5 , 𝜂rep6 ) = (1, 2) and

𝜂rep𝛾5,6,7

= (𝜂rep5 , 𝜂rep6 , 𝜂rep7 ) = (1, 2, 1), the clusters 𝑔𝛾5,6,7and 𝑔𝛾5,6

will have thesame cluster state representation 𝑞 = (𝜂5, 𝜂6) and the same three cluster states{ 𝑞0 = (0, 0), 𝑞1 = (1, 0), 𝑞2 = (1, 1) } (cf. Figure 4.22b on page 116).In contrast to cluster 𝑔𝛾5,6

, which is lacking any intermediate actors, thecluster 𝑔𝛾5,6,7

has multiple interpretations, that is multiple valid function valuesfor firings(𝑞1) and firings(𝑞2). An example of this is depicted in Figure 4.35.As can be seen, the cycle 𝑞1 → 𝑞2 → 𝑞1 must always contain one firing of the49The notation 𝜋𝑁 (𝑥) will be used to denote a projection of a vector 𝑥 = (𝑥1, 𝑥2, . . . 𝑥𝑛)

to a subset of its entries given by an (ordered) subset 𝑁 ⊆ ℐ(𝑥) of its index setℐ(𝑥) = { 1, 2, . . . 𝑛 }. The order of the index in the subset 𝑁 determines the or-der in which the entries 𝑥𝑘 appear in the resulting projected vector 𝜋𝑁 (𝑥), e.g.,𝜋{ 3,5,4 }(𝑥1, 𝑥2, 𝑥3, 𝑥4, 𝑥5, 𝑥6) = (𝑥3, 𝑥5, 𝑥4).𝜋𝑁 (𝑥)

131

Page 140: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

intermediate actor 𝑎7 as required by the repetition count 𝜂rep7 = 1. However,the transition where this firing is include can be freely chosen.

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/𝑓⟨𝑎6⟩

#𝑜 2≥

2/𝑓 ⟨

𝑎5⟩

𝑓⟨𝑎6,𝑎5,𝑎7⟩#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1 ∧#𝑜2 ≥ 2/

𝑞1 𝑞2

𝑞0

(a) firings(𝑞1) = (1, 0, 0);firings(𝑞2) = (1, 1, 0)

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/𝑓⟨𝑎6⟩

#𝑜 2≥

2/𝑓 ⟨

𝑎5,𝑎

7⟩

𝑓⟨𝑎6,𝑎5,𝑎7⟩#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1 ∧#𝑜2 ≥ 2/

𝑞0

𝑞1 𝑞2

(b) firings(𝑞1) = (1, 0, 1);firings(𝑞2) = (1, 1, 1)

#𝑜 2≥

2/𝑓 ⟨

𝑎5⟩

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/𝑓⟨𝑎6,𝑎7⟩

𝑓⟨𝑎6,𝑎5⟩#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1 ∧#𝑜2 ≥ 2/

𝑞0

𝑞1 𝑞2

(c) firings(𝑞1) = (1, 0, 0);firings(𝑞2) = (1, 1, 1)

Figure 4.35: Shown are three different interpretations for the cluster states 𝑞1

and 𝑞2 as well as the cluster FSMs corresponding to these interpre-tations. Note that the cluster state 𝑞0 will always be firings(𝑞0) =(𝜂5, 𝜂6, 𝜂7) = (0, 0, 0).

Furthermore, if the vector 𝜂src→dst is used to denote the number of actor firingsexecuted by the transition 𝑞src → 𝑞dst, then below given equation holds:50

(firings(𝑞src) + 𝜂src→dst) mod 𝜂rep𝛾 = firings(𝑞dst) (4.3)

To exemplify, the FSM depicted in Figure 4.35b is considered. For the transi-tion 𝑞0 → 𝑞1, the actors 𝑎5 and 𝑎7 are fired once. Hence, 𝜂0→1 = (𝜂5, 𝜂6, 𝜂7) =(1, 0, 1) and substituting into Equation (4.3) results in the below given calcula-tions:

(firings(𝑞0) + 𝜂0→1) mod 𝜂rep𝛾5,6,7

= firings(𝑞1)

((0, 0, 0) + (1, 0, 1)) mod (1, 2, 1) = firings(𝑞1)(1, 0, 1) mod (1, 2, 1) = firings(𝑞1)

(1, 0, 1) = firings(𝑞1)

To demonstrate the wrap around in the cluster state space, as induced by thecyclic nature of SDF or CSDF graphs, the transition 𝑞2 → 𝑞1 is considered. Thistransition fires each actor 𝑎5, 𝑎6, and 𝑎7 exactly once, that is 𝜂2→1 = (1, 1, 1).

50The notation 𝑞 mod 𝜂 is used to denote a modulo operation in a 𝑑-dimensional vectorspace. The modulo operation mod is defined for vectors 𝑞 ∈ Z𝑑 and 𝜂 ∈ N𝑑. It computesthe smallest vector 𝑞′ = 𝑞 mod 𝜂 such that 𝑞′ ≥ 0 and 𝑞 ≡ 𝑞′ (mod 𝜂). More formally,𝑞′ ∈ N𝑑

0 such that ∃𝑘 ∈ Z : 𝑞′ = 𝑞−𝑘 ·𝜂∧@𝑘′ > 𝑘 : 𝑞−𝑘′ ·𝜂 ≥ 0. Moreover, the ≥-relationbetween two vectors 𝑎 and 𝑏 with the same index set ℐ(𝑎) = ℐ(𝑏) denotes a partial order,i.e., 𝑎 ≥ 𝑏 ⇐⇒ ∀𝑘 ∈ ℐ(𝑎) : 𝑎(𝑘) ≥ 𝑏(𝑘).

132

Page 141: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

Substituting the relevant vectors into Equation (4.3) results in the below givencalculations:

(firings(𝑞2) + 𝜂2→1) mod 𝜂rep𝛾5,6,7

= firings(𝑞1)

((1, 1, 1) + (1, 1, 1)) mod (1, 2, 1) = firings(𝑞1)(2, 2, 2) mod (1, 2, 1) = firings(𝑞1)

(1, 0, 1) = firings(𝑞1)

Without considering the cyclic nature of the cluster state space, the number offirings of all actors in the cluster for the destination state 𝑞1 would be calculatedas firings(𝑞2) + 𝜂2→1 = (1, 1, 1) + (1, 1, 1) = (2, 2, 2).However, if each actor in the cluster 𝑔𝛾 has been fired exactly as many times

as specified by its repetition vector 𝜂rep𝛾 , then the cluster must be in the same

state as before. Hence, the cluster states after firing the actors in the cluster 𝜂+𝑛 · 𝜂rep

𝛾 ∀𝑛 ∈ Z times are all equivalent, e.g., (2, 2, 2) ≡ (1, 0, 1) (mod 𝜂rep𝛾5,6,7

).To compute canonical representations for a number of actor firings 𝜂 and acluster state 𝑞 the notations 𝜂 mod 𝜂rep

𝛾 and 𝑞 mod 𝜋𝐴𝐼∪𝐴𝑂(𝜂rep

𝛾 ) are used,respectively.

Cluster State Space

In the following, the state space of a cluster 𝑄𝛾 ⊂ N|𝐴𝐼∪𝐴𝑂|0 will be defined

precisely. The relevant point has been raised already in Section 4.2:

“The key idea for preserving deadlock freedom is to satisfy the worstcase environment of the cluster. This environment contains for eachoutput port 𝑜 ∈ 𝑔𝛾 .𝑂 and each input port 𝑖 ∈ 𝑔𝛾 .𝐼 a feedback loopwhere any produced token on the output port 𝑜 is needed for theactivation of an actor 𝑎 ∈ 𝑔𝛾 .𝐴 connected to the input port 𝑖.

In particular, postponing the production of an output token resultsin a deadlock of the entire system. Hence, to satisfy the equivalencecondition from Definition 4.2, the composite actor produced by acluster refinement must always produce the maximum number ofoutput tokens possible from the consumption of a minimum numberof input tokens.“

The above situation can be defined using the corresponding Kahn function𝑠out = 𝒦𝑔𝛾 (𝑠in) of the cluster 𝑔𝛾 . The Kahn function by definition producesmaximal sequences 𝑠out = (𝑠𝑜1 , 𝑠𝑜2 , . . . 𝑠𝑜|𝑂|) of output tokens on the cluster out-put ports { 𝑜1, 𝑜2, . . . 𝑜|𝑂| } ∈ 𝑂 from the given sequences 𝑠in = (𝑠𝑖1 , 𝑠𝑖2 , . . . 𝑠𝑖|𝐼|)of input tokens on the cluster input ports { 𝑖1, 𝑖2, . . . 𝑖|𝐼| } ∈ 𝐼. Hence, only

133

Page 142: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

the sequences 𝑠in have to be minimized in order to define the tuple of sig-nals 𝑠 = (𝑠in, 𝑠out) where the maximum number of output tokens are pro-duced from the consumption of a minimum number of input tokens, that is{ (𝑠in,𝒦𝑔𝛾 (𝑠in)) | 𝑠in ∈ 𝑆|𝐼| ∧ @𝑠′

in @ 𝑠in : 𝒦𝑔𝛾 (𝑠′in) = 𝒦𝑔𝛾 (𝑠in) }.

Finally, using the notation state : 𝑆|𝑃 | → 𝑄 to derive a cluster state 𝑞 =state(𝑠) from the tuple of signals 𝑠 = (𝑠in, 𝑠out), the following definition of thecluster state space can be derived:

Definition 4.6 (Cluster State Space). Given a deadlock-free and consistentstatic subcluster 𝑔𝛾 and its corresponding Kahn function 𝒦𝑔𝛾 , then its clusterstate space is 𝑄𝛾 = { state(𝑠in,𝒦𝑔𝛾 (𝑠in))mod𝜋𝐴𝐼∪𝐴𝑂

(𝜂rep𝑔𝛾 ) | 𝑠in ∈ 𝑆|𝐼|∧@𝑠′

in @

𝑠in : 𝒦𝑔𝛾 (𝑠′in) = 𝒦𝑔𝛾 (𝑠in) } ∪ { 𝑞0 } ⊂ N

|𝐴𝐼∪𝐴𝑂|0 . The initial cluster state 𝑞0 is

then trivially the all zero vector 0 representing the fact that, in the beginning,no actor of the cluster has fired.

Each cluster state 𝑞 ∈ 𝑄𝛾 represents a state of the cluster that has con-sumed a minimum number of input tokens. Furthermore, each cluster state𝑞 ∈ 𝑄𝛾 ∖ { 𝑞0 } has also produced a maximum number of output tokens fromthe consumption of a minimum number of input tokens. There may be one ex-ception to the assertion that a cluster state always contains the maximal numberof output actor firings, namely the initial cluster state 0 = 𝑞0 ∈ 𝑄𝛾 (the allzero vector). This exception is due to the fact that excess initial tokens may bestored on the internal channels of the cluster in the initial state to permit thefiring of cluster output actors without firing any cluster input actors. However,this exception can occur at most once per cluster as subsequent sequences ofstatically scheduled actor firings always produce the maximal number of outputactor firings from a minimal number of input actor firings, and thus, tokens willnot accumulate on internal channels. An example of such a situation are thethree transitions 𝑞0 → 𝑞1 shown in Figure 4.35.In general, the state transition graph of a cluster can be decomposed into (1)

the initial state, (2) an acyclic part (except the initial state), and (3) a stronglyconnected component. The acyclic part represents possible retiming sequencesto reach the cyclic behavior of the static DFG contained in the cluster. Thisbehavior is represented by the strongly connected component. In detail, (1) theinitial state may not have produced the maximum number of output tokens,whereas (2) each state of the acyclic part has produced the maximum numberof output tokens possible, but may still contain some excess tokens on internalchannels from the initial state, while (3) each state of the strongly connectedcomponent has both produced the maximum number of output tokens and allexcess tokens from the initial state have been used up.

134

Page 143: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

After having precisely defined the cluster state space, the connection betweenthe cluster state space and the clustering condition can now be formally statedas follows:

Theorem 4.4 (Finite Cluster State Space). A connected, consistent, and dead-lock-free static cluster 𝑔𝛾 has a finite cluster state space if and only if the clus-tering condition given in Definition 4.4 holds for the cluster.

The proof of the above theorem is separated into two parts. First, it is proventhat the clustering condition is a sufficient condition to guarantee a finite clusterstate space.

Proof. Let 𝑔𝛾 be a connected, consistent, and deadlock-free static cluster con-forming to the clustering condition given in Definition 4.4. Thus, there existsa directed path in the cluster from each input actor of the cluster to each out-put actor of the cluster. Hence, for each output actor 𝑎𝑛 ∈ 𝐴𝑂, there exists afiring count 𝜂bound,𝑛 of this actor which requires all input actors 𝐴𝐼 to performat least as many actor firings as encoded in the repetition vector 𝜂rep

𝛾 of thecluster. That is 𝑞(𝑎𝑛) ≥ 𝜂bound,𝑛 =⇒ 𝜋𝐴𝐼

(𝑞) ≥ 𝜋𝐴𝐼(𝜂rep

𝛾 ).51 Moreover, due tothe requirement to produce a maximum number of output tokens possible fromthe consumption of a minimum number of input tokens, if all input actors ofthe cluster have fired at least as many actor firings as encoded in the repetitionvector 𝜂rep

𝛾 of the cluster, i.e., 𝜋𝐴𝐼(𝑞) ≥ 𝜋𝐴𝐼

(𝜂rep𝛾 ), then all output actors of the

cluster will have done the same, i.e., 𝜋𝐴𝐼(𝑞) ≥ 𝜋𝐴𝐼

(𝜂rep𝛾 ) =⇒ 𝑞 ≥ 𝜋𝐴𝐼∪𝐴𝑂

(𝜂rep𝛾 ).

Hence, for all cluster states 𝑞 ∈ 𝑄𝛾 and for all output actors 𝑎𝑛 ∈ 𝐴𝑂, the num-ber of output actor firings 𝑞(𝑎𝑛) is bounded by 𝜂bound,𝑛. Otherwise, the followingcontradiction will result:

𝑞′ ∈ 𝑄𝛾 =⇒ 𝑞′ = 𝑞′ mod 𝜋𝐴𝐼∪𝐴𝑂(𝜂rep

𝛾 )

∃𝑎𝑛 ∈ 𝐴𝑂 : 𝑞′(𝑎𝑛) ≥ 𝜂bound,𝑛 =⇒ 𝜋𝐴𝐼(𝑞′) ≥ 𝜋𝐴𝐼

(𝜂rep𝛾 )

=⇒ 𝑞′ ≥ 𝜋𝐴𝐼∪𝐴𝑂(𝜂rep

𝛾 )

=⇒ 𝑞′ = 𝑞′ mod 𝜋𝐴𝐼∪𝐴𝑂(𝜂rep

𝛾 )

The existence of a bound for the number of input actor firings in a cluster statecan immediately be derived from the existence of the bound on the numberof output actor firings in the cluster state by defining the bound on the inputactor firings for each input actor to be the minimal number of input actor firingsrequired to execute 𝜂bound,𝑛 output actor firings for each output actor. Hence, thenumber of input and output actor firings for an arbitrary cluster state 𝑞 ∈ 𝑄𝛾

are all bounded. Thus, the cluster state space 𝑄𝛾 must be finite.

51Please remember, a vector, e.g., 𝑞 = (𝜂𝑎1 , 𝜂𝑎2 , . . . 𝜂𝑎𝑛 , . . . 𝜂𝑎|𝐴𝐼∪𝐴𝑂|), is always also consid-ered to be a function from its index set, e.g., here ℐ(𝑞) = 𝐴𝐼 ∪ 𝐴𝑂, to its entries, e.g.,𝑞(𝑎𝑛) = 𝜂𝑎𝑛 . Moreover, the ≥-relation between two vectors 𝑎 and 𝑏 with the same indexset ℐ(𝑎) = ℐ(𝑏) denotes a partial order, i.e., 𝑎 ≥ 𝑏 ⇐⇒ ∀𝑘 ∈ ℐ(𝑎) : 𝑎(𝑘) ≥ 𝑏(𝑘).

135

Page 144: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

To complete the proof of Theorem 4.4, it is now shown that the clusteringcondition is a necessary condition to guarantee a finite cluster state space.

Proof. Let 𝑔𝛾 be a connected, consistent, and deadlock-free static cluster notconforming to the clustering condition given in Definition 4.4. Due to the vi-olation of the clustering condition, there exists at least one pair (𝑎in, 𝑎out) ofinput/output actors without a directed path from the input actor 𝑎in to theoutput actor 𝑎out in the cluster 𝑔𝛾 . Hence, firings of the output actor 𝑎out areindependent of firings of the input actor 𝑎in. Thus, the state set of the clus-ter described by Definition 4.6 will contain an infinite amount of cluster states𝑞 ∈ 𝑄𝛾 where 𝑞(𝑎out) ≥ 𝑛, 𝑛 ∈ N but 𝑞(𝑎in) = 0. To exemplify, without loss ofgenerality, let the cluster input port 𝑖1 be connected to this input actor 𝑎in andno tokens are ever provided on this input port, but sufficient tokens are providedon all other input ports 𝑖2, . . . 𝑖|𝐼| for the output actor 𝑎out to fire 𝑛 times. Fur-thermore, due to the prerequisite that 𝑔𝛾 is a connected graph, the two actors𝑎in and 𝑎out are connected via a path 𝑝. Hence, at least one of the channelscontained in the path 𝑝 connecting the actors 𝑎in and 𝑎out will accumulate aninfinite number of tokens as 𝑛 approaches infinity.

Subsequently, the automata-based clustering algorithm that is presented in[FKH+08*] will be reviewed. This algorithm computes the cluster FSM 𝑎𝛾 .ℛ(cf. Definition 4.1 on page 90) of a composite actor 𝑎𝛾 ∈ 𝑔.𝐴 that is derived viathe cluster refinement operator 𝜉𝐹𝑆𝑀 , i.e., 𝜉𝐹𝑆𝑀(𝑔, 𝑔𝛾 ) = (𝑔, 𝑎𝛾 ). This clusterFSM implements an automata-based representation of a QSS for the actors 𝑔𝛾 .𝐴contained in the static subcluster 𝑔𝛾 of the original DFG 𝑔.The automata-based clustering algorithm works in three steps: (Step 1) com-

putes the set of input/output dependency function values by considering eachoutput actor in isolation to obtain the minimal number of input actor firingsrequired for 𝑛 firings of this output actor, (Step 2) derives the cluster statespace ℛ.𝑄 ≡ 𝑄𝛾 (cf. Definition 4.6) of the cluster FSM 𝑎𝛾 .ℛ from the setof input/output dependency function values, and (Step 3) obtains the set oftransitions ℛ.𝑇 of the cluster FSM 𝑎𝛾 .ℛ from the cluster state space 𝑄𝛾 .In order to avoid deadlocks, the worst case is assumed, i.e., each produced

output token causes the activation of an actor in the subcluster via a feedbackloop. Hence, it is required that the resulting QSS always produces a maximumnumber of output tokens from a minimum number of input tokens. The stateof the cluster 𝑔𝛾 after such a production is given by a cluster state 𝑞 ∈ 𝑄𝛾 .As an exhaustive evaluation (cf. Definition 4.6) is prohibitive, the cluster statespace will be derived one output actor at a time by computing (Step 1) a setof input/output dependency function values for each output actor in isolation(cf. Definition 4.8). Afterwards, in (Step 2), the cluster state space is generatedby a fixed-point operation (cf. Definition 4.10) that computes the pointwise

136

Page 145: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

maximum for all pairs of input/output dependency function values until no morenew values are generated. This fixed-point operation generates the necessaryinterleavings of output actor firings that have been previously neglected by onlyconsidering output actor firings in isolation. A cluster state is derived from aninput/output dependency function value by determining (cf. Definition 4.9) themaximal number of output actor firings possible with the minimal number ofinput actor firings given by an input/output dependency function value. Finally,in (Step 3), the transition set ℛ.𝑇 of the cluster FSM is derived by partiallyordering the cluster state set using the >-relation between two cluster states.A transition 𝑞src → 𝑞dst exists between two cluster states 𝑞dst > 𝑞src if there isno further state 𝑞′ between, i.e., @𝑞′ : 𝑞dst > 𝑞′ > 𝑞src, these two states. Moreprecisely, the cluster FSM for a consistent and deadlock-free subcluster 𝑔𝛾 isderived by following the three steps [FKH+08*] given below:

Step 1: Computation of Input/Output Dependency Function Values

First, the repetition vector 𝜂rep𝛾 for the subcluster 𝑔𝛾 is computed in order to

provide a termination criterion (cf. Equation (4.4) on page 139) for the compu-tation of the set of input/output dependency function values in Definition 4.8.To exemplify, for the subcluster 𝑔𝛾1,2,3

depicted in Figure 4.36 from [FZHT13*],the repetition vector is 𝜂rep

𝛾1,2,3= (𝜂rep1 , 𝜂rep2 , 𝜂rep3 ) = (1, 1, 1), i.e., fire all three

actors 𝑎1, 𝑎2, and 𝑎3 once.

𝑜1𝑖1

𝑖2 𝑜3

𝑜2𝑐2→1

𝑐3→2

𝑐2→3

𝑐1→2𝑎1

𝑎3

𝑎2 𝑎env

𝑔𝛾1,2,3

Figure 4.36: Example network graph 𝑔ex7 used for illustration of the automata-based clustering algorithm

As token productions may not be postponed, the proposed algorithm uses theso-called input/output dependency function, which encodes the minimal numberof input actor firings required for a given output actor 𝑎out and number ofoutput actor firings 𝑛. With this information, it is known how many tokensare required on the input channels in order to enable the number of requested

137

Page 146: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

output actor firings. More formally, the input/output dependency function isdefined as follows:

Definition 4.7 (Input/Output Dependency Function [FKH+08*]52). Given astatic cluster 𝑔𝛾 , its input/output dependency function dep : 𝐴𝑂 × N0 → N|𝐴𝐼 |

0

is a vector-valued function that associates with each output actor 𝑎out ∈ 𝐴𝑂

and number 𝑛 ∈ N0 of requested output actor firings, the minimal number ofinput actor firings (𝜂in1 , 𝜂in2 , . . . 𝜂in|𝐴𝐼 |

) ∈ N|𝐴𝐼 |0 required to produce the requested

number 𝑛 of firings of the output actor 𝑎out.

To exemplify, Table 4.1 lists values for the input/output dependency functiondep(𝑎out, 𝑛) of the static subcluster 𝑔𝛾1,2,3

depicted in Figure 4.36. In this case,the sets of input and output actors are 𝐴𝐼 = { 𝑎1, 𝑎3 } and 𝐴𝑂 = { 𝑎1, 𝑎2, 𝑎3 },respectively. As can be seen from column 𝑎out = 𝑎2, no input actor firings arerequired ((0, 0) = dep(𝑎2, 1)) in order to fire the output actor 𝑎2 once. However,for the second firing of the output actor 𝑎2, both input actors 𝑎1 and 𝑎3 haveto be fired once, i.e., (1, 1) = dep(𝑎2, 2). The columns 𝑎out = 𝑎1 and 𝑎out = 𝑎3represent similar information for the output actors 𝑎1 and 𝑎3.

dep(𝑎out, 𝑛)𝑛 𝑎out = 𝑎1 𝑎out = 𝑎2 𝑎out = 𝑎3

1 (1, 0) (0, 0) (0, 1)2 (2, 0) (1, 1) (0, 2)3 (3, 1) (1, 3)

Table 4.1: Input/output dependency function values (𝜂1, 𝜂3) = dep(𝑎out, 𝑛) forsubcluster 𝑔𝛾1,2,3

from Figure 4.36

Note that the number of input/output dependency function values that needto be considered is bounded by the repetition vector 𝜂rep

𝛾 (cf. Equation (4.4)).In principle, only values dep(𝑎out, 𝑛) ≥ 𝜋𝐴𝐼

(𝜂rep𝛾 ) need to be considered since

values dep(𝑎out, 𝑛) ≥ 𝜋𝐴𝐼(𝜂rep

𝛾 ) are redundant with respect to (mod 𝜂rep𝛾 ).

However, this would complicate the computation of the set of transitions ℛ.𝑇of the cluster FSM in (Step 3). Hence, for each output actor, the minimal in-put/output dependency function value that is greater than the repetition vector52In [FKH+08*], the input/output dependency function as well as the cluster state space

are defined in terms of minimum number of consumed tokens and maximum number ofproduced tokens on the cluster input and output ports. However, in order to have aconsistent definition of cluster states and cluster state space for both the automata-basedand the rule-based clustering algorithms, the automata-based clustering will be presentedhere using a cluster state space defined in terms of cluster input and output actor firingsas used for the rule-based representation published in [FZHT11*]. Both representationsare equivalent and can be transformed into each other by using the consumption andproduction rates of the input and output actors.

138

Page 147: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

for the cluster input actors, e.g., the input/output dependency function values(3, 1), (1, 1), and (1, 3) in Table 4.1, will also be included. More formally, the setof input/output dependency function values for each output actor of a clusteris defined as follows:

Definition 4.8 (Set of Input/Output Dependency Function Values). Given astatic cluster 𝑔𝛾 , its repetition vector 𝜂rep

𝛾 , and its input/output dependency func-tion dep. Then, the set of input/output dependency function values 𝐻dep(𝑎out)for a cluster output actor 𝑎out is given by:53

𝐻dep(𝑎out) ={dep(𝑎out, 𝑛) | 𝑛 ∈ { 1, 2, . . .min{𝑛′ ∈ N0 | dep(𝑎out, 𝑛′) ≥ 𝜋𝐴𝐼

(𝜂rep𝛾 ) } (4.4)

} }

Considering the DFG shown in Figure 4.36, these sets of input/output depen-dency function values for subcluster 𝑔𝛾1,2,3

are: dep(𝑎1) = { (1, 0), (2, 0), (3, 1) }for output actor 𝑎1, dep(𝑎2) = { (0, 0), (1, 1) } for output actor 𝑎2, and dep(𝑎3) ={ (0, 1), (0, 2), (1, 3) } for output actor 𝑎3.

Step 2: Derivation of the Cluster FSM State Set

Next, the cluster state space 𝑄𝛾 is derived from the set of input/output depen-dency function values. Each cluster state 𝑞 ∈ 𝑄𝛾 encodes the minimal numberof input actor firings 𝜋𝐴𝐼

(𝑞) required for the output actor firings 𝜋𝐴𝑂(𝑞) given

by the cluster state. By definition, an input/output dependency function value𝜂 is already a minimal vector of input actor firings (cf. Equation (4.5)).However, each cluster state 𝑞 ∈ 𝑄𝛾 ∖{ 𝑞0 }54 also encodes the maximal number

of output actor firings 𝜋𝐴𝑂(𝑞) that are possible with the given vector 𝜋𝐴𝐼

(𝑞) = 𝜂of input actor firings. Therefore, in order to derive a cluster state 𝑞 from a vector𝜂 of input actor firings, the maximal number of output actor firings (cf. Equa-tion (4.6)) have to be determined for all output actors (cf. Equation (4.7)). Thiscan be done efficiently, as the vector values of dep for a given output actor 𝑎outare totally ordered under ≤, i.e., dep(𝑎out, 𝑛) ≤ dep(𝑎out, 𝑛+1). More formally,a cluster state is derived from an input/output dependency function value asfollows:53The notation min𝑋{ . . . } is used to select the minimal value of a set { . . . }.54 Remember that there may be one exception to the assertion that a cluster state always

contains the maximal number of output actor firings, namely the initial cluster state0 = 𝑞0 ∈ 𝑄𝛾 (the all zero vector). This state is explicitly added (cf. Definition 4.6 onpage 134) to the cluster state space. This exception is due to the fact that sufficient initialtokens may be stored on the internal channels of the cluster, e.g., channels 𝑐1→2 and 𝑐3→2

in the DFG 𝑔ex7depicted in Figure 4.36, to permit the firing of cluster output actors

without firing any cluster input actors, e.g., transition 𝑞0 → 𝑞1 of the composite actordepicted in Figure 4.38.

139

Page 148: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

Definition 4.9 (Output Maximization Function [FKH+08*]55). Given a staticcluster 𝑔𝛾 and its input/output dependency function dep, then the output max-imization function statedep : N|𝐴𝐼 |

0 → N|𝐴𝐼∪𝐴𝑂|0 is a vector-valued function that

associates with a vector 𝜂 of input actor firings a cluster state 𝑞 containing themaximal number of output actor firings 𝜋𝐴𝑂

(𝑞) enabled by the given vector 𝜂of input actor firings. This can be defined precisely as follows:56

statedep(𝜂) = 𝑞 where𝜋𝐴𝐼

(𝑞) = 𝜂 (4.5)𝜋{ 𝑎out }(𝑞) = max{𝑛 ∈ N0 | 𝜂 ≥ dep(𝑎out, 𝑛) } (4.6)

∀𝑎out ∈ 𝐴𝑂 (4.7)

To exemplify, in case of subcluster 𝑔𝛾1,2,3shown in Figure 4.36, the set of initial

input/output dependency states 𝑄ini = { statedep(𝜂) | 𝜂 ∈ 𝐻dep(𝑎out), 𝑎out ∈𝐴𝑂 } ∪ { 𝑞0 } is listed in the right half of Table 4.2. This set is directly derivedfrom the set of input/output dependency function values shown in the left halfof Table 4.2. As can be seen from column 𝑎out = 𝑎2 on the right half of Table 4.2,no input actor firings are required ((0, 0) = 𝜋{ 𝑎1,𝑎3 }(0, 1, 0)) in order to fire theoutput actor 𝑎2 once (1 = 𝜋{ 𝑎2 }(0, 1, 0)). However, for the second firing ofthe output actor 𝑎2, both input actors 𝑎1 and 𝑎3 have to be fired once, i.e.,(1, 2, 1) ∈ 𝑄ini.

dep(𝑎out, 𝑛) statedep(dep(𝑎out, 𝑛))𝑛 𝑎out = 𝑎1 𝑎out = 𝑎2 𝑎out = 𝑎3 𝑎out = 𝑎1 𝑎out = 𝑎2 𝑎out = 𝑎3

𝑄ini = {𝑞0 = (0, 0, 0)1 (1, 0) (0, 0) (0, 1) (1, 1, 0) (0, 1, 0) (0, 1, 1)2 (2, 0) (1, 1) (0, 2) (2, 1, 0) (1, 2, 1) (0, 1, 2)3 (3, 1) (1, 3) (3, 2, 1) (1, 2, 3)}

Table 4.2: Input/output dependency function values (𝜂1, 𝜂3) = dep(𝑎out, 𝑛)from Table 4.1 and their corresponding cluster states (𝜂1, 𝜂2, 𝜂3) =statedep(dep(𝑎out, 𝑛)) ∈ 𝑄ini

However, as the calculation of the set of input/output dependency functionvalues considers each output actor in isolation, the derived set 𝑄ini does notcontain cluster states resulting from different interleavings of output actor fir-ings. In detail, computing a pointwise maximum 𝑞new = max(𝑞𝑛, 𝑞𝑚) of clusterstates 𝑞𝑛, 𝑞𝑚 ∈ 𝑄ini derived for different output actors 𝑎𝑛 = 𝑎𝑚 and 𝑎𝑛, 𝑎𝑚 ∈ 𝐴𝑂

might introduce a new cluster state 𝑞new ∈ 𝑄ini representing a state of the clus-ter that is reachable by an interleaving of the sequence of actor firings used to55This function is not explicitly defined in [FKH+08*], but it is present implicitly as part of

the definition of the input/output state set in [FKH+08*].56The notation max𝑋 is used to select the maximal value 𝑥 ∈ 𝑋 of a set 𝑋.

140

Page 149: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

reach the original cluster states 𝑞𝑛 and 𝑞𝑚. These interleavings are captured bythe interleaved cluster state space 𝑄lfp, which is defined as follows:

Definition 4.10 (Interleaved Cluster State Space). The interleaved cluster statespace 𝑄lfp is derived by (1) computing the least fixed point (cf. Equation (4.9))of the interleave function and (2) converting each value of the resulting set rep-resenting the least fixed point into a cluster state by determining the maximalnumber of output actor firings (cf. Equation (4.8)) for this value. As always,the initial state 𝑞0 is an exception that has to be explicitly added (cf. Equa-tion (4.10)). The interleave function is a monotonically increasing function thatenlarges (cf. Equation (4.11)) the set𝐻 starting from the set of all input/outputdependency function values (cf. Equation (4.12)) by computing the pointwisemaximum (cf. Equations (4.13) to (4.14)) for each pair of input/output depen-dency function values. This can be defined precisely as follows:57

𝑄lfp ={ statedep(𝜂) | (4.8)𝜂 ∈ lfp(𝐻 = interleave(𝐻)) }∪ (4.9)

{ 𝑞0 } (4.10)interleave(𝐻) =𝐻∪ (4.11)

{𝜂 ∈𝐻dep(𝑎out) | 𝑎out ∈ 𝐴𝑂 }∪ (4.12){max(𝜂1,𝜂2) | 𝜂1,𝜂2 ∈𝐻 }∪ (4.13){max(𝜂1,𝜂2) mod 𝜋𝐴𝐼

(𝜂rep𝑔𝛾 ) | 𝜂1,𝜂2 ∈𝐻 } (4.14)

To exemplify Definition 4.10, consider Figure 4.37 that depicts Hasse dia-grams for the cluster state sets 𝑄ini and 𝑄lfp. As the interleave function is amonotonically increasing function, its least fixed point can be calculated by it-eratively applying the function to its result until the fixed point is reached, e.g.,in case of subcluster 𝑔𝛾1,2,3

shown in Figure 4.36, interleave(interleave(∅))computes the same set as interleave(interleave(interleave(∅))). The result-ing states 𝑄ini = { statedep(𝜂) | 𝜂 ∈ interleave(∅) } for one iteration of theinterleave function are depicted in Figure 4.37a, while the final interleaved clus-ter states 𝑄lfp = { statedep(𝜂) | 𝜂 ∈ interleave(interleave(∅)) } are shown inFigure 4.37b.Comparing 𝑄ini and 𝑄lfp, it can be seen that the interleaving adds the states

𝑞9, 𝑞10, . . . 𝑞14 to 𝑄lfp by pointwise maximum operations and computation ofthe maximal number of output actor firings for a cluster state, e.g., the state𝑞9 ∈ 𝑄lfp is derived from the states 𝑞7 and 𝑞4 from 𝑄ini.58 The cluster statespace can now be derived from the interleaved cluster state space as follows:57The notation lfp(𝑥 = 𝑓(𝑥)) is used to select the minimal solution 𝑥 of the equation 𝑥 = 𝑓(𝑥).58For the subcluster 𝑔𝛾1,2,3

, the generated states are all greater than or equal to the repetitionvector, i.e., ∀𝑞 ∈ { 𝑞9, 𝑞10, . . . 𝑞14 } : 𝑞 ≥ 𝜋𝐴𝐼∪𝐴𝑂

(𝜂rep𝛾1,2,3

), however, as demonstrated in[FKH+08*], cluster states ≥ 𝜋𝐴𝐼∪𝐴𝑂

(𝜂rep𝛾 ) can also be generated.

141

Page 150: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝑞0 = (0, 0, 0)

𝑞1 = (0, 1, 0)

𝑞2 = (1, 1, 0)𝑞3 = (0, 1, 1)

𝑞4 = (2, 1, 0)𝑞5 = (0, 1, 2) 𝑞7 = (1, 2, 1)

𝑞6 = (3, 2, 1)𝑞8 = (1, 2, 3)

(a) Input/output dependency cluster state set𝑄ini

𝑞0 = (0, 0, 0)

𝑞1 = (0, 1, 0)

𝑞2 = (1, 1, 0)𝑞3 = (0, 1, 1)

𝑞4 = (2, 1, 0)𝑞5 = (0, 1, 2) 𝑞7 = (1, 2, 1)

𝑞6 = (3, 2, 1)𝑞8 = (1, 2, 3)

𝑞9 = (2, 2, 1)𝑞10 = (1, 2, 2)

𝑞11 = (2, 3, 2)

𝑞12 = (3, 3, 2)𝑞13 = (2, 3, 3)

𝑞14 = (3, 4, 3)

(b) Interleaved cluster state space 𝑄lfp accord-ing to Definition 4.10

Figure 4.37: Hasse diagrams for various cluster state sets of the subcluster 𝑔𝛾1,2,3

shown in Figure 4.36

Definition 4.11 (Cluster FSM State Set [FKH+08*]59). Given the interleavedcluster state space 𝑄lfp, the cluster FSM state set ℛ.𝑄 is derived by only con-sidering cluster states not greater or equal to the repetition vector of the cluster,i.e., ℛ.𝑄 = { 𝑞 ∈ 𝑄lfp | 𝑞 ≥ 𝜋𝐴𝐼∪𝐴𝑂

(𝜂rep𝛾 ) }. The initial state of the cluster FSM

is the initial cluster state ℛ.𝑞0 = 𝑞0 = 0 (the all zero vector).

In Figure 4.37, the cluster FSM state set ℛ.𝑄 has been marked by dashedgroupings. These states 𝑞0, 𝑞1, . . . 𝑞5 are also present in the cluster FSM of thecomposite actor 𝑎𝛾1,2,3

depicted in Figure 4.38.

𝑎3

𝑎2

𝑎1

#𝑖2 ≥ 1 ∧#𝑜3 ≥ 1/𝑓⟨𝑎3⟩

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1∧#𝑜2 ≥ 1/𝑓⟨𝑎1,𝑎2⟩

#𝑜3 ≥ 1/𝑓⟨𝑎3,𝑎2⟩#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/𝑓⟨𝑎1⟩

#𝑖2 ≥ 1 ∧#𝑜2 ≥ 1∧

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1 ∧#𝑜2 ≥ 1/𝑓⟨𝑎1,𝑎2⟩

#𝑖2 ≥ 1 ∧#𝑜3 ≥ 1/𝑓⟨𝑎3⟩

#𝑖2 ≥ 1 ∧#𝑜2 ≥ 1 ∧#𝑜3 ≥ 1/𝑓⟨𝑎3,𝑎2⟩

#𝑖1 ≥ 1 ∧#𝑜1 ≥ 1/𝑓⟨𝑎1⟩#𝑜2 ≥ 1

/𝑓⟨𝑎2⟩𝑞1

𝑞2

𝑞3

𝑞0

𝑞4

𝑞5

𝑖1

𝑖2

𝑜1

𝑜3

𝑜2

𝑔𝛾1,2,3

Figure 4.38: The composite actor 𝑎𝛾1,2,3replacing subcluster 𝑔𝛾1,2,3

depicted inFigure 4.36

59There is no explicit definition in [FKH+08*], but it is contained in [FKH+08*] Step 3.2where the cluster FSM state space is determined.

142

Page 151: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

However, the question arises as to whether the cluster state sets given byDefinitions 4.6 and 4.11 are equivalent. The following considerations can beused to answer this question.

Proof. The question of equivalence can be reduced to 𝑄𝛾 ⊆ ℛ.𝑄 and ℛ.𝑄 ⊆𝑄𝛾 . Fortunately, 𝑄𝛾 ⊆ ℛ.𝑄 is a corollary to the semantic equivalence asgiven by Definition 4.2 and proven in [FZHT13*]. A replication of this proofis presented in Theorem 4.5 on page 167. In the proof of Theorem 4.5, it isshown that each cluster state 𝑞ok ∈ 𝑄𝛾 is reachable from all cluster FSM states𝑞def ∈ ℛ.𝑄 where 𝑞ok ≥ 𝑞def . What is not proven is the absence of states𝑞def ∈ ℛ.𝑄∖𝑄𝛾 . However, each state 𝑞def ∈ ℛ.𝑄∖{ 𝑞0 } is not greater than therepetition vector (cf. Definition 4.11), i.e., 𝑞def = 𝑞def mod 𝜋𝐴𝐼∪𝐴𝑂

(𝜂rep𝛾 ), and

has a maximal number of output actor firings (cf. Equation (4.8)). To provethat 𝑞def ∈ 𝑄𝛾 , it remains to be shown that 𝑞def has a minimal number ofinput actor firings. For the states 𝑄ini generated by Equation (4.12), this istrivially true. Hence, 𝜂1 and 𝜂2 in Equations (4.13) to (4.14) can be assumedto be minimal. Furthermore, it is self-evident that statedep(max(𝜂1,𝜂2)) ≥statedep(𝜂1) ∧ statedep(max(𝜂1,𝜂2)) ≥ statedep(𝜂2). Thus, a lower bound forthe minimal number of input actor firings for statedep(max(𝜂1,𝜂2)) is min{𝜂 ∈N|𝐴𝐼∪𝐴𝑂|

0 | 𝜂 ≥ 𝜂1 ∧ 𝜂 ≥ 𝜂2 }. However, this is exactly the definition ofmax(𝜂1,𝜂2). Therefore, ℛ.𝑄 ⊆ 𝑄𝛾 and, thus, 𝑄𝛾 ≡ ℛ.𝑄.

Step 3: Derivation of the Cluster FSM Transition Set

Finally, the transition setℛ.𝑇 of the cluster FSM is derived by partially orderingthe interleaved cluster state space 𝑄lfp using the ≥-relation between two clusterstates. The resulting partial order can be visualized by a Hasse diagram, e.g.,Figure 4.37b for the cluster 𝑔𝛾1,2,3

depicted in Figure 4.36. The edges in thisdiagram will be used to derive the set of transitions for the cluster FSM, e.g.,as shown in Figure 4.38. Each edge in a Hasse diagram for a cluster state set 𝑄corresponds to a so-called tightly ordered pair (𝑞src, 𝑞dst) ∈ 𝑄2 where 𝑞dst > 𝑞src

and no other state 𝑞′ ∈ 𝑄 is between (cf. Equation (4.20)) these two states.An example of this is the tightly ordered pair (𝑞5, 𝑞8) depicted in Figure 4.37a,which is missing in Figure 4.37b due to the cluster state 𝑞10 between the states𝑞5 and 𝑞8. More formally, the cluster FSM transition set is derived from theinterleaved cluster state space 𝑄lfp as follows:

Definition 4.12 (Cluster FSM Transition Set [FKH+08*]60). Given the in-terleaved cluster state space 𝑄lfp, its set of tightly ordered pairs 𝒯 𝒪lfp can be

60There is no explicit definition in [FKH+08*], but it is contained in [FKH+08*] Step 3.3where the cluster FSM transition set is determined.

143

Page 152: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

derived (cf. Equation (4.19)).61 Subsequently, a transition 𝑡 is created (cf. Equa-tion (4.15)) for each tightly ordered pair that starts (cf. Equation (4.16)) from astate in the cluster FSM state set ℛ.𝑄. Thus, the source state 𝑡.𝑞src of the tran-sition can simply be set to 𝑞src (cf. Equation (4.17)). However, the destinationstate 𝑞dst of the tightly ordered pair (𝑞src, 𝑞dst) is not necessarily contained inthe cluster FSM state set ℛ.𝑄. Hence, an equivalent state (mod 𝜂rep

𝛾 ) has tobe computed (cf. Equation (4.18)). More precisely, the cluster FSM transitionset is defined as follows:

ℛ.𝑇 ={ 𝑡 | (𝑞src, 𝑞dst) ∈ 𝒯 𝒪lfp (4.15)𝑞src ∈ ℛ.𝑄 (4.16)𝑡.𝑞src = 𝑞src (4.17)𝑡.𝑞dst = 𝑞dst mod 𝜋𝐴𝐼∪𝐴𝑂

(𝜂rep𝛾 ) } (4.18)

𝒯 𝒪lfp ={ (𝑞src, 𝑞dst) ∈ 𝑄2lfp | (4.19)

𝑞dst > 𝑞src ∧ @𝑞′ ∈ 𝑄lfp : 𝑞dst > 𝑞′ > 𝑞src } (4.20)

As can be seen from Definition 4.11, only a subset ℛ.𝑄 ⊂ 𝑄lfp, which isdepicted as dashed grouping in Figures 4.37 and 4.39, is used for the clusterFSM state set. Thus, not all tightly ordered pairs of 𝒯 𝒪lfp are required toconstruct the cluster FSM. To exemplify, the tightly ordered pairs of interesthave been depicted in Figure 4.39a. These are the tightly ordered pairs thatstart from a state in the dashed grouping.

𝑞0 = (0, 0, 0)

𝑞1 = (0, 1, 0)

𝑞2 = (1, 1, 0)𝑞3 = (0, 1, 1)

𝑞4 = (2, 1, 0)𝑞5 = (0, 1, 2) 𝑞7 = (1, 2, 1)≡

𝑞1 = (0, 1, 0)

≡𝑞9 = (2, 2, 1)

𝑞2 = (1, 1, 0)

𝑞10 = (1, 2, 2)

𝑞3 = (0, 1, 1)≡

(a) Hasse diagram with the relevant tightly or-dered pairs for derivation of the clusterFSM transition set

𝑞0 = (0, 0, 0)

𝑞1 = (0, 1, 0)

𝑞2 = (1, 1, 0)𝑞3 = (0, 1, 1)

𝑞4 = (2, 1, 0)𝑞5 = (0, 1, 2)

(b) Cluster FSM state (boxes) and transition(directed edges) sets for subcluster 𝑔𝛾1,2,3

Figure 4.39: Example for the derivation of the cluster FSM shown in Figure 4.38from the interleaved cluster state space 𝑄lfp of the subcluster 𝑔𝛾1,2,3

depicted in Figure 4.36

61The notation 𝒯 𝒪𝑥 is used to denote the set of tightly ordered pairs of a set of states 𝑄𝑥.

144

Page 153: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

However, not all tightly ordered pairs starting in the dashed grouping alsoend in it, e.g., the tightly ordered pairs (𝑞5, 𝑞10), (𝑞3, 𝑞7), (𝑞2, 𝑞7), and (𝑞4, 𝑞9)shown in Figure 4.39a. These cases correspond to those transitions that crossthe boundary from one iteration of the cluster to the next one. Hence, anequivalent state contained in the cluster FSM state set has to be determined(cf. Equation (4.18)). For the destination states 𝑞10, 𝑞7, and 𝑞9, it can beobserved that subtracting the repetition vector 𝜂rep

𝛾1,2,3= (1, 1, 1) will derive

the equivalent states 𝑞3, 𝑞1, and 𝑞2, respectively. These states will be present inthe cluster FSM state set due to Equation (4.14). The presence of the originalstates 𝑞dst ∈ 𝑄lfp ∖ ℛ.𝑄 required for the existence of the tightly ordered pairs(𝑞5, 𝑞10), (𝑞3, 𝑞7), (𝑞2, 𝑞7), and (𝑞4, 𝑞9) is ensured by Equations (4.4) and (4.13).The resulting transitions 𝑞5 → 𝑞3, 𝑞3 → 𝑞1, 𝑞2 → 𝑞1, and 𝑞4 → 𝑞2 for thetightly ordered pairs (𝑞5, 𝑞10), (𝑞3, 𝑞7), (𝑞2, 𝑞7), and (𝑞4, 𝑞9) can be observed inFigure 4.39b.To finish the construction of the cluster FSM, the actions 𝑓⟨...⟩ and input/out-

put guards 𝑘io have to be determined for each transition. This step, presentedin Section 4.4.4, is independent of the clustering approach taken. Hence, it canbe used both for the automata-based clustering approach detailed in this sectionor the rule-based approach given in the next section. The resulting cluster FSMfor the composite actor 𝑎𝛾1,2,3

is depicted in Figure 4.38.

4.4.3 Rule-Based Clustering

The automata-based clustering methodology presented in the previous sectionmay suffer from the state space explosions problem. This state space explosioncan have two different underlying reasons: (1) the set of initial input/outputdependency states 𝑄ini is small, but computation of all possible interleavingsby Definition 4.10 results in state space explosion, and (2) the state space ex-plosion problem that is inherent in the marked graph expansion62 of a staticDFG resulting in a possibly large initial set 𝑄ini. Note that the set of initialinput/output dependency states might not suffer state space explosion even ifthe repetition vector sum is large. This is the case when the sequences of stat-ically scheduled actor firings executed by the transitions of the cluster FSMare relatively large and, hence, only a small number of transitions are requiredto cover all actor firings of the repetition vector, even if the repetition vectorsum is large. However, this does not mean that the computation of all possibleinterleavings does not produce a state space explosion. To ameliorate this rea-son for state space explosion, the rule-based approach published in [FZHT13*]will be presented in this section. This approach implicitly constructs the state

62The marked graph expansion contains as many vertices as the repetition vector sum∑𝑛∈ℐ(𝜂rep

𝛾 ) : 𝜂rep𝛾 (𝑛) of all the entries of the repetition vector 𝜂rep

𝛾 of the cluster 𝑔𝛾 .

145

Page 154: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

space 𝑄𝛾 of a cluster 𝑔𝛾 using so-called rules. Note that the rule-based rep-resentation might achieve a more compact code size of a QSS as compared tothe automata-based representation. However, if code size is not a concern, thenthe automata-based representation can achieve better overall overhead reduc-tion due to its more efficient ability to select an enabled transition for executionas compared to selecting an enabled rule.

Implicit Cluster State Space Representation

In order to avoid the explicit enumeration of the cluster state space𝑄𝛾 , rules areproposed to define valid state transitions implicitly. At a glance, a rule 𝑟 ∈ 𝑅maps a subspace of the vector space 𝑄rules = N

|𝐴|0,∞ to a vector of actor firings 𝜂

(in the following called a partial repetition vector) that can be performed if thecurrent state is contained in this subspace. More formally, a rule 𝑟 is defined asfollows:

Definition 4.13 (Rule [FZHT13*]). A rule is a tuple 𝑟 = (𝑙,𝑢,𝜂). The intervalboundary vectors 𝑙,𝑢 ∈ 𝑄rules define the state set 𝑄𝑟 where the rule is active,i.e., 𝑄𝑟 = { 𝑞 ∈ 𝑄rules | 𝑙 ≤ 𝑞 ≤ 𝑢 }. The partial repetition vector 𝜂 ∈ N|𝐴|

0

specifies how many firings to perform for each actor of the cluster when the rule𝑟 is applied in a state 𝑞cur ∈ 𝑄𝑟.

Note that two different definitions are used for𝑄rules.63 Initially, the algorithmworks with lower and upper bounds 𝑙,𝑢 ∈ 𝑄rules = N|𝐴|

0,∞. Later, after rulemerging has been applied, these bounds can be defined in terms of input actorfirings, i.e., 𝑙,𝑢 ∈ 𝑄rules = N

|𝐴𝐼 |0,∞ ∪ { 𝑞0 }. A detailed discussion of this is given

in the rule merging subsection.If a rule 𝑟 is applied in a current state 𝑞cur = 𝑞src, then the next state 𝑞next

is calculated in two steps. First, the actor firings specified by 𝜂 are added tothe source state (cf. Equation (4.21)) to compute a destination state. Then,the next state 𝑞next is the smallest equivalent state (cf. Equation (4.22)) of thedestination state:

𝑞dst = 𝑞src + 𝜋ℐ(𝑞src)(𝜂) (4.21)𝑞next = 𝑞dst mod 𝜋ℐ(𝑞dst)(𝜂

rep𝛾 ) (4.22)

In the following, the algorithm that computes the set of rules 𝑅fin for thecomposite actor 𝑎𝛾 ∈ 𝑔.𝐴 that is derived via the cluster refinement operator𝜉𝑟𝑢𝑙𝑒𝑠, i.e., 𝜉𝑟𝑢𝑙𝑒𝑠(𝑔, 𝑔𝛾 ) = (𝑔, 𝑎𝛾 ), will be reviewed. This set of rules implementsa QSS for the actors 𝑔𝛾 .𝐴 contained in the static subcluster 𝑔𝛾 of the originalDFG 𝑔.63Incidentally, this requires the projection 𝜋ℐ(𝑞)(. . .) in the Equations (4.21) and (4.22) to

the index set ℐ(𝑞) currently used by the state space 𝑄rules ∋ 𝑞.

146

Page 155: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

The rule-based clustering algorithm works in three steps: (Step 1) computesthe set of initial input/output dependency rules from the set of input/output de-pendency function values. (Step 2) derives the set of conflict-resolved rules thatcovers all necessary states and contains all necessary transitions of the clusterstate space from the set of initial input/output dependency rules.64 (Step 3)obtains the set of final rules 𝑅fin where every rule65 fires at least one input actorfrom the set of conflict-resolved rules.66

To avoid deadlocks, the worst case cluster environment has to be assumed and,thus, the resulting QSS always has to produce a maximum number of output to-kens from a minimum number of input tokens. This criterion is satisfied for eachoutput actor in isolation by constructing (Step 1) the set of initial input/outputdependency rules 𝑅ini (cf. Definition 4.14) from the set of input/output depen-dency function values of the given output actor. The state space of the set ofinitial rules 𝑅ini, i.e., the set of possible states for 𝑞cur, in general, is only asubset of the cluster state space 𝑄𝛾 . Furthermore, execution of the initial rulesmay perform transitions, i.e., state transitions as defined by Equations (4.21)to (4.22), that do not correspond to tightly ordered pairs in the state space ofthe rules 𝑅ini.To solve these irregularities, conflict resolution (cf. Definition 4.18) is per-

formed in (Step 2) for each pair (cf. Definition 4.17) of conflicting (cf. Defini-tion 4.16) rules. Conflict resolution will add new states to the state space of theset of rules until the state space of the resulting set of conflict-resolved rules isa superset of the cluster state space. Furthermore, conflict resolution will addadditional rules that replace redundant rules (cf. Definition 4.19). Redundantrules are rules that do not correspond to tightly ordered pairs in the state spaceof the set of rules. These redundant rules can be discarded in the next step.Finally, rule merging is performed in (Step 3). In general, the state space of

the set of rules computed at the end of the previous step is a superset of thecluster state space. This is the case due to the presence of rules that do not fireany input actors. Hence, in general, to perform the equivalent of a transition asdefined in Definition 4.12, a sequence of rule applications 𝜌 = ⟨𝑟1, 𝑟2, 𝑟3, . . . 𝑟𝑛⟩has to be taken. The first rule 𝑟1 in this sequence will fire at least one inputactor, while the rest of the sequence of rule applications ⟨𝑟2, 𝑟3, . . . 𝑟𝑛⟩ will notcontain any input actor firings. For each such sequence 𝜌, rule merging addsa rule that atomically fires all actor firings performed by this sequence of rule

64In general, the set of initial input/output dependency rules only covers a subset of thecluster state space 𝑄𝛾 ≡ ℛ.𝑄 and is also missing rules to cover all transitions ℛ.𝑇 .

65Of course, the exception of the rule that is active in the initial state 𝑞0 still applies. Re-member that the initial state may not necessarily have maximized output actor firings. Inthis case, exactly one rule with no input actor firings is present in 𝑅fin.

66In contrast, the set of conflict-resolved rules may contain rules that fire only output orintermediate actors.

147

Page 156: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

applications. After this operation, the state space of the set of final rules 𝑅fin

derived via the rule merging operation corresponds to the cluster state space asgiven in Definition 4.6. More precisely, the set of final rules 𝑅fin for a consistentand deadlock-free subcluster 𝑔𝛾 is derived by the three steps given below:

Step 1: Derivation of Initial Rules

A rule 𝑟 can be constructed from any two states 𝑞src, 𝑞dst ∈ 𝑄rules = N|𝐴|0,∞ with

𝑞dst > 𝑞src. Then, the rule 𝑟 generated by these two states has to perform𝑟.𝜂 = 𝑞dst − 𝑞src actor firings (cf. Equation (4.24)). The lower bound 𝑟.𝑙 isequal to 𝑞src (cf. Equation (4.25)), representing the minimum number of actorfirings which have to be performed in order to enable the rule. If an actor𝑎 is fired by 𝑟, i.e., 𝜂(𝑎) > 0, the upper bound is set to the lower bound(cf. Equation (4.26)). This constrains the number of firings of the actor 𝑎 toan interval which only contains the single value 𝑙(𝑎). This requirement is dueto the fact that only sufficient tokens can be assumed to execute 𝜂(𝑎) firingsstarting from the 𝑙(𝑎)’th firing of the actor 𝑎. If an actor 𝑎 is not fired by 𝑟, onlya minimum number of actor firings are required, i.e., 𝑢(𝑎) = ∞. This ensuresthat at least the minimum number of tokens have been produced by the actor𝑎. Note that if more firings of the actor 𝑎 have already been performed thanindicated by 𝑙(𝑎), then the actor 𝑎 must also have produced more tokens thanrequired, not less.The first step of the rule-based clustering approach is to find a set of initial

input/output dependency rules 𝑅ini based on the set of input/output dependencyfunction values. More formally, the set of initial input/output dependency rules𝑅ini can be defined as given below:

Definition 4.14 (The Set of Initial Rules [FZHT13*]). Given a static cluster𝑔𝛾 and its set of input/output dependency function values 𝐻dep(𝑎out), then itsset of initial input/output dependency rules 𝑅ini can be derived as follows:

𝑅ini =⋃

𝑎out∈𝐴𝑂

{ 𝑟 | (𝑞src, 𝑞dst) ∈ 𝒯 𝒪ini(𝑎out) (4.23)

𝑟.𝜂 = 𝑞dst − 𝑞src (4.24)𝑟.𝑙 = 𝑞src (4.25)

𝑟.𝑢(𝑎) =

{𝑟.𝑙(𝑎) if 𝑟.𝜂(𝑎) > 0∞ otherwise ∀𝑎 ∈ 𝐴 (4.26)

}𝒯 𝒪ini(𝑎out) ={ (𝑞src, 𝑞dst) ∈ 𝑄ini(𝑎out)

2 | (4.27)𝑞dst > 𝑞src ∧ @𝑞′ ∈ 𝑄ini(𝑎out) : 𝑞dst > 𝑞′ > 𝑞src }

𝑄ini(𝑎out) ={firings ∘ statedep(𝜂) | 𝜂 ∈𝐻dep(𝑎out) } ∪ { 𝑞0 } (4.28)

148

Page 157: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

The construction of the set of initial input/output dependency rules pro-ceeds by iterating over each output actor 𝑎out ∈ 𝐴𝑂 (cf. Equation (4.23)) andcomputing a set of initial input/output dependency states 𝑄ini(𝑎out) (cf. Equa-tion (4.28)) from the set of input/output dependency function values 𝐻dep(𝑎out)of the selected output actor.67 From this set 𝑄ini(𝑎out), a set of tightly orderedpairs 𝒯 𝒪ini(𝑎out) can be derived (cf. Equation (4.27)). Finally, for each tightlyordered pair (𝑞src, 𝑞dst) ∈ 𝒯 𝒪ini(𝑎out) (cf. Equation (4.23)), a rule is constructed(cf. Equations (4.24) to (4.26)) as described above and added to the set of initialinput/output dependency rules 𝑅ini. Note that each rule conforms to the ruleproperty as given below:

Definition 4.15 (Rule Property [FZHT13*]). A rule 𝑟 obeys the rule propertyif and only if 𝑟.𝜂(𝑎) = 0, then 𝑟.𝑢(𝑎) =∞, and otherwise 𝑟.𝑙(𝑎) = 𝑟.𝑢(𝑎).

In other words, for a rule 𝑟 to be well formed, actors which are not firedby the rule 𝑟 may have an arbitrary amount of firings, as long as the lowerbound 𝑟.𝑙(𝑎) is satisfied, while actors which are fired by rule 𝑟 must have aconcrete number of firings 𝑟.𝑙(𝑎) = 𝑟.𝑢(𝑎). This is trivially true for the initialinput/output dependency rules as described in Section 4.4.3 as well as rulesderived by conflict resolution in Definition 4.17.Furthermore, according to Lemma 4.3 on page 165, any set of rules that has

the property that all initial input/output dependency states are reachable fromthe initial state 𝑞0 via application of these rules is a valid set of initial input/out-put dependency rules. Moreover, each set of initial input/output dependencystates 𝑄ini(𝑎out) represents a chain, i.e., a totally ordered subset, of the partiallyordered state space 𝑄rules. Hence, all initial input/output dependency states arereachable via application of rules from 𝑅ini starting from 𝑞0, i.e., 𝑅ini is a validset of initial input/output dependency rules.To exemplify, consider the DFG given in Figure 4.36 on page 137, with the

subcluster 𝑔𝛾1,2,3induced by the set of static actors 𝐴𝑆 = { 𝑎1, 𝑎2, 𝑎3 }. The

chains used to cover the set of initial input/output dependency states of 𝑔𝛾1,2,3

(cf. Table 4.2 on page 140) is shown in Figure 4.40a. The initial rule 𝑟1 corre-sponding to the edge 𝑞0 → 𝑞2 is then derived as follows: 𝜂 = 𝑞2−𝑞0 = (1, 1, 0),𝑙 = 𝑞0 = (0, 0, 0), and 𝑢 = (0, 0,∞). Therefore, in order to enable the rule,actors 𝑎1 and 𝑎2 must not have been fired before. Further initial input/outputdependency rules are tabulated in Figure 4.40b.Simulating the initial input/output dependency rules 𝑅ini as given in Fig-

ure 4.40b results in the FSM depicted in Figure 4.41a. It can be observed thatall states of the FSM shown in Figure 4.38 on page 142 are present. However,the transitions are different, and not all states have outgoing transitions, there-fore causing a deadlock if the FSM enters one of these states. These deadlocks67The ‘∘’-operator, e.g., firings ∘ statedep, is used to denote function composition, i.e.,

(firings ∘ statedep)(𝜂) ≡ firings(statedep(𝜂)).

149

Page 158: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝑞5 = (0, 1, 2)

𝑞1 = (0, 1, 0)

𝑞2 = (1, 1, 0) 𝑞4 = (2, 1, 0)

𝑞3 = (0, 1, 1)𝑞0=

(0,0,0)

𝑟1

𝑟2

𝑟5𝑟3

𝑟4

(a) Input/output dependency state chains

𝑟 𝑙 𝑢 𝜂𝑟1 (0, 0, 0) ( 0, 0,∞) (1, 1, 0)𝑟2 (0, 0, 0) (∞, 0,∞) (0, 1, 0)𝑟3 (0, 0, 0) (∞, 0, 0) (0, 1, 1)𝑟4 (1, 1, 0) ( 1,∞,∞) (1, 0, 0)𝑟5 (0, 1, 1) (∞,∞, 1) (0, 0, 1)

(b) The set of initial input/output depen-dency rules 𝑅ini

Figure 4.40: Depicted are the initial input/output dependency states for thecluster 𝑔𝛾1,2,3

. Three chains (cf. Table 4.2 on page 140) are used tocover all states. Each tightly ordered pair of a chain corresponds toan edge. The derived rules for the tightly ordered pairs are anno-tated to the edges. The corresponding set of initial input/outputdependency rules is given in the table to the right.

𝑓⟨𝑎1⟩

𝑓⟨𝑎3⟩

𝑟4

𝑟5

𝑓 ⟨𝑎2,𝑎1

𝑟 1

𝑓⟨𝑎2⟩𝑟2𝑓⟨𝑎2 ,𝑎

3 ⟩𝑟3

𝑞0 𝑞1

𝑞2 𝑞4

𝑞5𝑞3

(a) FSM derived from simulating rules𝑅ini

𝑟2 = ((0, 0, 0), (∞, 0,∞), (0, 1, 0))

𝑟3 = ((0, 0, 0), (∞, 0, 0), (0, 1, 1))

𝑟4 = ((1, 1, 0), (1,∞,∞), (1, 0, 0))

𝑟5 = ((0, 1, 1), (∞,∞, 1), (0, 0, 1))

𝑟1 = ((0, 0, 0), (0, 0,∞), (1, 1, 0))

(b) Conflict graph for initial input/out-put dependency rules 𝑅ini

Figure 4.41: Shown are the conflict graph of the initial input/output dependencyrules from Figure 4.40b and an FSM derived from simulating theserules. Applied rules as well as the corresponding sequence of stati-cally scheduled actor firings are annotated to the transitions of theFSM. If two rules conflict, they are connected with an edge in theconflict graph shown to the right.

are caused by so-called conflict rules. More formally, the conflict relation fortwo rules is defined below:

Definition 4.16 (Rule Conflict [FZHT13*]). Given two rules 𝑟1 and 𝑟2, thenthese rules conflict, i.e., conflict(𝑟1, 𝑟2), if and only if these rules have a commonstate 𝑞cur ∈ 𝑄1 ∩𝑄2 (cf. Equation (4.29)), and they fire at least one commonactor (cf. Equation (4.30)).

150

Page 159: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

conflict(𝑟1, 𝑟2) = 𝑄1 ∩𝑄2 = ∅ (4.29)∧min(𝑟1.𝜂, 𝑟2.𝜂) = 0 (4.30)

If two rules 𝑟1 and 𝑟2 conflict and are both enabled in a state 𝑞cur ∈ 𝑄1∩𝑄2,then applying one of them will disable the other rule.

Proof. [FZHT13*] Let actor 𝑎 be such a common actor which is fired by bothrules. Then, by construction (cf. Equation (4.26)), the lower and upper boundsof the corresponding intervals are as follows: 𝑟1.𝑙(𝑎) = 𝑟1.𝑢(𝑎) = 𝑛 and𝑟2.𝑙(𝑎) = 𝑟2.𝑢(𝑎) = 𝑚. By the rule conflict definition, both rules have atleast one common state. Hence, the intersection of their intervals is not empty.However, as both intervals contain only a single value, this can only be thecase if 𝑛 = 𝑚. Let 𝑞cur(𝑎) = 𝑛 in order to enable both 𝑟1 and 𝑟2. With-out loss of generality, assume that 𝑟1 is applied first. Then, for the next state𝑞next(𝑎) = 𝑛+ 𝑟1.𝜂(𝑎) must hold. Note that 𝑟1.𝜂(𝑎) > 0, as actor 𝑎 is fired by𝑟1. However, as 𝑞next(𝑎) = 𝑚 the rule 𝑟2 is disabled. Analogously, if 𝑟2 is firedfirst, 𝑟1 will be disabled.

Note that if two rules 𝑟1 and 𝑟2 have no common states, they cannot conflict,as they can never be enabled at the same time. Also, if two rules have indeed acommon state 𝑞cur, but do not fire common actors, they cannot conflict either.

Proof. [FZHT13*] Without loss of generality, let actor 𝑎 be such an actor whichis fired by 𝑟1 but not by 𝑟2. Then, by construction, the lower and upperbounds of the corresponding intervals are as follows: 𝑟1.𝑙(𝑎) = 𝑟1.𝑢(𝑎) = 𝑛and 𝑟2.𝑙(𝑎) = 𝑚, 𝑟2.𝑢(𝑎) = ∞. By the rule conflict definition, both ruleshave at least one common state. Hence, the intersection of their intervals isnot empty. This can only be the case if 𝑛 ≥ 𝑚. Let 𝑞cur(𝑎) = 𝑛 in order toenable both 𝑟1 and 𝑟2. If 𝑟1 is fired first, it follows for the next state that𝑞next(𝑎) = 𝑛 + 𝑟1.𝜂(𝑎) > 𝑛 ≥ 𝑚, i.e., 𝑟2 is still enabled. If, on the other hand,𝑟2 is applied first, it follows for the next state that 𝑞next(𝑎) = 𝑛+ 𝑟2.𝜂(𝑎) = 𝑛,i.e., 𝑟1 remains enabled.

An example of conflict rules are 𝑟1 and 𝑟3 from the set of initial input/outputdependency rules 𝑅ini given in Figure 4.40b. Both rules 𝑟1 and 𝑟3 are enabledin state 𝑞0. However, if 𝑟1 is fired first, 𝑟3 cannot be fired anymore. On theother hand, if 𝑟3 is fired first, 𝑟1 cannot be fired anymore. These conflicts willbe resolved next.

Step 2: Resolving Conflict Rules

In order to resolve these conflicts, new rules will be added to the set of initialinput/output dependency rules 𝑅ini by a least fixed-point operation to form theset of conflict-resolved rules 𝑅lfp. For each pair 𝑟1 and 𝑟2 of conflicting rules,

151

Page 160: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

common actor firings are extracted from both rules, resulting, in the generalcase, in two additional rules 𝑟1

1 and 𝑟12. Then, 𝑟1

2 can be applied after 𝑟1, andanalogously, 𝑟1

1 can be executed after 𝑟2 (For a detailed proof cf. Section 4.4.5Lemma 4.2 on page 165). More formally, the conflict resolution operator isdefined as follows:

Definition 4.17 (Conflict Resolution Operator [FZHT13*]). Given two con-flicting rules 𝑟1 and 𝑟2, the conflict resolving operation (𝑟1

1, 𝑟12) = 𝑟1 ◇𝑟2 gener-

ates two new rules 𝑟11 and 𝑟1

2 such that 𝑟12 can be executed after 𝑟1, and 𝑟1

1 canbe executed after 𝑟2. Let 𝜂𝑐 = min(𝑟1.𝜂, 𝑟2.𝜂) be the pointwise minimum of thepartial repetition vectors 𝑟1.𝜂 and 𝑟2.𝜂. Note that 𝜂𝑐 represents the commonactor firings of 𝑟1 and 𝑟2. Then, the lower bounds and partial repetition vectorsof 𝑟1

1 and 𝑟12 can be calculated as follows:

𝑟11.𝑙 = 𝑟1.𝑙 + 𝜂𝑐 𝑟1

1.𝜂 = 𝑟1.𝜂 − 𝜂𝑐

𝑟12.𝑙 = 𝑟2.𝑙 + 𝜂𝑐 𝑟1

2.𝜂 = 𝑟2.𝜂 − 𝜂𝑐

The upper bounds 𝑟11.𝑢 and 𝑟1

2.𝑢 are then calculated from these lower boundsand partial repetition vectors as described by Equation (4.26).

An upper bound for the possible state space of 𝑞cur can be determined fromEquation (4.22) as 𝑞cur ∈ 𝑄cur = { 𝑞 ∈ 𝑄rules | 0 ≤ 𝑞 � 𝜂rep

𝛾 }. Note that,in the general case, the conflict resolution operator ‘◇’ might produce rules 𝑟1

1

and 𝑟12 with lower bounds greater than or equal to the repetition vector of the

cluster. However, such rules can never be executed since the current state 𝑞cur

will never be greater than or equal to the repetition vector. Note that suchrules must not be discarded, but must be shifted, i.e., 𝑟′ = 𝑟 + 𝑘 · 𝜂rep

𝛾 , to theircanonical position in the state space.68

The canonical position for an arbitrary state 𝑞 ∈ 𝑄rules is 𝑞′ = 𝑞 mod 𝜂rep𝛾 ,

i.e., the state 𝑞′ = 𝑞 + 𝑘 · 𝜂rep𝛾 , 𝑘 ∈ Z such that 0 ≤ 𝑞′ � 𝜂rep

𝛾 . Equivalently,the notation 𝑟 mod 𝜂 is used to denote a modulo operation for rules.69 Themodulo operation 𝑟′ = 𝑟 mod 𝜂rep

𝛾 shifts the state space 𝑄𝑟 where the rule 𝑟 isapplicable into the new state space 𝑄𝑟′ where the new rule 𝑟′ is applicable byadding an integer multiple of the repetition vector to each state in 𝑄𝑟 in sucha way that the intersection 𝑄cur ∩𝑄𝑟′ is not empty.

68The notation 𝑟′ = 𝑟 + 𝜂shift is used to denote that the rule 𝑟′ is a copy of 𝑟 where thestate space 𝑄𝑟′ has been derived from shifting the state space 𝑄𝑟 by adding the partialrepetition vector 𝜂shift to each state in 𝑄𝑟, i.e., 𝑟′.𝑙 = 𝑟.𝑙+𝜂shift, 𝑟′.𝑢 = 𝑟.𝑢+𝜂shift, and𝑟′.𝜂 = 𝑟.𝜂.

69The notation 𝑟′ = 𝑟 mod 𝜂 is used to denote a modulo operation for rules. The modulooperation mod is defined for rules 𝑟.𝑙, 𝑟.𝑢 ∈ Z𝑑 and 𝜂 ∈ N𝑑. It computes the shifted rule𝑟′ = 𝑟 + 𝑘 · 𝜂, 𝑘 ∈ Z such that 𝑟′.𝑙 is the smallest vector with 𝑟′.𝑙 ≥ 0.

152

Page 161: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

Moreover, since rules may now be shifted (cf. Equation (4.35)), the conflictcheck (cf. Equation (4.38)) must now also detect conflicts between shifted ver-sions (cf. Equation (4.37)) of rules. More formally, the set of conflict-resolvedrules 𝑅lfp for a cluster can be derived as follows:

Definition 4.18 (The Set of Conflict-Resolved Rules [FZHT13*]). Given theset of initial input/output dependency rules 𝑅ini, then the set of conflict-resolvedrules 𝑅lfp is derived by computing the least fixed point (cf. Equation (4.31)) ofthe resolve function. The resolve function is a monotonically increasing func-tion that enlarges (cf. Equation (4.33)) the set of rules 𝑅 starting (cf. Equa-tion (4.34)) from the set of initial input/output dependency rules 𝑅ini by adding(cf. Equation (4.35)) the canonical version of all rules 𝑟1

1 and 𝑟12 generated by

the conflict resolution operator ‘◇’ (cf. Equation (4.39)) for all pairs (cf. Equa-tion (4.36)) of possibly shifted (cf. Equation (4.37)) conflicting rules (cf. Equa-tion (4.38)). This can be defined precisely as follows:

𝑅lfp = lfp(𝑅 ={ 𝑟 ∈ resolve(𝑅) | (4.31)𝑟.𝜂 > 0 }) (4.32)

resolve(𝑅) =𝑅∪ (4.33)𝑅ini∪ (4.34){ 𝑟1

1 mod 𝜂rep𝛾 , 𝑟1

2 mod 𝜂rep𝛾 | (4.35)

∃𝑟1, 𝑟2 ∈ 𝑅, (4.36)𝑛 ∈ Z, 𝑟′

2 = 𝑟2 + 𝑛 · 𝜂rep𝛾 : (4.37)

conflict(𝑟1, 𝑟′2) (4.38)

(𝑟11, 𝑟

12) = 𝑟1 ◇ 𝑟′

2 } (4.39)

Note that the conflict resolution operator ‘◇’ might also produce rules with noactor firings. These rules can be discarded (cf. Equation (4.32)). However, theproduction of such rules hints at the presence of redundant rules in the set ofconflict-resolved rules 𝑅lfp. A special case arises that leads to redundant rulesduring conflict resolution if for a pair of conflicting rules 𝑟other and 𝑟red the rule𝑟other is more general than the rule 𝑟red. More formally, the set of redundantrules 𝑅red can be derived as follows:

Definition 4.19 (Rule Redundancy [FZHT13*]). Given a set of rules 𝑅lfp,then the function redundant(𝑅lfp) = 𝑅red selects all rules in 𝑅lfp that areredundant due to the presence of another rule in 𝑅lfp. In detail, a rule 𝑟red isredundant if and only if another rule 𝑟other exists (cf. Equation (4.40)) in 𝑅lfp

that can be applied in a superset 𝑄other ⊇ 𝑄red of states where the redundant

153

Page 162: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

rule 𝑟red can be applied (cf. Equation (4.41)) and the other rule fires a subsetof the actor firings of the redundant rule (cf. Equation (4.42)).

redundant(𝑅lfp) ={ 𝑟red ∈ 𝑅lfp | ∃𝑟other ∈ 𝑅lfp ∖ { 𝑟red } : (4.40)𝑟other.𝑙 ≤ 𝑟red.𝑙 ∧ 𝑟other.𝑢 ≥ 𝑟red.𝑢∧ (4.41)𝑟red.𝜂 ≥ 𝑟other } (4.42)

Conflict resolution of 𝑟red and 𝑟other will have generated rules 𝑟1other and 𝑟1

red.In this case, the common actor firings are 𝜂𝑐 = min(𝑟other.𝜂, 𝑟red.𝜂) = 𝑟other.𝜂,and thus, 𝑟1

other.𝜂 = 0. This means that 𝑟1other can be discarded (cf. Equa-

tion (4.32)) because it does not fire any actors. Also, 𝑟red can be eliminated asa redundant rule.

Proof. [FZHT13*] Lemma 4.2 on page 165 shows that 𝑟1red can be applied after

𝑟other and will lead to the same state as firing 𝑟red directly. Consequently, in allstates where 𝑟red can be applied, the sequence of rule applications ⟨𝑟other, 𝑟

1red⟩

can also be taken, and will result in the same destination state. Hence, rule 𝑟red

is redundant.

To exemplify, consider the set of initial input/output dependency rules 𝑅ini ={𝑟1, 𝑟2, . . . 𝑟5} from Figure 4.40b. The corresponding conflict graph is shown inFigure 4.41b. As can be seen in the conflict graph, the rules 𝑟1 and 𝑟2, 𝑟1 and𝑟3, as well as 𝑟2 and 𝑟3 conflict.In the first step, the conflict between 𝑟1 and 𝑟2 will be resolved. A com-

mon state of these rules is 𝑞0. The vector of common actor firings is 𝜂𝑐 =(0, 1, 0). Hence, the conflict resolving operation 𝑟1◇𝑟2 results in two rules: 𝑟1

1 =((0, 1, 0), (0,∞,∞), (1, 0, 0)), and 𝑟1

2 = ((0, 1, 0), (∞,∞,∞), (0, 0, 0)) (whichcan be discarded due to the zero partial repetition vector). Note that 𝑟2 ismore general than 𝑟1. Therefore, 𝑟1 can be eliminated as a redundant rule. Itshould be noted that this conflict resolving operation also resolves the conflictbetween 𝑟1 and 𝑟3. The set of rules after this first conflict resolving step is𝑅1

ini = {𝑟11, 𝑟2, 𝑟3, 𝑟4, 𝑟5}. The resulting FSM and conflict graph are depicted

in Figure 4.42.Analogously, the second step resolves the conflict between 𝑟2 and 𝑟3. The

set of conflict-resolved rules after this second (and last) conflict resolving stepis 𝑅lfp ∖𝑅red = {𝑟1

1, 𝑟2, 𝑟13, 𝑟4, 𝑟5}. The resulting FSM and conflict graph are

depicted in Figure 4.43.Comparing the FSM in Figure 4.38 on page 142 to the FSM obtained by

simulation in Figure 4.43a, the former features a transition from 𝑞4 to 𝑞2 firingboth 𝑎2 and 𝑎3, while in the latter, this transition is split into two transitions:One transition from 𝑞4 to 𝑞7 firing 𝑎3 only, and a second transition from 𝑞7 to𝑞2 firing 𝑎2. The reason for this can be found in the conflict resolution step:

154

Page 163: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

𝑟11𝑓⟨𝑎1⟩

𝑓⟨𝑎1⟩

𝑓⟨𝑎3⟩

𝑟4

𝑟5

𝑟11

𝑓⟨𝑎2⟩ 𝑟2

𝑓⟨𝑎1⟩

𝑓⟨𝑎2 ,𝑎

3 ⟩𝑟3

𝑟2

𝑓⟨𝑎2⟩

𝑓⟨𝑎1⟩𝑟11

𝑞0 𝑞1

𝑞2 𝑞4

𝑞5𝑞3

𝑞6

(a) Simulated FSM after first conflict re-solving step

𝑟2 = ((0, 0, 0), (∞, 0,∞), (0, 1, 0))

𝑟3 = ((0, 0, 0), (∞, 0, 0), (0, 1, 1))

𝑟4 = ((1, 1, 0), (1,∞,∞), (1, 0, 0))

𝑟5 = ((0, 1, 1), (∞,∞, 1), (0, 0, 1))

𝑟11 = ((0, 1, 0), (0,∞,∞), (1, 0, 0))

(b) Conflict graph for rules after firstconflict resolving step

Figure 4.42: Listed to the right is the set of rules 𝑅1ini derived from one step

of conflict resolution of the set of initial input/output dependencyrules 𝑅ini. The redundant rule 𝑟1 has already been discarded.

Although it generates rules which guarantee a deadlock-free execution of thecomposite actor, it may also generate rules which do not fire any input actors.While this is perfectly valid if counter variables for all actors are maintained

during the execution of the composite actor, it becomes problematic if onlycounter variables for input actors are used, which is desirable, as it means lesschecks on rule conditions, and thus, less scheduling overhead for the compositeactor. This is possible, as the state of the cluster can always be determinedsolely based on the number of input actor firings.70

For example, consider rule 𝑟2 in Figure 4.43b, which is only applicable if thecounter for output actor 𝑎2 is zero. However, eliminating the condition for 𝑎2results in a rule which is always applicable, as the firing intervals correspondingto 𝑎1 and 𝑎3 are not constrained.In order to only use counter variables for input actors, all rules that do not

fire any input actors must be eliminated. This is discussed next.

Step 3: Rule Merging

Let 𝑟in be a rule which fires at least one input actor (and possibly some outputactors), and 𝑟ni be a rule which does not fire any input actors. After the rule𝑟in has been applied, the set of possible states can be computed as 𝑄′

in =𝑄in + 𝑟in.𝜂.71 In principle, 𝑟ni can be applied after 𝑟in if the set of states70As always, the initial cluster state 𝑞0 is the exception.71The notation 𝑄′ = 𝑄+ 𝜂shift is used to denote that the state set 𝑄′ is derived by shifting

all states of 𝑄 by adding the partial repetition vector 𝜂shift to each state in 𝑄, i.e.,

155

Page 164: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝑟11

𝑓⟨𝑎3⟩ 𝑟13

𝑓⟨𝑎1⟩

𝑓⟨𝑎1⟩

𝑓⟨𝑎3⟩

𝑟4

𝑟5

𝑟11

𝑓⟨𝑎2⟩ 𝑟2

𝑓⟨𝑎3⟩ 𝑟13

𝑓⟨𝑎1⟩

𝑓⟨𝑎2⟩ 𝑟2

𝑓⟨𝑎2⟩

𝑓⟨𝑎1⟩𝑟11

𝑟2

𝑓⟨𝑎3⟩

𝑞0 𝑞1

𝑞2 𝑞4

𝑞5𝑞3

𝑞6

𝑞7

𝑟13

(a) Simulated FSM after second conflictresolving step

𝑟2 = ((0, 0, 0), (∞, 0,∞), (0, 1, 0))

𝑟13 = ((0, 1, 0), (∞,∞, 0), (0, 0, 1))

𝑟4 = ((1, 1, 0), (1,∞,∞), (1, 0, 0))

𝑟5 = ((0, 1, 1), (∞,∞, 1), (0, 0, 1))

𝑟11 = ((0, 1, 0), (0,∞,∞), (1, 0, 0))

(b) Conflict graph for rules after sec-ond conflict resolving step

Figure 4.43: Shown to the right is the set of conflict-resolved rules after removalof the redundant rules, i.e., 𝑅lfp ∖ 𝑅red. The FSM derived fromsimulating these rules is depicted to the left.

𝑄𝑐 = 𝑄′in ∩ 𝑄ni is non-empty (cf. Figure 4.44a).72 Then, 𝑟in can be merged

with 𝑟ni, resulting in a new merged rule 𝑟m, which combines the actor firingsperformed separately by 𝑟in and 𝑟ni, i.e., 𝑟m.𝜂 = 𝑟in.𝜂+𝑟ni.𝜂 (cf. Figure 4.44b).The state set 𝑄m of this merged rule can simply be calculated by shifting backthe state set 𝑄𝑐, i.e, 𝑄m = 𝑄𝑐 − 𝑟in.𝜂 ⊆ 𝑄in.However, inserting such a merged rule 𝑟m into the conflict-free set of rules

would destroy this property since 𝑟in conflicts with 𝑟m due to common actorfirings, i.e, 𝑟in.𝜂 ≤ 𝑟m.𝜂, and common states, i.e., 𝑄m ⊆ 𝑄in. Therefore, themerge operation must also restrict the state set 𝑄in of the original rule 𝑟in suchthat it can no longer be applied to states in 𝑄m (where 𝑟m can be applied).To this end, the set ��in = 𝑄in ∖𝑄m has to be calculated in terms of intervalboundary vectors. Without going into detail, this operation results in a set ofrules ��in = {��in,0, ��in,1, . . . , ��in,𝑛} necessary to cover all states in ��in. The rulesin ��in may not be conflict-free, but as they are derived from the same rule,

𝑄′ = { 𝑞+𝜂shift | 𝑞 ∈ 𝑄 }. This operation can be performed by shifting the correspondinginterval boundary vectors 𝑙 and 𝑢 of the set 𝑄 by adding the partial repetition vector𝜂shift to them.

72Note that in this case, the interval boundary vectors 𝑙𝑐 and 𝑢𝑐 corresponding to 𝑄𝑐 canbe calculated by intersecting the interval boundary vectors of 𝑄ni and 𝑄′

in, i.e., 𝑙𝑐 =max(𝑟ni.𝑙, 𝑟in.𝑙+ 𝑟in.𝜂), and 𝑢𝑐 = min(𝑟ni.𝑢, 𝑟in.𝑢+ 𝑟in.𝜂).

156

Page 165: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

𝑟in

𝑟in

𝑟ni

𝑟ni

𝑄′in

𝑄in

𝑄𝑐

𝑄ni 𝑄′ni

(a) Rules 𝑟in and 𝑟ni before merging

𝑟m

𝑟ni𝑄′

ni

𝑄m

��in,𝑘��

′in

𝑄ni

��in

(b) Resulting merged rule 𝑟m and a set of rules,which replaces 𝑟in, containing at least onerule 𝑟in,𝑘

Figure 4.44: Graphical depiction of a rule merging situation. Shown on the leftside is the original situation. Depicted on the right side is theresulting situation after merging has been performed.

namely 𝑟in, they can be safely added to the set of final rules 𝑅fin. Note that as𝑟ni will be eliminated later, it does not have to be restricted like 𝑟in.An important point which has to be considered during the rule merging step

is that if 𝑟ni is eliminated after the rule merging step, the merge operation hasto be an over-approximation. Otherwise, some states in the FSM may not haveoutgoing transitions, therefore causing deadlocks during the execution of thecomposite actor. Given two rules 𝑟in and 𝑟ni, 𝑄𝑐 may be empty, in which casethey will not be merged. As a result, when merging rules, equivalent rules of𝑟ni, i.e., rules 𝑟ni + 𝑘 · 𝜂rep

𝛾 , 𝑘 ∈ N0 must also be considered. If, however, due to𝑘 > 0, the merged rule 𝑟m has 𝑟m.𝑙 ≥ 𝜂rep

𝛾 , it can safely be discarded, as thecounter variables maintained during simulation of the composite actor cannotreach the values of such lower bounds of the condition intervals.Finally, a special rule may exist which fires only output actors, but must not

be eliminated after the rule merging step: This is the case if sufficient initialtokens, e.g., channels 𝑐1→2 and 𝑐3→2 in the DFG 𝑔ex7 depicted in Figure 4.36on page 137, are contained in the cluster such that output actors can be firedwithout firing any input actors. Obviously, eliminating this special rule wouldprevent the composite actor from running at all as the initial state 𝑞0 wouldhave no outgoing transitions. As only one such rule can exist, it can easily betagged as special and, thus, can be retained after the merging step.To exemplify, the conflict-free set of rules 𝑅lfp ∖ 𝑅red = {𝑟1

1, 𝑟2, 𝑟13, 𝑟4, 𝑟5}

from Figure 4.43b is considered. Rule 𝑟2 fires only output actors, namely𝑎2. First, this rule is merged with rule 𝑟1

1, which fires 𝑎1, an input/out-put actor. For 𝑘 = 0, 𝑙𝑐 = max((0, 0, 0), (0, 1, 0) + (1, 0, 0)) = (1, 1, 0), and𝑢𝑐 = min((∞, 0,∞), (0,∞,∞) + (1, 0, 0)) = (0, 1,∞). In this case, 𝑟1

1 and 𝑟2,0

cannot be merged, as 𝑙𝑐 � 𝑢𝑐.

157

Page 166: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝑟1,2𝑓⟨𝑎1,𝑎2⟩

𝑓⟨𝑎3,𝑎2⟩ 𝑟3,2 𝑟4

𝑓⟨𝑎1⟩

𝑟5

𝑓⟨𝑎3,𝑎2⟩

𝑟3,2

𝑓⟨𝑎3⟩

𝑓⟨𝑎1⟩ ��11

𝑓⟨𝑎3⟩ ��13

𝑓⟨𝑎1,𝑎2⟩𝑟1,2

𝑓⟨𝑎2⟩𝑟2

𝑞0 𝑞1

𝑞2 𝑞4

𝑞5𝑞3

(a) Simulated final rules 𝑅fin

𝑟 𝑙 𝑢 𝜂

��11,0 (0, 1, 0) ( 0,∞, 0) (1, 0, 0)

𝑟1,2 (0, 1, 1) ( 0, 1,∞) (1, 1, 0)𝑟2 (0, 0, 0) ( 1, 0, 1) (0, 1, 0)��13,0 (0, 1, 0) ( 0,∞, 0) (0, 0, 1)

𝑟3,2 (1, 1, 0) (∞, 1, 0) (0, 1, 1)𝑟4 (1, 1, 0) ( 1,∞,∞) (1, 0, 0)𝑟5 (0, 1, 1) (∞,∞, 1) (0, 0, 1)

(b) Final rules 𝑅fin

Figure 4.45: Depicted is the set of final rules 𝑅fin for subcluster 𝑔𝛾1,2,3from

Figure 4.36. This set of rules has been derived from the set ofconflict-free rules shown in Figure 4.43b via rule merging. Theresulting FSM derived from state space expansion of the set of finalrules 𝑅fin is also shown to the right. Applied rules are annotatedto the transitions.

However, for 𝑘 = 1, 𝑙𝑐 = max((0, 0, 0) + (1, 1, 1), (0, 1, 0) + (1, 0, 0)) = (1, 1, 1)and 𝑢𝑐 = min((∞, 0,∞) + (1, 1, 1), (0,∞,∞) + (1, 0, 0)) = (1, 1,∞). Now,𝑙𝑐 ≤ 𝑢𝑐, and 𝑟1

1 and 𝑟2,1 can be merged. The merged rule is calculated as 𝑟1,2 =((1, 1, 1)− (1, 0, 0), (1, 1,∞)− (1, 0, 0), (1, 0, 0) + (0, 1, 0)) = ((0, 1, 1), (0, 1,∞),

(1, 1, 0)). The set of restricted rules ��1

1 contains only a single rule (after pruninganother rule which could never be applied), namely ��1

1,0 = ((0, 1, 0), (0,∞, 0),(1, 0, 0)).It can be observed that 𝑟1,2 can be applied in state 𝑞5 = (0, 1, 2) (cf. Fig-

ure 4.45a), creating the missing (direct) transition from 𝑞5 to 𝑞3 (contrast Fig-ure 4.43a to Figure 4.38 on page 142). Moreover, the restricted rule 𝑟1

1 in theform of ��1

1,0 can no longer be applied in 𝑞5, thus eliminating state 𝑞6.Merging 𝑟2 with the remaining states 𝑟1

3, 𝑟4 and 𝑟5 results in the rules 𝑅fin

given in Figure 4.45b. Note that 𝑟2 has been retained as discussed above.The final step of the rule-based clustering approach is to schedule the partial

repetition vector of each rule 𝑟 ∈ 𝑅fin, the set of rules after the merging step.

4.4.4 Scheduling Static Sequences

In this section, the input/output guard 𝑘io and the action 𝑓⟨...⟩ are derived foreach transition of the cluster FSM or each rule of the set of final rules thatimplement the QSS for the composite actor 𝑎𝛾 derived via the 𝜉𝐹𝑆𝑀 or 𝜉𝑟𝑢𝑙𝑒𝑠cluster refinement operator, respectively. The actions execute sequences of stat-

158

Page 167: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

ically scheduled actor firings for the actors 𝑎 ∈ 𝑔𝛾 .𝐴 contained in the originalcluster 𝑔𝛾 . To generate these sequences of statically scheduled actor firings, apartial repetition vector 𝜂 ∈ N|𝐴|

0 is required. For the rule-based approach, thepartial repetition vector 𝑟.𝜂 of a rule 𝑟 is readily available for all rules after rulemerging has been performed, but before the state space has been reduced toonly use input actors. For the automata-based approach, the partial repetitionvector of a transition can be computed from the firings function as well as thesource state 𝑞src and the destination state 𝑞dst of the transition. More precisely,the partial repetition vector 𝑡.𝜂 of a transition 𝑡 can be defined as follows:

𝑡.𝜂 =

{𝜂rep𝛾 if 𝑡.𝑞src = 𝑡.𝑞dst

(firings(𝑡.𝑞dst)− firings(𝑡.𝑞src)) mod 𝜂rep𝛾 otherwise (4.43)

The 𝑡.𝑞src = 𝑡.𝑞dst case in Equation (4.43) covers self-loops. A self-loop repre-sents monolithic clustering, which executes a whole iteration of the static DFGcontained in the cluster. Monolithic clustering is performed by the automata-based clustering algorithm if the 𝜉𝑆𝐷𝐹 cluster refinement operation is safe. Foran exact definition of this safety property, refer to Theorem 4.1 on page 106.Note that the cluster FSMmay also include a retiming sequence (cf. Figure 4.16don page 108) before entering this self-loop state.In order to generate the input/output guard 𝑘io for a transition or rule, knowl-

edge of the cluster state in which the actor firings of the partial repetition vectorwill be performed is required. This is due to the possible presence of CSDF ac-tors in the cluster 𝑔𝛾 . The communication behavior of a CSDF actor depends onthe current phase of the actor. To determine which phases of the CSDF actorsare executed by the partial repetition vector, the starting state is required. Forthe automata-based approach, the source state 𝑡.𝑞src is used to determine thestarting state. For the rule-based approach, the lower bound 𝑟.𝑙 is used. Withthis information, the number of consumed and produced tokens on the clusterinput and output ports by the execution of the sequences of statically sched-uled actor firings derived from the partial repetition vector can be determined.Thus, the input/output guard 𝑘io for each transition or rule can be computed.More formally, using the function io : 𝑄𝛾 → Z|𝐼∪𝑂| to map from a cluster stateto the number of consumed (negative) and produced (positive) tokens on thecluster input and outputs ports, the input/output guards for all transitions canbe defined as follows:73

cons(𝑡, 𝐼) = − 𝜋𝐼(io(𝑡.𝑞src + 𝜋𝐴𝐼∪𝐴𝑂(𝑡.𝜂))− io(𝑡.𝑞src)) ∀𝑡 ∈ 𝑇

prod(𝑡, 𝑂) = 𝜋𝑂(io(𝑡.𝑞src + 𝜋𝐴𝐼∪𝐴𝑂(𝑡.𝜂))− io(𝑡.𝑞src)) ∀𝑡 ∈ 𝑇

73Here, the input/output guard is defined indirectly via the notation cons(𝑡, 𝐼) andprod(𝑡, 𝑂) that denotes the number of consumed and produced tokens when the tran-sition is taken.

159

Page 168: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

Equivalently, one can define the input/output guards for all rules:74

cons(𝑟, 𝐼) = − 𝜋𝐼(io(𝜋𝐴𝐼∪𝐴𝑂(𝑟.𝑙 + 𝑟.𝜂))− io(𝜋𝐴𝐼∪𝐴𝑂

(𝑟.𝑙))) ∀𝑟 ∈ 𝑅

prod(𝑟, 𝑂) = 𝜋𝑂(io(𝜋𝐴𝐼∪𝐴𝑂(𝑟.𝑙 + 𝑟.𝜂))− io(𝜋𝐴𝐼∪𝐴𝑂

(𝑟.𝑙))) ∀𝑟 ∈ 𝑅

Finally, the partial repetition vectors 𝑡.𝜂 or 𝑟.𝜂 for all transitions or rules,respectively, are scheduled by a modified version of the cycle-breaking algorithmpresented in [HB07], which, for pure SDF graphs, always finds an Single Ap-pearance Schedule (SAS) if existent. In an SAS, each actor occurs only once,therefore minimizing code memory size when inlining the actor firings.Basically, an SAS corresponds to a topological sorting of the actors contained

in the static cluster. This is achieved by topologically sorting the StronglyConnected Components (SCCs) of the cluster. Subsequently, for each non-trivialSCC, the algorithm removes the edges with enough initial tokens to perform awhole iteration of the subgraph of the cluster induced by the SCC. If the SCCis still strongly connected after removing such edges, the subgraph is said to betightly connected and is scheduled by simulation. Otherwise, the SCC is looselyconnected, and the algorithm is applied recursively to this SCC.For partial repetition vectors, however, it may not always be possible to find

an SAS even if one exists. This is due to the fact that the partial repetition vectorin question may not be a multiple of the repetition vector of the subgraph ofthe cluster induced by the SCC. The generated looped schedules implementingthe sequences of statically scheduled actor firings are used as actions 𝑓⟨...⟩ by thesoftware synthesis to generate C++ source code, which is itself compiled by gccto generate the executable.

4.4.5 Proving Semantic Equivalence

Before presenting experimental results showing the benefits of the presentedclustering algorithms, a formal proof of the correctness of the approach is given.The following proofs are based on infinite version 𝑄∞

ini of the set of initial in-put/output dependency states 𝑄ini.75

The semantic equivalence (cf. Definition 4.2 on page 99) of the composite actor𝑎𝛾 and the static DFG 𝑔𝛾 is proven in Theorem 4.5 by showing the equivalenceof the denotational KPN descriptions [Kah74] 𝒦𝑎𝛾 and 𝒦𝑔𝛾 for both 𝑎𝛾 and 𝑔𝛾 .To exemplify the denotational description, the DFG from Figure 4.36 is used.For the sake of convenience, this DFG is replicated here as Figure 4.46. Its

74Likewise, the input/output guard of a rule is defined indirectly via the notation cons(𝑟, 𝐼)and prod(𝑟, 𝑂) that denotes the number of consumed and produced tokens when the ruleis applied.

75For a given state space 𝑄, the infinite version is defined as 𝑄∞ = { 𝑞+𝑛 · 𝜋ℐ(𝑞)(𝜂rep𝛾 ) | 𝑞 ∈

𝑄, 𝑛 ∈ N0 }.

160

Page 169: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

denotational description 𝒦𝑔𝛾 maps tuples 𝑠in = (𝑠𝑖1 , 𝑠𝑖2) of sequences 𝑠 ∈ 𝒱**

of tokens ∙ on its input ports 𝑖1 and 𝑖2 to tuples 𝒦𝑔𝛾 (𝑠in) = 𝑠out = (𝑠𝑜1 , 𝑠𝑜2 , 𝑠𝑜3)of sequences of tokens on its output ports 𝑜1, 𝑜2, and 𝑜3. As can be seen fromthe example values of 𝒦𝑔𝛾 given below, the denotational description naturallyexpresses the maximum output from input property, which should be preservedby the clustering algorithm.

𝑜1𝑖1

𝑖2 𝑜3

𝑜2𝑐2→1

𝑐3→2

𝑐2→3

𝑐1→2𝑎1

𝑎3

𝑎2 𝑎env

𝑔𝛾1,2,3

Figure 4.46: The DFG from Figure 4.36 replicated for the convenience of thereader

𝒦𝑔𝛾 (𝜆, 𝜆) = (𝜆, ⟨∙⟩, 𝜆)

𝒦𝑔𝛾 (⟨∙⟩, 𝜆) = (⟨∙⟩, ⟨∙⟩, 𝜆)𝒦𝑔𝛾 (⟨∙, ∙⟩, 𝜆) = (⟨∙, ∙⟩, ⟨∙⟩, 𝜆)

𝒦𝑔𝛾 (⟨∙, ∙, . . .⟩, 𝜆) = (⟨∙, ∙⟩, ⟨∙⟩, 𝜆)

𝒦𝑔𝛾 (𝜆, ⟨∙⟩) = (𝜆, ⟨∙⟩, ⟨∙⟩)𝒦𝑔𝛾 (𝜆, ⟨∙, ∙⟩) = (𝜆, ⟨∙⟩, ⟨∙, ∙⟩)

𝒦𝑔𝛾 (𝜆, ⟨∙, ∙, . . .⟩) = (𝜆, ⟨∙⟩, ⟨∙, ∙⟩)

𝒦𝑔𝛾 (⟨∙⟩, ⟨∙⟩) = (⟨∙⟩, ⟨∙, ∙⟩, ⟨∙⟩)

𝒦𝑔𝛾 (⟨∙, ∙⟩, ⟨∙⟩) = (⟨∙, ∙⟩, ⟨∙, ∙⟩, ⟨∙⟩)𝒦𝑔𝛾 (⟨∙, ∙, ∙⟩, ⟨∙⟩) = (⟨∙, ∙, ∙⟩, ⟨∙, ∙⟩, ⟨∙⟩)

𝒦𝑔𝛾 (⟨∙, ∙, ∙, . . .⟩, ⟨∙⟩) = (⟨∙, ∙, ∙⟩, ⟨∙, ∙⟩, ⟨∙⟩)

161

Page 170: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝒦𝑔𝛾 (⟨∙⟩, ⟨∙, ∙⟩) = (⟨∙⟩, ⟨∙, ∙⟩, ⟨∙, ∙⟩)𝒦𝑔𝛾 (⟨∙⟩, ⟨∙, ∙, ∙⟩) = (⟨∙⟩, ⟨∙, ∙⟩, ⟨∙, ∙, ∙⟩)

𝒦𝑔𝛾 (⟨∙⟩, ⟨∙, ∙, ∙, . . .⟩) = (⟨∙⟩, ⟨∙, ∙⟩, ⟨∙, ∙, ∙⟩)· · ·

Note that there are an infinite number of tuples 𝑠in and 𝑠out representingthe infinite stream of tokens that is processed by a cluster. Hence, later on, amathematical proof for the equivalence of 𝒦𝑔𝛾 and 𝒦𝑎𝛾 will be given. For thisproof, the token values can be abstracted away by using ∙ as token symbol.This can be done because clustering can also be thought of as KPN processcomposition. Therefore, the same operations will always be performed on thesame values. However, it is not trivially obvious if the set of rules generated bythe clustering algorithm will always produce the maximum number of outputtokens from the given input tokens. For example, a fully static scheduling of𝑔𝛾 by the composite actor 𝑎𝛾 would produce the following KPN function 𝒦𝑎𝛾 ,which is not equivalent to 𝒦𝑔𝛾 :

𝒦𝑎𝛾 (𝜆, 𝜆) = (𝜆, 𝜆, 𝜆)

𝒦𝑎𝛾 (⟨∙⟩, 𝜆) = (𝜆, 𝜆, 𝜆)

𝒦𝑎𝛾 (⟨∙, . . .⟩, 𝜆) = (𝜆, 𝜆, 𝜆)

𝒦𝑎𝛾 (𝜆, ⟨∙⟩) = (𝜆, 𝜆, 𝜆)

𝒦𝑎𝛾 (𝜆, ⟨∙, . . .⟩) = (𝜆, 𝜆, 𝜆)

𝒦𝑎𝛾 (⟨∙⟩, ⟨∙⟩) = (⟨∙⟩, ⟨∙⟩, ⟨∙⟩)𝒦𝑎𝛾 (⟨∙, . . .⟩, ⟨∙⟩) = (⟨∙⟩, ⟨∙⟩, ⟨∙⟩)𝒦𝑎𝛾 (⟨∙⟩, ⟨∙, . . .⟩) = (⟨∙⟩, ⟨∙⟩, ⟨∙⟩)

· · ·

Therefore, by proving semantic equivalence, the correctness of the clusteringalgorithms will also be proven. The opposite, that a correct (in the sense ofalways producing maximum output from minimal input) clustering algorithmwill produce a semantically equivalent 𝒦𝑎𝛾 from 𝒦𝑔𝛾 is trivially true.Before tackling the correctness proof for the clustering algorithm, some prepa-

ration is required. First, the semantic equivalence proof requires that for each

162

Page 171: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

pair of reachable states 𝑞𝑥, 𝑞𝑦 ∈ 𝑄 with 𝑞𝑦 ≥ 𝑞𝑥, a sequence of statically sched-uled actor firings exists that can be executed in the cluster state 𝑞𝑥 resulting inthe destination state 𝑞𝑦. The proof for the existence of 𝑎 is derived from themax state reachability lemma given below.

Max State Reachability

The notation 𝑞𝑥𝑎−→ 𝑞𝑦 is used to denote the existence of a sequence of statically

scheduled actor firings 𝑎, which can be executed in state 𝑞𝑥 leading to state 𝑞𝑦.Furthermore, a state 𝑞 is called reachable if and only if 𝑞0

𝑎−→ 𝑞, and a state𝑞𝑦 is called reachable from 𝑞𝑥 if and only if 𝑞𝑥

𝑎−→ 𝑞𝑦. In the following, thereachability of state 𝑞𝑚 = max(𝑞𝑥, 𝑞𝑦) from both states 𝑞𝑥 and 𝑞𝑦 is provenunder the precondition that these two states are reachable themselves.

Lemma 4.1 (Max State Reachability [FZHT13*]). If two states 𝑞𝑥, 𝑞𝑦 ∈ 𝑄are reachable, the pointwise maximum of these two states 𝑞𝑚 = max(𝑞𝑥, 𝑞𝑦) isreachable from both 𝑞𝑥 and 𝑞𝑦.

Proof. [FZHT13*] The proof is based on the well-known freedom of conflictproperty of SDF systems. The property can be derived by considering SDFas a subclass of place/transition Petri nets (p/t-nets, for short), where actorscorrespond to transitions and channels to places. In general p/t-nets, firing atransition can disable another transition which was previously enabled. Thissituation is referred to as a conflict. However, for the subclass of p/t-netscorresponding to SDF, this can never happen [Mur89]. Therefore, for SDFsystems, an actor 𝑎, once enabled, cannot be disabled by firing another actor𝑎′ ∈ 𝐴 ∖ { 𝑎 }.Without loss of generality, only the existence of 𝑞𝑥

𝑎𝑥,𝑚−→ 𝑞𝑚 is proven. Giventwo reachable states 𝑞𝑥 (𝑞0

𝑎𝑥−→ 𝑞𝑥) and 𝑞𝑦 (𝑞0

𝑎𝑦−→ 𝑞𝑦) as well as define the set𝐴′ to be the set of actors which have more occurrences in 𝑎𝑦 than in 𝑎𝑥, i.e.,𝐴′ = {𝑎 ∈ 𝐴 | #𝑎𝑎𝑦 > #𝑎𝑎𝑥}76, then the proposition 𝑞𝑥

𝑎𝑥,𝑚−→ 𝑞𝑚 is triviallytrue if 𝐴′ is empty since 𝑞𝑚 = 𝑞𝑥. Otherwise, let 𝑎′

𝑦 be the longest prefix of𝑎𝑦 with the property that each actor 𝑎 ∈ 𝐴′ occurs less than or equal to #𝑎𝑎𝑥.Furthermore, due to ∀𝑎 ∈ 𝐴 ∖ 𝐴′ : #𝑎𝑎

′𝑦 ≤ #𝑎𝑎𝑦 ≤ #𝑎𝑎𝑥, all actors 𝑎 ∈ 𝐴 occur

less than or equal to #𝑎𝑎𝑥 in the sequence of statically scheduled actor firings𝑎′𝑦. Moreover, after executing the sequence 𝑎′

𝑦, at least one enabled actor 𝑎′ ∈ 𝐴′

exists. Therefore, due to the SDF freedom of conflict property, this actor 𝑎′ mustalso be enabled after executing sequence 𝑎𝑥. Firing 𝑎′ in state 𝑞𝑥 will lead to anew state 𝑞′

𝑥 with property 𝑞𝑚 = max(𝑞′𝑥, 𝑞𝑦). Furthermore, the existence of a

76The notation #𝑥𝑥 is used to denote the number of occurrences of a value 𝑥 in a sequence 𝑥,e.g, #𝑎𝑥⟨𝑎𝑥, 𝑎𝑦, 𝑎𝑥⟩ = 2 denotes that the actor 𝑎𝑥 occurs twice in the sequence of staticallyscheduled actor firings 𝑎 = ⟨𝑎𝑥, 𝑎𝑦, 𝑎𝑥⟩.

163

Page 172: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝑞′𝑥

𝑎𝑥′,𝑚−→ 𝑞𝑚 is proven by induction where 𝐴′ = ∅ is the induction start. Finally,𝑎𝑥,𝑚 is given by concatenating the sequences of statically scheduled actor firings⟨𝑎′⟩ and 𝑎𝑥′,𝑚.

Next, the proof of the existence of a sequence of statically scheduled actor fir-ings 𝑞𝑥

𝑎−→ 𝑞𝑦 for all 𝑞𝑦 ≥ 𝑞𝑥 with 𝑞𝑥, 𝑞𝑥 ∈ 𝑄∞𝛾 is sketched using the max state

reachability lemma. This lemma requires the reachability of the states 𝑞𝑥 and𝑞𝑦. First, note that all states 𝑞 ∈ 𝑄∞

ini are trivially reachable by construction,i.e., the definition of the input/output dependency function (cf. Definition 4.7 onpage 138) implies the existence of a sequence 𝑞0

𝑎−→ 𝑞 for all states 𝑞 ∈ 𝑄∞ini.

Furthermore, the set of initial input/output dependency states are derived fromthe set of input/output dependency function values by determining the maxi-mal number (cf. Definition 4.9 on page 140) of output actor firings possible withthe input actor firings given by an input/output dependency function value.The remaining states 𝑄∞

𝛾 ∖ 𝑄∞ini are derived by pairwise maximum operations

from 𝑄∞ini via a least fixed-point operation (cf. Definition 4.10 on page 141). Let

𝑞𝑚 = max(𝑞𝑥, 𝑞𝑦) be a state, which is added by a step of the lfp operation, then

𝑞0

𝑎a𝑥 𝑎𝑥,𝑚−→ 𝑞𝑚 is a sequence of statically scheduled actor firings that reaches 𝑞𝑚.

With the above reasoning, it is proven that all states 𝑞 ∈ 𝑄∞𝛾 are reachable

and that ∀𝑞𝑥, 𝑞𝑦 ∈ 𝑄∞𝛾 : 𝑞𝑦 ≥ 𝑞𝑥 =⇒ 𝑞𝑥

𝑎𝑥,𝑦−→ 𝑞𝑦 exists. However, the exis-tence of a sequence 𝜌𝑥,𝑦 of rule applications realizing the sequence of staticallyscheduled actor firings 𝑎𝑥,𝑦 also needs to be proven.

Max State Reachability via Rules

First, some more notation is required: 𝑞𝑥𝑟−→ 𝑞𝑦 denotes the existence of a rule

𝑟 which can be applied in state 𝑞𝑥 leading to a new state 𝑞𝑦, while 𝑞𝑥

𝜌𝑥,𝑦−→ 𝑞𝑦

denotes the existence of a sequence 𝜌𝑥,𝑦 of rule applications from state 𝑞𝑥 tostate 𝑞𝑦, i.e., 𝑞𝑥 = 𝑞𝑥,0

𝑟𝑥,1−→ 𝑞𝑥,1

𝑟𝑥,2−→ . . .𝑟𝑥,𝑖−→ 𝑞𝑥,𝑖 = 𝑞𝑦. Furthermore, a state

𝑞 is called rule reachable if and only if 𝑞0

𝜌𝑥−→ 𝑞, and a state 𝑞𝑦 is called rule

reachable from 𝑞𝑥 if and only if 𝑞𝑥

𝜌𝑥,𝑦−→ 𝑞𝑦. Moreover, rules 𝑟 are not of anarbitrary form, but obey the rule property (cf. Definition 4.15 on page 149).In the following, the rule reachability of 𝑞𝑚 = max(𝑞𝑥, 𝑞𝑦) from both states

𝑞𝑥 and 𝑞𝑦 is proven, assuming all rules obey the rule property. First, a specialcase is proven in Lemma 4.2 for rules 𝑟𝑥 and 𝑟𝑦 that are enabled in a commonstate 𝑞. For this limited case, the sequences consist both of a single rule, i.e.,

𝜌𝑥,𝑚 = 𝑞𝑥

𝑟1𝑦−→ 𝑞𝑚 and 𝜌𝑦,𝑚 = 𝑞𝑦

𝑟1𝑥−→ 𝑞𝑚, which execute the sequence 𝑎𝑥,𝑚 and

𝑎𝑦,𝑚, respectively. Later on, this result will be extended by Lemma 4.3 to allrule reachable states 𝑞𝑥, 𝑞𝑦 ∈ 𝑄. From this result, a proof is sketched that allstates 𝑞 ∈ 𝑄 are rule reachable. More formally, the limited case is given below:

164

Page 173: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

Lemma 4.2 (Single-Step Max State Rule Reachability [FZHT13*]). Given tworules 𝑟𝑥, 𝑟𝑦 ∈ 𝑅 that can both be applied in state 𝑞, i.e., 𝑟𝑥.𝑙 ≤ 𝑞 ≤ 𝑟𝑥.𝑢and 𝑟𝑦.𝑙 ≤ 𝑞 ≤ 𝑟𝑦.𝑢, then the pointwise maximum 𝑞𝑚 = max(𝑞𝑥, 𝑞𝑦) is alsoreachable from both states 𝑞𝑥 and 𝑞𝑦 via the two rules 𝑟1

𝑦 and 𝑟1𝑥 given in

Definition 4.17 on page 152, i.e., 𝑞𝑥

𝑟1𝑦−→ 𝑞𝑚 and 𝑞𝑦

𝑟1𝑥−→ 𝑞𝑚.

Proof. [FZHT13*] Without loss of generality, 𝑟𝑥 is selected to be applied first.Therefore, rule application of 𝑟𝑥 will lead to the state 𝑞

𝑟𝑥−→ 𝑞 + 𝑟𝑥.𝜂 = 𝑞𝑥.Hence, the rule 𝑟1

𝑦 is selected for application.77 For 𝑟1𝑦 to execute, its precondi-

tion 𝑟1𝑦.𝑙 ≤ 𝑞𝑥 ≤ 𝑟1

𝑦.𝑢 has to be satisfied. First, the lower bound 𝑟1𝑦.𝑙 ≤ 𝑞𝑥 is

considered. With (Definition 4.17) 𝑟1𝑦.𝑙 = 𝑟𝑦.𝑙+min(𝑟𝑦.𝜂, 𝑟𝑥.𝜂) ≤ 𝑟𝑦.𝑙+𝑟𝑥.𝜂 ≤

𝑞 + 𝑟𝑥.𝜂 = 𝑞𝑥, the lower bound is handled. For the upper bound 𝑞𝑥 ≤ 𝑟1𝑦.𝑢,

two cases can occur for an actor 𝑎: First, 𝑟𝑦.𝜂(𝑎) ≤ 𝑟𝑥.𝜂(𝑎). In this case,𝑟1𝑦.𝜂(𝑎) = 𝑟𝑦.𝜂(𝑎) − min(𝑟𝑦.𝜂(𝑎), 𝑟𝑥.𝜂(𝑎)) = 0, and, by construction 𝑞𝑥(𝑎) ≤

𝑟1𝑦.𝑢(𝑎) = ∞. Otherwise, 𝑟𝑦.𝜂(𝑎) > 𝑟𝑥.𝜂(𝑎), and, by construction 𝑟1

𝑦.𝑢(𝑎) =𝑟𝑦.𝑢(𝑎) +min(𝑟𝑦.𝜂(𝑎), 𝑟𝑥.𝜂(𝑎)) = 𝑟𝑦.𝑢(𝑎) + 𝑟𝑥.𝜂(𝑎) ≥ 𝑞(𝑎) + 𝑟𝑥.𝜂(𝑎) = 𝑞𝑥(𝑎).After proving the precondition, 𝑟1

𝑦 is executed and the resulting state 𝑞′ is calcu-lated to be 𝑞′ = 𝑞𝑥+𝑟𝑦.𝜂−min(𝑟𝑦.𝜂, 𝑟𝑥.𝜂) = 𝑞+𝑟𝑥.𝜂+𝑟𝑦.𝜂−min(𝑟𝑦.𝜂, 𝑟𝑥.𝜂) =𝑞 +max(𝑟𝑦.𝜂, 𝑟𝑥.𝜂) = max(𝑞 + 𝑟𝑦.𝜂, 𝑞 + 𝑟𝑥.𝜂) = max(𝑞𝑦, 𝑞𝑥) = 𝑞𝑚.

In the following, the general case with arbitrary 𝑞𝑥, 𝑞𝑦 ∈ 𝑄 is discussed.The lemma only requires that there is a common state 𝑞0 from which both 𝑞𝑥

and 𝑞𝑦 are rule reachable. Note that this is trivially true for the set of initialinput/output dependency states 𝑞 ∈ 𝑄∞

ini as the derivation of the set of initialinput/output dependency rules in Section 4.4.3 ensures that for each 𝑞 ∈ 𝑄∞

ini asequence of rule applications from 𝑞0 exists. An equivalent argument to the onegiven at the end of Section 4.4.5 for the extension of the reachability of all states𝑞 ∈ 𝑄 from the reachability of the set of initial input/output dependency states𝑞 ∈ 𝑄∞

ini can be applied to extend the rule reachability from the set of initialinput/output dependency states to all states. However, this argument requiresa rule equivalent for Lemma 4.1, which is given below:

Lemma 4.3 (Multi-Step Max State Rule Reachability [FZHT13*]). For everytwo states 𝑞𝑥, 𝑞𝑦 ∈ 𝑄 that are rule reachable from a common state 𝑞0, i.e.,

𝑞0

𝜌𝑥−→ 𝑞𝑥 and 𝑞0

𝜌𝑦−→ 𝑞𝑦, the state 𝑞𝑚 = max(𝑞𝑥, 𝑞𝑦) is rule reachable from

both 𝑞𝑥 and 𝑞𝑦 via sequences of rule applications 𝜌𝑥,𝑚 and 𝜌𝑦,𝑚, i.e., 𝑞𝑥

𝜌𝑥,𝑚−→ 𝑞𝑚

and 𝑞𝑦

𝜌𝑦,𝑚−→ 𝑞𝑚.

Proof. [FZHT13*] The lemma is proven via induction. Given that 𝑞𝑥 = 𝑞𝑥,𝑖 and𝑞𝑦 = 𝑞𝑦,𝑗 (cf. Figure 4.47) are reachable from a state 𝑞0 = 𝑞𝑥,0 = 𝑞𝑦,0 via two

77The existence of a sequence of statically scheduled actor firings 𝑎𝑥,𝑚 for 𝑟1𝑦 to execute hasbeen proven by Lemma 4.1.

165

Page 174: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

𝜌𝑦

𝜌𝑥

𝜌1𝑥

𝜌1𝑦

· · ·

· · ·

· · ·

· · ·

𝑟2𝑦,1

𝑟𝑦,2

𝑟𝑦,1

𝑟1𝑦,1

𝑟𝑥,1

𝑟𝑥,2

𝑟2𝑥,1

𝑟𝑥,3

𝑟𝑦,3

𝑟𝑥,𝑖

𝑟𝑖𝑦,1𝑟𝑖−1

𝑦,1

𝑟𝑦,𝑗

𝑟𝑗𝑥,1𝑟𝑗−1

𝑥,1

𝑟1𝑥,3

𝑟1𝑥,2

𝑟1𝑥,1 𝑟1

𝑦,2

𝑟1𝑥,𝑖

𝑟1𝑦,3 𝑟1

𝑦,𝑗

𝑞0

𝑞𝑥,1

𝑞𝑥,2

𝑞𝑦,2

𝑞𝑦,1

𝑞𝑥,𝑖

𝑞11

𝑞𝑥,𝑖−1

𝑞𝑦,𝑗𝑞𝑦,𝑗−1

𝑞1𝑦,2 𝑞1

𝑦,𝑗−1 𝑞1𝑦,𝑗

𝑞1𝑥,2 𝑞1

𝑥,𝑖−1 𝑞1𝑥,𝑖

Figure 4.47: Rule application sequences used in the proof of Lemma 4.3

sequences 𝜌𝑥 (𝑞𝑥,0

𝑟𝑥,1−→ 𝑞𝑥,1

𝑟𝑥,2−→ . . .𝑟𝑥,𝑖−→ 𝑞𝑥,𝑖) and 𝜌𝑦 (𝑞𝑦,0

𝑟𝑦,1−→ 𝑞𝑦,1

𝑟𝑦,2−→ . . .𝑟𝑦,𝑗−→

𝑞𝑦,𝑗). The maximum indices 𝑖 and 𝑗 denote the length, i.e., the number of ruleapplications, for the sequences 𝜌𝑥 and 𝜌𝑦, respectively.

First, note that for the case of 𝑖 being zero, the lemma is trivially true, i.e.,if 𝑖 is zero then 𝑞𝑥 = 𝑞0 and 𝑞𝑦 ≥ 𝑞𝑥, therefore 𝑞𝑚 = 𝑞𝑦 which is reachablefrom 𝑞0 = 𝑞𝑥 via the sequence 𝜌𝑦. For 𝑗 being zero, the symmetrical argumentapplies.

Hence, it can be assumed that 𝑖, 𝑗 ≥ 1 and that the lemma is true for sequences

𝜌1𝑥 (𝑞1

1 = 𝑞1𝑥,1

𝑟1𝑥,2−→ 𝑞1

𝑥,2

𝑟1𝑥,3−→ . . .

𝑟1𝑥,𝑖−→ 𝑞1

𝑥,𝑖 = 𝑞1𝑥) of length 𝑖 − 1 and 𝜌1

𝑦 (𝑞11 =

𝑞1𝑦,1

𝑟1𝑦,2−→ 𝑞1

𝑦,2

𝑟1𝑦,3−→ . . .

𝑟1𝑦,𝑗−→ 𝑞1

𝑦,𝑗 = 𝑞1𝑦) of length 𝑗 − 1. The sequences 𝜌1

𝑥 and 𝜌1𝑦

are used as induction hypothesis for the proof.

The induction proceeds by proving the existence of two rules 𝑞𝑥

𝑟𝑖𝑦,1−→ 𝑞1

𝑥 and

𝑞𝑦

𝑟𝑗𝑥,1−→ 𝑞1

𝑦 as well as two sequences 𝜌1𝑥 and 𝜌1

𝑦 such that 𝑞𝑚 = max(𝑞1𝑥, 𝑞

1𝑦).

With a successful proof of the existence of 𝜌1𝑥, 𝑟

𝑖𝑦,1 and 𝜌1

𝑦, 𝑟𝑗𝑥,1, the induction

hypothesis can be applied and, hence, the proof of Lemma 4.3 finished.

First, max(𝑞𝑥,1, 𝑞𝑦,1) is chosen for 𝑞11. Without loss of generality, only the

existence of 𝜌1𝑥 is proven. Note that Lemma 4.2 applies to the rules 𝑟𝑥,1 and

𝑟𝑦,1, therefore the presence of the rule 𝑟1𝑦,1 is ensured. Now, assuming the

presence of the rule 𝑞𝑥,𝑛−1

𝑟𝑛−1𝑦,1−→ 𝑞1

𝑥,𝑛−2 note that Lemma 4.2 applies to the rules𝑟𝑛−1𝑦,1 and 𝑟𝑥,𝑛 which results in the rules 𝑟𝑛

𝑦,1 and 𝑟1𝑥,𝑛. Furthermore, 𝑞1

𝑥 =max(𝑞𝑦,1, 𝑞𝑥,1, 𝑞𝑥,2, . . . 𝑞𝑥,𝑖) = max(𝑞𝑦,1, 𝑞𝑥,𝑖) = max(𝑞𝑦,1, 𝑞𝑥). The symmetricalargument applies for 𝜌1

𝑦 with 𝑞1𝑦 = max(𝑞𝑥,1, 𝑞𝑦). Finally, note that max(𝑞1

𝑥, 𝑞1𝑦)

= max(𝑞𝑦,1, 𝑞𝑥, 𝑞𝑥,1, 𝑞𝑦) = max(𝑞𝑥, 𝑞𝑦) = 𝑞𝑚.

166

Page 175: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

Semantic Equivalence

In this section, the correctness of the presented clustering algorithms is finallyproven by showing the equivalence of the semantics of a static cluster 𝑔𝛾 andits corresponding composite actor 𝑎𝛾 . This is done by defining the semantics ofboth, 𝑔𝛾 and 𝑎𝛾 , with the denotational semantics of KPN and abstracting fromdata values. The actual proof is done by induction.

Theorem 4.5 (Semantic Equivalence [FZHT13*]). Given a static cluster 𝑔𝛾of a DFG 𝑔 satisfying the clustering condition in Definition 4.4 and its corre-sponding composite actor 𝑎𝛾 constructed by either the 𝜉𝐹𝑆𝑀 or the 𝜉𝑟𝑢𝑙𝑒𝑠 clusterrefinement operator. Then, it holds that the behaviors of 𝑔 and the refined DFG𝑔 resulting from replacing 𝑔𝛾 with 𝑎𝛾 are sequence equivalent.

Proof. [FZHT13*] Theorem 4.5 is proven by showing that the Kahn descriptions𝒦𝑔𝛾 and 𝒦𝑎𝛾 of 𝑔𝛾 and 𝑎𝛾 are equivalent, i.e., 𝒦𝑔𝛾 ≡ 𝒦𝑎𝛾 .78 These functionsmap tuples of sequences of tokens on the input ports to tuples of sequences oftokens on the output ports.However, due to the data independent nature of SDF and CSDF actors, the

sequences can be abstracted to their length. The production of the correct tokenvalues is guaranteed by the firing of the contained actors on the same tokens asin the original subcluster. Finally, the sequence length produced by an input oroutput actor can be represented by the number of input or output actor firings.With these abstractions, the functions 𝒦′

𝑔𝛾 ,𝒦′𝑎𝛾 : N|𝐴𝐼 |

0 → N|𝐴𝑂|0 are used, which

map the number of input actor firings into the number of output actor firings.Thus, the equivalence of the denotational Kahn functions is reduced to theequivalence of their corresponding abstracted functions, i.e., 𝒦𝑔𝛾 ≡ 𝒦𝑎𝛾 ⇐⇒𝒦′

𝑔𝛾 ≡ 𝒦′𝑎𝛾 .

Note that for all inputs 𝒦′𝑎𝛾 can only be less than 𝒦′

𝑔𝛾 as Lemma 4.1 ensuresthe existence of sequences of statically scheduled actor firings for all rules, i.e.,the non-equivalence of 𝒦′

𝑔𝛾 and 𝒦′𝑎𝛾 could only be caused by missing output

actor firings of 𝒦′𝑎𝛾 . The proof proceeds by showing a contradiction. Assume

that for some 𝜂 ∈ N|𝐴𝐼 |0 the composite actor lacks output actor firings, i.e.,

𝒦′𝑎𝛾 (𝜂) < 𝒦′

𝑔𝛾 (𝜂), and that no sequence 𝜌 exists adding these missing out-put actor firings. Let the state 𝑞def = (𝜂,𝒦′

𝑎𝛾 (𝜂)) be this defective state and𝑞ok = (𝜂,𝒦′

𝑔𝛾 (𝜂)) the correct state. Then, there exists at least one output actor𝑎out with missing firings, i.e., 𝑞def(𝑎out) < 𝑞ok(𝑎out), and a corresponding initialinput/output dependency cluster state 𝑞ini ∈ 𝑄∞

ini where this output is maxi-mized, i.e., 𝑞ini(𝑎out) = 𝑞ok(𝑎out). Then, by the least fixed-point operation fromDefinition 4.10, the state 𝑞corr = max(𝑞ini, 𝑞def) must also exist. Furthermore,

78The relation between Kahn’s denotational semantics and the semantics of data flow modelswith the notion of firing is presented in [Lee97].

167

Page 176: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

33

𝑖1 𝑜1𝑎env

𝑎2𝑎1

size(𝑐out) = 3𝑐out

𝑐1→2

size(𝑐1→2) = 4

𝑐insize(𝑐in) = 4

𝑔𝛾1,2

(a) Unrefined DFG 𝑔ex8with

its cluster 𝑔𝛾1,2

33𝑎2𝑎1

𝑐1→2

𝑖1 𝑜1𝑎env

𝑐outsize(𝑐out) = 5

𝑐insize(𝑐in) = 7

𝑞0 𝑡1

#𝑖1 ≥ 3∧#𝑜1 ≥ 3/𝑓⟨𝑎1,𝑎2,𝑎2,𝑎2⟩

𝑎𝛾1,2

(b) Refined DFG 𝑔ex8with the

SDF composite actor 𝑎𝛾1,2

Figure 4.48: Depicted above is a DFG 𝑔ex8 containing a dynamic actor 𝑎env(white) and a cluster 𝑔𝛾1,2

containing the static actors 𝑎1 and 𝑎2(shaded).

by Lemma 4.3, there exists a sequence 𝑞def𝑠−→ 𝑞corr. If there are still output

actor firings missing (from output actors other than 𝑎out), then the previoussteps can be repeated until 𝑞ok is reached.

4.4.6 FIFO Channel Capacity Adjustments

In contrast to the data flow models that are introduced in literature, DFGsused for refinement of applications into implementations on multi-core plat-forms including both hardware and software-programmable resources requireFIFO channels with limited capacities. Especially, existing synthesis back-endsmight only support the synthesis of FIFOs with fixed capacities that are nolonger changeable at run time. But, the most severe problem is that QSSs,in general, reduce the accessible capacities of channels contained in a cluster,e.g., channel 𝑐1→2 contained in the cluster 𝑔𝛾1,2

depicted in Figure 4.48. KnownQSS, e.g., as introduced in Sections 4.4.2 and 4.4.3 as well as similar methodsknown from literature [PBL95, TBG+13], might—as will be shown—introduceartificial deadlocks due to these inaccessible capacities. The channel capacitiesof the FIFO channels contained in the cluster can be easily determined in thescheduling phase (cf. Section 4.4.4) that translates the partial repetition vectorsof the transitions or rules into sequences of statically scheduled actor firings.However, the channel capacities of the FIFO channels connected to the clusterinput and output ports as specified in the original unrefined DFG may induceartificial deadlocks for the refined DFG.To exemplify, consider the DFG 𝑔ex8 depicted in Figure 4.48. Each initial

token of a channel is depicted as a black dot inside the circle representingthe channel. The channel capacities are specified via the size function. Thetransformation of the DFG 𝑔ex8 into the DFG 𝑔ex8 illustrates the refinement ofthe (static) cluster 𝑔𝛾1,2

into the composite actor 𝑎𝛾1,2. Moreover, as will be

168

Page 177: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

explained later, to prevent any deadlock, the channel capacities of the FIFOchannels 𝑐in and 𝑐out have to be adjusted from capacities to hold at most 4 and3 tokens to capacities accommodating 7 and 5 tokens, respectively.The composite actor 𝑎𝛾1,2

from Figure 4.48 has a cluster FSM with one state𝑞0 and one self-loop transition 𝑡1 that, when taken, calls the action 𝑓⟨𝑎1,𝑎2,𝑎2,𝑎2⟩.This action executes the sequence ⟨𝑎1, 𝑎2, 𝑎2, 𝑎2⟩ of statically scheduled actorfirings, i.e., fire actor 𝑎1 once followed by three consecutive firings of actor𝑎2. Moreover, the guard for the transition 𝑡1 is given via the notation #𝑖1 ≥3 ∧ #𝑜1 ≥ 3 that encodes the requirement for the availability of at least threetokens and at least three free places in the channels 𝑐in and 𝑐out connected tothe input port 𝑖1 and the output port 𝑜1 of the composite actor 𝑎𝛾1,2

. However,as the capacity of channel 𝑐out is limited to three tokens—two of them alreadyoccupied—, the transition cannot be taken. As a consequence, the compositeactor 𝑎𝛾1,2

depicted in Figure 4.48b would deadlock without a proper channelcapacity adjustment.In the following, the problem this section is dedicated to will be given more

formally. Subsequently, the reasons for the introduction of artificial deadlocksdue to limited channel capacities will be detailed and a FIFO channel capac-ity adjustment algorithm will be given to prevent the introduction of artificialdeadlocks caused by these reasons.

Channel Capacity Adjustment Problem

Note that the determination of limited channel capacities that do not introduceany deadlock for a KPN is in general undecidable [Par95]. Moreover, the chan-nel capacities for the FIFO channels connected to the cluster input and outputports of a cluster cannot be used unmodified from the original unrefined DFGfor the refined DFG containing composite actors that implement a QSS. Thus,the channel capacities of the cluster input and output channels cannot be re-computed from scratch, but must be derived by computing an adjustment forthe worst case that enlarges the channel capacities of the cluster input and out-put channels of the original unrefined DFG, e.g., the channel capacities size(𝑐in)and size(𝑐out) have to be increased in order to make the refinement of the DFG𝑔ex8 into the DFG 𝑔ex8 via application of a QSS for the actors 𝑎1 and 𝑎2 feasible.More formally, given a cluster 𝑔𝛾 and a QSS 𝑔𝛾 .ℛ for it, the problem is to find avector 𝑔𝛾 .adj ∈ N

|𝑔𝛾 .𝐼∪𝑔𝛾 .𝑂|0 , such that given any deadlock free DFG 𝑔 exhibiting

KPN semantics and containing the cluster 𝑔𝛾 , then replacing the cluster with itscorresponding composite actor 𝑎𝛾 and increasing the channel capacities of thechannels connected to this composite actor by the amount given by the 𝑔𝛾 .adjvector results in a refined DFG 𝑔 that is also free of any deadlock.For the example shown in Figure 4.48, the adjustment vector computed by

the algorithm introduced in the next section is 𝑔𝛾1,2.adj = (𝑛𝑐in , 𝑛𝑐out) = (3, 2).

169

Page 178: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

Channel Capacity Adjustment Algorithm

In the following, the reasons for the introduction of artificial deadlocks due tolimited channel capacities will be detailed and our novel FIFO channel capacityadjustment algorithm will be introduced. This adjustment algorithm ensuresthat cluster refinement will not introduce an artificial deadlock into the refinedDFG if the original unrefined DFG is a KPN. Note that the presented algorithmdoes not require any knownledge—indeed, the key aspect of the problem at handis exactly the lack of such knowledge—of the cluster environment. Nevertheless,for illustration purposes, cluster environments will be modeled by SDF or CSDFactors in the examples presented in this section.In general terms, a QSS ensures that there are no artificial deadlocks caused

by missing tokens on feedback loops from the cluster output ports to the clus-ter input ports. If the channel capacities are limited, however, clustering mightintroduce artificial deadlocks due to back pressure. Back pressure induced dead-lock denotes the fact that there exists a (reverse) feedback loop of channels andactors where all actors in the loop cannot fire because free places are missing ontheir output ports connected to the channels contained in the loop. Five rea-sons can be identified for the introduction of artificial deadlocks due to limitedcapacities. These reasons are caused by back pressure from

(A) atomic token consumption on a sole cluster input port,

(B) atomic token production on a sole cluster output port,

(C) missing free places on feedback loops from the cluster input ports to thecluster output ports, i.e., input-to-output back pressure,

(D) missing free places on feedback loops from one cluster input port to anothercluster input port, i.e., input-to-input back pressure, and

(E) missing free places on feedback loops from one cluster output port toanother cluster output port, i.e., output-to-output back pressure.

Indeed, reason (A) is just a special case of reason (D) when the cluster hasonly one input port and reason (B) is the special case of reason (E) when thecluster has only one output port. That is to say, reason (A) can be thought ofmissing free places on a feedback loop of the input port with itself. Note thatthe consumption of tokens does also produce free places and, thus, can formpart of a feedback loop with itself. Thus, reason (B) can be thought of missingfree places on a feedback loop of the output port with itself. Nonetheless, thedetailed explanation of the adjustment algorithm will begin with reason (A) and(B) due to the relative simple nature of the problem. Next, the channel capacityadjustments adji2o required to solve reason (C) concerned with deadlocks due to

170

Page 179: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

back pressure from a cluster input port to a cluster output port will be discussed.Subsequently, the solutions given by the channel capacity adjustments adji2i andadjo2o to the general cases of reasons (A) and (B) embodied in the reasons (D)and (E) covering the situations that the cluster has multiple input and outputports are discussed.Note that the cluster state space may contain more than one state and that

the cluster environment is not aware of the current state of a cluster. In theexamples used in this paper, the cluster environment is represented by a sin-gle actor named 𝑎env. In reality, however, it represents the arbitrary complexsubgraph of the DFG 𝑔 that represents the DFG without its contained clus-ter. Moreover, a deadlock-free execution of the DFG must be guaranteed foreach state. Thus, the following analyses of the required channel capacity ad-justments given by the three functions adji2o, adji2i and adjo2o are performedfor each state of the cluster state space. With these three adjustment functionsgiven in Definitions 4.21 to 4.23, the FIFO channel capacity adjustment vector𝑔𝛾 .adj can be defined as follows:

Definition 4.20. (FIFO Channel Capacity Adjustment Vector) The FIFO chan-nel capacity adjustment vector adj ∈ N|𝐼∪𝑂|

0 specifies a sufficient increase inchannel capacities for each cluster input and output channel such that givenany deadlock free DFG 𝑔 exhibiting KPN semantics and containing the cluster𝑔𝛾 , then replacing the cluster with its corresponding composite actor 𝑎𝛾 andincreasing the channel capacities of the channels connected to this compositeactor by the amount given by the adjustment vector results in a refined DFG𝑔 that is also free of any deadlock. In the following, the notation max𝑋 isused to denote the least upper bound of a set of partially ordered vectors, e.g.,max{ (1, 0), (0, 1) } = (1, 1). With this notation, the adjustment vector is de-rived from the three adjustment functions adji2o, adji2i and adjo2o as follows:

𝜋𝑂(adj) = max{ adjo2o(𝑞) | 𝑞 ∈ 𝑄𝛾 } (4.44)

𝑛i2i = max{ adji2i(𝑞) | 𝑞 ∈ 𝑄𝛾 } (4.45)

𝑛i2o = max{ adji2o(𝑞, 𝜋𝑂(adj)) | 𝑞 ∈ 𝑄𝛾 } (4.46)

𝜋𝐼(adj) =

{max{𝑛i2i,𝑛i2o } if 𝑂 = ∅𝑛i2i otherwise (4.47)

As shown in Equations (4.44) to (4.46), the channel capacity adjustment vec-tor 𝑔𝛾 .adj is derived from the pointwise maximum of the adjustments requiredby the individual states, thus assuming a worst case cluster environment withrespect to back pressure. First, if the cluster has any output ports, then theFIFO channel capacity adjustment vector 𝜋𝑂(adj) for the cluster output chan-nels is determined by computing the pointwise maximum (see Equation (4.44))of the adjustments required for the output-to-output back pressure problem for

171

Page 180: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

each individual cluster state.79 Next, if the cluster has any input ports, thena vector 𝑛i2i of input channel capacity adjustments is determined by comput-ing the pointwise maximum (see Equation (4.45)) of the adjustments requiredfor the input-to-input back pressure problem for each individual cluster state.Subsequently, if the cluster has both input as well as output ports, then a vec-tor 𝑛i2o of input channel capacity adjustments is determined by computing thepointwise maximum (see Equation (4.46)) of the adjustments required for theinput-to-output back pressure problem for each individual cluster state. Here,the fact that the cluster output channel capacities will be enlarged by the vectorspecified by 𝜋𝑂(adj) can be used to minimize the required adjustments on thecluster input channels. Otherwise, if the cluster has no output ports, then thevector 𝑛i2o is assumed to be the all zeros vector 0. Finally, if the cluster hasany input ports, then the FIFO channel capacity adjustment vector 𝜋𝐼(adj) forthe cluster input channels is determined by computing the pointwise maximum(see Equation (4.47)) of the adjustment requirements of the input-to-input aswell as input-to-output back pressure problems.To exemplify the computation of the adj vector for the example given in Fig-

ure 4.48, its three constituting parts are in Table 4.3. Later on, it will be shownhow to compute these function values. The QSS for cluster 𝑔𝛾1,2

has only a sin-gle state. Hence, inserting the values from Table 4.3 into Definition 4.20 resultsin the following computation of the FIFO channel capacity adjustment vectoradj:

𝜋𝑂(adj) = adjo2o(𝑞0) = (2)

𝑛i2i = adji2i(𝑞0) = (0)

𝑛i2o = adji2o(𝑞0, 𝜋𝑂(adj)) = (3)

𝜋𝐼(adj) = max{𝑛i2i,𝑛i2o } = (3)

79In the following, the ‘𝑔𝛾 .’-prefix will be dropped from various notations when the cluster isclear from context, e.g., 𝑂 instead of 𝑔𝛾 .𝑂.

Table 4.3: The three channel capacity adjustments parts for the cluster 𝑔𝛾1,2

adjo2o(𝑞) adji2i(𝑞) adji2o(𝑞, (2))(𝑛𝑐out) (𝑛𝑐in) (𝑛𝑐in)

𝑞 = 𝑞0 (2) (0) (3)

max{ . . . | 𝑞 ∈ 𝑄𝛾 } (2) (0) (3)

172

Page 181: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

Using this adjustment vector (adj = (𝑛𝑐in , 𝑛𝑐out) = (3, 2)), the adjusted chan-nel capacities for the refined DFG 𝑔ex8 depicted in Figure 4.48b are derived asfollows: 𝑔ex8 . size(𝑐in) = 𝑔ex8 . size(𝑐in) + 𝑛𝑐in = 4 + 3 = 7𝑔ex8 . size(𝑐out) = 𝑔ex8 . size(𝑐out) + 𝑛𝑐out = 3 + 2 = 5

After having introduced the algorithm to compute the FIFO channel capac-ity adjustment vector, it will be discussed how to derive its constituting partsembodied by the three adjustment functions adji2o, adji2i and adjo2o. First, theinput-to-input back pressure and output-to-output back pressure problems willbe considered in isolation for the special cases that a cluster has only a singleinput or output port, that is reasons (A) and (B), respectively. Subsequently, acluster that has both a single input and a single output port will be discussed.Hence, capacity adjustments for this cluster will also require the solution of thedeadlock problem due to reason (C) concerned with deadlocks due to back pres-sure from a cluster input port to a cluster output port. Next, the general casesof the input-to-input back pressure and output-to-output back pressure problems(reasons (D) and (E)) will be considered in isolation by analyzing two exam-ples, one having two inputs and the other one two output ports, respectively.Finally, the computation of the FIFO channel capacity adjustment vector willbe demonstrated for the general case resulting from the interplay of all reasons(A) to (E) embodied by an example having multiple cluster states as well asmultiple input and output ports.

Atomic Consumption/Production on a Single Cluster Port

Reasons (A) and (B) stem from the fact that multiple firings of the same inputor output actor may be performed atomically by a transition. As previouslydefined, an input actor denotes an actor of a cluster that consumes tokens froma channel outside the cluster. Conversely, an output actor denotes an actorof a cluster that produces tokens onto a channel outside the cluster. ConsiderFigure 4.49 as an example for reason (A). In the unrefined DFG 𝑔ex9 , the inputactor 𝑎3 can fire independently of actor 𝑎4. Hence, the two tokens in the clusterinput channel 𝑐in can be consumed by two firings of actor 𝑎3, thus allowingactor 𝑎env to produce three new tokens onto channel 𝑐in. In contrast to this, thecomposite actor 𝑎𝛾3,4

of the refined DFG 𝑔ex9 can only consume three tokensatomically. Hence, an artificial deadlock would result without adjustment of thechannel capacity of the channel 𝑐in from three to five tokens, i.e., the channelcapacity size(𝑐in) is increased by two tokens. To compute the required increaseof the channel capacity of 𝑐in, the consumption rate of the transition 𝑡1 of thecomposite actor 𝑎𝛾3,4

is first determined to be three tokens, i.e., cons(𝑡1) =(𝑛𝑐in) = (3). In contrast, an isolated firing of the actor 𝑎3 requires one tokenon the cluster input channel 𝑐in, i.e., cons(𝑐in) = 1. Hence, the capacity of the

173

Page 182: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

3

3

cons(𝑐3→4) = 3cons(𝑐in) = 1

𝑖1𝑎env

𝑎4𝑎3𝑐3→4

size(𝑐3→4) = 3

𝑐insize(𝑐in) = 3

𝑔𝛾3,4

(a) Unrefined DFG 𝑔ex9with

its cluster 𝑔𝛾3,4

3

3

𝑎4𝑎3𝑐3→4

cons(𝑡1) = (3)

𝑖1𝑎env

𝑐insize(𝑐in) = 5

𝑞0 𝑡1

#𝑖1 ≥ 3/𝑓⟨𝑎3,𝑎3,𝑎3,𝑎4⟩𝑎𝛾3,4

(b) Refined DFG 𝑔ex9with

the SDF composite actor𝑎𝛾3,4

Figure 4.49: Example DFG 𝑔ex9 illustrates that the refinement of a cluster into acomposite actor without channel capacity adjustment can introducean artificial deadlock into a DFG even if it is only acting as a sinkactor.

3

3

prod(𝑐out) = 1prod(𝑐4→5) = 3

𝑜1𝑎env

𝑎5𝑎4

𝑐outsize(𝑐out) = 3

𝑐4→5

size(𝑐4→5) = 7

𝑔𝛾4,5

(a) Unrefined DFG 𝑔ex10

with its cluster 𝑔𝛾4,5

3

3

𝑎5𝑎4𝑐4→5

prod(𝑡1) = (3)

𝑜1𝑎env

𝑐outsize(𝑐out) = 5

𝑞0

#𝑜1 ≥ 3/𝑓⟨𝑎4,𝑎5,𝑎5,𝑎5⟩

𝑡1

𝑎𝛾4,5

(b) Refined DFG 𝑔ex10with

the SDF composite actor𝑎𝛾4,5

Figure 4.50: Example DFG 𝑔ex10 illustrates that the refinement of a cluster into acomposite actor without channel capacity adjustment can introducean artificial deadlock into a DFG even if it is only acting as a sourceactor.

cluster input channel 𝑐in must be enlarged by two tokens in order to compensatefor the difference between the two consumption rates. Reason (B) as shown inFigure 4.50 can be handled analogously for the cluster output channel 𝑐out.More formally, the functions adji2i : 𝑄𝛾 → N|𝐼|

0 and adjo2o : 𝑄𝛾 → N|𝑂|0

are used to denote the required adjustments of the input respectively outputchannel capacities due to reason (A) or reason (B). Of course, the value forthe function adji2i can only be computed if the cluster has at least one input

174

Page 183: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

port. Conversely, the value for the function adjo2o can only be computed if thecluster has at least one output port. To exemplify, the adjustment values forthe three clusters 𝑔𝛾1,2

, 𝑔𝛾3,4, and 𝑔𝛾4,5

depicted in Figures 4.49 to 4.51 are givenin Table 4.4. In the simple cases—only one cluster state and only a single inputor output port—represented by the clusters 𝑔𝛾3,4

and 𝑔𝛾4,5, the FIFO channel

capacity adjustment vector can be directly derived from Table 4.4.In case of cluster 𝑔𝛾3,4

(see Figure 4.49), the adjustment vector is solely rep-resented by a capacity adjustment for the channel 𝑐in connected to the singlecluster input port 𝑖1. Next, Equations (4.45) and (4.47) from Definition 4.20 areapplied, thus, resulting in Equation (4.48). Finally, the cluster state space ofcluster 𝑔𝛾3,4

has only the single state 𝑞0 and, thus, the result is given by Equa-tion (4.49).

𝜋𝐼(𝑔𝛾3,4.adj) = max{ 𝑔𝛾3,4

.adji2i(𝑞) | 𝑞 ∈ 𝑄𝛾 = { 𝑞0 } } (4.48)

= 𝑔𝛾3,4.adji2i(𝑞0) = (2) (4.49)

Conversely, in case of cluster 𝑔𝛾4,5(see Figure 4.50), the adjustment vector

is solely represented by an adjustment for the channel capacity for the channel𝑐out connected to the single cluster output port 𝑜1. To proceed, Equation (4.44)from Definition 4.20 is applied to transform the problem into Equation (4.50).To conclude, only the single state 𝑞0 is present in the cluster state space, thus,giving the result present in Equation (4.51).

𝜋𝑂(𝑔𝛾4,5.adj) = max{ 𝑔𝛾4,5

.adjo2o(𝑞) | 𝑞 ∈ 𝑄𝛾 = { 𝑞0 } } (4.50)

= 𝑔𝛾4,5.adjo2o(𝑞0) = (2) (4.51)

However, if a cluster has both input and output ports, then the input-to-output back pressure problem must also be solved as discussed next.

Table 4.4: Adjustment values due to reasons (A) and (B)

adji2i(𝑞0)

𝑔𝛾3,4(see Figure 4.49) cons(𝑡1)− cons(𝑐in) = 3− 1 = 2

𝑔𝛾4,5(see Figure 4.50) —

𝑔𝛾1,2(see Figure 4.51) cons(𝑡1)− cons(𝑐in) = 3− 3 = 0

adjo2o(𝑞0)

𝑔𝛾3,4(see Figure 4.49) —

𝑔𝛾4,5(see Figure 4.50) prod(𝑡1)− prod(𝑐out) = 3− 1 = 2

𝑔𝛾1,2(see Figure 4.51) prod(𝑡1)− prod(𝑐out) = 3− 1 = 2

175

Page 184: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

33

(3,3,0,0,3)

(1,0,2,3,3)

prod(𝑐out) = 1cons(𝑐in) = 3

𝑖1 𝑜1𝑎env

𝑎2𝑎1

𝑐outsize(𝑐out) = 3

𝑐1→2

size(𝑐1→2) = 4

𝑐insize(𝑐in) = 4

𝑔𝛾1,2

(a) Unrefined DFG 𝑔ex8 withits cluster 𝑔𝛾1,2

33

(3,3,0,0,3)

(1,0,2,3,3)

𝑎2𝑎1𝑐1→2

prod(𝑡1) = (3)cons(𝑡1) = (3)

𝑖1 𝑜1𝑎env

𝑐outsize(𝑐out) = 5

𝑐insize(𝑐in) = 7

𝑞0 𝑡1

𝑓⟨𝑎1,𝑎2,𝑎2,𝑎2⟩#𝑖1 ≥ 3 ∧#𝑜1 ≥ 3/𝑎𝛾1,2

(b) Refined DFG 𝑔ex8 with theSDF composite actor 𝑎𝛾1,2

Figure 4.51: Shown above is the replication of the DFG 𝑔ex8 from Figure 4.48where the cluster environment 𝑎env is given by a CSDF actor.

Input-to-Output Back Pressure

While reasons (A) and (B) can occur even if the composite actor is only a sinkor source actor, reason (C) can only occur if the cluster has both inputs andoutputs. To exemplify, the cluster 𝑔𝛾1,2

shown in Figure 4.48 and replicatedhere for the convenience of the reader as Figure 4.51 is again considered. TheDFG 𝑔ex8 containing this cluster has the following FIFO channel capacities:

size(𝑐in) = 4

size(𝑐out) = 3

size(𝑐1→2) = 4

In the following, a scenario that leads to deadlock due to reasons (B) and(C) will be presented. In this scenario, two firings of the cluster environment𝑎env will consume one token from the cluster output channel 𝑐out and push sixtokens onto the cluster input channel 𝑐in by twice producing three tokens ontothe channel. This scenario is also performed by the first two firings of the CSDFactor 𝑎env that is depicted in Figure 4.51.Obviously, when the unrefined DFG 𝑔ex8 is dynamically scheduled, the speci-

fied FIFO channel capacities are sufficient: Fire actor 𝑎1 once, wait for the firstfiring of the cluster environment modeled by 𝑎env to result in two free placeson the cluster output channel 𝑐out and further three tokens on the cluster inputchannel 𝑐in, fire actor 𝑎2 twice and actor 𝑎1 once, wait for the second firingof the environment to produce three more tokens on the cluster input channel𝑐in. After this scenario has finished, all channels 𝑐in, 𝑐out, and 𝑐1→2 are filled tocapacity.On the other hand, after 𝑔𝛾1,2

has been refined into the composite actor 𝑎𝛾1,2

(see Figure 4.51b), the action 𝑓⟨𝑎1,𝑎2,𝑎2,𝑎2⟩ of transition 𝑡1 requires three tokens on

176

Page 185: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

the cluster input port 𝑖1 and three free places on the cluster output port 𝑜1 beforeit can be executed. However, assuming that the channel capacity of channel 𝑐outis still three tokens, then the transition cannot be enabled due to the two initialtoken in channel 𝑐out. Hence, no tokens are consumed from the cluster inputchannel 𝑐in and, thus, the cluster environment cannot even produce the firstbatch of three tokens onto the cluster input channel 𝑐in. Therefore, the refinedgraph 𝑔ex8 results in a deadlock. This artificial deadlock is caused by reason (B).To compensate, the channel capacities must be adjusted (see Table 4.4 cluster𝑔𝛾1,2

) as follows:

size′(𝑐in) = size(𝑐in) + adji2i(𝑞0)= 4 + 0 = 4

size′(𝑐out) = size(𝑐out) + adjo2o(𝑞0)= 3 + 2 = 5

As can be seen, the channel capacity of the cluster output channel 𝑐out hasbeen increased by two tokens (adjo2o(𝑞0) = 2) to handle reason (B). This in-crease stems from the difference of the atomic production of three tokens by thetransition 𝑡1 (see Figure 4.51b) as compared to the production of one token by asingle firing of actor 𝑎2. In contrast to this, reason (A) does not pose a problem(adji2i(𝑞0) = 0) for this example. However, these adjustments will prove to bestill insufficient to prevent any artificial deadlock due to reason (C).After the adjustment given above, the composite actor can fire once, thus

allowing the cluster environment to produce the first batch of three tokens outof the six tokens. Subsequently, the cluster environment 𝑎env will fire onceresulting in a state where the cluster input and output channels are both storingfour tokens. Hence, neither can the composite actor fire a second time, nor canthe cluster environment produce the second batch of three tokens, thus resultingin another artificial deadlock. This deadlock is caused by the inability of thecluster environment to access the FIFO channel capacity of four tokens availableon the FIFO channel 𝑐1→2 contained in the composite actor 𝑎𝛾1,2

. That is, theartificial deadlock is due to reason (C).To compensate, the capacity of either the cluster input channel or the cluster

output channel has to be increased. In the following, the function adji2o(𝑞,𝑛)will be used to denote a sufficient increase in the channel capacities of thecluster input channels in order to facilitate the compensation for the inaccessiblechannel capacities inside the cluster. This compensation depends on the givencluster state 𝑞 and the adjustment 𝑛 ∈ N|𝑂|

0 that will be used to increase thechannel capacities of the cluster output channels.To compute adji2o, a cluster state 𝑞dyn ∈ 𝑄dyn

𝛾 that could have been encoun-tered during dynamic scheduling of the cluster is searched for. This cluster state𝑞dyn must have the property that it stores a local maximum of tokens inside thecluster. The number of tokens stored inside a cluster is maximized by starting

177

Page 186: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

3 3

𝑖1 𝑜1𝑎env

𝑎2𝑎1

𝑐out

𝑐1→2

size(𝑐1→2) = 4

𝑐in

𝑔𝛾1,2

(a) Original situation in clus-ter state 𝑞0

3 3

𝑖1 𝑜1𝑎env

𝑎2𝑎1

𝑐out

𝑐1→2

size(𝑐1→2) = 4

𝑐in

𝑔𝛾1,2

(b) Dynamic cluster state𝑞dyn,1 reachable by firingactor 𝑎1 once and undo-ing one firing of actor 𝑎2

3 3

𝑖1 𝑜1𝑎env

𝑎2𝑎1

𝑐out

𝑐1→2

size(𝑐1→2) = 4

𝑐in

𝑔𝛾1,2

(c) Dynamic cluster state𝑞dyn,2 reachable by undo-ing four firings of the ac-tor 𝑎2

Figure 4.52: Here, two dynamic cluster states storing a maximum number oftokens inside the cluster 𝑔𝛾1,2

are illustrated.

from the cluster state 𝑞 ∈ 𝑄𝛾 consuming tokens from the cluster inputs andundoing80 the production of produced tokens on the cluster output ports.In case of cluster 𝑔𝛾1,2

, the state where a local—even a global—maximum oftokens is stored inside the cluster is given when the channel 𝑐1→2 is filled tocapacity. As can be seen in Figure 4.52, two dynamic cluster states 𝑞dyn,1 and𝑞dyn,2 of maximum token storage can be derived from the original state 𝑞0.The first dynamic cluster state 𝑞dyn,1 ∈ 𝑄dyn

𝛾 (see Figure 4.52b) is reachedfrom the original state 𝑞0 by firing actor 𝑎1 once and undoing one firing of actor𝑎2, thus consuming three tokens from the input port 𝑖1 and consuming one tokenfrom the output port 𝑜1. Using the input-to-output back pressure adjustmentfunction adji2o, this can also be expressed as adji2o(𝑞0, (1)) = (3) and shouldbe read as: Given an increase of the capacity of 𝑐out by at least one token andthe case that the cluster is in state 𝑞0, then an increase of three tokens forthe capacity of the channel 𝑐in is sufficient to solve reason (C) concerned withartificial deadlock due to input-to-output back pressure. However, there is stillreason (B) that requires an increase of at least two tokens of the capacity ofthe channel 𝑐out. Hence, the question arises how many additional free placeson the cluster input channel are required to solve the input-to-output backpressure problem if the capacity on the cluster output channel has already beenincreased by two free places, i.e., adji2o(𝑞0, (2)). Still, three more free places arerequired for the channel 𝑐in to solve the input-to-output back pressure problem,i.e., adji2o(𝑞0, (2)) = (3), since undoing two firings of actor 𝑎2 does not fillchannel 𝑐1→2 to capacity. Thus, with the function values for the the cluster

80In reality, the cluster will not undo any actor firing and this should be thought of as delayedproduction of tokens by a dynamically scheduled cluster in contrast to a composite actorimplementing a QSS. A QSS is required to always produce the maximal number of outputtokens from a minimal number of input tokens and, hence, will never delay the productionof tokens on its output ports.

178

Page 187: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

𝑔𝛾1,2in Table 4.4 the values in Table 4.3 are explained, and the updated channel

capacities given below can be computed as given in Definition 4.20.

𝑔ex8 . size(𝑐in) = 4 + 3 = 7𝑔ex8 . size(𝑐out) = 3 + 2 = 5

The second dynamic cluster state 𝑞dyn,2 ∈ 𝑄dyn𝛾 (see Figure 4.52c) is reached

from the original state 𝑞0 by undoing four firings of actor 𝑎2, thus consumingfour token from the output port 𝑜1, i.e., adji2o(𝑞0, 4) = 0. Hence, instead ofincreasing the channel capacity of the cluster input channel 𝑐in, it would alsobe possible to increase the channel capacity of the cluster output channel 𝑐outby four tokens to handle reason (C). Thus, leading to the following channelcapacities of the cluster input and output channels:

𝑔ex8 . size(𝑐in) = 4 + 0 = 4𝑔ex8 . size(𝑐out) = 3 + 4 = 7

More formally, the function adji2o can be computed as follows:

Definition 4.21. (Input-to-Output Back Pressure Adjustment) The input-to-output back pressure adjustment is encoded by the vector-valued function adji2o :

𝑄𝛾 × N|𝑂|0 → N|𝐼|

0 . Given a cluster state 𝑞 and a vector 𝑛 ∈ N|𝑂|0 that denotes

an increase in the output channel capacities of the cluster, then this functioncomputes the required increase of the cluster input channel capacities in orderto solve the input-to-output back pressure problem for the given state. The ad-justment is computed for each input port 𝑖 in isolation. A dynamic cluster state𝑞dyn (see Equation (4.52)) that could have been encountered during dynamicscheduling of the cluster is determined. In general terms, the dynamic clusterstate 𝑞dyn is encountered during dynamic scheduling when the cluster has al-ready consumed some input tokens, but the production of output tokens hasbeen delayed. Thus, compared to the cluster state 𝑞 of the composite actor 𝑎𝛾 ,the dynamic cluster state consumes 𝑛𝑖 more tokens (see Equation (4.53)) fromthe cluster input port 𝑖 but produces 𝑛𝑂 less tokens (see Equation (4.54)) on thecluster output ports 𝑂. However, the generated composite actor 𝑎𝛾 will alwaysproduce the maximal number of output tokens, thus always producing all tokensremaining in the cluster to the cluster output channels. Hence, only dynamiccluster states are considered where these productions of output tokens by thecomposite actor can be accommodated (see Equation (4.54)) by the increase inthe channel capacities of the cluster output channels denoted by the vector 𝑛.Moreover, only dynamic cluster states 𝑞dyn ∈ 𝑄dyn

𝛾 (see Equation (4.55)) thatexhibit a local maximum of tokens stored inside the cluster are considered. Thatis to say, the number of tokens stored inside the cluster cannot be increased byconsuming more tokens from the cluster input port 𝑖 (see Equation (4.56)) or

179

Page 188: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

(3,3,0,0)

(0,0,3,3)

3

3

3𝑖2

𝑖1

𝑎env

𝑎4

𝑎1

𝑎3

𝑐3→4

size(𝑐3→4) = 3

𝑐1→3

size(𝑐1→3) = 7

size(𝑐in,2) = 4𝑐in,2

𝑐in,1size(𝑐in,1) = 3

𝑔𝛾1,3,4

(a) Original unrefined DFG 𝑔ex11 withits cluster 𝑔𝛾1,3,4

3

3

3

(3,3,0,0)

(0,0,3,3)

𝑎4

𝑎1

𝑎3

𝑐3→4

𝑐1→3

𝑖2

𝑖1

𝑎env

size(𝑐in,2) = 10𝑐in,2

𝑐in,1size(𝑐in,1) = 5

𝑞0 𝑡1

#𝑖1 ≥ 3 ∧#𝑖2 ≥ 3/𝑓⟨𝑎3,𝑎3,𝑎1,𝑎3,𝑎4⟩

𝑎𝛾1,3,4

(b) Refined DFG 𝑔ex11 containing thecomposite actor 𝑎𝛾1,3,4

Figure 4.53: Here, the DFG 𝑔ex11 is used to generalize the input-to-input backpressure problem to multiple input ports.

delaying the production of further output tokens (see Equation (4.57)). Finally,to determine the required channel capacity adjustment for the cluster inputchannel connected to the cluster input port 𝑖, the dynamic cluster state 𝑞dyn ischosen where 𝑛𝑖 is minimal (see Equation (4.52)). This is the state where thealready present channel capacity increases 𝑛 on the cluster output channels arebest taken advantage of.

𝜋{ 𝑖 }(adji2o(𝑞,𝑛)) = min{𝑛𝑖 | ∃𝑞dyn ∈ 𝑄dyn𝛾 : (4.52)

𝑛𝑖 = − 𝜋{ 𝑖 }(io(𝑞dyn)− io(𝑞)) ≥ 0 ∧ (4.53)𝑛𝑂 = − 𝜋𝑂(io(𝑞dyn)− io(𝑞)) ≤ 𝑛 ∧ (4.54)

@𝑞′

dyn ∈ 𝑄dyn𝛾 ∖ { 𝑞dyn } : (4.55)

𝑛𝑖 ≤ −𝜋{ 𝑖 }(io(𝑞′

dyn)− io(𝑞)) ∧ (4.56)

𝑛𝑂 ≤ −𝜋𝑂(io(𝑞′

dyn)− io(𝑞)) } (4.57)

Input-to-Input Back Pressure

Previously, the input-to-input and output-to-output back pressure problemshave only been discussed for the special cases that the cluster has only a singleinput or a single output port, respectively. In contrast to this, the input-to-inputback pressure problem if a cluster has multiple inputs will be discussed in thissection. To put focus on this, the DFG 𝑔ex11 shown in Figure 4.53 that exhibitsthe input-to-input back pressure problem in isolation will be used to analyze theproblem. That is to say, input-to-output and output-to-output back pressure donot pose a problem for this example as the cluster 𝑔𝛾1,3,4

has no output ports.In the following, a critical scenario that leads to an artificial deadlock due

to reason (D) will be presented. As can be seen in Figure 4.53a, the cluster

180

Page 189: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

environment 𝑎env tries to produce six tokens onto the cluster input channel 𝑐in,2by producing two batches of three tokens each. In the unrefined case, there aresufficient channel capacities inside the cluster to facilitate the consumption ofsix tokens from the cluster input channel 𝑐in,2, i.e., fire actor 𝑎1 once, consumingthe first batch of three tokens from cluster input channel 𝑐in,2, fire actor 𝑎3 once,consuming one of the initial tokens provided in the cluster input channel 𝑐in,1,and fire actor 𝑎1 once more, consuming the second batch of three tokens fromcluster input channel 𝑐in,2.On the other hand, after 𝑔𝛾1,3,4

has been refined into the composite actor 𝑎𝛾1,3,4

(see Figure 4.53b), the transition 𝑡1 requires three tokens on the cluster inputports 𝑖1 and 𝑖2 before it can be taken. Moreover, as the cluster environment𝑎env does not provide any additional tokens on the cluster input channel 𝑐in,1,the transition 𝑡1 cannot be taken. Hence, assuming that the channel capacity ofthe cluster input channel 𝑐in,2 is still four tokens, an artificial deadlock resultsdue to reason (D). To compensate, the channel capacities must be adjusted asfollows: 𝑔ex11 . size(𝑐in,1) = 𝑔ex11 . size(𝑐in,1) + 𝑛𝑐in,1 = 3 + 2 = 5𝑔ex11 . size(𝑐in,2) = 𝑔ex11 . size(𝑐in,2) + 𝑛𝑐in,2 = 4 + 6 = 10

These adjustments correspond to the number of tokens that can be consumedby the cluster on its two different input ports while still consuming less tokenson both input ports than would be necessary to activate the transition 𝑡1. Fora better depiction of the number of consumed tokens, the situation as shownin Figure 4.54a is assumed as start point. As can be seen in Figure 4.54c, thecluster 𝑔𝛾1,3,4

can consume up to six tokens on its input port 𝑖2 (fire actor 𝑎3once and actor 𝑎1 twice) without enabling the transition 𝑡1. Thus, the channelcapacity of the cluster input channel 𝑐in,2 must be increased by six tokens, i.e.,𝑛𝑐in,2 = 6. Consumption of more than six tokens would require nine tokens oninput port 𝑖2 and four tokens on input port 𝑖1. Hence, enabling the input/outputguard 𝑘io = #𝑖1 ≥ 3 ∧#𝑖2 ≥ 3 of the transition 𝑡1. An equivalent observation(see Figure 4.54b) can be made for the input port 𝑖1, which can consume upto two tokens without enabling the transition 𝑡1. More formally, the functionadji2i is used to encode the adjustment requirements:

Definition 4.22. (Input-to-Input Back Pressure Adjustment) The input-to-input back pressure adjustment is given via the vector-valued function adji2i :

𝑄𝛾 → N|𝐼|0 that associates with a cluster state 𝑞 the required increase of the

cluster input channel capacities in order to solve the input-to-input back pressureproblem for the given cluster state. The adjustment is computed for each inputport 𝑖 by determining the maximal (see Equation (4.58)) number 𝜋{ 𝑖 }(𝑛𝐼) oftokens that can be consumed (see Equation (4.59)) by the cluster on the inputport 𝑖, but without enabling (see Equation (4.61)) any transition (see Equa-tion (4.60)) leaving the state 𝑞. Here, the vector 𝑛𝐼 represents the number of

181

Page 190: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

3

3

3

𝑖1

𝑖2

𝑔env

𝑎4

𝑎1

𝑎3𝑐3→4

𝑐1→3

size(𝑐3→4) = 3

size(𝑐1→3) = 7

𝑔𝛾1,3,4

(a) Original situation incluster state 𝑞0

3

3

3

𝑖1

𝑖2

𝑔env

𝑎4

𝑎1

𝑎3𝑐3→4

𝑐1→3

size(𝑐3→4) = 3

size(𝑐1→3) = 7

𝑔𝛾1,3,4

(b) Dynamic cluster state𝑞dyn,1

3

3

3

𝑖1

𝑖2

𝑔env

𝑎4

𝑎1

𝑎3𝑐3→4

𝑐1→3

size(𝑐3→4) = 3

size(𝑐1→3) = 7

𝑔𝛾1,3,4

(c) Dynamic cluster state𝑞dyn,2

Figure 4.54: In the above given example, a cluster environment 𝑔env that pro-vides an infinite number of tokens is assumed. For depiction pur-pose, seven tokens are chosen for each cluster input port. Threesituations distinguished by their corresponding state of the clusterare shown. The original situation (𝑞0) is given in (a). The situation(𝑞dyn,1) where the cluster consumes the maximal number of tokens(here two tokens) from input port 𝑖1 while still not activating tran-sition 𝑡1 is given in (b). Conversely, the situation (𝑞dyn,2) where thecluster consumes the maximal number of tokens (here six tokens)from input port 𝑖2 while still not activating transition 𝑡1 is shown(c).

tokens that would be consumed on the cluster input ports during a sequenceof actor firings transitioning the cluster from the current cluster state 𝑞 to an-other state 𝑞dyn ∈ 𝑄dyn

𝛾 that could be encountered by dynamic scheduling ofthe cluster.

𝜋{ 𝑖 }(adji2i(𝑞)) = max{ 𝜋{ 𝑖 }(𝑛𝐼) | ∃𝑞dyn ∈ 𝑄dyn𝛾 : (4.58)

𝑛𝐼 = − 𝜋𝐼(io(𝑞dyn)− io(𝑞)) ≥ 0 ∧ (4.59)@𝑡 ∈ 𝑇 ∧ 𝑡.𝑞src = 𝑞 : (4.60)𝑛𝐼 ≥ cons(𝑡) } (4.61)

Output-to-Output Back Pressure

Finally, to conclude the discussion of the reasons of artificial deadlock due toback pressure, the solution for output-to-output back pressure will be general-ized to clusters having more than one output port. The problem of output-to-output back pressure can even occur if the cluster is a pure source for the DFG.We will use this to consider the problem in isolation by analyzing the output-to-output back pressure problem for the cluster 𝑔𝛾2,4,5

shown in Figure 4.55. As

182

Page 191: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

(3,3,0,0)

(0,0,3,3)

3 𝑜1

𝑜2

𝑎env

𝑎5

𝑎2

𝑎4

𝑐out,1size(𝑐out,1) = 3

𝑐out,2size(𝑐out,2) = 3

𝑐5→2

size(𝑐5→2) = 3

𝑐4→5

size(𝑐4→5) = 7𝑔𝛾2,4,5

(a) Original unrefined DFG 𝑔ex12 withits cluster 𝑔𝛾2,4,5

3

(3,3,0,0)

(0,0,3,3)

𝑎5

𝑎2

𝑎4

𝑐4→5

𝑐5→2

𝑜1

𝑜2

𝑎env

𝑐out,1size(𝑐out,1) = 5

𝑐out,2size(𝑐out,2) = 8

𝑞0 𝑡1

#𝑜1 ≥ 3 ∧#𝑜2 ≥ 3/𝑓⟨𝑎4,𝑎5,𝑎2,𝑎5,𝑎2,𝑎5,𝑎2⟩

𝑎𝛾2,4,5

(b) Refined DFG 𝑔ex12 containing thecomposite actor 𝑎𝛾2,4,5

Figure 4.55: For illustration of the generalized solution of the output-to-outputpack pressure problem from one to multiple output ports the abovegiven DFG 𝑔ex12 is employed.

this cluster has no input ports, input-to-output and input-to-input back pressuredoes not occur.As always, the description of a critical scenario for the output-to-output back

pressure problem in the DFG 𝑔ex12 is given. In this scenario, the cluster en-vironment 𝑎env pulls six tokens from the cluster output channel 𝑐out,1 by twiceconsuming three tokens from the channel. This can be accommodated by theunrefined DFG 𝑔ex12 by firing both actors 𝑎5 and 𝑎2 once, a consumption of thefirst batch of three tokens by the cluster environment 𝑎env, three more firings ofactor 𝑎5, and the consumption of the second batch of three tokens by the clusterenvironment 𝑎env. After this sequence of actor firings, the channel 𝑐5→2 is filledto capacity.On the other hand, the composite actor 𝑎𝛾2,4,5

is always producing the max-imum number of output tokens, thus always producing all tokens remaining inchannel 𝑐5→2 to the cluster output channel 𝑐out,2. Hence, assuming that thechannel capacity of the cluster output channel 𝑐in,2 is still three tokens, an arti-ficial deadlock results due to reason (E). To compensate, the channel capacitiesmust be adjusted as follows:𝑔ex12 . size(𝑐out,1) = 𝑔ex12 . size(𝑐out,1) + 𝑛𝑐out,1 = 3 + 2 = 5𝑔ex12 . size(𝑐out,2) = 𝑔ex12 . size(𝑐out,2) + 𝑛𝑐out,2 = 3 + 5 = 8

To compute these adjustment values for the transition 𝑡1 and each outputport, the maximal number of token productions that can be undone on thisport while still not having undone all produced tokens of transition 𝑡1 will bedetermined. To enable a graphical illustration of undoing token productions,the original situation given in Figure 4.56a will be assumed. In this situation,the cluster environment has consumed seven tokens from each cluster outputport.

183

Page 192: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

3 𝑜1

𝑜2

𝑔env

𝑎5

𝑎2

𝑎4𝑐4→5

𝑐5→2

size(𝑐4→5) = 7

size(𝑐5→2) = 3

𝑔𝛾2,4,5

(a) Situation after transi-tion 𝑡1 has been taken

3 𝑜1

𝑜2

𝑔env

𝑎5

𝑎2

𝑎4𝑐4→5

size(𝑐5→2) = 3

size(𝑐4→5) = 7

𝑐5→2

𝑔𝛾2,4,5

(b) Dynamic cluster state𝑞dyn,1

3 𝑜1

𝑜2

𝑔env

𝑎5

𝑎2

𝑎4𝑐4→5

𝑐5→2

size(𝑐4→5) = 7

size(𝑐5→2) = 3

𝑔𝛾2,4,5

(c) Dynamic cluster state𝑞dyn,2

Figure 4.56: To illustrate the undoing of actor firings of the cluster 𝑔𝛾2,4,5, the

situation in (a) is assumed. Here, the cluster environment 𝑔envhas consumed seven tokens provided on each cluster output port.Moreover, two situations resulting from undoing actor firings start-ing from (a) are shown in (b) and (c). These situations correspondto the dynamic cluster states 𝑞dyn,1 and 𝑞dyn,2, respectively. In par-ticular, state 𝑞dyn,1 corresponds to the situation where the clusterhas undone the maximal number of token productions (here twotokens) on the output port 𝑜1 while still not having undone all pro-duced tokens of transition 𝑡1. Likewise, state 𝑞dyn,2 has undonethe maximal number of token productions (here five tokens) on theoutput port 𝑜2.

As can be seen in Figure 4.56c, five token productions can be undone onthe cluster output port 𝑜2 while still allowing the production of at least onetoken that would also have been produced by the transition 𝑡1, e.g., the tokenproduced by the first firing of actor 𝑎5 of the transition 𝑡1. Hence, the channelcapacity of the cluster output channel 𝑐out,2 has been increased by five tokens,i.e., 𝑛𝑐out,2 = 5. An equivalent observation (see Figure 4.56b) can be made forthe output port 𝑜1. In this case, two firings of the actor 𝑎5 can be undone whilestill producing at least one token, i.e., the token produced by the first firing ofactor 𝑎5. Thus, the channel capacity of the cluster output channel 𝑐out,1 hasbeen increased by two tokens, i.e., 𝑛𝑐out,1 = 2. If the first firing of actor 𝑎5 is alsoundone, then the three firings of actor 𝑎2 have also to be reversed. Hence, all ofthe tokens produced by the transition 𝑡1 have been undone. More formally, thefunction adjo2o is used to encode the adjustment requirements:

Definition 4.23. (Output-to-Output Back Pressure Adjustment) The solutionfor the output-to-output back pressure problem for a given cluster state is com-puted by the vector-valued output-to-output back pressure adjustment functionadjo2o : 𝑄𝛾 → N|𝑂|

0 that associates with the given cluster state 𝑞 the required

184

Page 193: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.4 Cluster Refinement in Dynamic Data Flow Graphs

increase of the cluster output channel capacities. The adjustment is computedfor each output port 𝑜 by determining the maximal (see Equation (4.62)) num-ber 𝜋{ 𝑜 }(𝑛𝑂) of token productions that can be undone (see Equation (4.64)) onan output port 𝑜 of the cluster starting from the destination state of a transition𝑡 (see Equation (4.63)) such that dynamic scheduling of the cluster would stillbe able to produce at least one token (see Equation (4.65)) that would havebeen produced by the transition 𝑡. Here, the vector 𝑛𝑂 represents the numberof token productions that could be undone starting from the destination state𝑡.𝑞dst of the transition 𝑡 such that at least one token produced by the transitioncould still be produced via dynamic scheduling of the cluster.

𝜋{ 𝑜 }(adjo2o(𝑞)) = max{ 𝜋{ 𝑜 }(𝑛𝑂) | ∃𝑞dyn ∈ 𝑄dyn𝛾 : (4.62)

∃𝑡 ∈ 𝑇 ∧ 𝑡.𝑞src = 𝑞 : (4.63)𝑛𝑂 = − 𝜋𝑂(io(𝑞dyn)− io(𝑡.𝑞dst)) ≥ 0 ∧ (4.64)𝑛𝑂 ≥ prod(𝑡) } (4.65)

General Example Requiring Solutions for All Reasons

Finally, as an example requiring all three adjustment parts, consider the DFG𝑔𝛾1,2,...6

depicted in Figure 4.57. The adjustment requirements to handle theoutput-to-output, input-to-input, and input-to-output and back pressure prob-lems are listed in Table 4.5. First, the output-to-output back pressure problemis handled. According to Table 4.5, an increase of at least eight tokens is re-quired for the cluster output channel 𝑐out,1 and an increase of at least four tokensfor the cluster output channel 𝑐out,2, i.e., 𝜋𝑂(adj) = (8, 4). These adjustmentsare reused for the input-to-output back pressure problem, i.e., adji2o(𝑞, (8, 4)).Finally, the required channel capacity adjustments for the cluster input chan-nels are determined from the values of the adji2i(𝑞) and adji2o(𝑞, (8, 4)) func-tions. Here, an increase of at least seven tokens is required for the clusterinput channel 𝑐in,1 and at least six tokens for the cluster input channel 𝑐in,2, i.e.,𝜋𝐼(adj) = max{ (3, 3), (7, 6) } = (7, 6).

Table 4.5: The three channel capacity adjustments parts for the cluster 𝑔𝛾1,2,...6

adjo2o(𝑞) adji2i(𝑞) adji2o(𝑞, (8, 4))(𝑛out,1, 𝑛out,2) (𝑛in,1, 𝑛in,2) (𝑛in,1, 𝑛in,2)

𝑞 = 𝑞0 (2, 4) (2, 3) (7, 0)𝑞 = 𝑞1 (5, 2) (2, 0) (4, 3)𝑞 = 𝑞2 (8, 2) (3, 0) (1, 6)

max{ . . . | 𝑞 ∈ 𝑄𝛾 } (8, 4) (3, 3) (7, 6)

185

Page 194: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

3

3 3

3

𝑜2

𝑜1𝑖1

𝑖2

𝑎env2𝑎env1

𝑎5

𝑎2

𝑎6𝑎4

𝑎1

𝑎3

𝑐out,2size(𝑐out,2) = 3

𝑐out,1size(𝑐out,1) = 3

size(𝑐6→2) = 5

𝑐6→2

𝑐6→5

size(𝑐6→5) = 3

𝑐4→6

𝑐1→2

size(𝑐1→2) = 4

size(𝑐4→6) = 4𝑐1→4

size(𝑐1→4) = 4

𝑐3→4

size(𝑐3→4) = 3𝑐in,1

size(𝑐in,1) = 3

size(𝑐in,2) = 4𝑐in,2

𝑔𝛾1,2,...6

(a) Original unrefined DFG 𝑔ex13with its cluster 𝑔𝛾1,2,...6

3

3 3

3𝑎5

𝑎2

𝑎6𝑎4

𝑎1

𝑎3

𝑐6→2

𝑐6→5

𝑐4→6

𝑐1→2

𝑐3→4

𝑐1→4

𝑖2

𝑜1𝑖1

𝑜2

𝑎env2𝑎env1

𝑐out,1size(𝑐out,1) = 11

size(𝑐out,2) = 7𝑐out,2

size(𝑐in,2) = 10𝑐in,2

size(𝑐in,1) = 10𝑐in,1

𝑞0 𝑞1 𝑞2

#𝑖1 ≥ 3 ∧#𝑜1 ≥ 3/𝑓⟨𝑎3,𝑎3,𝑎3,𝑎4,𝑎6,𝑎5,𝑎6,𝑎5,𝑎6,𝑎5⟩

#𝑖1 ≥ 3 ∧#𝑜1 ≥ 2/𝑓⟨𝑎3,𝑎3,𝑎3,𝑎4,𝑎6,𝑎5,𝑎6,𝑎5⟩

#𝑖2 ≥ 3 ∧#𝑜1 ≥ 1∧#𝑜2 ≥ 3/𝑓⟨𝑎1,𝑎2,𝑎2,𝑎2,𝑎6,𝑎5⟩

#𝑖2 ≥ 3 ∧#𝑜2 ≥ 3/𝑓⟨𝑎1,𝑎2,𝑎2,𝑎2⟩

𝑡1 𝑡2

𝑡4 𝑡3

𝑎𝛾1,2,...6

(b) Refined DFG 𝑔ex13containing the composite actor 𝑎𝛾1,2,...6

Figure 4.57: As last example, the DFG 𝑎𝛾1,2,...6having multiple input and out-

put ports as well as a corresponding QSS with multiple states isconsidered to demonstrate the interplay of all reasons (A) to (E).

4.5 Results

In this section, the benefits of the proposed clustering methodology are pre-sented. First, in Section 4.5.1, it is shown that clustering can obtain a speedupof 10x and beyond by reducing the overall overhead in the execution of fine-grained data flow graphs that are dominated by static actors. Synthetic andreal-word examples are used to illustrate this advantage. Then, in Section 4.5.2,it is shown that the proposed clustering methodology enables a much larger de-sign space for cluster refinements as compared to monolithic clusterings, whichare representable by SDF composite actors. Thus, the second key contributionof this work can exploit the presence of static actors in fine-grained data flowgraphs in a larger set of situations than would be possible with the traditionalmonolithic clustering approaches known from literature.

186

Page 195: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.5 Results

4.5.1 Speedup from Overhead Reduction

In the following, results from [FZHT13*] are presented. Binary executableshave been generated by (M1) fully dynamically scheduling the graphs (for refer-ence purpose), (M2) the rule-based approach from Section 4.4.3, and (M3) theautomata-based approach from Section 4.4.2. The measurements for the resultswhere conducted on an Intelr CoreTM i7-2600 CPU with 3.40GHz. All exe-cutables where generated with the -Os compile option using version 4.5 of thegcc compiler suite. For code size measurements, the generated executable havebeen stripped of debugging information. Furthermore, the binary code sizesproduced by synthesizing the clustering with the rule-based approach (M2) andthe automata-based approach (M3) were also required to not exceed 133% ofthe code size of the dynamic case (M1). The generated binary executables areused for speedup and code size measurements to evaluate the approaches againsteach other.First, the dynamic scheduling algorithm is presented which is used as a base-

line for all comparisons. This scheduling algorithm is a variant of the simpleround-robin scheduler with the modification that (1) each actor is fired in a loopuntil the number of tokens on its inputs are insufficient for further firings and(2) all static actors in a cluster 𝑔𝛾 are scheduled by a round-robin subschedulerwhich is executed by the main round-robin scheduler (cf. Lines 17 to 26 fromAlgorithm 3).Modification (2) enables the replacement of the round-robin subscheduler with

the automata-based approach or the rule-based approach without any modifi-cations to the check and execute sequence of the dynamic actors by the mainround-robin scheduler. This is important as permutations in the check and exe-cute sequence can lead to noticeable performance fluctuations unrelated to theQSS under consideration. The replacement subscheduler to execute the rulesderived in Section 4.4.3 is given in Algorithm 4. This subscheduler is used toreplace Lines 17 to 26 from Algorithm 3. The resulting scheduler is used to eval-uate the performance and code size of the rule-based static data flow clusteringalgorithm (M2) presented in Section 4.4.3. For evaluation of the automata-based approach (M3), Lines 17 to 26 from Algorithm 3 are replaced with theappropriate subscheduler to execute the cluster FSM derived via the algorithmgiven in Section 4.4.2.In order to illustrate the benefits of the clustering algorithm developed in Sec-

tion 4.4.2, it is applied to both real-world applications (cf. Figures 4.58 and 4.59)as well as synthetic data flow graphs. Shaded vertices correspond to static ac-tors. White vertices are dynamic actors. The first real-world application isan MP3 decoder (cf. Figure 4.58) working on the MP3 granule level (i.e., 576frequency/time domain samples). The MP3 decoder has five static subclusters(two trivial subclusters consist of a single actor only and one subclusters only

187

Page 196: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

containing two actors). The remaining two subclusters 𝑔𝛾aand 𝑔𝛾b

are pro-cessed by the presented automata-based clustering algorithm to compute theirQSSs. For these two subclusters, QSSs have been computed using the presentedapproach. To evaluate the scheduling overhead reduction in isolation, as muchfunctionality as possible is removed from all actors. The dynamic scheduler forthis MP3 decoder has an overall execution time of about 184ms for a givenMP3 input stream (approximately 20MB). This decreases to 57ms when usingthe previously computed QSSs for the two subgraphs, i.e., an improvement ofapproximately 69%. For a real-world test, the unmodified MP3 decoder is used.

Algorithm 3 Fully dynamic scheduler

1 VAR: 𝑓𝑖𝑟𝑒𝑑, 𝑓𝑖𝑟𝑒𝑑𝛾 ∈ {true, false}2 VAR:𝑞𝛾1

, 𝑞𝛾2, . . . 𝑞𝛾n

∈ 𝑄 Current cluster state for the 𝑛 clusters3 VAR: 𝑞a1 , 𝑞a2 , . . . 𝑞am ∈ 𝑄 Current state for all dynamically scheduled actors4 IN : The DFG 𝑔 and its set of static actors 𝐴𝑆 ⊆ 𝑔.𝐴5 BEGIN6 LET 𝑞𝛾 ← 𝑞0 ∀𝑔𝛾 ∈ 𝑔.𝐺𝛾

7 LET 𝑞a ← 𝑞0 ∀𝑎 ∈ 𝑔.𝐴8 DO LET 𝑓𝑖𝑟𝑒𝑑← false9 FOREACH 𝑎 ∈ 𝑔.𝐴− 𝐴𝑆 DO

10 WHILE ∃𝑡 ∈ 𝑎.𝑇 : 𝑡.𝑞src = 𝑞𝑎 ∧ 𝑡.𝑘 DO11 CALL 𝑡.𝑓𝑎𝑐𝑡𝑖𝑜𝑛12 LET 𝑞𝑎 ← 𝑡.𝑞dst13 LET 𝑓𝑖𝑟𝑒𝑑← true14 ENDIF15 ENDFOR16 FOREACH 𝑔𝛾 ∈ 𝑔.𝐺𝛾 DO17 DO LET 𝑓𝑖𝑟𝑒𝑑𝛾 ← false18 FOREACH 𝑎 ∈ 𝑔𝛾 .𝐴 DO19 WHILE ∃𝑡 ∈ 𝑎.𝑇 : 𝑡.𝑞src = 𝑞𝑎 ∧ 𝑡.𝑘 DO20 CALL 𝑡.𝑓𝑎𝑐𝑡𝑖𝑜𝑛21 LET 𝑞𝑎 ← 𝑡.𝑞dst22 LET 𝑓𝑖𝑟𝑒𝑑𝛾 ← true23 ENDIF24 ENDFOR25 LET 𝑓𝑖𝑟𝑒𝑑← 𝑓𝑖𝑟𝑒𝑑𝛾 ∨ 𝑓𝑖𝑟𝑒𝑑26 WHILE 𝑓𝑖𝑟𝑒𝑑𝛾27 ENDFOR28 WHILE 𝑓𝑖𝑟𝑒𝑑29 END

188

Page 197: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.5 Results

This resulted in an overall execution time of about 2, 230ms for the same inputstream using the dynamic scheduler, and 2, 100ms when using the QSSs. Thiscorresponds to an improvement of approximately 6%. However, for data flowgraphs containing static actors of fine granularity, the achievable reduction inoverall overhead is expected to be higher.

SynthReorder

Huff.

isation

Requant-isation

IMDCT𝑆

Alias

Alias

IMDCT𝑆 SYNCSYNC

IMDCT𝐿

IMDCT𝐿

SYNCSYNC inversion

Frequency

Frequency

Frequency

Play

Frequency

SynthReorder

MP3

Requant-

𝑔𝛾b

Source Decoder

reduction inversion

reduction inversion

inversion

𝑔𝛾a

Stereodecode

Figure 4.58: DFG of an MP3 decoder

The Motion-JPEG decoder (cf. Figure 4.59) has two static subclusters (thesource actor being a trivial cluster containing only a single actor). As can beseen, the IDCT and the inverse ZigZag transformation correspond to the largest

Algorithm 4 Rule subscheduler for clustered actors

13 . . .14 DO LET 𝑓𝑖𝑟𝑒𝑑𝛾 ← false15 FOREACH 𝑟 ∈ 𝑅 DO16 IF 𝑟.𝑙 ≤ 𝑞𝛾 ≤ 𝑟.𝑢 ∧ 𝑟.𝑘io THEN17 CALL 𝑟.𝑓⟨...⟩18 LET 𝑞𝛾 ← 𝑞𝛾 + 𝑟.𝑠

19 LET 𝑓𝑖𝑟𝑒𝑑𝛾 ← true20 ENDIF21 ENDFOR22 IF 𝑞𝛾 ≥ 𝜋ℐ(𝑞𝛾 )(𝜂

rep𝛾 ) THEN

23 LET 𝑞𝛾 ← 𝑞𝛾 − 𝜋ℐ(𝑞𝛾 )(𝜂rep𝛾 )

24 ENDIF25 LET 𝑓𝑖𝑟𝑒𝑑← 𝑓𝑖𝑟𝑒𝑑𝛾 ∨ 𝑓𝑖𝑟𝑒𝑑26 WHILE 𝑓𝑖𝑟𝑒𝑑𝛾27 . . .

189

Page 198: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

static subcluster. Each butterfly operation of the 8 × 8 inverse discrete cosinetransform (IDCT ) is modeled as an SDF actor (cf. Figure 4.10 on page 101).Such fine-grained graphs can result from the usage of tools like [VBCG04] whichreplace coarse-grained actors in a DFG by fine-grained subgraphs correspondingto DFGs derived from the functions of the actors. This expanded graph can thenbe clustered again to better utilize the resources of an MPSoC environment thanwould be feasible by only clustering actors atomically.

(0,0,0,8,0,0,0,0)

(0,0,0,0,0,8,0,0)

(0,0,0,0,0,0,0,8)

(0,8,0,0,0,0,0,0)

(0,0,0,0,8,0,0,0)

(0,0,0,0,0,0,8,0)

(0,0,8,0,0,0,0,0)

(8,0,0,0,0,0,0,0)

6481

1 8

8

8

11

81

1 8

81

1 8(1,1,1,1,1,1,1,1)

(1,1,1,1,1,1,1,1)

(1,1,1,1,1,1,1,1)

(1,1,1,1,1,1,1,1)

(1,1,1,1,1,1,1,1)

(1,1,1,1,1,1,1,1)

(1,1,1,1,1,1,1,1)

(1,1,1,1,1,1,1,1)

18

8 1

18

8 1

18

8 1

18

8 1

64

JPEG Inverse Inverse

Inverse

Huff.

Frame

Decoder

Shuffler

ZRL Quant.

ZigZag

Source Parser

IDCTSinkImage

𝑔𝛾idct| IDCT

Figure 4.59: DFG of a Motion-JPEG decoder

For test purposes, the standard Lena image was used and the decoding wasrepeated 25 times. For this test case, a reduction of 89% in the overall execu-tion time of the decoder could be observed where the dynamic scheduler (M1)required 329ms whereas after applying the presented rule-based QSS scheduling(M2) only 35ms where needed. With the functionality restored, a reduction of40% in the overall execution time of the decoder could still be observed wherethe dynamic scheduler (M1) had an overall execution time of 823ms while only492ms where required by the rule-based QSS (M2).In order to evaluate the presented clustering algorithm more thoroughly, it was

applied to 48,000 randomly generated data flow graphs: The generated graphshave the same properties, except (1) the average input and output degree of thedata flow graph vertices and (2) the number of actor firings necessary to reachthe initial state of the SDF graph, i.e., the sum of the repetition vector scalars.The average input and output degree was varied from 2 to 7 in incrementsof 1 and the repetition vector sum from 100 to 1,000 in increments of 300.Using the SDF3 tool [SGB06b], six different SDF graphs for each pair of average

190

Page 199: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.5 Results

input/output degree and repetition vector sum were generated, thus, creating120 initial data flow graphs. The generated graphs consist of 60 actors andare cyclic, i.e., they contain strongly connected components, with random butconsistent SDF rates and sufficient initial tokens to guarantee a deadlock-freeself-scheduled execution. For each random graph 𝑔, various clusterings weregenerated. A clustering is a set of clusters 𝐺𝛾 into which the static actors 𝐴𝑆

of the graph have been partitioned in such a way that each cluster conforms tothe clustering condition given in Definition 4.4. To execute such clusterings, thedynamic scheduler from Algorithm 3 is simply equipped with the correspondingnumber of subschedulers (one for each cluster in the clustering). Furthermore,the number of dynamically scheduled entities 𝑠𝑐ℎ𝑒𝑑dyn can be derived for eachclustering as follows:

𝑠𝑐ℎ𝑒𝑑dyn = |𝐴− 𝐴𝑆|+ |𝐺𝛾 | (4.66)

As one can see from the definition above, dynamically scheduled entities aredynamic actors and the composite actors (the |𝐺𝛾 | part) derived for all clus-ters. In fact, this exactly corresponds to the subschedulers in the dynamicscheduler and is also the number which is depicted on the x-axis of the speedupcomparisons in Figure 4.61. For each of the six different initial SDF graphs cor-responding to a pair of average input/output degree and repetition vector sum,a best clustering is selected for a given number of dynamically scheduled entities𝑠𝑐ℎ𝑒𝑑dyn. The speedup value (on the y-axis) for a given average input/outputdegree 𝑑 (z-axis), number of dynamically scheduled entities 𝑠𝑐ℎ𝑒𝑑dyn (x-axis)and repetition vector sum rvs (cf. Figures 4.61a to 4.61d) is the average of thesix speedup values achieved by the best clusterings. The speedup is computedby dividing the throughput of the clustered DFGs by the throughput of thecorresponding fully dynamic scheduled DFG. The throughput of a DFG is de-rived by dividing the number of iterations computed by the DFG by the overallexecution time 𝑡𝐸𝑋𝐸𝐶 required for the given number of iterations. In order toreduce the measuring inaccuracy of each measure, the number of iterations fora measure is adjusted in such a way that the overall execution time is betweenthree and five seconds.Frequently, the best clustering of a static region 𝐴𝑆 is not a monolithic com-

posite actor 𝑎𝛾 , if such a cluster is feasible at all due to the clustering condition,but a partitioning of the static region into a set of smaller composite actorsall having less complex input/output behavior. An example of such cases canbe seen in Figure 4.60. Also note that the binary code size constraints of theautomata-based approach (M3) forces the usage of more clusters in the clus-tering to satisfy the constraint of at most 133% increase in binary code sizeas compared to the fully dynamic scheduled system. This often yields a betterspeedup compared to a big composite actor.

191

Page 200: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

1 7 13 19 25 31 37 43 49 55 2 3

4 5 6

7 1 2 3 4 5 6

Nr. o

f clu

ster

s in

clu

ster

ing

Nr. of dyn. scheduled entitiesAvg.

degree

Rules LimitedFSM Limited

Nr. o

f clu

ster

s in

clu

ster

ing

Figure 4.60: Example of the average clusters in the fastest clustering under bi-nary code size constraints for random graphs with a repetition vec-tor sum of 100

To evaluate the improvement potential which can be exploited by the pre-sented rule-based clustering approach, the overall overhead of the round-robinscheduler from Algorithm 3 was measured. Each initial data flow graph has perdefinition no dynamic actors. Hence, each initial data flow graph can be fullystatically scheduled. The overhead 𝑠overhead is defined as the factor by which thefully statically scheduled initial data flow graph is faster than the same graphscheduled fully dynamically executing the same amount of useful work. Typi-cally, thousands of iterations of the graph are used for these measurements. Asclustering reduces this overall overhead, this represents an upper bound for thespeedup that can be obtained by clustering.However, if the clustering is between fully static and fully dynamic, the re-

maining overhead in the system can only be estimated. The estimation assumesthat the overhead is uniformly spread over all actors:

𝑠estimation =𝑠overhead · 59

𝑠overhead · (𝑠𝑐ℎ𝑒𝑑dyn − 1) + 60− 𝑠𝑐ℎ𝑒𝑑dyn

Note that this is only a coarse approximation as the actors may have differentrepetition counts. As can be seen in Figure 4.61, for higher repetition vectorsums, the approximation gets worse. This is expected as the speedup of the rule-based approach (M2) and the automata-based approach (M3) is the average ofthe best clustering for each random test graph. And the best clustering tends toselect the actors for clustering that eliminate the most overall overhead from thesystem and not the average case assumed by the uniform spread of the overalloverhead over all actors.The available amount of memory is typically constrained in embedded sys-

tems. Hence, clusterings which, when synthesized by the rule-based approach(M2) or the automata-based approach (M3), exceed a binary code of of 133% ofthe code size of the dynamic case (M1) where excluded from consideration for

192

Page 201: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.5 Results

1 7 13 19 25 31 37 43 49 55 2 3

4 5 6

7 1 3 5 7 9

11

Spee

dup

Nr. of dyn. scheduled entitiesAvg.

degree

Rules LimitedFSM Limited

Estimation

Spee

dup

(a) Average speedup of the best clustering forsynthetic SDF graphs with a repetitionvector sum of 100

1 7 13 19 25 31 37 43 49 55 2 3

4 5 6

7 1 3 5 7 9

11

Spee

dup

Nr. of dyn. scheduled entitiesAvg.

degree

Rules LimitedFSM Limited

Estimation

Spee

dup

(b) Average speedup of the best clustering forsynthetic SDF graphs with a repetitionvector sum of 400

1 7 13 19 25 31 37 43 49 55 2 3

4 5 6

7 1 3 5 7 9

11

Spee

dup

Nr. of dyn. scheduled entitiesAvg.

degree

Rules LimitedFSM Limited

Estimation

Spee

dup

(c) Average speedup of the best clustering forsynthetic SDF graphs with a repetitionvector sum of 700

1 7 13 19 25 31 37 43 49 55 2 3

4 5 6

7 1 3 5 7 9

11Sp

eedu

p

Nr. of dyn. scheduled entitiesAvg.

degree

Rules LimitedFSM Limited

EstimationSp

eedu

p

(d) Average speedup of the best clustering forsynthetic SDF graphs with a repetitionvector sum of 1,000

Figure 4.61: Given above is the average speedup of the best clustering normal-ized to the throughput of the fully dynamically scheduled systemfor each of the six random test graphs with the given average in-put/output degree, repetition vector sum, and number of dynami-cally scheduled entities in the clustering. Furthermore, the binarycode sizes produced by the synthesizing the clustering with the rule-based approach (M2) and the automata-based approach (M3) werealso required to not exceed 133% of the code size of the dynamiccase (M1).

the best clustering for the speedups as depicted in Figure 4.61. In case no suchexclusion is enforced, the relative binary code size of the rule-based approach(M2) or the automata-based approach (M3) is depicted in Figure 4.62.All executables where generated with the -Os compile option of gcc and then

stripped. After that, the remaining executable sizes where measured. The ex-ecutable size for the fully dynamically scheduled system (M1) is the reference

193

Page 202: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

1 7 13 19 25 31 37 43 49 55 2 3

4 5 6

7 1 2 3 4 5

Rel

. bin

ary

code

siz

e

Nr. of dyn. scheduled entitiesAvg.

degree

RulesFSM

Rel

. bin

ary

code

siz

e

(a) Average normalized binary code size of thebest clustering for synthetic SDF graphswith a repetition vector sum of 100

1 7 13 19 25 31 37 43 49 55 2 3

4 5 6

7 1 2 3 4 5

Rel

. bin

ary

code

siz

e

Nr. of dyn. scheduled entitiesAvg.

degree

RulesFSM

Rel

. bin

ary

code

siz

e

(b) Average normalized binary code size ofthe best clustering for synthetic SDFgraphs with a repetition vector sum of 400

1 7 13 19 25 31 37 43 49 55 2 3

4 5 6

7 1 2 3 4 5

Rel

. bin

ary

code

siz

e

Nr. of dyn. scheduled entitiesAvg.

degree

RulesFSM

Rel

. bin

ary

code

siz

e

(c) Average normalized binary code size of thebest clustering for synthetic SDF graphswith a repetition vector sum of 700

1 7 13 19 25 31 37 43 49 55 2 3

4 5 6

7 1 2 3 4 5

Rel

. bin

ary

code

siz

e

Nr. of dyn. scheduled entitiesAvg.

degree

RulesFSM

Rel

. bin

ary

code

siz

e

(d) Average normalized binary code size ofthe best clustering for synthetic SDFgraphs with a repetition vector sum of1,000

Figure 4.62: Shown above is the average binary code size normalized to the codesize of the fully dynamically scheduled system of the best cluster-ing for each of the six random test graphs with the given averageinput/output degree, repetition vector sum, and number of dynam-ically scheduled entities in the clustering. No code size limitationswhere applied to find the clusterings with the best speedup.

and corresponds to 1. One can see a correlation between Figure 4.62 and Fig-ure 4.61. In cases where the binary code size of the automata-based approach(M3) far exceeds the binary code size of the dynamically scheduled system (M1),the speedup of the automata-based approach (M3) under code size limitationsalso falls noticeably behind the speedup of the rule-based approach (M2). Thisblowup stems from the fact that the representation of QSSs via FSMs typicallyrequires huge code sizes for complex schedules.

194

Page 203: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.5 Results

4.5.2 Design Space for Clustering

In the following, the potential of the presented clustering methodology and ofmonolithic clustering to reduce the number of dynamically scheduled entitiesis evaluated. While the best clustering in terms of speedup of a set of staticactors may not be the clustering with the least number of dynamically scheduledentities, the design space for clustering supported by an approach comprisesthe span from the minimal number of dynamically scheduled entities given inFigure 4.63 to the fully dynamic case. Hence, the lower the obtainable numberof dynamically scheduled entities for an approach is, the larger is the designspace that is covered by this approach. A large design space is desirable tocover the conflicting requirements (cf. Section 4.1) of computational efficiency,e.g., from schedule overhead reduction, minimal latency, or overall throughputinduced by the various architectural choices for embedded systems.The test scenario consists of 240 SDF graphs that have been generated using

the SDF3 tool [SGB06b] to obtain graphs varying in the average actor input/out-put degree from 2 to 7 in increments of 1 and in the repetition vector sum from100 to 1,000 in increments of 300. All of these graphs have exactly 100 actors.For each pair of average input/output degree and repetition vector sum, 10 dif-ferent SDF graphs have been generated. Of course, all of these SDF graphs canbe fully statically scheduled. Hence, if none of the actors in these graphs aremarked as dynamic, then both monolithic clustering as well as the automata-based approach can reduce the number of dynamically scheduled entities to one(cf. Figure 4.63a).The compared approaches are monolithic clustering, which generates SDF

composite actors and is labeled as SDF in Figure 4.63, and the presented clus-tering methodology in three different variants that are labeled as FSM-500,FSM-10,000, and heuristic. These three variants differ in the maximum permis-sible complexity of the QSSs realized by them: The variant labeled as heuristiccorresponds to the number of dynamically scheduled entities obtained by theclusterings derived by the heuristic introduced in Section 4.4.1. Hence, the onlyconstraint on clusters contained in these clusterings are a finite cluster statespace. However, these clusterings might, in fact, not be implementable due tothe huge binary code size requirements of complex QSSs. Thus, the variantslabeled FSM-500 and FSM-10,000 are further constrained to only contain clus-ters with a cluster state space of cardinality less than 500 respectively 10,000states.81 Hence, the minimal number of dynamically scheduled entities obtain-able by the different approaches will always increase starting from the heuristic,to the FSM-10,000, FSM-500, and SDF approach.

81Monolithic clustering can also be thought of as clustering with the constraint that eachcluster has a cluster state space containing exactly one cluster state.

195

Page 204: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

SDFFSM-500

FSM-10,000heuristic

Nr.

of d

yn. s

ched

uled

ent

itie

s

(a) Average nr. of dyn. scheduled entities for0% of dyn. actors in the DFG

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

SDFFSM-500

FSM-10,000heuristic

Nr.

of d

yn. s

ched

uled

ent

itie

s

(b) Average nr. of dyn. scheduled entities for1% of dyn. actors in the DFG

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

SDFFSM-500

FSM-10,000heuristic

Nr.

of d

yn. s

ched

uled

ent

itie

s

(c) Average nr. of dyn. scheduled entities for3% of dyn. actors in the DFG

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100N

r. o

f dyn

. sch

edul

ed e

ntit

ies

Repetition vector sumAvg.

degree

SDFFSM-500

FSM-10,000heuristic

Nr.

of d

yn. s

ched

uled

ent

itie

s

(d) Average nr. of dyn. scheduled entities for5% of dyn. actors in the DFG

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

SDFFSM-500

FSM-10,000heuristic

Nr.

of d

yn. s

ched

uled

ent

itie

s

(e) Average nr. of dyn. scheduled entities for7% of dyn. actors in the DFG

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

SDFFSM-500

FSM-10,000heuristic

Nr.

of d

yn. s

ched

uled

ent

itie

s

(f) Average nr. of dyn. scheduled entities for9% of dyn. actors in the DFG

Figure 4.63: Average number of dynamically scheduled entities obtainable by thefour approaches for different DFGs

To evaluate the approaches in the presence of dynamic actors, a percentageof randomly selected actors have been marked as dynamic and, thus, unavail-able for clustering. In detail, for each of the percentages 1%, 3%, 5%, 7%,and 9% (cf. Figures 4.63b to 4.63f), 10 variants have been derived from the240 original SDF graphs by marking a random selection of the correspondingpercentage of actors as dynamic. Hence, for each pair of average input/output

196

Page 205: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.5 Results

degree, repetition vector sum, and percentage of dynamic actors, 100 graphs areused to average the number of dynamically scheduled entities obtainable by thefour approaches. The number of dynamically scheduled entities for the heuristicapproach are derived by applying the heuristic from Section 4.4.1 to each of the100 graphs, computing the number of dynamically scheduled entities (cf. Equa-tion (4.66)) for the resulting clusterings, and averaging the 100 results. Forthe approaches SDF, FSM-500, and FSM-10,000, the clusters in the clusteringderived from the heuristic are further divided until the corresponding constrainton the maximal cardinality of the cluster state space is satisfied.When considering Figure 4.63 as a whole, one can observe that even the pres-

ence of one dynamic actor (cf. Figure 4.63b) in the DFG drastically limits thepotential of monolithic clustering. For graphs with an average actor input andoutput degree of two, already around fifty entities must be dynamically sched-uled, whereas the proposed clustering methodology only requires around twodynamically scheduled entities. Furthermore, with increasing average input andoutput degree, the complexity of the required QSSs for the clusters increases.Thus, smaller clusters must be used to conform to the cluster state space car-dinality constraint and, hence, the number of required dynamically scheduledentities increases. A further contributing factor to complexity is the repetitionvector sum of the graph. An increased repetition vector sum will increase theset of input/output dependency function values (cf. Definition 4.8 on page 139)from which the cluster FSM is derived. This complexity factor is dominant forlarge clusters with few FIFO channels connecting them to their cluster environ-ment. These types of clusters are present if few dynamic actors are contained ina DFG (cf. Figures 4.63a to 4.63c). In contrast to this, if the number of dynamicactors and the average actor input and output degree increases (cf. Figures 4.63dto 4.63f), then the number of FIFO channels connecting clusters to their clus-ter environment will also increase. Then the dominant complexity factor is therequired interleavings of output actor firings (cf. Definition 4.10 on page 141).As can be seen in Figures 4.63d to 4.63f, clusters containing marked graphs82

are especially susceptible to this phenomenon.To exemplify, the automata- and rule-based approaches have been applied to

the 100 actor test graphs used in this section. The considered clusterings havebeen constrained by the requirement that their implementations have a binarycode size of at most 133% of the fully dynamically scheduled implementation(M1) of the corresponding graph. As can be seen in Figure 4.64, the rule-basedapproach (M2) has a higher potential to reduce the number of dynamicallyscheduled entities in comparison to the automata-based approach (M3). This82The SDF graphs used in this section have exactly 100 actors. Hence, if the repetition

vector sum of a graph is also 100, then each actor must fire exactly once in an iterationof the graph. Thus, the graph is a marked graph or a graph where the consumption andproduction rates for each edge of the graph are identical.

197

Page 206: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

is especially evident in the situations (cf. Figures 4.64d to 4.64f) where thecomplexity of the QSSs stems from the required interleavings of output actorfirings.

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

FSM LimitedRules Limited

Nr.

of d

yn. s

ched

uled

ent

itie

s

(a) Average nr. of dyn. scheduled entities for2% of dyn. actors in the DFG

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

FSM LimitedRules Limited

Nr.

of d

yn. s

ched

uled

ent

itie

s(b) Average nr. of dyn. scheduled entities for

3% of dyn. actors in the DFG

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

FSM LimitedRules Limited

Nr.

of d

yn. s

ched

uled

ent

itie

s

(c) Average nr. of dyn. scheduled entities for5% of dyn. actors in the DFG

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

FSM LimitedRules Limited

Nr.

of d

yn. s

ched

uled

ent

itie

s

(d) Average nr. of dyn. scheduled entities for7% of dyn. actors in the DFG

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

FSM LimitedRules Limited

Nr.

of d

yn. s

ched

uled

ent

itie

s

(e) Average nr. of dyn. scheduled entities for8% of dyn. actors in the DFG

100 400 700 1000 2 3

4 5 6

7 10 20 30 40 50 60 70 80 90

100

Nr.

of d

yn. s

ched

uled

ent

itie

s

Repetition vector sumAvg.

degree

FSM LimitedRules Limited

Nr.

of d

yn. s

ched

uled

ent

itie

s

(f) Average nr. of dyn. scheduled entities for9% of dyn. actors in the DFG

Figure 4.64: Average number of dynamically scheduled entities obtainable un-der 133% code size limitations by the automata- and rule-basedapproaches

198

Page 207: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.6 Related Work

4.6 Related Work

The considered refinement algorithms can be distinguished into (1) algorithmsfor clustering SDF graphs into monolithic clusters, that is each monolithic clus-ter can again be represented by an SDF composite actor, and (2) algorithmsfor clustering an SDF or CSDF graph into composite actors each of which canbe represented not by an SDF or CSDF actor, but by a composite actor imple-menting a QSS. The algorithms in (1) are applicable to closed systems, that isworking on whole SDF graphs. The algorithms in (2) are developed for opensystems, that is working on clusters potentially having inputs and outputs toan unknown cluster environment.The first set of clustering algorithms for generating monolithic clusters has

mainly been researched by Bhattacharyya et. al. [BL93, PBL95]. However,these algorithms severely constrain the possibilities of clustering an SDF graph.Furthermore, the Pairwise Grouping of Adjacent Nodes (PGAN) algorithm pre-sented in [BL93] does only work for SDF graphs and has at its main focus thereduction of code and data memory for uniprocessor systems. The heuristicpresented in [PBL95] also works for SDF subgraphs, that is static clusters, em-bedded in dynamic DFGs when both constant consumption and production ratesare known for each edge connecting the dynamic DFG to the SDF subgraph.However, the heuristic is still limited to generate only monolithic clusters, hence,it is severely constrained in the clustering possibilities.The second set of refinement algorithms for generating composite actors im-

plementing QSSs has mainly been researched in this thesis [FKH+08*, FZK+11*,FZHT11*, FZHT13*] and by Tripakis et al. [TBG+13]. The algorithms pre-sented in [FZHT11*, FZHT13*, FZK+11*] generate basically the same QSS fora static cluster. The only difference between [FZK+11*] (cf. Section 4.4.2) and[FZHT11*, FZHT13*] (cf. Section 4.4.3) is the representation of the QSS.In contrast to this, Tripakis et al. [TBG+13] considers cluster refinement

under the aspect of modular code generation for SDF graphs. Both approacheslack information about the cluster environment of a cluster. In the case ofthe cluster environment being a dynamic DFG, this lack of information is dueto the lack of analyzability of dynamic DFGs in the general case. In case ofmodular code generation, the cluster environment is not known at the timeof cluster refinement. Note, however, that modular code generation only usesQSSs to preserve scheduling flexibility for the intermediate representations usedby successive compilation steps that successively assemble actors and clustersinto a whole SDF graph. The final whole SDF graph will be scheduled by atraditional PSOS that no longer contains any actors implementing a QSS.More precisely, the approach of Tripakis et al. uses so-called Deterministic

SDF with shared FIFOs (DSSF) graphs as input and output for the presentedmodular code generation algorithm. The DSSF model extends SDF in such

199

Page 208: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4. Clustering

a way that even when shared FIFOs are used in DSSF, determinism can beensured. Since DSSF is only an extension of SDF, it enables reusage of manytechniques developed for standard SDF. More importantly, the QSS represen-tation chosen in [TBG+13] is a DSSF subgraph that is called an SDF profile.Modular code generation is concerned with the generation of an SDF profile froman (original) DSSF subgraph. This process can reduce the number of actors inthe DSSF subgraph representing the SDF profile compared to the original DSSFsubgraph. The reduction is facilitated by combining the firings of multiple ac-tors of the original DSSF subgraph into one or a lesser number of actors of theSDF profile. The ability of an SDF profile to partition the firings of one actor ofthe original DSSF subgraph required for one iteration of the original DSSF sub-graph amongst multiple actors of the SDF profile enables a larger design spacefor cluster refinements as compared to monolithic clusterings representable byan SDF subgraph. However, the data flow characteristics of the DSSF modelconstrain the QSSs representable by SDF profiles to always be cyclic. While theDSSF model can represent QSSs with multiple cycles, QSSs containing initialretiming sequences cannot be represented by SDF profiles. Moreover, Tripakiset al. does not need to tackle the FIFO size adaption (cf. Section 4.4.6) problemsince only SDF cluster environments are considered by the approach. Hence,determining the channel capacities in order to prevent deadlocks or satisfy var-ious other criteria is a decidable problem [BMMKU10] that can be performedat compile time while the final whole SDF graph is assembled.In contrast to [Buc93, SLWSV99], which represent a monolithic approach

to schedule dynamic DFGs, the presented clustering methodology is inherentlyworking on static islands which represent data flow subgraphs. Hence, thealgorithms from [Buc93, SLWSV99] have an upper problem size to be computa-tionally feasible, while clustering can handle data flow graphs of arbitrary sizeby only considering static subgraphs of sizes that are computationally feasiblefor the presented clustering methodology. Of course, optimization potential islost if individual parts of the static island in the DFG are scheduled in isolationby the clustering methodology.

Summary

In this chapter, the second key contribution of this work, a clustering method-ology [FKH+08*, FZK+11*, FHZT13*] that exploits islands of static actors inmore general DFGs, has been presented. Clustering replaces islands of staticactors in the data flow graph by composite actors. Two variants of clustering, anautomata-based and a rule-base approach, are contributed. The rule-based rep-resentation trades latency and throughput in order to achieve a more compactcode size of a QSS as compared to the automata-based representation. However,

200

Page 209: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

4.6 Related Work

under code size constraints, the rule-base approach might offer a better latencyand throughput due to its ability to represent clusters containing a larger num-ber of static actors. Later, in the implementation generated by the synthesisback-end, the remaining actors which are not replaced by a composite actor aswell as all composite actors are scheduled by a dynamic scheduler at run time.Thus, by combining multiple static actors into one composite actor, the cluster-ing methodology reduces the number of actors which have to be scheduled bythe dynamic scheduler. This reduction is of benefit as it has been shown thatcompile-time scheduling of static actors produces more efficient implementationsin terms of latency and throughput than run-time scheduling of the same actorsby a dynamic scheduler.Furthermore, clustering enables the generation of QSSs that employ sequences

of statically scheduled actor firings. Hence, clustering not only reduces thescheduling overhead imposed by the run-time system to find a schedule of allactors of the DFG at run time, but clustering also enables efficient code gen-eration for sequences of statically scheduled actor firings as compared to codegeneration optimizations possible if all actors of a sequence are considered inisolation by a compiler.In contrast to known clustering methods from literature [KB06, CH97, BML97,

PBL95], the presented methodology is not limited to clustering only SDF graphs,and neither limited to produce only SDF actors from static islands. Hence, com-pared to methods from literature [KB06, CH97, BML97, PBL95], the presentedapproach drastically enlarges the possible applicability to heterogeneous DFGswith static and dynamic actors. Thus, the ability to generate QSSs instead ofonly SDF or CSDF actors enables the generation of optimized schedules for amuch larger variety of static islands.

201

Page 210: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität
Page 211: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5Synthesis

In this chapter, the third key contribution of this work, a synthesis back-end forthe SystemC Models of Computation (SysteMoC) language, will be presented.The synthesis back-end contributes to the realization of the promised produc-tivity gain [FZH+10*] of Electronic System Level (ESL) modeling by providingan automatic way [ZFS+14*] to generate virtual prototypes from SysteMoCapplications. Here, the synthesis back-end supports multiple different targetslike pure single threaded C++ software implementations, multithreaded C++software implementations, software for the Central Processing Units (CPUs)employed in an Multi-Processor System-on-Chip (MPSoC) system, as well asSystemC code suitable as input for commercial behavioral synthesis tools forthe generation of hardware accelerators. The synthesis back-end employs tar-get dependent source-to-source transformations that can transform a SysteMoCactor into an appropriate source format for these different targets.In general, the synthesis back-end [ZHFT12*, ZFHT12*, ZFS+14*] generates

an implementation by transforming a SysteMoC application according to a setof design decisions taken by the SystemCoDesigner [HFK+07*, KSS+09*]Design Space Exploration (DSE) methodology. Examples for design decisionsare allocation and binding decisions, as well as the Quasi-Static Schedules(QSSs) generated by the clustering methodology. The allocation defines theset of hardware resources that are used in an implementation, and the bindingprovides the information which SysteMoC actor is implemented on which re-source. Note that the selection of a concrete realization for each communicationchannel is also performed during DSE. The QSSs generated by the clusteringmethodology are used by the synthesis back-end to reduce the scheduling over-head of the generated software implementations running on the CPU resourcesof the overall MPSoC system.First, the synthesis back-end is used to derive the characteristics, e.g., ex-

ecution delay, binary code size, etc., of a single actor when implemented onthe various possible targets. These characteristics are required by the System-CoDesigner DSE methodology to evaluate the candidate implementations atESL. Moreover, the synthesis back-end is used to realize synthesis-in-the-loopfor the SystemCoDesigner DSE methodology: Synthesis-in-the-loop enables

203

Page 212: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5. Synthesis

the DSE to evaluate its design decisions based on real implementations, in con-trast to the approximately-timed performance simulation carried out via theVirtual Processing Components (VPC) [SFH+06*] approach at ESL. As a com-promise between the evaluation accuracy obtainable by the synthesis-in-the-loopapproach and the evaluation speed obtained via the VPC approach, synthesis-in-the-loop could also be used to consolidate the characteristics for possible im-plementations of whole groups of actors on a resource in contrast to the modularanalysis from the individual characteristics of the constituting actors performedat ESL.The rest of this chapter is structured as follows: First, the presented approach

is motivated in Section 5.1. Then, the integration of the synthesis back-end intothe SystemCoDesigner design flow is shown in Section 5.2. In Section 5.3, theMotion-JPEG decoder introduced in Chapter 4 is briefly recapitulated. In orderto derive an implementation at a lower level of abstraction from a SysteMoCapplication at the ESL, source code transformations have to be performed. Ex-amples for these transformations are demonstrated in Section 5.4. In Section 5.5,results from applying the presented synthesis back-end to the Motion-JPEG de-coder example are given. Finally, the related work is reviewed in Section 5.6.Note that parts of this chapter are derived from [HFK+07*, KSS+09*].

5.1 Motivation

Several commercial C/C++/SystemC-based behavioral synthesis (also known ashigh-level synthesis) tools exist, e.g., Forte Cynthesizer [For13], Catapult C fromCalypto Design Systems [Cal13], or Vivado from Xilinx [XIL14]. These toolsallow the automatic translation of behavioral SystemC models to the RegisterTransfer Level (RTL). However, synthesis of an entire system using behav-ioral synthesis tools is either not possible at all or prohibitive due to the sheercomplexity of these systems.To overcome these limitations, the proposed synthesis back-end integrates be-

havioral synthesis into the SystemCoDesigner DSE methodology. That way,it becomes possible to automatically (1) optimize MPSoC implementations and(2) generate the resulting MPSoC implementation. The synthesis back-end sup-ports the generation of pure software, as shown in Section 4.5 for the resultsof the single processor evaluation of the clustering methodology that have beengenerated by measuring the resulting throughput of the implementations gen-erated by the presented synthesis back-end. Moreover, the generation of virtualprototypes for hardware/software implementations [ZFS+14*] is also possible.However, the focus of this chapter is on the generation of pure hardware imple-mentations, where an automatic behavioral synthesis step is used to generateRTL code for single actors (SystemC modules).

204

Page 213: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5.2 Synthesis in the SystemCoDesigner Design Flow

In particular, the integration of Forte Cynthesizer into the SystemCoDe-signer DSE environment will be highlighted. The target architecture is a XilinxVirtex-II Field Programmable Gate Array (FPGA). Here, Cynthesizer is used toautomatically generate synthesizable RTL code from SystemC modules. The re-sulting actor implementations are characterized regarding their size (number ofLook-Up Tables (LUTs), number of Block Random Access Memorys (BRAMs),etc.) and their performance (delay). Note that it is possible to generate differ-ent implementations for each actor with respect to size and performance. Theattributes of the actor implementations are used during DSE to select an imple-mentation for each actor, resulting in a set of non-dominated solutions. Fromthis set, the designer can select an implementation of the entire system andsynthesize it using Xilinx Embedded Development Kit (EDK) [XIL05]. For thissynthesis, a library providing implementations for the communication primitivesis supported. In order to facilitate this integration, the synthesis back-end isused to translate each SysteMoC actor into a SystemC module.

5.2 Synthesis in the SYSTEMCODESIGNER Design Flow

The overall design flow of the SystemCoDesigner DSE methodology is basedon (1) actor-oriented modeling in SysteMoC, as presented in Chapter 3, (2)hardware generation for each actor using translation of SysteMoC actors toSystemC modules and subsequent behavioral synthesis, (3) DSE to find thebest candidate implementations, and (4) automatic generation of the MPSoCby the presented synthesis back-end. With the help of Figure 5.1, the requiredsteps to integrate the synthesis back-end into the SystemCoDesigner designflow are briefly summarized.Note that the parts of the design flow contained in the dashed box have

already been presented in Chapter 3. As introduced, the design flow startsby modeling an application as an executable specification using the SysteMoClibrary. Note that the SysteMoC application can be functionally simulated fortesting purposes.Next, the SysteMoC infrastructure is used to extract the network graph of

the application. An exact definition of a network graph is given in Definition 3.3on page 49. Note that for historical reasons, the name problem graph [BTT98]is commonly used if this graph is considered from the DSE perspective.83 Then,the synthesis back-end is used to transform each SysteMoC actor into a SystemCmodule. A brief outline of this transformation is given in in Section 5.4. Theresulting SystemC module is used as input for the behavioral synthesis tool.

83From a broader DSE perspective, a problem graph may not necessarily exhibit data flowsemantics and might only encode some kind of data communication. In contrast to this,the name network graph is used if data flow aspects are in focus.

205

Page 214: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5. Synthesis

𝑟4

𝑟2 𝑟1𝑟5

𝑟3𝑟6

𝑎1 𝑎2

𝑎4𝑎3

𝑎5

Executable specificationwritten in SysteMoC

DesignSpace

Exploration

Write

Select implementation

Specify architecture graphSpecify mappings

Automatic extractionof the problem graph

Communicationlibrary

Componentlibrary

Behavioralsynthesis

SoCgeneration

individualsNon-dominated

Figure 5.1: Extension of the SystemCoDesigner DSE methodology as pre-sented in Chapter 3 (dashed box) by the steps required to integratethe synthesis back-end into the design flow

Moreover, multiple possible RTL implementation can be generated from thesame SystemC module by varying the area and latency constraints given to thebehavioral synthesis tool. Each of these variants is added to the componentlibrary and their characteristics, e.g, execution delay and area consumption, isderived from the synthesis reports of the behavioral synthesis tool.In the next step, an architecture graph [BTT98] is formed by using the modules

of the component library as resources. Moreover, other modules such as CPUs,Intellectual Property (IP) cores, hand-optimized modules, and communicationchannels and buses, etc., can be used to derive the architecture graph. Moreformally, an architecture graph 𝑔a = (𝑅,𝐿) is composed of resources (CPUs,buses, etc.) modeled by vertices 𝑅 and communication links modeled as edges𝐿 between these resources. Furthermore, mapping edges 𝑀 ⊆ 𝐴∪𝐶×𝑅 will beadded for each actor of the network graph to their possible hardware implemen-tations derived via behavioral synthesis. If a CPU is present in the architecture

206

Page 215: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5.3 Case Study

graph, a mapping edge may be added for an actor to denote the possibility ofa realization of the actor as software running on the CPU. Moreover, mappingedges are added for each channel of the network graph to dedicated hardwareFirst In First Out (FIFO) implementations from the communication library.For Xilinx FPGAs, two variants of hardware FIFOs are present in the commu-nication library. A variant where the provided storage is realized using BRAMcells and a variant where distributed Random Access Memory (RAM) is used.Additionally, if memories of CPU subsystems are present, then mapping edgesmay be added that map channels to these memories.The problem graph, the architecture graph, and the mapping edges form

the specification graph that is used as an input to the automatic DSE. Fi-nally, necessary information like delays or area consumption, etc., required bythe estimation during DSE is annotated to the resources and mapping edges.For example, objectives that represent the area consumption, such as number ofLUTs, number of Flip-Flops (FFs), and number of BRAM cells are evaluated bysummarizing the module characteristics from behavioral synthesis and IP coredata sheets. Estimations for the achieved performance numbers, e.g., through-put and latency, can be evaluated using system-level performance models bymeans of VPC [SFH+06*]. For software only implementations—be they single-or multithreaded—an evaluation using synthesis-in-the-loop is also feasible dur-ing DSE.The last step of the SystemCoDesigner ESL design flow is the automatic

generation of an FPGA-based System-on-Chip (SoC) implementation. Individ-uals from the DSE represent implementations, which can be translated to anSoC realization by the proposed synthesis back-end. Interface and communica-tion modules from a communication library are used to set up inter and intrahardware/software communication. Finally, the entire SoC platform is buildand compiled into an FPGA bit stream by means of generating soft-core pro-cessor subsystems, instantiations of IP cores and communication infrastructureusing, e.g., the EDK tool chain. A more detailed description of the automaticgeneration step is given in Section 5.4.

5.3 Case Study

For automatic DSE and automatic behavioral synthesis, only a well definedsubset of SystemC is suitable. Here, SysteMoC [FHT06*] is used to cover boththe needs of DSE and automatic behavioral synthesis. Communication is re-stricted to dedicated channels. Actors contain ports, to which the channels areconnected. Here, port-to-port communication media with FIFO semantics thatconnect exactly one output port with exactly one input port are assumed.

207

Page 216: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5. Synthesis

JPEGSource ZRL

Inverse InverseQuant.

Huff.

Image Frame Inverse

Decoder

Sink Shuffler ZigZag

Parser

IDCT

Figure 5.2: Shown above is a DFG of a Motion-JPEG decoder, transforminga Motion-JPEG data stream into uncompressed images. The DFGconsists of 9 actors connected amongst each other via 12 FIFO chan-nels.

See Figure 5.2 for the simplified SysteMoC model of the Motion-JPEG de-coder, which was introduced previously in Chapter 4, that is also used as runningexample throughout this chapter. The Motion-JPEG decoder case study con-sists of 5,650 lines of code, supporting interleaved and non-interleaved baselineprofile without sub sampling. It transforms a baseline Motion-JPEG streamfrom the JPEG Source actor into a series of Portable Pixmap (PPM) imagesthat are written by the ImageSink actor. The parser extracts and propagatesthe control information like the image size and quantization steps. After en-tropy decoding (Huffman Decoder) and block reconstruction (Inverse ZRL toInverse ZigZag), an Inverse Discrete Cosine Transform (IDCT) is performed.The Frame Shuffler reorders the pixels of the resulting 8× 8 blocks into a rasterscan order which are sent to the ImageSink actor.Each of these actors is modeled in SysteMoC and is divided into three parts:

(1) the communication ports, (2) the actor functionality, and (3) the actor com-munication behavior. Actors may posses some internal state (internal variables),e.g., as demonstrated by the ImageSink SysteMoC actor shown in Figure 5.3.

𝑎sink|ImageSink

#𝑖2 ≥ 1 ∧ mp = 1/𝑓processPixel(𝑖2[0])

#𝑖1 ≥ 1/𝑓processNewFrame(𝑖1[0])

#𝑖2 ≥ 1 ∧ mp = 1/𝑓processPixel(𝑖2[0])

𝑖1 𝑖2𝑞write𝑞start

Figure 5.3: Visualized above is the ImageSink actor from Figure 5.2. The actorhas two input ports 𝑖1 and 𝑖2 as well as an actor FSM that determinesits communication behavior. The FSM consists of two states andthree transitions. Each transition is of the form guard/action.

208

Page 217: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5.4 Synthesis of Actor-Oriented Designs

Actor functionality is a collection of functions that can access data on channelsvia ports (e.g, processPixel() in Figure 5.3). These functions are only executedduring transitions of the actor FSM which implements the communication be-havior of the actor. Transitions in this FSM include a guard and an actionfunction, which is executed if the transition is taken. The guard determinesunder which conditions a transition may be taken, e.g., depending on a certainminimal amount of input tokens or values of tokens on input ports. A transitioncan only be taken, if the guard is satisfied and therewith activate the transition.For example, see the transition from state 𝑞write to state 𝑞start in Figure 5.3.Separated by a logical conjunction, there are two parts in this guard. The firstpart, #𝑖2 ≥ 1, demands at least one available token at input port 𝑖2 which isconsumed (i.e., removed from the channel) after execution of the action; thesecond part, mp = 1, asserts that some internal variable mp equals ’1’. If bothparts are satisfied, the transition is activated. Taking the transition leads tothe execution of the action function 𝑓processPixel(𝑖2[0]) (which possibly changesinternal variables), consuming and processing the first token from the channelconnected to port 𝑖2, and a state transition from state 𝑞write to 𝑞start. A precisedefinition of this notions is given in Chapter 3.

5.4 Synthesis of Actor-Oriented Designs

In this section, the automatic transformation of SysteMoC actors to pure Sys-temC modules for behavioral synthesis, as well as the automatic system gen-eration and the used communication primitives are described. The automatictransformation is performed by a source-to-source compiler contained in thesynthesis back-end. This source-to-source compiler is based on the C LanguageFamily Frontend of the LLVM Compiler Infrastructure (clang) [Lat11]. Thetransformed SystemC description is synthesizable to Verilog Hardware Descrip-tion Language (Verilog) using Forte Cynthesizer, a state-of-the-art behavioralsynthesis tool. For each synthesized Verilog RTL module, the makespan forthe execution of each action are extracted by means of analyzing the synthesisreports provided by Cynthesizer. These timing values are used for optimizingthe latency during DSE with SystemCoDesigner. In the automatic systemgeneration phase, the synthesized Verilog RTL modules are being connected andsynthesized according to the DSE results to obtain an FPGA implementation.

5.4.1 Actor Transformation

SysteMoC actors contain abstract communication ports, a dedicated actor FSM,and action and guard functions. Whereas SystemC modules are C++ classescontaining ports for hardware signals and parallel processes. To transform the

209

Page 218: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5. Synthesis

abstract SysteMoC communication ports to ports for hardware signals, the ac-cesses to these ports are replaced and an FSM process controlling the guard andaction invocations is created. The transformation steps [KSS+09*] are depictedin Figure 5.4.

while(true) {

case Q_START:

case Q_WRITE:

switch(state) {

mp = i1[0]; mp = i1.read(0);

𝑞start 𝑞write

Actor SystemC Module

Ports

SysteMoC Synthesizable

Portaccess

GuardsActions

FiringStateMachine

Figure 5.4: [KSS+09*] Transformation steps from a SysteMoC actor to a syn-thesizable SystemC module

For each SysteMoC communication port, hardware signal ports are createdto be able to connect the hardware implementation of SysteMoC FIFOs as de-scribed in Section 5.4.3. There are processes for each SysteMoC write port thathandle the hardware protocol, and, thus, allow concurrent writes to differenthardware ports. Concurrent reads from different input ports are realized bysplitting the read access into two functions: The first function starts the readaccess for a port and does not wait for a clock edge, while the second functionwaits for the result from the input port. Hence, reads from different input portscan be started concurrently by calling the first function for each input port.To transform the action and guard functions, all port accesses are replaced

by function calls that access the hardware ports as described above. The trans-formed functions are not implemented as parallel processes, but they are calledfrom the controlling FSM process. To generate hardware alternatives whichmay be different in speed and size, the designer can place directives, e.g., con-straints for the latency of a loop body, for the behavioral synthesis tool withinthe transformed SystemC code. These alternatives can be used during DSE togenerate different individuals including trade-offs, e.g., area vs. throughput.

210

Page 219: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5.4 Synthesis of Actor-Oriented Designs

In SysteMoC, there is an abstract syntax to describe the actor FSM of eachactor, which cannot be used in pure SystemC modules. To create a SystemCmodule, a controlling process containing the transformed actor FSM is automat-ically created. This process is the only process schedulable for the behavioralsynthesis tool if concurrent guard evaluation is not enabled. Otherwise, in ad-dition to the controlling process derived from the actor FSM, a guard processwill be generated for each guard. Thus, the controlling process will not evalu-ate the guard functions, but only use the results computed by the concurrentlyexecuting guard processes.The transformed actor FSM code uses a switch statement for the state vari-

able. Within each state, all possible transitions are checked sequentially bychecking the token counts in the input and output FIFOs and calling the guardfunctions. If a transition is taken, the corresponding action function is called,tokens are consumed from the input FIFOs and tokens are produces on the out-put FIFOs. Without optimizations, the possible transitions within each stateare checked sequentially, because guards access input ports and, thus, cannotbe parallelized without additional hardware resources. However, if additionalhardware registers are used to cache tokens from input ports, then parallel guardevaluation [ZHFT12*] is possible.

5.4.2 Generating the Architecture

As depicted in Figure 5.1, the result of the automatic DSE is a set of Pareto-optimal individuals. From these individuals, the designer can select one asimplementation according to strategic considerations. This process is known asdecision making in multi-objective optimization.The designer selects individuals resulting from the automatic DSE for imple-

mentation. These individuals are used as input to the synthesis back-end inorder to generate the FPGA-based designs. In the following case study, thehardware-only system generation flow for Xilinx FPGA targets is described.The architecture generation is twofold: In the first step, the synthesis back-

end automatically inserts the allocated IP cores from the component library. Inthe second step, it automatically inserts the communication resources from thecommunication library. The result of this architecture generation is a hardwaredescription file (.mhs-file in case of the EDK tool chain). In the following, therelevant details of the architecture generation process are discussed.Beside the information stored in the architecture graph, information from

the SysteMoC model must be considered during the architecture generation aswell. A vertex representing a FIFO in the problem graph contains informationabout the depth and the data type of the communication channel used in theSysteMoC model. A vertex representing an Actor contains the names of its

211

Page 220: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5. Synthesis

Operation Behaviorrd_tokens() Returns how many tokens can be read from

the SysteMoC FIFO. (available tokens)wr_tokens() Returns how many tokens can be written into

the SysteMoC FIFO. (free places)read(offset) Reads a token at a given offset relative to the

first available token. The read token is notremoved from the SysteMoC FIFO.

write(offset, value) Writes a token at a given offset relative to thefirst free place. The written token is not madeavailable.

rd_commit(count) Removes count tokens from the SysteMoCFIFO.

wr_commit(count) Makes count tokens available for reading.

Table 5.1: [HFK+07*] SysteMoC FIFO interface

SysteMoC ports; these names are used for mapping SysteMoC communicationports to hardware ports.After instantiating the IP cores, the final step is to insert the communication

resources. These communication resources are taken from the platform-specificcommunication library. In the case of hardware-only systems, the used commu-nication resource are always SysteMoC FIFOs as described in the next section.After generating the architecture, several Xilinx implementation tools like

map, par, bitgen, data2mem, etc., are used to produce the platform-specific bitfile. Finally, the bit file can be loaded on the FPGA platform and the systemcan be run.

5.4.3 Communication Resources

The SysteMoC FIFO communication resources serve three main purposes: Theystore data, transport data, and synchronize the actors via availability of tokensrespectively buffer space. For these tasks, the interface shown in Table 5.1is used. In the following, each communication resource that implements theinterface from Table 5.1 is called a SysteMoC FIFO.The implementation of SysteMoC FIFOs is not limited to be a single hardware

FIFO module. It may, e.g., consist of two hardware modules that communicateover a bus. In this case, one of the modules would implement the read interface,the other one the write interface.To be able to store data in the SysteMoC FIFO, it has to contain a buffer.

Depending on the implementation, this buffer may also be distributed over

212

Page 221: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5.5 Results

different modules. The designer specifies the buffer sizes in the SysteMoC model.Of course, it would be possible to optimize the buffer sizes for a given system.However, this is future work in SystemCoDesigner.As can be seen from Table 5.1, a SysteMoC FIFO is more complex than a

common hardware FIFO. This is due to the fact that common hardware FIFOsdo not support non-consuming read operations for guard functions and thatSysteMoC FIFOs must be able to commit more than one read or written token.The hardware implementation of SysteMoC FIFOs consist of a single module

described in Verilog. This is a generic module that is synthesizable to effi-cient hardware implementations on different FPGA platforms. For example,the buffer containing the FIFO data can either be composed of BRAM cells orprovided by distributed RAM on Xilinx FPGAs. As shown in [HFK+07*], theoverhead compared to native Xilinx COREGEN FIFOs is rather small, e.g., fora SysteMoC FIFO providing storage in 8 BRAMs for 4,096 tokens each of whichhas a data width of 32 bit, the overhead is just 12 FFs and 33 4-input LUTs.

5.5 Results

The presented design flow is used to create a hardware implementation of aMotion-JPEG decoder as depicted in Figure 5.2. The target platform is a Xil-inx Virtex-II FPGA (XC2V6000) where the designer has chosen a target clockfrequency of 50 MHz.First, an executable specification of the Motion-JPEG decoder using SysteMoC.

Then, the actor specifications are automatically transformed to synthesizableSystemC modules. These SystemC modules were synthesized to a gate-levelnetlist using Forte Design Systems Cynthesizer [For13] and Synplify Pro fromSynopsys [Syn13].The specification graph provided at least one hardware implementation alter-

native for each actor. For three actors, two hardware implementation alterna-tives were provided for each of them. Two alternatives for SysteMoC hardwareFIFOs implementations were available. One alternative uses BRAM, the otheruses distributed RAM to provide the storage for the tokens. The used LUTs,FFs, BRAMs, and Hardware Multipliers (HMULs) for each actor were auto-matically extracted from the synthesis reports. Also the latency information foreach action of each actor was automatically extracted from the high-level syn-thesis report. All this information was used during DSE. The DSE uses VPCwith action accurate simulation to get latency and throughput data [KSS+09*].Table 5.2 shows selected properties of individuals found by DSE. The per-

formance values latency and throughput are VPC simulation results for QCIF(176x144 pixel) pictures. The first individual was also synthesized for the given

213

Page 222: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5. Synthesis

Latency Throughput LUTs FFs BRAMs/HMULs10.35ms 98.54 fps 36 059 15 885 7910.32ms 98.60 fps 45 250 16 601 5310.32ms 98.61 fps 41 963 16 401 73

Table 5.2: VPC simulation results

Latency Throughput LUTs FFs BRAMs/HMULs23.45ms 43.5 fps 36 000 15 800 79

Table 5.3: Hardware synthesis results of the first design point from Table 5.2

target platform. Table 5.3 shows the corresponding properties of this hardwareimplementation derived via gate-level synthesis and simulation.The used hardware resources are very close to the estimated values, because

the estimation already uses data from the gate-level synthesis. During VPCsimulation, time passes only while executing actors, not in the controlling actorFSM or guard functions. This is the reason for the discrepancy between the VPCand RTL simulation performance results. To minimize this discrepancy, twothings can be done: (1) Parallel guard evaluation [ZHFT12*] can be performedin hardware. This will minimize the hardware overhead in the controlling FSMprocess. (2) The scheduling overhead spent by the controlling FSM process canbe distributed to the action execution times by means of calibration [KSH+11,KSH+12].

5.6 Related Work

In this section, related work in the area of DSE at the ESL as well as auto-matic synthesis of systems is discussed. Tools providing DSE at the ESL are forinstance Sesame [PEP06], MILAN [MPND02], and CHARMED [KB04]. How-ever, all these tools do not provide a synthesis path to generate an implemen-tation. On the other hand, some tools exist providing synthesis to hardware/-software implementations but do not perform an automatic DSE. For instance,ESPAM/Compaan/Laura [DTZ+08, NSD08] automatically converts a Matlabloop program into a Kahn Process Network (KPN). This process network canbe transformed into a hardware/software system by instantiating processors andIP cores and connecting them with FIFOs. Special software routines take careof the hardware/software communication. The EDK tool chain is used for finalbit stream generation.The main difference between ESPAM/Compaan/Laura and SystemCoDe-

signer lies in the supported Models of Computation (MoCs). The Com-

214

Page 223: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5.6 Related Work

paan/Laura approach [SZT+04] restricts the MoC to KPNs, while System-CoDesigner supports Non-Determinate Data Flow (NDF). Moreover, ES-PAM does not provide a way to use behavioral synthesis.The Center for Embedded Computer Systems [Cen] has introduced its Em-

bedded Systems Environment (ESE). ESE starts with a SystemC descriptionand synthesizes an SoC implementation. In a first step, the mapping of oper-ations to resources is done manually. In a second step, a SystemC transactionlevel model is generated which is used for synthesis. Here, ESE provides abehavioral synthesis approach to generate hardware implementations from Sys-temC modules. However, in contrast to SystemCoDesigner, the automaticDSE is excluded, i.e., the mapping has to be done manually.An approach supporting automatic DSE as well as synthesis is presented in

[KKO+06]. It is called Koski and, as SystemCoDesigner, it is dedicated tothe automatic SoC design. The input specification is given as a KPN mod-eled in UML. The Kahn processes are refined using Statecharts. The targetarchitecture consists of the application software, the platform-dependent andplatform-independent software, and synthesizable communication and process-ing resources. Moreover, special functions for application distribution are in-cluded, i.e., inter-process communication for multi-processor systems. DuringDSE, Koski uses simulation for performance evaluation. Although, many sim-ilarities can be identified between Koski and SystemCoDesigner, there arefundamental differences: Koski in its current version provides a sophisticatedsoftware generation targeting real-time operating systems allowing a great flexi-bility. On the other hand, SystemCoDesigner integrates behavioral synthesistools into the DSE removing any manual step from the design flow.

Summary

In this chapter, the third third key contribution of this work, a synthesis back-end for the SysteMoC language, is presented. The integration of this syn-thesis back-end with the SystemCoDesigner DSE methodology allows theautomatic optimization and generation of MPSoC systems—starting with anabstract actor-oriented model written in SysteMoC. The presented design flowallows the generation of MPSoC implementations, determined by means of DSE,in very short time. Furthermore, the presented synthesis back-end can be usedto realize synthesis-in-the-loop for the SystemCoDesigner DSE methodol-ogy. This enables SystemCoDesigner to evaluate its design decisions basedon real implementations, in contrast to the approximately-timed performancemodeling performed via the VPC [SFH+06*] approach. Additionally, the im-plementations generated via the synthesis back-end can be used to validate theannotated timing values [SGHT10] (cf. the discrepancies between the estimated

215

Page 224: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

5. Synthesis

and the measured latency in Tables 5.2 and 5.3) used for the VPC-based backannotation of design decisions into the SysteMoC application for the purpose ofsimulative evaluation at the ESL.Moreover, by automatic generation, the synthesized system avoids errors that

might be introduced by manual refinement steps. Furthermore, no manual op-timizations are made. Instead, clustering is used to optimize the schedules foractors bound onto the same resource. Finally, the efficiency of the presenteddesign flow was demonstrated using a Motion-JPEG decoder as an industrialgrade case study.

216

Page 225: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

6Outlook and Conclusions

The main contributing factor for the complexity explosion of modern embed-ded Systems is the heterogeneous architecture of Multi-Processor System-on-Chip (MPSoC) systems used to realize these embedded systems. This ris-ing complexity is increasingly threatening the feasibility of traditional embed-ded system design flows. Hence, the adoption of design methodologies at theElectronic System Level (ESL) becomes imperative. To cope with the concur-rency implied by these architectures, data flow modeling is a widely acceptedparadigm [LNW03, LN04] at the ESL. This work proposes a clustering-basedMPSoC design flow for data flow-oriented applications for MPSoCs. The in-troduced approach aims at a seamless design flow from an application modeledat the ESL down to a virtual prototype or a Field Programmable Gate Ar-ray (FPGA) based implementation. It puts special emphasis on efficient codegeneration for data flow graphs of very fine granularity. In the following, thekey contributions of this thesis are summarized. Then, an outlook on futureresearch perspectives is given.

6.1 Summary

This thesis contributes improvements for design flows targeting MPSoC systemsby: (1) providing the SystemC Models of Computation (SysteMoC) modelinglanguage with formal underpinnings in data flow modeling to enable a trade-off between analyzability and expressiveness of the described application; (2)contributing a methodology called clustering that exploits these trade-offs togenerate efficient schedules for applications containing actors of high analyz-ability and low expressiveness; and (3) enabling the synthesis of the modeledapplications via a synthesis back-end that supports multiple targets such assingle core software implementations, virtual prototypes of distributed MPSoCsystems, and dedicated hardware implementations.Every contribution of this thesis is essential for the proposed design flow. In

the following, these steps and their importance will be reiterated. The firstkey contribution of this work—the SysteMoC modeling language introduced in

217

Page 226: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

6. Outlook and Conclusions

Chapter 3—is required to serve as design entry for the presented design flow.The SysteMoC language is used for modeling of executable specifications at theESL. Other languages like Cal Actor Language (CAL) [EJ03] or pure Sys-temC [GLMS02] can also be used to model executable specifications, but lackthe integration into a Design Space Exploration (DSE) methodology. To fa-cilitate this integration, this thesis contributes both a way to extract a dataflow graph from a SysteMoC application as well as a way to enable the DSE toback annotate design decisions into the SysteMoC application for the purposeof simulative performance evaluation.If an executable specification is modeled as a coarse-grained data flow graph,

a naïve translation of one actor to one process, thread, or isolated unit of com-pilation is feasible for translation of this executable specification into an imple-mentation. However, if the data flow graph is very fine grained, the executiontime required for synchronization and handling of data dependencies betweenthese actors at run time will dominate the amount of useful computational workperformed by the actors. That is to say, for very fine-grained data flow graphsthe scheduling overhead will dominate the amount of useful computational work.This problem can be mended by coding the actors of the application at an

appropriate level of granularity, i.e., combining as much functionality into oneactor such that the computation performed by the actor dominates the execu-tion time required for scheduling. However, combining functionality of multipleactors into one actor can also mean that more data has to be produced andconsumed atomically which requires larger channel capacities and may delaycomputation unnecessarily. This degrades the latency if multiple computationresources are available for execution of the application. Hence, selecting a levelof granularity for a functionality by the designer is a manual trade-off betweenschedule efficiency to improve throughput and latency on a single core softwareimplementation against scheduling flexibility required to improve throughputand latency on an MPSoC implementation. Thus, if the mapping of actors toresources itself is part of the synthesis step, e.g., as is the case in the Sys-temCoDesigner DSE methodology, an appropriate level of granularity canno longer be chosen in advance by the designer.

Monolithic clustering approaches [BL93, PBL95] for an automatic transfor-mation of the level of granularity exist for data flow graphs consisting of purelystatic actors. Clustering replaces a connected subgraph of static actors by a com-posite actor implementing a schedule for the replaced actors derived at compiletime by combining multiple static actors into one composite actor. Thus, clus-tering reduces the number of actors which have to be scheduled by the dynamicscheduler. This transformation reduces the scheduling overhead as it has beenshown that compile-time scheduling of static actors produces more efficient im-plementations in terms of latency and throughput than run-time scheduling ofthe same actors by a dynamic scheduler. However, monolithic clustering ap-

218

Page 227: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

6.1 Summary

proaches break down if even one non-static actor is present in the data flowgraph. In contrast to this, for the first time, the second key contribution—theclustering methodology introduced in Chapter 4—can perform this transforma-tion of the level of granularity for general data flow graphs. This goes signifi-cantly beyond previous approaches as it is not confined to monolithic clustering,but produces a Quasi-Static Schedule (QSS) for the generated composite actor.This enhances the power of the contribution in terms of avoiding deadlocksand, thus, increasing the number of actors that can be clustered. Hence, theclustering methodology is neither limited to data flow graphs consisting only ofstatic actors, nor does it require a special treatment of these static actors by thedeveloper of the SysteMoC application. For sufficiently fine-grained data flowgraphs dominated by static actors, it is shown that a speedup by a factor of tenand beyond can be obtained by means of clustering.Multiple challenges have been solved as part of this clustering methodology:

(1) To identify the actors amenable to clustering, the SysteMoC language—incontrast to SystemC—enforces a distinction between communication and com-putation of an actor. This enables the design flow to determine the Modelof Computation (MoC) used by a SysteMoC actor by means of classification[FHZT13*, ZFHT08*]. Actors that are classified as static data flow actors areamenable to the contributed clustering methodology. (2) A heuristic is con-tributed that maximizes the size of the created clusters and, thus, reduces thescheduling overhead for a single core software implementation. Determinationof clusters by means of DSE for MPSoC implementations is also possible, butnot explicitly presented as part of this thesis. (3) Two alternative clusteringalgorithms—an FSM-based and a rule-based representation of QSSs—are giventhat transform the level of granularity of a data flow graph by replacing is-lands of static actors in the data flow graph by composite actors implementingQSSs for the replaced actors. The rule-based representation trades latency andthroughput in order to achieve a more compact code size for the QSS as com-pared to the Finite State Machine (FSM)-based representation. Intuitively, aQSS is a schedule in which a relatively large proportion of scheduling deci-sions have been made at compile time. A QSS executes sequences of staticallyscheduled actor firings at run time. These sequences have been determined atcompile time by only postponing required scheduling decisions to run time. Incontrast to the QSSs computed by the algorithm presented by Tripakis et al.[TBRL09, TBG+13], the algorithms employed by the contributed clustering al-gorithms will even generate retiming sequences in order to maximize the lengthof the sequences of statically scheduled actor firings and, thus, further reduce thescheduling overhead. (4) In order to extend the design space still further fromdata flow graphs with unbounded channel capacities to graphs with boundedchannel capacities, a First In First Out (FIFO) channel capacity adjustment al-gorithm is given. The introduction of limited channel capacities into SysteMoC

219

Page 228: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

6. Outlook and Conclusions

models is necessary when refining a model for a subsequent implementation inhardware or software. Here, synthesis back-ends only support limited channels.Note that the determination of limited channel capacities that do not introducedeadlocks into a model is in general undecidable [Par95] for dynamic data flowgraphs. However, the channel capacities for the FIFO channels connected tothe cluster input and output ports cannot be used unmodified. The contributedadjustment algorithm ensures that clustering will not introduce an artificialdeadlock into the clustered data flow graph if the original graph is sequencedeterminate. (5) Finally, a correctness proof is contributed for the introducedclustering methodology.To conclude the presented design flow, the third key contribution of this

work—the synthesis back-end introduced in Chapter 5—enables the transfor-mation of executable specifications modeled in SysteMoC according to a set ofdesign decisions determined via DSE into an implementation. The integrationof this synthesis back-end with the SystemCoDesigner DSE methodologyallows the automatic optimization and generation of MPSoC systems—startingwith an abstract actor-oriented model written in SysteMoC. Moreover, the pre-sented synthesis back-end can be used to realize synthesis-in-the-loop for theSystemCoDesigner DSE methodology. This enables SystemCoDesignerto evaluate its design decisions based on real implementations, in contrast tothe approximately-timed performance modeling performed via the Virtual Pro-cessing Components (VPC) [SFH+06*] approach. Additionally, the implemen-tations generated via the synthesis back-end can be employed to validate theannotated timing values [SGHT10] used for the VPC-based back annotation ofdesign decisions into the SysteMoC application for the purpose of simulativeevaluation at the ESL.Examples for design decisions are the schedules generated by the clustering

methodology, which are used by the synthesis back-end to reduce the schedul-ing overhead of the generated software, as well as the allocation and bindingdecisions. No manual refinement steps are needed, and so one major source oferrors is eliminated. No manual optimizations are made. Instead, clustering isused to optimize the schedules for actors bound onto the same resource. In thegenerated implementations, the remaining actors which are not replaced by acomposite actor as well as all composite actors are scheduled by the dynamicscheduler of each resource.

6.2 Future Research Directions

The presented work can be extended in several directions. From the DSE per-spective, the decision of which actors to cluster can be delegated from theheuristics to the SystemCoDesigner DSE methodology. In this way, both

220

Page 229: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

6.2 Future Research Directions

architecture and clustering can be optimized together, thus, resulting in betteroptimized implementations. Moreover, while synthesis-in-the-loop can be usedto evaluate these implementations, an analytic evaluation is preferable to speedup exploration. Here, simple heuristics like maximization of the length of the se-quences of statically scheduled actor firings could be one objective for the DSE.A more advanced evaluation for clustering and binding decisions for islands ofstatic actors can be performed by means of max-plus algebra to determine la-tency and throughput for the islands of static actors. In this way, adjustmentof the FIFO channel capacities in order to maximize throughput for all FIFOscontained in the islands of static actors may also be considered.From the clustering methodology perspective, an extension from clustering of

islands of static actors to islands of sequence determinate, i.e., islands conform-ing to the Kahn Process Network (KPN) MoC, is an interesting challenge thatwould tremendously enlarge the design space for clustering. Here, the equiv-alence condition for sequence determinate clusters given in Chapter 4 can bereused to define safe cluster refinement operations for islands conforming to theKPN MoC. Open questions are how QSS should be represented for these kindsof clusters. If a finite state based representation is chosen, then what conditionand assumptions for the KPN cluster are needed to guarantee the existence of afinite state based representation of its QSS. As a starting point for research inthis direction, the work of Buck [Buc93] to schedule Boolean Data Flow (BDF)graphs may give interesting insights. Moreover, to guarantee a seamless de-sign flow, classification of SysteMoC actors has to be extended to also recognizeactors conforming to the KPN MoC.From a synthesis perspective, code generation optimizations enabled by clus-

tering are of interest. Here, techniques like buffer memory sharing of all FIFOchannels inside the cluster or techniques like strength reductions of FIFO chan-nels to local variable could be pursued. Moreover, binding actors not to oneresource, but to a set of homogeneous resources might improve the possiblescheduling flexibility of the dynamic scheduler. Here, techniques like workstealing [BL99] or the techniques presented for the multi-threaded simulationscheduler [HPB11] could be employed to compensate for the reduced schedulingflexibility induced by clustering actors together.

221

Page 230: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität
Page 231: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

German Part

Eine Clustering-basierte Methode zurAbbildung von datenflussorientierten

Anwendungen aufEin-Chip-Multiprozessor-Systeme

223

Page 232: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität
Page 233: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Zusammenfassung

Durch neuartige Technologien und stetige Miniaturisierung werden zukünftigeeingebettete Systeme eine immer höhere Rechenleistung – insbesondere durchdie Integration mehrerer Rechenkerne – erzielen. Dies ermöglicht die Realisie-rung immer komplexerer Anwendungen. Gleichzeitig steigen die Anforderun-gen an die Effizienz solcher Systeme, insbesondere hinsichtlich ihres Energiever-brauchs. Hierfür ist es notwendig, eine speziell auf die Anwendung abgestimm-te Architektur aus verschiedenen Rechenkernen (CPUs, DSPs, GPGPUs) undHardware-Beschleunigern zu entwerfen bzw. Anwendungen auf gegebene Archi-tekturen effizient abzubilden. Zukünftige Ansätze zum Entwurf solcher Systememüssen deshalb die Aspekte Parallelität und Heterogenität inhärent und durch-gehend von der Idee bis zur Realisierung unterstützen.Die vorliegende Arbeit stellt einen Clustering-basierten Ansatz zur Abbildung

von datenflussorientierten Anwendungen auf Ein-Chip-Multiprozessor-Systemevor. Hierbei werden drei zentrale Aspekte des Entwurfs um neuartige Methodenerweitert und zu einem durchgehenden Ansatz verknüpft: Für die Modellierungsolcher Systeme und insbesondere der Anwendungen wurde die datenflussorien-tierte Systembeschreibungssprache SysteMoC – eine Erweiterung der SpracheSystemC – entwickelt, die es erlaubt das Verhalten durch einen Datenflussgra-phen zu beschreiben. Dieser Graph besteht aus einer Menge über FIFOs kommu-nizierender Aktoren. Neben einer effizienten Simulation werden die Synthese vonHard- und Software-Implementierungen und insbesondere auch die Generierungeines Systemmodells für die automatische Entwurfsraumexploration einer vor-zugebenden heterogenen Zielarchitektur unterstützt. Diese Zielarchitektur kanndie Abbildung der Aktoren auf Prozessoren und Hardware-Implementierungender Aktoren vorsehen. Zur effizienten Umsetzung von datenflussorientierten An-wendungen wird eine Technik namens Clustering vorgestellt. Die Idee von Clus-tering besteht darin, kommunizierende Aktoren einer gegebenen Anwendunggeschickt zu Verbundaktoren zu gruppieren und die Ablaufreihenfolge der ineiner solchen Gruppierung enthaltenen Aktoren durch Definition sog. quasi-statischer Ablaufpläne festzulegen. Für einen quasi-statischen Ablaufplan istcharakteristisch, dass möglichst viele Ablaufplanungsentscheidungen bereits zurÜbersetzungszeit getroffen werden. Hinsichtlich der anschließenden Codegene-

225

Page 234: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

German Part

rierung bzw. Synthese – und damit Implementierung – werden durch solcheAblaufpläne die Overheads in der Ablaufplanung zur Laufzeit reduziert undsomit die Effizienz des Systems signifikant gesteigert. In einem anschließendenSchritt werden gruppierte und auf die Architektur abgebildete Anwendungenmittels einer Synthese-Infrastruktur automatisch in Hard- wie Softwarekompo-nenten übersetzt und zu einem Gesamtsystem integriert. Der vorgeschlageneAnsatz (1) unterstützt durch die durchgehende Modellierung in SysteMoC so-wohl Parallelität wie Heterogenität der Ressourcen, (2) erhöht durch Clusteringdie Effizienz der Systemrealisierungen und (3) steigert durch einen hohen Gradan Automatisierung – von der Idee bis zur Realisierung – die Produktivitätdes Entwicklers und damit die Effizienz des Entwurfs selbst. Die wesentlichenBeiträge zu den drei genannten Gebieten werden im Folgenden nochmals zu-sammengefasst.

Modellierung:Im Bereich der Modellierung wird mittels der Beschreibungssprache SysteMoCein Beitrag zur effizienten Modellierung von eingebetteten Systemen geliefert.Im Allgemeinen ist die Funktionalität von Anwendungen für heutige und zukünf-tige eingebettete Systeme auf mehrere Prozessoren und Hardwarebeschleunigerverteilt. Damit wird eine direkte Unterstützung von Parallelität durch Beschrei-bungssprachen für diese Systeme zur wichtigen Voraussetzung für deren effizi-ente Modellierung. Dies steht im Gegensatz zu den Fähigkeiten von imperativenSprachen wie C und C++, welche herkömmlicherweise zur Modellierung von ein-gebetteten Systemen eingesetzt werden. Die entwickelte BeschreibungsspracheSysteMoC stellt – im Gegensatz zu traditionellen imperativen Sprachen – eineMöglichkeit der Anwendungsbeschreibung basierend auf dem Datenflussparadig-ma zur Verfügung. Dieses Paradigma hat sich als geeignetes Mittel zur Modellie-rung von Parallelität herauskristallisiert und ist somit hervorragend im Kontextheterogener eingebetteter Systeme nutzbar. Das vorgeschlagene Vorgehen hatden weiteren Vorteil, dass aus der Anwendungsbeschreibung in SysteMoC einmathematisches Modell der Anwendung automatisch extrahiert werden kann.Dieses Modell dient als Grundlage für eine Entwurfsraumexploration, welcheden Entwurfsraum der möglichen Realisierungen des eingebetteten Systems au-tomatisch unter Berücksichtigung verschiedener Zielgrößen, z. B. Kosten oderZeitverhalten, untersucht.

Clustering:Der Nachteil einer Modellierung auf Datenflussbasis ist die schwierige Gene-rierung von effizienten Software-Implementierungen. Dieser Nachteil erwächstaus dem Missverhältnis zwischen dem sequentiellen imperativen Ausführungs-modell von Prozessoren und dem datengetriebenen Ausführungsmodell des Da-tenflussparadigmas. Im Bereich der Analyse wird mittels des in dieser Arbeit

226

Page 235: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

vorgestellten Clustering-basierten Ansatzes eine effiziente Implementierung fürstatische Teilgraphen von allgemeinen Datenflussgraphen ermöglicht. StatischeTeilgraphen dürfen nur statische Aktoren enthalten. Diese Aktoren sind da-durch charakterisiert, dass deren Ausführungsverhalten durch einfache von denkommunizierten Datenwerten unabhängigen Regeln spezifiziert werden kann.Es konnte gezeigt werden, dass für Datenflussanwendungen, die vornehmlichaus statischen Aktoren bestehen, eine Reduktion der Ausführungszeit auf Ein-Prozessor-Systemen um den Faktor Zehn oder mehr erreicht werden kann. Be-stehende Ansätze erfordern, dass entweder (1) die Anwendung ausschließlich ausstatischen Aktoren besteht oder (2) der statische Teilgraph die Kommunikati-on zwischen sich und dem Rest der Anwendungen vollständige kontrolliert. Dervorgestellte Ansatz hingegen unterstützt – im Gegensatz zum Stand der Tech-nik – die automatische Identifikation von statischen Aktoren einer in SysteMoCmodellierten Datenflussanwendung und deren Gruppierung zu Verbundaktoren.

Synthese:Im Bereich der Synthese ermöglicht die vorgestellte Synthese-Infrastruktur ei-ne automatische Umsetzung einer SysteMoC-Anwendung in einen virtuellenPrototypen anhand einer von der Entwurfsraumexploration gefundenen mög-lichen Realisierung der Funktionalität der Anwendung. Des Weiteren wird dieSynthese-Infrastruktur benötigt, um Charakteristika, wie z. B. Zeitverhaltenoder Energieverbrauch, für mögliche Implementierungen der Funktionalität aufverschiedenen Ausprägungen der Zielarchitektur zu bestimmen. Um eine hoheGenauigkeit der Bewertung der von der Entwurfsraumexploration betrachtetenRealisierungsmöglichkeiten zu erreichen, kann die Synthese-Infrastruktur sogareingesetzt werden, um für jede betrachtete Realisierungsmöglichkeit eine proto-typische Implementierung zu erstellen und deren Charakteristika zu bewerten.

227

Page 236: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität
Page 237: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

[Bai05] Mike Baird, editor. IEEE Standard 1666-2005 SystemC Lan-guage Reference Manual. IEEE Standards Association, New Jer-sey, USA, 2005.

[BB91] Albert Benveniste and Gérard Berry. The Synchronous Ap-proach to Reactive and Real-Time Systems. Proc. of the IEEE,79(9):1270–1282, September 1991.

[BB01] Bishnupriya Bhattacharya and Shuvra S. Bhattacharyya. Pa-rameterized Dataflow Modeling for DSP Systems. IEEE Trans-actions on Signal Processing, 49(10):2408–2421, October 2001.

[BBHL95] Shuvra S. Bhattacharyya, Joseph T. Buck, Soonhoi Ha, and Ed-ward A. Lee. Generating Compact Code from Dataflow Specifi-cations of Multirate Signal Processing Algorithms. IEEE Trans-actions on Circuits and Systems I: Fundamental Theory and Ap-plications, 42(3):138–150, March 1995.

[BCE+03] Albert Benveniste, Paul Caspi, Stephen A. Edwards, NicolasHalbwachs, Paul Le Guernic, and Robert de Simone. The Syn-chronous Languages 12 Years Later. Proc. of the IEEE, 91(1):64–83, January 2003.

[BCG+97] Felice Balarin, Massimiliano Chiodo, Paolo Giusto, Harry Hsieh,Attila Jurecska, Luciano Lavagno, Claudio Passerone, Alberto L.Sangiovanni-Vincentelli, Ellen Sentovich, Kei Suzuki, and Bas-sam Tabbara. Hardware-Software Co-Design of Embedded Sys-tems: The POLIS Approach, volume 404 of The Springer Inter-national Series in Engineering and Computer Science. KluwerAcademic Publishers, 1997.

[BEG+13] Martin Barnasconi, Karsten Einwich, Christoph Grimm, TorstenMaehne, and Alain Vachoux. Analog/Mixed-signal (AMS) Lan-guage Reference Manual. Release 2.0, SystemC AMS Working

229

Page 238: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

Group of the Accellera Systems Initiative, 1370 Trancas Street,Napa, CA 94558, U.S.A., 2013. http://www.accellera.org.

[BELP96] Greet Bilsen, Marc Engels, Rudy Lauwereins, and Jean Peper-straete. Cyclo-Static Dataflow. IEEE Transaction on SignalProcessing, 44(2):397–408, February 1996.

[BHLM94] Joseph T. Buck, Soonhoi Ha, Edward A. Lee, and David G.Messerschmitt. Ptolemy: A Framework for Simulating and Pro-totyping Heterogenous Systems. International Journal in Com-puter Simulation, 4(2):155–182, April 1994.

[BL93] Shuvra S. Bhattacharyya and Edward A. Lee. Scheduling Syn-chronous Dataflow Graphs for Efficient Looping. J. VLSI SignalProcess. Syst., 6:271–288, December 1993.

[BL99] Robert D. Blumofe and Charles E. Leiserson. Scheduling Multi-threaded Computations by Work Stealing. Journal of the ACM,46(5):720–748, September 1999.

[BML97] Shuvra S. Bhattacharyya, Praveen K. Murthy, and Edward A.Lee. APGAN and RPMC: Complementary Heuristics for Trans-lating DSP Block Diagrams into Efficient Software Implemen-tations. Journal of Design Automation for Embedded Systems,January 1997.

[BMMKM10] Mohamed Benazouz, Olivier Marchetti, Alix Munier-Kordon,and Thierry Michel. A New Method for Minimizing Buffer Sizesfor Cyclo-Static Dataflow Graphs. In Proc. 8th IEEE Workshopon Embedded Systems for Real-Time Multimedia, ESTIMedia’10,pages 11–20, October 2010.

[BMMKU10] Mohamed Benazouz, Olivier Marchetti, Alix Munier-Kordon,and Pascal Urard. A new Approach for Minimizing Buffer Capac-ities with Throughput Constraint for Embedded System Design.In Proc. 2010 IEEE/ACS International Conference on ComputerSystems and Applications, AICCSA’10, pages 1–8, May 2010.

[BTT98] Tobias Blickle, Jürgen Teich, and Lothar Thiele. System-LevelSynthesis Using Evolutionary Algorithms. Design Automationfor Embedded Systems, 3(1):23–58, 1998.

[Buc93] Joseph T. Buck. Scheduling Dynamic Dataflow Graphs withBounded Memory Using the Token Flow Model. Technical re-port, Dept. of EECS, UC Berkeley, Berkeley, CA 94720, U.S.A.,1993. Technical Report UCB/ERL 93/69, Ph.D dissertation.

230

Page 239: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

[BWH+03] Felice Balarin, Yosinori Watanabe, Harry Hsieh, LucianoLavagno, Claudio Passerone, and Alberto L. Sangiovanni-Vincentelli. Metropolis: An Integrated Electronic System DesignEnvironment. Computer, 36(4):45–52, April 2003.

[Cal13] http://www.calypto.com, 2013.

[Cen] http://www.cecs.uci.edu/.

[CH97] Chabong Choi and Soonhoi Ha. Software Synthesis for DynamicData Flow Graph. In Proc. 8th IEEE International Workshopon Rapid System Prototyping, 1997. ’Shortening the Path fromSpecification to Prototype’, pages 72–79, June 1997.

[CHEP71] Frederic Commoner, Anatol W. Holt, Shimon Even, and AmirPnueli. Marked Directed Graphs. Journal of Computer andSystem Sciences, 5(5):511–523, October 1971.

[DDG+13] Abhijit Davare, Douglas Densmore, Liangpeng Guo, RobertoPasserone, Alberto L. Sangiovanni-Vincentelli, Alena Simalat-sar, and Qi Zhu. metroii: A Design Environment forCyber-Physical Systems. ACM Trans. Embed. Comput. Syst.,12(1s):49:1–49:31, March 2013.

[Den74] Jack Dennis. First Version of a Data Flow Procedure Language.In B. Robinet, editor, Programming Symposium, volume 19 ofLecture Notes in Computer Science, pages 362–376. Springer-Verlag, Berlin, Heidelberg, 1974.

[dKSvdW+00] Erwin A. de Kock, Wim J. M. Smits, Pieter van der Wolf, Jean-Yves Brunel, Wido M. Kruijtzer, Paul Lieverse, Kees A. Vissers,and Gerben Essink. YAPI: Application Modeling for Signal Pro-cessing Systems. In Proc. 37th Design Automation Conference,DAC’00, pages 402–405, New York, NY, USA, 2000. ACM.

[DP90] Brian A. Davey and Hilary A. Priestly. Introduction to Latticesand Order. Cambridge University Press, 1990.

[DSB+13] Morteza Damavandpeyma, Sander Stuijk, Twan Basten, MarcGeilen, and Henk Corporaal. Schedule-Extended SynchronousDataflow Graphs. IEEE Transactions on Computer-Aided De-sign of Integrated Circuits and Systems, 32(10):1495–1508, 2013.

[DTZ+08] Steven Derrien, Alexandru Turjan, Claudiu Zissulescu, BartKienhuis, and Ed F. Deprettere. Deriving efficient control in

231

Page 240: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

Process Networks with Compaan/Laura. International Journalof Engineering Science, 3(3):170–180, 2008.

[EJ03] Johan Eker and Jörn W. Janneck. CAL Language Report –language version 1.0. Technical report, University of Californiaat Berkeley, December 2003.

[EJL+03] Johan Eker, Jörn W. Janneck, Edward A. Lee, Jee Liu, Xiao-jun Liu, Jozsef Ludvig, Stephen Neuendorffer, Sonia Sachs, andYhong Xiong. Taming Heterogeneity – The Ptolemy Approach.Proc. of the IEEE, 91(1):127–144, 2003.

[ET06] Stephen A. Edwards and Olivier Tardieu. SHIM: A DeterministicModel for Heterogeneous Embedded Systems. IEEE Trans. VeryLarge Scale Integr. Syst., 14(8):854–867, August 2006.

[For13] http://www.forteds.com, 2013.

[GD09] Christian Genz and Rolf Drechsler. Overcoming Limitations ofthe SystemC Data Introspection. In Proc. Design, Automationand Test in Europe, DATE’09, pages 590–593. IEEE ComputerSociety, 2009.

[Gei09] Marc Geilen. Reduction Techniques for Synchronous DataflowGraphs. In Proc. Design Automation Conference, DAC’09, pages911–916, Los Alamitos, CA, USA, 2009. IEEE Computer Society.

[GHP+09] Andreas Gerstlauer, Christian Haubelt, Andy D. Pimentel,Todor P. Stefanov, Daniel D. Gajski, and Jürgen Teich. Elec-tronic System-Level Synthesis Methodologies. IEEE Transac-tions on Computer-Aided Design of Integrated Circuits and Sys-tems, 28(10):1517–1530, October 2009.

[GJRB11] Ruirui Gu, Jörn W. Janneck, Mickaël Raulet, and Shuvra S.Bhattacharyya. Exploiting statically schedulable regions indataflow programs. Signal Processing Systems, 63(1):129–142,2011.

[GLL99] Alain Girault, Bilung Lee, and E.A. Lee. Hierarchical FiniteState Machines with Multiple Concurrency Models. IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems, 18(6):742–760, June 1999.

[GLMS02] Thorsten Grötker, Stan Liao, Grant Martin, and Stuart Swan.System Design with SystemC. Kluwer Academic Publishers,2002.

232

Page 241: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

[Gou80] Mohamed G. Gouda. Liveness of Marked Graphs and Commu-nication and VLSI Systems Represented by Them. Technicalreport, Austin, TX, USA, 1980.

[GTZ12] Michael Glaß, Jürgen Teich, and Liyuan Zhang. A Co-simulationApproach for System-Level Analysis of Embedded Control Sys-tems. In Proc. International Conference on Embedded Com-puter Systems: Architectures, Modeling, and Simulation, IC-SAMOS’12, pages 355–362, Samos, Greece, July 2012.

[Har87] David Harel. Statecharts: A Visual Formalism for Complex Sys-tems. Science of Computer Programming, pages 231–274, August1987.

[HB07] Chia-Jui Hsu and Shuvra S. Bhattacharyya. Cycle-BreakingTechniques for Scheduling Synchronous Dataflow Graphs. Tech-nical Report UMIACS-TR-2007-12, Institute for Advanced Com-puter Studies, University of Maryland at College Park, February2007.

[HKL+07] Soonhoi Ha, Sungchan Kim, Choonseung Lee, Youngmin Yi,Seongnam Kwon, and Young-Pyo Joo. PeaCE: A Hardware-Software Codesign Environment for Multimedia Embedded Sys-tems. ACM Trans. Des. Autom. Electron. Syst., 12(3):24:1–24:25, May 2007.

[HLL+04] Christopher Hylands, Edward Lee, Jie Liu, Xiaojun Liu, StephenNeuendorffer, Yuhong Xiong, Yang Zhao, and Haiyang Zheng.Overview of the Ptolemy Project, Technical Memorandum No.UCB/ERL M03/25. Technical report, Department of ElectricalEngineering and Computer Sciences, University of California,Berkeley, CA, 94720, USA, July 2004.

[Hoa78] C. A. R. Hoare. Communicating Sequential Processes. Commu-nications of the ACM, 21(8):666–677, August 1978.

[Hoa85] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1985.

[HPB11] Chia-Jui Hsu, José L. Pino, and Shuvra S. Bhattacharyya.Multithreaded Simulation for Synchronous Dataflow Graphs.ACM Transactions on Design Automation of Electronic Systems,16(3):25:1–25:23, June 2011.

233

Page 242: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

[HST06] Christian Haubelt, Thomas Schlichter, and Jürgen Teich. Im-proving Automatic Design Space Exploration by IntegratingSymbolic Techniques into Multi-Objective Evolutionary Algo-rithms. International Journal of Computational Intelligence Re-search (IJCIR), Special Issue on Multiobjective Optimization andApplications, 2(3):239–254, 2006.

[HV08] Fernando Herrera and Eugenio Villar. A Framework for Hetero-geneous Specification and Design of Electronic Embedded Sys-tems in SystemC. ACM Trans. Des. Autom. Electron. Syst.,12(3):22:1–22:31, May 2008.

[HVG+08] Fernando Herrera, Eugenio Villar, Christoph Grimm, MarkusDamm, and Jan Haase. Heterogeneous Specification with HetSCand SystemC-AMS: Widening the Support of MoCs in SystemC.In Eugenio Villar, editor, Embedded Systems Specification andDesign Languages, volume 10 of Lecture Notes in Electrical En-gineering, pages 107–121. Springer-Verlag, Berlin, Heidelberg,2008.

[Imp13] http://www.imperas.com, 2013.

[JBP06] Ahmed A. Jerraya, Aimen Bouchhima, and Frédéric Pétrot. Pro-gramming Models and HW-SW Interfaces Abstraction for Multi-Processor SoC. In Proc. 43rd annual Design Automation Con-ference, DAC’06, pages 280–285, New York, NY, USA, 2006.ACM.

[Kah74] Gilles Kahn. The Semantics of a Simple Language for ParallelProgramming. In IFIP Congress, pages 471–475, 1974.

[KB04] Vida Kianzad and Shuvra S. Bhattacharyya. CHARMED: AMulti-Objective Co-Synthesis Framework for Multi-Mode Em-bedded Systems. In Proc. 15th IEEE International Conferenceon Application-Specific Systems, Architectures and Processors,ASAP’04, pages 28–40. IEEE Computer Society, 2004.

[KB06] Vida Kianzad and Shuvra S. Bhattacharyya. Efficient Tech-niques for Clustering and Scheduling onto Embedded Multipro-cessors. IEEE Transactions on Parallel and Distributed Systems,17(7):667–680, July 2006.

[KDWV02] Bart Kienhuis, Ed F. Deprettere, Pieter van der Wolf, andKees A. Vissers. A Methodology to Design Programmable Em-bedded Systems - The Y-Chart Approach. In Proc. Embedded

234

Page 243: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

Processor Design Challenges: Systems, Architectures, Modeling,and Simulation, SAMOS’02, pages 18–37, Berlin, Heidelberg,2002. Springer-Verlag.

[KFL+13] Rainer Kiesel, Stefan Freisler, Otto Löhlein, Christian Haubelt,Martin Streubühr, and Jürgen Teich. Multi-Platform Perfor-mance Evaluation of Pedestrian Detection at the Electronic Sys-tem Level. In GMM-Fachbericht – Automotive meets Electronics,AmE’13, pages 98–103, Berlin-Charlottenburg, February 2013.VDE-Verlag.

[KKO+06] Tero Kangas, Petri Kukkala, Heikki Orsila, Erno Salminen,Marko Hännikäinen, Timo D. Hämäläinen, Jouni Riihimäki, andKimmo Kuusilinna. UML-Based Multiprocessor SoC DesignFramework. ACM Transactions on Embedded Computing Sys-tems, 5(2):281–320, May 2006.

[Kos78] Paul R. Kosinski. A Straightforward Denotational Semanticsfor Non-determinate Data Flow Programs. In Proc. 5th ACMSIGACT-SIGPLAN Symposium on Principles of ProgrammingLanguages, POPL’78, pages 214–221, New York, NY, USA, 1978.ACM.

[KP78] Gilles Kahn and Gordon D. Plotkin. Domaines concrets, rapport336. Technical report, IRIA-LABORIA, 1978.

[KSH+11] Rainer Kiesel, Martin Streubühr, Christian Haubelt, Otto Löh-lein, and Jürgen Teich. Calibration and Validation of Soft-ware Performance Models for Pedestrian Detection Systems. InLuigi Carro and Andy D. Pimentel, editors, Proc. 2011 Inter-national Conference on Embedded Computer Systems: Architec-tures, Modeling, and Simulation, SAMOS XI, Samos, Greece,July 18-21, 2011, ICSAMOS’11, pages 182–189. IEEE, 2011.

[KSH+12] Rainer Kiesel, Martin Streubühr, Christian Haubelt, AnestisTerzis, and Jürgen Teich. Virtual Prototyping for Efficient Multi-Core ECU Development of Driver Assistance Systems. In Em-bedded Computer Systems (SAMOS), 2012 International Con-ference on, pages 33–40, July 2012.

[Lat11] Chris Lattner. Keynote Talk: LLVM and Clang: AdvancingCompiler Technology. In Proc. Free and Open Source DevelopersEuropean Meeting, FOSDEM’11, February 2011.

235

Page 244: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

[Lee97] Edward A. Lee. A Denotational Semantics for Dataflow withFiring, Technical Memorandum No. UCB/ERL M97/3. Techni-cal report, Department of Electrical Engineering and ComputerSciences, University of California, Berkeley, CA, 94720, USA,1997.

[Lee99] Edward A. Lee. Modeling Concurrent Real-Time Processes Us-ing Discrete Events. Ann. Software Eng., 7:25–45, 1999.

[Lee06] Edward A. Lee. The Problem with Threads. Technical ReportUCB/EECS-2006-1, EECS Department, University of Califor-nia, Berkeley, January 2006. The published version of this paperis in IEEE Computer 39(5):33-42, May 2006.

[LLW+07] Nikolaos D. Liveris, Chuan Lin, Jia Wang, Hai Zhou, andPrithviraj Banerjee. Retiming for Synchronous Data FlowGraphs. In Proc. Asia and South Pacific Design AutomationConference, ASPDAC’07, pages 480–485, Los Alamitos, CA,USA, January 2007. IEEE Computer Society.

[LM87a] Edward A. Lee and David G. Messerschmitt. Synchronous DataFlow. Proc. of the IEEE, 75(9):1235–1245, September 1987.

[LM87b] Edward Ashford Lee and David G. Messerschmitt. StaticScheduling of Synchronous Data Flow Programs for Digital Sig-nal Processing. IEEE Transactions on Computers, 36(1):24–35,January 1987.

[LN04] Edward A. Lee and Stephen Neuendorffer. Formal Methodsand Models for System Design. chapter Actor-Oriented Modelsfor Codesign: Balancing Re-Use and Performance, pages 33–56.Kluwer Academic Publishers, Norwell, MA, USA, 2004.

[LNW03] Edward A. Lee, Stephen Neuendorffer, and Michael J. Wirth-lin. Actor-Oriented Design of Embedded Hardware and Soft-ware Systems. Journal of Circuits, Systems, and Computers,12(3):231–260, 2003.

[LSG+09] Martin Lukasiewycz, Martin Streubühr, Michael Glaß, ChristianHaubelt, and Jürgen Teich. Combined System Synthesis andCommunication Architecture Exploration for MPSoCs. In Proc.Design, Automation and Test in Europe, DATE’09, pages 472–477. IEEE Computer Society, April 2009.

236

Page 245: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

[LSV98] Edward A. Lee and Alberto L. Sangiovanni-Vincentelli. AFramework for Comparing Models of Computation. IEEE Trans-actions on Computer-Aided Design of Integrated Circuits andSystems, 17(12):1217–1229, 1998.

[Men13] http://www.mentor.com, 2013.

[MPND02] Sumit Mohanty, Viktor K. Prasanna, Sandeep Neema, andJames R. Davis. Rapid Design Space Exploration of Hetero-geneous Embedded Systems Using Symbolic Search and Multi-Granular Simulation. In Proc. Joint Conference on Languages,Compilers, and Tools for Embedded Systems & Software andCompilers for Embedded Systems, LCTES’02-SCOPES’02, pages18–27, Berlin, Germany, June 2002. ACM.

[MSS+09] Mike Meredith, Benjamin Carrion Schafer, Alan Peisheng Su,Andres Takach, and Jos Verhaegh. SystemC Synthesizable Sub-set. 1.3 Draft, Synthesis Working Group of the Open SystemCInitiative (Accellera Systems Initiative), 1370 Trancas Street,Napa, CA 94558, U.S.A., 2009. http://www.accellera.org.

[Mur89] T. Murata. Petri Nets: Properties, Analysis and Applications.Proc. of the IEEE, 77(4):541–580, April 1989.

[NSD08] Hristo Nikolov, Todor Stefanov, and Ed F. Deprettere. Auto-mated Integration of Dedicated Hardwired IP Cores in Hetero-geneous MPSoCs Designed with ESPAM. EURASIP Journal onEmbedded Systems, 2008.

[Par95] Thomas M. Parks. Bounded Scheduling of Process Networks.Technical report, Dept. of EECS, UC Berkeley, Berkeley, CA94720, U.S.A., 1995. Technical Report UCB/ERL M95/105,Ph.D dissertation.

[PBL95] José L. Pino, Shuvra S. Bhattacharyya, and E.A. Lee. A Hierar-chical Multiprocessor Scheduling System for DSP Applications.In Proc. Asilomar Conference on Signals, Systems, and Com-puters, volume 1, pages 122–126, October 1995.

[PCH99] Chanik Park, Jaewoong Chung, and Soonhoi Ha. Extended Syn-chronous Dataflow for Efficient DSP System Prototyping. InProc. IEEE Int. Workshop on Rapid System Prototyping, pages196–201, July 1999.

237

Page 246: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

[PEP06] Andy D. Pimentel, Cagkan Erbas, and Simon Polstra. A System-atic Approach to Exploring Embedded System Architectures atMultiple Abstraction Levels. IEEE Transactions on Computers,55(2):99–112, 2006.

[PS04] Hiren D. Patel and Sandeep K. Shukla. SystemC Kernel Exten-sions for Heterogeneous System Modeling — A Framework forMulti-MoC Modeling and Simulation. Kluwer, 2004.

[PSKB08] William Plishker, Nimish Sane, Mary Kiemb, and Shuvra S.Bhattacharyya. Heterogeneous Design in Functional DIF. InProc. 8th international workshop on Embedded Computer Sys-tems: Architectures, Modeling, and Simulation, SAMOS’08,pages 157–166, Berlin, Heidelberg, 2008. Springer-Verlag.

[Pto14] Claudius Ptolemaeus, editor. System Design, Modeling,and Simulation using Ptolemy II. Ptolemy.org, 2014.http://ptolemy.org/systems.

[Ros65] Barkley Rosser. An Informal Exposition of Proofs of Gödel’sTheorem and Church’s Theorem. In M. Davis, editor, The Un-decidable: Basic Papers on Undecidable Propositions, Unsolv-able Problems and Computable Functions, pages 223–230. RavenPress, Hewlett, NY, 1965.

[SB00] Sundararajan Sriram and Shuvra S. Bhattacharyya. Embed-ded Multiprocessors: Scheduling and Synchronization. MarcelDekker, Inc., New York, NY, USA, 1st edition, 2000.

[SD94] N. Srinivas and Kalyanmoy Deb. Muiltiobjective OptimizationUsing Nondominated Sorting in Genetic Algorithms. Evolution-ary Computation, 2(3):221–248, September 1994.

[SGB06a] Sander Stuijk, Marc Geilen, and Twan Basten. Exploring Trade-Offs in Buffer Requirements and Throughput Constraints forSynchronous Dataflow Graphs. In Proc. 43rd Design AutomationConference, DAC’06, pages 899–904, 2006.

[SGB06b] Sander Stuijk, Marc Geilen, and Twan Basten. SDF3: SDF ForFree. In Proc. 6th International Conference on Application ofConcurrency to System Design, ACSD’06, pages 276–278, June2006.

238

Page 247: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

[SGB08] Sander Stuijk, Marc Geilen, and Twan Basten. Throughput-Buffering Trade-Off Exploration for Cyclo-Static and Syn-chronous Dataflow Graphs. IEEE Transactions on Computers,57(10):1331–1345, October 2008.

[SGHT09] Martin Streubühr, Jens Gladigau, Christian Haubelt, and Jür-gen Teich. Efficient Approximately-Timed Performance Model-ing for Architectural Exploration of MPSoCs. In Proc. Forumon Specification & Design Languages, FDL’09, pages 1–6, 2009.

[SGHT10] Martin Streubühr, Jens Gladigau, Christian Haubelt, and Jür-gen Teich. Efficient Approximately-Timed Performance Model-ing for Architectural Exploration of MPSoCs. In Dominique Bor-rione, editor, Advances in Design Methods from Modeling Lan-guages for Embedded Systems and SoC’s, volume 63 of LectureNotes in Electrical Engineering, pages 59–72. Springer-Verlag,Berlin, Heidelberg, 2010.

[SGTB11] Sander Stuijk, Marc Geilen, Bart D. Theelen, and Twan Basten.Scenario-Aware Dataflow: Modeling, Analysis and Implementa-tion of Dynamic Applications. In Proc. International Conferenceon Embedded Computer Systems: Architectures, Modeling, andSimulation, ICSAMOS’11, pages 404–411. IEEE Computer So-ciety, July 2011.

[SLWSV99] Marco Sgroi, Luciano Lavagno, Yosinori Watanabe, and Al-berto L. Sangiovanni-Vincentelli. Quasi-static scheduling of em-bedded software using equal conflict nets. In Proc. 20th Inter-national Conference on Application and Theory of Petri Nets,pages 208–227, Berlin, Heidelberg, 1999. Springer-Verlag.

[SW97] Mechthild Stoer and Frank Wagner. A Simple Min-cut Algo-rithm. Journal of the ACM, 44(4):585–591, July 1997.

[Syn13] http://www.synopsys.com, 2013.

[SZT+04] Todor Stefanov, Claudiu Zissulescu, Alexandru Turjan, BartKienhuis, and Ed F. Deprettere. System Design Using KahnProcess Networks: The Compaan/Laura Approach. In Proc. De-sign, Automation and Test in Europe, DATE’04, pages 340–345.IEEE Computer Society, February 2004.

[TBG+13] Stavros Tripakis, Dai N. Bui, Marc Geilen, Bert Rodiers, andEdward A. Lee. Compositionality in Synchronous Data Flow:

239

Page 248: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

Modular Code Generation from Hierarchical SDF Graphs. ACMTrans. Embedded Comput. Syst., 12(3):83:1–83:26, April 2013.

[TBRL09] Stavros Tripakis, Dai Bui, Bert Rodiers, and Edward A. Lee.Compositionality in Synchronous Data Flow: Modular CodeGeneration from Hierarchical SDF Graphs. Technical ReportUCB/EECS-2009-143, EECS Department, University of Cali-fornia, Berkeley, October 2009.

[Tei12] Jürgen Teich. Hardware/Software Codesign: The Past, thePresent, and Predicting the Future. Proc. IEEE, 100(SpecialCentennial Issue):1411–1430, May 2012.

[TGB+06] Bart D. Theelen, Marc Geilen, Twan Basten, Jeroen Voeten,Stefan Valentin Gheorghita, and Sander Stuijk. A Scenario-Aware Data Flow Model for Combined Long-Run Average andWorst-Case Performance Analysis. In Proc. International Con-ference on Formal Methods and Models for Co-Design, MEM-OCODE’06, pages 185–194. IEEE Computer Society, July 2006.

[TSZ+99] Lothar Thiele, Karsten Strehl, Dirk Ziegenbein, Rolf Ernst, andJürgen Teich. FunState–An Internal Design Representation forCodesign. In Jacob K. White and Ellen Sentovich, editors, IC-CAD, pages 558–565. IEEE, 1999.

[TTN+98] Lothar Thiele, Jürgen Teich, Martin Naedele, Karsten Strehl,and Dirk Ziegenbein. SCF - State Machine Controlled Flow Di-agrams. Technical report, Computer Engineering and NetworksLab (TIK), Swiss Federal Institute of Technology (ETH) Zurich,Gloriastrasse 35, CH-8092, 1998. Technical Report TIK-33.

[VBCG04] Girish Venkataramani, Mihai Budiu, Tiberiu Chelcea, andSeth C. Goldstein. C to Asynchronous Dataflow Circuits: AnEnd-to-End Toolflow. In Proc. IEEE 13th International Work-shop on Logic Synthesis, IWLS’04, pages 1–8, Temecula, CA,June 2004.

[WBJS06] Maarten Wiggers, Marco Bekooij, Pierre G. Jansen, and Ger-ard J. M. Smit. Efficient Computation of Buffer Capacitiesfor Multi-Rate Real-Time Systems with Back-Pressure. InProc. 4th International Conference on Hardware/Software Code-sign and System Synthesis, Seoul, Korea, October 22-25, 2006,CODES+ISSS’06, pages 10–15. ACM, 2006.

240

Page 249: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

[WBS07] Maarten Wiggers, Marco Bekooij, and Gerard J. M. Smit.Efficient Computation of Buffer Capacities for Cyclo-StaticDataflow Graphs. In Proc. 44th Design Automation Conference,San Diego, CA, USA, June 4-8, 2007, DAC’07, pages 658–663.IEEE, 2007.

[Win93] Glynn Winskel. The Formal Semantics of Programming Lan-guages: An Introduction. MIT Press, Cambridge, MA, USA,1993.

[XIL05] XILINX. Embedded SystemTools Reference Manual - EmbeddedDevelopment Kit EDK 8.1ia, October 2005.

[XIL14] XILINX. Vivado Design Suite Reference Manual - Vivado2014.1, May 2014.

[XRW+12] Yang Xu, Rafael Rosales, Bo Wang, Martin Streubühr, RalphHasholzner, Christian Haubelt, and Jürgen Teich. A VeryFast and Quasi-accurate Power-State-Based System-Level PowerModeling Methodology. In Andreas Herkersdorf, Kay Römer,and Uwe Brinkschulte, editors, Proc. Architecture of ComputingSystems, volume 7179 of ARCS’12, pages 37–49, Berlin, Heidel-berg, 2012. Springer-Verlag.

[Yat93] Robert Kim Yates. Networks of Real-Time Processes. In EikeBest, editor, CONCUR’93, volume 715 of Lecture Notes in Com-puter Science, pages 384–397. Springer-Verlag, Berlin, Heidel-berg, 1993.

[ZGBT13] Liyuan Zhang, Michael Glaß, Nils Ballmann, and Jürgen Teich.Bridging Algorithm and ESL Design: Matlab/Simulink ModelTransformation and Validation. In Proc. Forum on Specification& Design Languages, FDL’13, pages 1–8. IEEE, 2013.

[ZLT02] Eckart Zitzler, Marco Laumanns, and Lothar Thiele. SPEA2:Improving the Strength Pareto Evolutionary Algorithm for Mul-tiobjective Optimization. In Kyriakos C. Giannakoglou et al., ed-itors, Evolutionary Methods for Design, Optimisation and Con-trol with Application to Industrial Problems, EUROGEN’01,pages 95–100. International Center for Numerical Methods inEngineering (CIMNE), September 2002.

[ZSJ10] Jun Zhu, Ingo Sander, and Axel Jantsch. HetMoC: Heteroge-neous Modelling in SystemC. In Adam Morawiec and Jinnie

241

Page 250: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Bibliography

Hinderscheit, editors, Proc. Forum on Specification & DesignLanguages, FDL’10, pages 117–122. ECSI, Electronic Chips &Systems design Initiative, September 2010.

242

Page 251: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Author’s Own Publications

[FHT06*] Joachim Falk, Christian Haubelt, and Jürgen Teich. Efficient Rep-resentation and Simulation of Model-Based Designs in SystemC. InProc. Forum on Specification & Design Languages, FDL’06, pages129–134, September 2006.

[FHT07*] Joachim Falk, Christian Haubelt, and Jürgen Teich. Task GraphClustering with Internal State. Technical Report 01-2007, Uni-versity of Erlangen-Nuremberg, Department of CS 12, Hardware-Software-Co-Design, Am Weichselgarten 3, 91058 Erlangen, Ger-many, December 2007.

[FHZT13*] Joachim Falk, Christian Haubelt, Christian Zebelein, and Jür-gen Teich. Integrated Modeling Using Finite State Machines andDataflow Graphs. In Shuvra S. Bhattacharyya, Ed F. Deprettere,R. Leupers, and J. Takala, editors, Handbook of Signal Process-ing Systems, pages 975–1013. Springer-Verlag, Berlin, Heidelberg,2013.

[FKH+08*] Joachim Falk, Joachim Keinert, Christian Haubelt, Jürgen Teich,and Shuvra S. Bhattacharyya. A Generalized Static Data FlowClustering Algorithm for MPSoC Scheduling of Multimedia Appli-cations. In Proc. 8th ACM international conference on Embeddedsoftware, EMSOFT’08, pages 189–198, New York, NY, USA, Oc-tober 2008. ACM.

[FZH+10*] Joachim Falk, Christian Zebelein, Christian Haubelt, Jürgen Te-ich, and Rainer Dorsch. Integrating Hardware/Firmware Ver-ification Efforts Using SystemC High-Level Models. In Man-fred Dietrich, editor, Proc. Methoden und Beschreibungssprachenzur Modellierung und Verifikation von Schaltungen und Systemen,MBMV’10, pages 137–146. Fraunhofer Verlag, 2010.

243

Page 252: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Author’s Own Publications

[FZHT11*] Joachim Falk, Christian Zebelein, Christian Haubelt, and JürgenTeich. A Rule-Based Static Dataflow Clustering Algorithm for Ef-ficient Embedded Software Synthesis. In Proc. Design, Automationand Test in Europe, DATE’11, pages 521–526. IEEE, March 2011.

[FZHT13*] Joachim Falk, Christian Zebelein, Christian Haubelt, and JürgenTeich. A Rule-Based Quasi-Static Scheduling Approach for StaticIslands in Dynamic Dataflow Graphs. ACM Trans. Embedded Com-put. Syst., 12(3):74:1–74:31, April 2013.

[FZK+11*] Joachim Falk, Christian Zebelein, Joachim Keinert, ChristianHaubelt, Jürgen Teich, and Shuvra S. Bhattacharyya. Analysisof SystemC Actor Networks for Efficient Synthesis. ACM Trans.Embedded Comput. Syst., 10(2):18:1–18:34, January 2011.

[GFH+14*] Marc Geilen, Joachim Falk, Christian Haubelt, Twan Basten,Bart D. Theelen, and Sander Stuijk. Performance Analysis ofWeakly-Consistent Scenario-Aware Dataflow Graphs. In Proc.Asilomar Conference on Signals, Systems, and Computers, Novem-ber 2014.

[HFK+07*] Christian Haubelt, Joachim Falk, Joachim Keinert, ThomasSchlichter, Martin Streubühr, Andreas Deyhle, Andreas Hadert,and Jürgen Teich. A SystemC-Based Design Methodology for Dig-ital Signal Processing Systems. EURASIP Journal on EmbeddedSystems, 2007(1):1–22, January 2007.

[KFHT07*] Joachim Keinert, Joachim Falk, Christian Haubelt, and JürgenTeich. Actor-Oriented Modeling and Simulation of Sliding WindowImage Processing Algorithms. In Samarjit Chakraborty and PetruEles, editors, Proc. 2007 5th Workshop on Embedded Systems forReal-Time Multimedia, Oct. 4-5, Salzburg, Austria, conjunctionwith CODES+ISSS’07, ESTImedia’07, pages 113–118. IEEE, 2007.

[KSS+09*] Joachim Keinert, Martin Streubühr, Thomas Schlichter, JoachimFalk, Jens Gladigau, Christian Haubelt, Jürgen Teich, and MichaelMeredith. SystemCoDesigner — An Automatic ESL SynthesisApproach by Design Space Exploration and Behavioral Synthesisfor Streaming Applications. Transactions on Design Automationof Electronic Systems, 14(1):1:1–1:23, 2009.

[SFH+06*] Martin Streubühr, Joachim Falk, Christian Haubelt, Jürgen Teich,Rainer Dorsch, and Thomas Schlipf. Task-Accurate PerformanceModeling in SystemC for Real-Time Multi-Processor Architectures.

244

Page 253: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Author’s Own Publications

In Georges G. E. Gielen, editor, Proc. Design, Automation andTest in Europe, DATE’06, pages 480–481. European Design andAutomation Association, Leuven, Belgium, 2006.

[ZFH+10*] Christian Zebelein, Joachim Falk, Christian Haubelt, Jürgen Te-ich, and Rainer Dorsch. Efficient High-Level Modeling in the Net-working Domain. In Proc. Design, Automation and Test in Europe,DATE’10, pages 1189–1194, 3001 Leuven, Belgium, Belgium, 2010.European Design and Automation Association.

[ZFHT08*] Christian Zebelein, Joachim Falk, Christian Haubelt, and JürgenTeich. Classification of General Data Flow Actors into KnownModels of Computation. In Proc. 6th ACM/IEEE InternationalConference on Formal Methods and Models for Codesign, MEM-OCODE’08, pages 119–128, June 2008.

[ZFHT12*] Christian Zebelein, Joachim Falk, Christian Haubelt, and JürgenTeich. A Model-Based Inter-Process Resource Sharing Approachfor High-Level Synthesis of Dataflow Graphs. In Proc. ElectronicSystem Level Synthesis Conference, ESLSyn’12, pages 17–22, June2012.

[ZFS+14*] Liyuan Zhang, Joachim Falk, Tobias Schwarzer, Michael Glaß, andJürgen Teich. Communication-driven Automatic Virtual Prototyp-ing for Networked Embedded Systems. In Proc. Euromicro Con-ference on Digital Systems Design, DSD’14, pages 435–442, LosAlamitos, CA, USA, August 2014. IEEE Computer Society.

[ZHF+13*] Christian Zebelein, Christian Haubelt, Joachim Falk, TobiasSchwarzer, and Jürgen Teich. Representing Mapping and Schedul-ing Decisions within Dataflow Graphs. In Proc. Forum on Speci-fication & Design Languages, FDL’13, pages 1–8, Piscataway, NJ,USA, September 2013. IEEE.

[ZHFT12*] Christian Zebelein, Christian Haubelt, Joachim Falk, and JürgenTeich. Exploiting Model-Knowledge in High-Level Synthesis. InJens Brandt and Klaus Schneider, editors, Methoden und Beschrei-bungssprachen zur Modellierung und Verifikation von Schaltun-gen und Systemen, Kaiserslautern, Germany, Mar. 5-7, 2012,MBMV’12, pages 181–191. Verlag Dr. Kovac, 2012.

[ZHFT13*] Christian Zebelein, Christian Haubelt, Joachim Falk, and Jür-gen Teich. Model-Based Representation of Schedules for DataflowGraphs. In Christian Haubelt and Dirk Timmermann, editors,

245

Page 254: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Author’s Own Publications

Methoden und Beschreibungssprachen zur Modellierung und Veri-fikation von Schaltungen und Systemen, Warnemünde, Germany,Mar. 12-14, 2013, MBMV’13, pages 105–115. Institut für Ange-wandte Mikroelektronik und Datentechnik, Fakultät für Informatikund Elektrotechnik, Universität Rostock, 2013.

246

Page 255: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

List of Symbols

Γ The cluster operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .62𝜂 Firing count of an actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13𝜂rep Repetition count of an actor . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .14𝜂 Vector of actor firings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131𝜂rep Repetition vector of a static DFG . . . . . . . . . . . . . . . . . . . . . . . . . . .14𝜂rep𝛾 Repetition vector of a subcluster 𝑔𝛾 . . . . . . . . . . . . . . . . . . . . . . 101

𝜆 The empty signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .19𝜇 Phase of a CSDF actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .72𝜇 Classification candidate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72𝜈 A data value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8𝒱 The universal set of data values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8𝜉 A cluster refinement operator. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .99𝜉SDF A cluster refinement operator producing SDF actors . . . . . . . 101𝜉CSDF A cluster refinement operator producing CSDF actors . . . . . 110

𝜉FSM

A cluster refinement operator producing an FSM-basedQSS representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

𝜉rules

A cluster refinement operator producing a rule-based QSSrepresentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

𝜋𝑁(𝑥)Projection of a vector 𝑥 to a subset of its entries given bya subset 𝑁 ⊆ ℐ(𝑥) of its index set . . . . . . . . . . . . . . . . . . . . . . . . . 131

𝜌 Sequence of rule applications, e.g., 𝜌 = ⟨𝑟1, 𝑟2, . . . 𝑟𝑛⟩ . . . . . . 147𝜏 Number of phases of a CSDF actor . . . . . . . . . . . . . . . . . . . . .. . . . .15𝜔 Annotation tuple used in actor classification . . . . . . . . . . . .. . . . .75∅ The empty set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .62N The set of natural numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .12N0 The set of non-negative integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8N0,∞ The set of non-negative integers including infinity . . . . . . . . . . .17Z The set of integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77|𝑋| The set cardinality operator . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .44𝑋* Finite Kleene closure of a value set X. . . . . . . . . . . . . . . . . . .. . . . . . 8𝑋** Infinite Kleene closure of a value set X . . . . . . . . . . . . . . . . . . . . . .17

247

Page 256: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

List of Symbols

⟨𝑥1, 𝑥2, 𝑥3⟩ Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .10⊑ Prefix order for signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20# The sequence length operator. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .17#𝑥𝑥 Number of occurrences of a value 𝑥 in a sequence 𝑥 . . . . . . . 163a The sequence concatenation operator . . . . . . . . . . . . . . . . . . .. . . . .18◇ The conflict resolution operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 152𝑓 ∘ 𝑔 Function composition operator, i.e., 𝑓 ∘ 𝑔(𝑥) ≡ 𝑓(𝑔(𝑥)) . . . . . .59𝑎 Actor in a DFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8𝑎in An input actor of a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62𝑎out An output actor of a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .62𝑎𝛾 Composite actor derived from cluster 𝑔𝛾 via refinement .. . . . .88𝑎 Sequence of actor firings, e.g., 𝑎 = ⟨𝑎1, 𝑎5, 𝑎4, 𝑎6, 𝑎3, 𝑎2⟩ . . . . . .10

adji2i

Adjustment of the cluster input channel capacities to han-dle input-to-input back pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

adji2o

Adjustment of the cluster input channel capacities to han-dle input-to-output back pressure. . . . . . . . . . . . . . . . . . . . . . . . . . 177

adjo2o

Adjustment of the cluster output channel capacities tohandle output-to-output back pressure. . . . . . . . . . . . . . . . . . . . . 183

𝐴 Set of actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

𝐴EC

An equivalence class, i.e., a set of actors of a cluster can-didate that share the same set of predecessor input actorsand successor output actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

𝐴𝐼 The set of input actors of a cluster . . . . . . . . . . . . . . . . . . . . . . . . . .62𝐴𝑂 The set of output actors of a cluster . . . . . . . . . . . . . . . . . . . .. . . . .62𝐴𝑆 The set of static actors of a graph . . . . . . . . . . . . . . . . . . . . . . . . . 124𝑐 Channel in a DFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

clash

A symmetric relation between equivalence classes thatevaluates to true if the actors of these equivalence classesmust not be contained in the same cluster . . . . . . . . . . . . . . . . . 125

conflict Conflict relation for two rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149cons(𝑐) Consumption rate on channel 𝑐 . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 9cons(𝑖) Consumption rate on input port 𝑖 . . . . . . . . . . . . . . . . . . . . . . . . . . .22cons(𝑡, 𝑖) Consumption rate of transition 𝑡 on input port/channel 𝑖. . . . .48

cons(𝑎1, 𝑎2)Consumption rate of actor 𝑎2 on the channel between theactors 𝑎1 and 𝑎2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

𝐶 Set of channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8delay The delay function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8dep The input/output dependency function . . . . . . . . . . . . . . . . . . . . 137𝑒 An edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .25

𝐸A set of edges (usually partitioned into more specific sub-sets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

248

Page 257: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

𝑓 A function operating on finite signals . . . . . . . . . . . . . . . . . . .. . . . .24𝑓action An action function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44𝑓guard A guard function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .44f The Boolean value false . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

firingsMapping function from a cluster state to the number offirings of all actors in the cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . 131

ℱ The set of all functions operating on finite signals . . . . . . . . . . .25ℱaction A set of SysteMoC actions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .44ℱguard A set of SysteMoC guards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .44𝑔 A data flow graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 8𝑔a An architecture graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206𝑔 A cluster or network graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61

𝑔 A refined network graph derived from replacing a clusterby a composite actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99

𝑔𝛾 A subcluster of the cluster or network graph 𝑔 . . . . . . . . .. . . . .61𝑔n A network graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49𝐺 A set of data flow graphs or clusters . . . . . . . . . . . . . . . . . . . .. . . . .61𝐺n A set of data flow graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .62

𝐻dep(𝑎out)The set of input/output dependency function values foran output actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

𝐿 A set of communication links between resources . . . . . . . . . . . 206𝑀 A set of mapping edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

interleaveFunction deriving a set of cluster states by interleavingactor firings from a given set of states . . . . . . . . . . . . . . . . . . . . . 140

io

Mapping function from a cluster state to the number ofconsumed (negative) and produced (positive) tokens onthe cluster ports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

𝐼 A set of input ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43ℐ Function returning the index set of a vector or sequence . . . . .14𝑘 A Boolean function used as a transition guard . . . . . . . . . .. . . . .25

𝑘io

A Boolean function used as part of a transition guard tocheck the availability of tokens and free places . . . . . . . . . .. . . . .87

𝒦 A Kahn function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18𝑙 Lower bound for a rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146lfp(𝑥 = 𝑓(𝑥)) The minimal value 𝑥 solving the equation 𝑥 = 𝑓(𝑥) . . . . . . . . 140max𝑋 Maximal value 𝑥 ∈ 𝑋 of a set 𝑋 . . . . . . . . . . . . . . . . . . . . . . . . . . . 139max𝑋 Least upper bound of a set 𝑋 of partially ordered vectors. . . 171min𝑋 Minimal value 𝑥 ∈ 𝑋 of a set 𝑋 . . . . . . . . . . . . . . . . . . . . . . . . . . . 139mod Modulo operation for integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77𝑞 mod 𝜂 Modulo operation for cluster states . . . . . . . . . . . . . . . . . . . . . . . . 132𝑟 mod 𝜂 Modulo operation for rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

249

Page 258: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

List of Symbols

𝑂 A set of output ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .43prod(𝑐) Production rate on channel 𝑐 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9prod(𝑜) Production rate on output port 𝑜 . . . . . . . . . . . . . . . . . . . . . . .. . . . .22prod(𝑡, 𝑜) Production rate of transition 𝑡 on output port/channel 𝑜. . . . .48

prod(𝑎1, 𝑎2)Production rate of actor 𝑎1 on the channel between theactors 𝑎1 and 𝑎2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

𝒫A set of ports (usually partitioned into a set of input andoutput ports) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44

𝑞 A state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .25𝑞0 An initial state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .25𝑞src Source state of a transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .25𝑞dst Destination state of a transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25𝑞 A state of a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130𝑞0 An initial state of a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130𝑞src Source state of a cluster FSM transition . . . . . . . . . . . . . . . . . . . 132𝑞dst Destination state of a cluster FSM transition . . . . . . . . . . . . . . 132𝑄 A set of states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .25𝑄 A set of cluster states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133𝑄𝛾 State space of the cluster 𝑔𝛾 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133𝑄dyn

𝛾 State space of the cluster 𝑔𝛾 if dynamically scheduled. . . . . . 177𝑄ini The set of initial input/output dependency states. . . . . . . . . . 140𝑄lfp The set of interleaved states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140𝑟 Resource in an architecture graph . . . . . . . . . . . . . . . . . . . . . . . . . . .37𝑅 A set of resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206𝑟 A rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146redundant Function selecting all redundant rules of a set of rules . . . . . 153

resolveFunction performing one iteration of conflict resolutionfor a set of rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

ℛ A finite state machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .25𝑅 A set of rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146𝑅fin The set of final rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146𝑅ini The set of initial input/output dependency rules . . . . . . . . . . . 147𝑅lfp The set of conflict-resolved rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 147𝑅red The set of redundant rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153𝑠 A signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .17𝑠 Tuple of signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18size The channel capacity function . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .49snk(𝑐) Function returning the sink actor of a channel 𝑐 . . . . . . . . . . . . . . 8src(𝑐) Function returning the source actor of a channel 𝑐 . . . . . . . . . . . . 8

stateFunction deriving the cluster state from the signals con-sumed and produced by the cluster . . . . . . . . . . . . . . . . . . . . . . . . 133

250

Page 259: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

statedep

Function deriving a cluster state from a given number ofinput actor firings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

𝑆 The set of all signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17S The set of all vectors of signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18𝑡 A transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .25𝑡𝐸𝑋𝐸𝐶 Overall execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88𝑡𝑆𝐴 Overall scheduling time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .88𝑡𝑇𝐴 Overall computation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88t The Boolean value true. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .23tail The tail function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18𝑇 A set of transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .25𝒯 𝒪𝑥 The set of tightly ordered pairs for a set of states 𝑄𝑥 . . . . . . 143𝑢 Upper bound for a rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146𝑣 A vertex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

𝑉A set of vertices (usually partitioned into more specificsubsets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .25

251

Page 260: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität
Page 261: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Acronyms

APG Acyclic Precedence GraphAPI Application Programming InterfaceASIC Application-Specific Integrated CircuitAST Abstract Syntax TreeBB Basic BlockBDF Boolean Data FlowBRAM Block Random Access MemoryCAL Cal Actor LanguageCFG Control Flow GraphCFSM Codesign Finite State MachineCSP Communicating Sequential Processesclang C Language Family Frontend of the LLVM Compiler

InfrastructureCPU Central Processing UnitCSDF Cyclo-Static Data FlowCT Continuous TimeDDF Dennis Data FlowDE Discrete EventDFG Data Flow GraphDSE Design Space ExplorationDSSF Deterministic SDF with shared FIFOsEDA Electronic Design AutomationEDK Xilinx Embedded Development KitESL Electronic System LevelFF Flip-FlopFIFO First In First OutFPGA Field Programmable Gate ArrayFSM Finite State Machine

253

Page 262: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Acronyms

FunState Functions Driven by State MachinesGALS Globally Asynchronous Locally SynchronousHDF Heterochronous Data FlowHMUL Hardware MultiplierHSDF Homogeneous (Synchronous) Data FlowIP Intellectual PropertyKPN Kahn Process NetworkLUT Look-Up TableMoC Model of ComputationMOEA Multi-Objective Evolutionary AlgorithmMPSoC Multi-Processor System-on-ChipNDF Non-Determinate Data FlowOSCI Open SystemC InitiativePGAN Pairwise Grouping of Adjacent NodesPSOS Periodic Static-Order ScheduleRAM Random Access MemorySAS Single Appearance ScheduleSAT Boolean SatisfiabilitySCC Strongly Connected ComponentSPDF Synchronous Piggybacked Data FlowSR Synchronous ReactiveQSS Quasi-Static ScheduleRTL Register Transfer LevelRTTI Run Time Type InformationSDF Synchronous Data FlowSoC System-on-ChipSysteMoC SystemC Models of ComputationVerilog Verilog Hardware Description LanguageVHDL Very High Speed Integrated Circuit Hardware Description

LanguageVPC Virtual Processing Components

254

Page 263: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Index

actor, 7actor FSM, 46acyclic precedence graph, 103and states, 57architecture graph, 37, 206artificial deadlock, 99

boolean data flow, 21bounded memory execution, 10

channel, 7channel capacity, 7, 48clash, 128classification candidate, 72closed system, 61cluster, 61cluster FSM, 89, 90cluster candidate, 124cluster environment, 62cluster input channel, 173cluster output channel, 173cluster partitioning heuristic, 123cluster state, 130cluster state space, 130cluster states, 90clustering condition, 121coalescing of actor firings, 104coalescing operation, 106communication behavior, 10, 44, 52composite actor, 23, 90conflict rules, 149conflict-resolved rules, 153

consistent Synchronous Data Flow (SDF)graph, 13

consumption rate, 9consumption rate function, 12, 15,

46, 48, 111cyclo-static data flow, 15

data flow graph, 7deadlock-free SDF graph, 13delay, 7delay-less cycle, 10delay-less directed path, 112dennis data flow, 22destination state of a transition, 46dynamic data flow, 17dynamic scheduler, 85dynamic scheduling, 85

enabled actor, 46enabled transition, 46equivalence class, 127equivalent behavior in data flow, 97executable specification, 40execution time, 25, 42, 87

finite state machine, 27fireable, 7firing, 7functionality guard, 47functionality state, 45FunState, 25

grouping of actor firings, 111

255

Page 264: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Index

heterochronous data flow, 30hierarchical network graph, 62homogeneous data flow, 10

implementation graph, 37incoming transition, 46initial cluster state, 90, 131initial input/output dependency states,

140initial token, 7input actor, 62, 63, 107input guard, 47input ports, 19input predicate, 47input/output dependency function,

137input/output dependency function value,

136intermediate actor, 62, 131interval boundary vectors, 146iteration, 11

Kahn process network, 19

leaf states, 57length of a signal, 17length of an iteration, 11

marked graph, 10

network graph, 48non-hierarchical network graph, 48

open system, 61outgoing transition, 46output actor, 62, 63, 107output guard, 47output ports, 19output predicate, 47overall computation time, 88overall execution time, 42, 89overall overhead, 93overall scheduling time, 88

parallel composition, 59partial repetition vector, 84, 146periodic static-order schedule, 10, 11ports, 19problem graph, 37, 205production rate, 9production rate function, 12, 15, 46,

48, 111

quasi-static schedule, 89

reachable cluster state, 162redundant rules, 153repetition vector sum, 145resource, 37retiming, 108rule, 146rule-based clustering, 145

safe cluster refinement, 100scheduling overhead, 88SDF to marked graph unfolding, 103sequence determinate, 24sequence of actor firings, 11set of initial input/output dependency

rules, 147set of input/output dependency func-

tion values, 136signal, 17source state of a transition, 46specification graph, 37, 207static cluster, 83static data flow, 9static data flow actor, 10subcluster, 61synchronous data flow, 12synchronous reactive domain, 59synthesis back-end, 203synthesis-in-the-loop, 203SystemC, 34SysteMoC, 44

token, 7

256

Page 265: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Index

transition contraction condition, 73

unsafe cluster refinement, 100

worst case cluster environment, 99

xor states, 57

257

Page 266: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität
Page 267: A Clustering-Based MPSoC Design Flow for Data Flow ... · A Clustering-Based MPSoC Design Flow for Data Flow-Oriented Applications Der Technischen Fakultät der Friedrich-Alexander-Universität

Curriculum Vitæ

Joachim Falk was born in 1977 and grew up around Erlangen, Germany. Heearned his university admission by visiting the technical school (Fachoberschule)in Erlangen and started his study of electrical engineering in 1997 at the GeorgSimon Ohm University of Applied Sciences in Nuremberg, Germany. There, heearned his degree in Electrical Engineering with a Specialization in Data Pro-cessing in 2002. In 2003, he joined the chair of Hardware/Software Co-Design atthe Friedrich-Alexander-Universität Erlangen-Nürnberg under the supervisionof Prof. Dr.-Ing. Jürgen Teich. From 2006 onward, he has been working as aresearcher in the project “Actor-Oriented Synthesis and Optimization of DigitalHardware/Software-Systems at the Electronic System Level,” which is funded bythe German Research Foundation (DFG). Joachim Falk has been a reviewer forseveral international journals. His main research focuses are data flow languagesand transformations for data flow graphs in the context of embedded system de-sign.