Parallel Paradigms and Run-time Management Techniques for ...home.deib.polimi.it/fornacia/lib/exe/fetch.php?... · Parallel Paradigms and Run-time Management Techniques for Many-core

1

Parallel Paradigms and Run-time Management Techniques for Many-core Architectures: The 2PARMA Approach

C. Silvano, W. Fornaciari, S. Crespi Reghizzi, G. Agosta, G. Palermo, V. Zaccaria, P. Bellasi F. Castro, S. Corbetta, A. Di Biagio, E. Speziale, M. Tartara

Dipartimento di Elettronica e Informazione - Politecnico di Milano - Italy

D. Siorpaes STMicroelectronics - Italy

H. Hübert, B. Stabernack, J. Brandenburg Fraunhofer HHI - Germany

Martin Palkovic, Praveen Raghavan, Chantal Ykman-Couvreur IMEC vzw, Belgium. Also with IBBT, Belgium

A. Bartzas, S. Xydis, D. Soudris Institute of Communication and Computer Systems, National Tech. University of Athens, Greece

T. Kempf, G. Ascheid, H. Meyr, J. Ansari, P. Mähönen RWTH – Aachen University – Germany

Bart Vanthournout CoWare – Belgium

KEYWORDS: Multi-core Architectures, Programming Models, Virtualisation, Design Space Exploration, Run-time Resource Management, Applications.

Abstract

The 2PARMA project aims at overcoming the lack of parallel programming models and run-time resource management techniques to exploit the features of many-core processor architectures. More in detail, the 2PARMA project focuses on the definition of a parallel programming model combining component-based and single-instruction multiple-thread approaches, instruction set virtualisation based on portable bytecode, run-time resource management policies and mechanisms as well as design space exploration methodologies for Many-core Computing Fabrics.

2

Parallel Paradigms and Run-time Management Techniques for Many-core Architectures: The 2PARMA Approach

Abstract

The 2PARMA project aims at overcoming the lack of parallel programming models and run-time resource management techniques to exploit the features of many-core processor architectures. More in detail, the 2PARMA project focuses on the definition of a parallel programming model combining component-based and single-instruction multiple-thread approaches, instruction set virtualisation based on portable bytecode, run-time resource management policies and mechanisms as well as design space exploration methodologies for Many-core Computing Fabrics.

1. Introduction The current trend in computing architectures is to replace complex superscalar architectures with meshes of small homogeneous processing units connected by an on-chip network. This trend is mostly dictated by inherent silicon technology frontiers, which are getting as closer as the process densities levels increase. The number of cores to be integrated in a single chip is expected to rapidly increase in the coming years, moving from multi-core to many-core architectures. This trend will require a global rethinking of software and hardware design approaches. Multi-core architectures are nowadays prevalent in general purpose computing and in high performance computing. In addition to dual- and quad-core general-purpose processors, more scalable multi-core architectures are widely adopted for high-end graphics and media processing, e.g. IBM Cell BE, NVIDIA Fermi, SUN Niagara and Tilera TILE64. The 2PARMA project focuses on the flexible family of parallel and scalable computing processors, which we call Many-core Computing Fabric (MCCF) Template, composed of many homogeneous processing cores interconnected by an on-chip mesh as shown in Figure 1.

Figure 1: 2PARMA Many-Core Computing Fabric Template

3

This class of computing systems (which we call Many-core Computing Fabric) promises to increase performance, scalability and flexibility if appropriate design and programming methodologies will be defined to exploit the high degree of parallelism exposed by the architecture. Benefits of Many-core Computing Fabric architectures include finer grained possibilities for energy efficiency paradigms, local process variations accounting, and improved silicon yield due to voltage/frequency island isolation possibilities. To exploit these potential benefits, effective run-time power and resource management techniques are needed. With respect to conventional computing architectures, Many-core Computing Fabric offers some customisation capabilities to extend and/or configure at run-time the architectural template to address a variable workload.

The 2PARMA project aims at overcoming the lack of parallel programming models and run-time resource management techniques to exploit the features of many-core processor architectures focusing on the definition of a parallel programming model combining component-based and single-instruction multiple-thread approaches, instruction set virtualisation based on portable bytecode, run-time resource management policies and mechanisms as well as design space exploration methodologies for Many-core Computing Fabrics.

The above scientific and technical objectives are intended to meet some of the main challenges in computing system research:

• To improve performance by providing software programmability techniques to exploit the hardware parallelism;

• Efficient management of power/performance trade-offs by providing runtime resource management and optimisation;

• To improve system reliability, mainly in terms of lifetime and yield of hardware components by providing transparent resource reconfiguration and instruction set virtualisation;

• To increase the productivity of the process of developing parallel software by using semi-automatic parallelism extraction techniques and extending the OpenCL programming paradigm for parallel computing systems.

2. Main Goals

The main goals of the 2PARMA project related to the analysis and development of the complete software layer able to exploit the features of future many-core processor architectures. This goal has been tackled from several standpoints as presented in the following subsections.

2.1. Programmability of Many-core Computing Fabrics

2PARMA project tackles the issue of programmability of Multi-core Computing Fabrics at both the programming language and Operating System level. On one hand, it leverages the increasingly popular Component-Based Software Engineering (CBSE) and develops parallelism extraction techniques to

4

identify opportunities for parallelisation at a high level in the design phase; 2PARMA then employs extensions of existing standards for parallel programming, such as OpenCL, to express data parallelism for Many-core Computing Fabrics. On the Operating System level, 2PARMA provides the means to define and deploy logic peripherals to the Many-core Computing Fabric, preserving isolation among them and efficient communication between host and Computing Fabric. The 2PARMA intends providing developers with comfortable tools and programming environments aiming at increasing software cycles productivity with respect to current, mainly manual, methodologies.

The proposed improvements build upon existing work on GPGPU architectures [1], since the 2PARMA platforms can be programmed using tools and languages commonly used for such architectures (e.g., OpenCL). Thus, it will benefit from techniques that automatically handle memories local to processor clusters usually available in GPGPU architectures, as well as automated lowering of OpenMP code to OpenCL. This allows the programmer to first design the application under a shared memory paradigm, and then perform fine-tuning on a view of the application where the shared memory abstraction is removed.

2.2. Virtualisation and Continuous Adaptation

Portable bytecode representations of software are employed to provide not only portability but also the ability to adapt applications at runtime to the available resources. Instruction set virtualisation provides the means to tailor the application to the subset of the computing resources determined by the co-existence of multiple applications on the computing platform, by employing dynamic compilation and optimisation techniques.

2PARMA leverages the ILDJIT dynamic compiler [2], designed for execution on parallel machines and targeting the ECMA-335 bytecode language, thus exposing a virtual homogeneous platform. Thus, it will be possible to decide at runtime which parts of the application will be executed on the host processor, and which will be accelerated by the computing fabric.

In addition to the basic virtualisation properties, ILDJIT is able to exploit free computational resources to perform aggressive code optimisation, as well as hiding the latencies of compilation through Dynamic Lookahead Compilation (DLA) [3], using accurate forecasting of the program execution trace to compile functions ahead of their actual invocation. Thus, when the execution reaches one of the pre-compiled functions, the dynamic compiler can immediately invoke it, hiding the compilation latency. Since DLA must be performed without affecting application code execution, compilation and optimisation threads are spawned when computational resources are available, and withdrawn when the resources are scarce, using an hysteretic pattern to reduce fluctuations. This approach minimizes the overhead of the DLA, while maximizing its effectiveness when resources are abundant.

2.3. Runtime Management

Given the opportunities for adaptation of the application to the available resources, 2PARMA develops intelligent policies to manage the system resources taking into account the Quality-of-Service (QoS)

5

requirements imposed by the user to each application, while optimising the resource usage for system-wide performance and energy goals. 2PARMA project aims at reducing system power consumption with respect to conventional power management strategies, supporting efficient and optimal task, data and devices managements able to dynamically adapting to the changing context.

Most of the existing approaches that are looking into runtime resource management require big design-time efforts in order to characterize the system behaviour under a range of system workloads. At design-time, profiling information are gathered and analysed in order to construct a runtime scheduler that can be lightweight. On the other side, modern applications exhibit a dynamic, often unpredictable, behaviour with respect to timing prediction and resource utilization. It is clear that there is a trade-off to be achieved between design-time effort and run-time overhead.

In 2PARMA we push the boundary of this trade-off to reduce the design time effort and move more responsibility to the runtime resource manager. We accomplish this task by providing a runtime manager (RTM) with metadata information covering both runtime and design time knowledge of both hardware and software. The runtime management performs adaptive task mapping and scheduling [4], dynamic data management [5] and system-wide power management [6].

The first role of the RTM is to monitor the dynamic applications behaviour both to control the available and required platform resources and to tune application parameters at run time while meeting user requirements. At run time, multiple tasks can compete for the limited amount of platform resources. Considering all the possible tasks that can run on the platform, the number of possible mappings grows exponentially with the cardinality of the task set. To handle this complexity, a scenario-based mapping technique needs to be developed, identifying subsets of mappings taking into account inter-task scenarios. This technique is complementary to the intra-task scenario methodology [7] as well as the task concurrency management [8]. To allow the exploitation of the proposed technique, the platform control should be distributed and supporting flexibility in several planes as explained in [9].

The second role of the runtime manager is to handle the dynamism in the control flow and data usage (dynamic (de)allocation of data), by determining a suitable allocation strategies that meet the application needs. To that end, new fit, coalescing, and split policies, taking application characteristics into account, may be needed. For example, the customized dynamic memory allocators contain a specific layer/component that offers an adaptive fit policy, as the application(s) requirements change at runtime [5]. Furthermore, the data management techniques take into considerations the constraints imposed by the adaptive task allocation and power management techniques.

Finally, the runtime manager is responsible for the adaptive power management of the many-core computing fabric architecture. This is achieved by identifying a suitable top-level modelling of the entities composing the overall systems, in terms of exchanged data, exposition of control settings and status information of the components/devices. The manager is responsible to combine the adaptive runtime task and data management schemes (component/device specific optimizations) with the adaptive power management policies, being aware of the presence of local optimization strategies exposed by the

6

rest of the system components (e.g., device drivers and in general any other resource manager). The end result is a distributed runtime QoS Constrained Power Manager working at the OS-level [6].

2.4. Design Space Exploration

Continuous adaptation and runtime management require large amount of information on the system and the applications to take effective and timely decisions. 2PARMA goes beyond traditional design space exploration (DSE) by defining a methodology to provide synthetic information about the points of operation of each application with respect to the subsets of resources available to it. Design space exploration methodologies developed in 2PARMA provide also architectural customisation to support parallel programming models, especially communication and memory mapping.

Design space exploration covers a central role in design of novel computing platforms [10, 11, 12]. In fact, for a given system specification there may be many different design alternatives that need to be evaluated and judged to understand their quality and to take a decision on which is the system alternative to implement. Design alternatives may consist of the tuning of processor micro-architectural components, different mappings of software tasks to resources, different scheduling policies implemented on shared resources as well as lower level design parameters such as clock frequency or bus/network width. Moreover, design space exploration involves the analysis of multiple criteria, since each design alternative usually represents a trade-off among different optimisation goals.

In this context, the 2PARMA project wants to provide to the designer novel and robust design space exploration methodologies to trade-offs the system cost factors during the design phase by considering the dynamic evolution of the system. In particular, this goal is obtained by extending and retargeting an open source framework for the automatic system level multi-objective exploration of multiprocessor architecture developed in the MULTICUBE project [13]. The MULTICUBE Explorer framework [14] extensions enable the robust co-exploration of static and dynamic parameters and to support run-time resource management. Response Surface Modelling techniques are also added to speed up the design space exploration for multi/many core architectures.

Focusing on the combined optimization of parallel programming models and architectural enhancements for many core platforms it is expected that conventional or state of the art profiling techniques cannot be used for the task of analyzing and profiling. Especially profiling memory accesses on a cycle accurate basis as describe in [15] is not sufficiently supported by available profiling tools and methodologies due to the fact that only shared memory architectures were modelled at the time. It is foreseen that many core platforms will be build upon completely different connection topologies, which requires new profiling techniques that take the connection topologies into account. Based on the profiling methodologies it will be possible to get an in depth view of how parallel programming models behave on many core platforms. The results will be used to co-optimize the used programming model and the architecture of the computing platform.

7

3. Platforms and Applications

The 2PARMA project will demonstrate methodologies, techniques and tools by using innovative hardware platforms provided and developed by the partners, including the “Platform 2012” an early implementation of Many-core Computing Fabric provided by STMicroelectronics and the many-core COBRA platform provided by IMEC.

3.1. STMicroelectronics Platform 2012

The P2012 program is cooperation among STMicroelectronics and CEA and aims at designing and prototyping a regular computing fabric able to improve manufacturing yield. Platform 2012 (P2012) is a high-performance programmable accelerator whose architecture meets requirements for next generation SoC products at 32nm and beyond. The goal of P2012 is twofold:

• Provide flexibility through massive programmable and scalable computing power

• Provide solid ways of dealing with increasing manufacturability issues and energy constraints

In order to achieve these two goals the P2012 program will use the two following key enablers:

• Emerging 3D stacking techniques allowing to revisit memory hierarchy organization

• Application deep know-how STMicroelectronics is having thanks to its product group organization in order to inject the right specialization level into the architecture.

The combination of these dimensions into an architecture proposal can offer a viable alternative to heterogeneous mix hardware software systems we are using today.

Organized around an efficient Network-on-Chip communication infrastructure, P2012 allows connecting a large number of decoupled STxP70 processors SMP clusters, offering flexibility, scalability and high computation density. P2012 offers a software environment that allows firmware developers to master the inherent complexity of parallel programming. It does not assume a unique programming model; instead, several will be made available to better suit programmers’ needs, including, but not limited to, open standards such as Khronos OpenCL.

Figure 2 shows the Platform 2012 computing fabric, made of a variable number of ‘tiles’ that can be easily replicated to provide scalability. Each tile includes a computing cluster with its memory hierarchy and a communication engine. Tiles are connected through a regular network on chip. The whole computing fabric operation is coordinated by a fabric controller and is connected to the SoC host subsystem through a dedicated bridge, with DMA capabilities. Clusters of the fabric can be isolated to reduce power consumption (or to switch-off a faulty element) and frequency/voltage scaling can be applied in active mode.

Also, as soon as 3D stacking is used, it is envisaged that the NoC could provide high-bandwidth access to a stacked layer of embedded memory, whose dimension will also be determined according to the SoC

8

needs. The P2012 computing fabric is connected to a host processor such as the ARM Cortex A9, via a system bridge. The fabric is in this way exposed to legacy operating systems like the GNU/Linux OS.

Many P2012 platform design choices are still open and subject to change, and the 2PARMA Consortium, which is one of the very early adopters of this technology, effectively contributes to the platform architecture specification and relevant optimizations.

Figure 2. Platform 2012 Template

9

3.2. IMEC’s ADRES-based COBRA Platform

Figure 3: IMEC's COBRA platform

The IMEC’s COBRA platform is an advanced platform template targeting 4G giga-bit per second wireless communication. One instance of the platform is shown in Figure 3. This platform can be customized to handle very high data rates as well as low throughputs in a scalable way. This platform largely consists of 4 types of cores:

1. DIFFS: This is an ASIP processor tuned towards sensing and synchronization. It is optimized so that it has a very low power for performing these tasks. It is tuned towards average duty cycle.

2. ADRES [16]: The ADRES core is a coarse-grained reconfigurable core template [17]. It consists of a number of functional units connected in a given interconnect network. The core has been tuned to be capable of doing inner modem processing of various standards efficiently.

3. FlexFEC [18]: A flexible forward error correction ASIP that is capable of doing different outer modem processing. This ASIP architecture is a SIMD engine template where the instruction set, bit width of the data-path and the number of SIMD slots can be chosen based on the set of requirements of the standard to be run.

4. ARM core for controlling the tasks on the platform (e.g. the run-time manager task).

10

The first three cores (DIFFS, ADRES, FlexFEC) cores can be instantiated for a mix of targeted standards that need to be supported. Also all parts of the platform are programmable in C (ADRES, ARM) or high-level assembly (FlexFEC, DIFFS). The communication is ensured by customized InterConnect Controller (ICC) cores that are programmable at assembly level as well.

In this platform, besides the type and the size of each core, the number of each type of core can be selected based on the different standards that need to be supported on the platform. For example for the highest throughput modes for Wireless LAN 802.11n 4x4 MIMO, the number of DIFFS cores may be 4 and two (multi-threaded) ADRES cores and two FlexFEC cores. In case the platform has to support only a low end Wireless LAN SISO standard or a basic LTE SISO reception, one DIFFS core, ADRES core and one FlexFEC would be sufficient to meet the requirements of this mode.

The core instances used in the platform template are reconfigurable (run-time) and programmable (design-time) and the mapping flow for the platform is developed [19]. However, proper run-time control mechanisms at the top of the platform are under development to steer the run-time reconfigurability of the platform.

3.3. Applications

To ensure a wide range of application scenarios comprising the typical computation-intensive workload of a general-purpose computing system, a set of industrial high performance demanding applications will be used and customized by using the techniques and methodologies developed in 2PARMA project. Applications’ architecture, development and integration will leverage in particular from the acknowledged experience of three partners from the Consortium: Fraunhofer HHI for Scalable Video Coding application, RWTH Aachen University for Cognitive Radio (MAC and Physical Layer), and IMEC for Multi View Image Processing.

3.3.1. Scalable Video Coding

SVC [20] also known as layered video coding has already been included in different video coding standards in the past. Scalability has always been a desirable feature of a media bit stream for different services and especially for best-effort networks that are not provisioned to provide suitable QoS and especially suffer from significantly varying throughput. Thus a service needs to dynamically adapt to the varying transmission conditions: A video encoder for example shall be capable of adapting the media rate of the video stream to the transmission conditions to provide at least acceptable quality at the clients, but shall also be able to explore the full benefits of available higher system resources. Within a typical multimedia session the video consumes the major part of the total available transmission rate compared to control and audio data. Therefore, an adaptation capability for the video bit rate is of primary interest in a multimedia session. Strong advantages of a video bit rate adaptation method relying on a scalable representation are drastically reduced processing requirements in network elements compared to approaches that require video re-encoding or transcoding. With this motivation in mind, H.264/AVC-based SVC is of major practical interest and it is therefore highly important to investigate implementation aspects of SVC.

11

SVC represents one of the potential killer applications. Due to its scalable nature it is very well suited to demonstrate e.g. the different concepts of runtime resource management. SVC is not only scalable in respect to its coding features. It can also be used and implemented for scalability techniques regarding processing power, power consumption or resource management. In this case e.g. the scalability of the media bitstream will be reflected by the available processing power. On the other hand this scalability in terms of needed processing power can be used to minimize power consumption by limiting the available processing power accordingly to the given power budget. Scalable multi media signal processing on scalable computing platforms will be one of the most challenging research topics in this field in the future.

3.3.2. Cognitive radio (MAC and Physical Layer)

From the domain of wireless communications, a cognitive radio application will be provided by RWTH Aachen University. This application includes both physical and MAC-layer processing. Especially, the low latency as well as high throughput and reconfiguration requirements of state-of-the-art wireless communication standards makes the cognitive radio application as a highly appropriate use case for the 2PARMA project and its parallel programming models.

The operational scenario of the physical-layer application has been defined based on cutting-edge standards, such as LTE, WiMax and WLAN. These standards have adopted OFDM transmission schemes, because of the high spectrum efficiency and low equalizer complexity. In addition, MIMO techniques are regarded as the key for improving data rates and reliability for future communication systems. The combination of MIMO and OFDM effectively enhances the achievable data rate, spectrum efficiency, and data reliability. Accordingly, the physical-layer application use case targets a MIMO-OFDM transceiver system that has been defined by a thorough investigation of latest wireless communication standards. For implementation of such transceiver, the challenging application tasks are considered to be MIMO demapping, channel estimation and decoding as well as FFT/IFFT processing.

Besides the physical-layer application, the provided cognitive MAC-layer allows re-configuration as per the changing spectrum conditions and the evolving application demands. In the component-oriented MAC design approach, special focus is set to efficiently coordinate the data and control-flow at runtime. This includes flexible and efficient resource management as well as runtime modifications and exploitation of parallelism during the execution of different MAC-layer components.

During the application development, it is planned to study and analyze the architecture and application requirements as well as the new parallel programming models, in particular, the proposed component-based programming models and tools. Therefore, these novel programming tools are currently implemented to enable efficient physical- and MAC-layer [21, 22] development in the future. In addition, these will allow fast on-the-fly reconfiguration of the physical- and MAC-layer while respecting the time-critical nature of the data and control transfer among different components.

12

3.3.3. Multi-View Video

With the current development of electronic, network, and computing technology, Multi-View Video (MVV) becomes a reality and allows countering the limitations of conventional single video. MVV refers to a set of N temporal synchronized video streams coming from cameras that capture the same scene from different viewpoints. The availability of multiple views of the scene broadens the application field: it extends the sensation of classical 2D video, it allows the user to freely choose a viewpoint (Free Viewpoint Video), and it provides a 3D depth impression of the scene (3D television).

Figure 4: Applications/Architectures Integration The application, considered in the 2PARMA project, is the cross-based stereo matching algorithm [23], where two aligned left and right cameras are assumed for the sake of clarity. Given a pair of images from two aligned cameras and from any stereo pair of pixels p in both left and right images, the depth of p is inferred from the pixel disparity, being the difference on the x coordinate of p in the stereo pair. Once the disparity is obtained for each pixel, the disparity map can be visualized in grayscale encoding (see Figure 4). The cross-based stereo matching algorithm is an area-based local method, where the disparity computation at a given pixel only depends on intensity values within a finite window. The challenge of this method is that the support window should be large enough to include enough intensity variation for reliable matching, while it should be small enough to avoid disparity variation inside the window. Therefore, to obtain accurate disparity results at reasonable costs, an appropriate support window should be selected adaptively.

13

4. Design Flow and Tools

The tool environment and design flow of the 2PARMA project is shown in Figure 5. The basic idea behind the 2PARMA project is to combine the automatic extraction of parallelism to dynamic compilation in order to exploit the management of system resources at runtime. Starting from the specifications of the industrial applications and architecture to be used for the integration and validation of the design flow, the main goal is to define a parallel compilation toolchain and Operating System support. The compilation tool chain starts with the component-based application source code (C-based) to be assembled and compiled to bytecode and further dynamically translated to machine code. Then the machine code execution and deployment will be supported by an OS layer to provide isolated logical devices efficiently communicating (device-to-device and host-to-device). The GNU/Linux operating system will be used as the software reference common ground for what the host processor is concerned. On the parallel side, another main goal consists of developing methodologies and tools to support the application/architecture co-exploration. More in detail, the project focuses on profiling the parallel applications aimed at finding the bottleneck of the target platform and on the robust design space co-exploration of static and dynamic parameters by considering dynamic workloads, while identifying hints/guidelines for dynamic resource management. Then, the Run-Time Resource Manager (RTRM) provides adaptive task and data allocation as well as scheduling of the different tasks and the accesses to the data for many-core architectures. Furthermore, the adequate power management techniques as well as the integration to the Linux OS will be provided.

14

Figure 5: 2PARMA Design Flow and Tools

5. Summary and project outcomes

To conclude, the concrete results of the project are related to the analysis and development of the complete software layer able to exploit the features of future many-core processor architectures. These outcomes can be summarized as follows.

Integrated Compiler Toolchain and OS Layer to start from a componentised application source code (C-based) to be semi-automatically parallelised into an OpenCL program and then compiled to bytecode and further dynamically translated to machine code. Then the machine code execution and deployment will be supported by an OS layer to provide isolated logical devices efficiently communicating (device-to-device and host-to-device).

Design toolset for supporting the HW/SW co-exploration considering the system dynamic. Starting from a configurable architecture and parallelised application description (coming from the compilation toolchain) the toolset will explore the HW/SW design space by generating:

• A bottleneck analysis and optimisation of the target architecture with respect to the parallel application;

15

• Robust tuning of the design-time (static) configurable parameters with respect to run-time (dynamic) system behaviour;

• A set of operating points for the run-time configurable parameters to be used for dynamic resource management.

Run-time Manager: A set of techniques to manage at run-time the system resources. Based on the set of operating points given by the DSE tool at design time and the info collected at run-time on system workload and resource utilization, the run-time management techniques will optimise data allocation and data access scheduling, task mapping and scheduling and power consumption.

6. References [1] A.DiBiagio and G.Agosta. Improved Programming of GPU Architectures through Automated Data Allocation and Loop Restructuring. In Proceedings of the 2PARMA Workshop, February 2010.

[2] S.Campanoni, G.Agosta, S.Crespi-Reghizzi and A.DiBiagio. A highly flexible, parallel virtual machine: design and experience of ILDJIT. In Software: Practice and Experience, Volume 40 Issue 2, pages 177-207. January 2010.

[3] S.Campanoni, M.Sykora, G.Agosta and S.Crespi-Reghizzi. Dynamic Lookahead Compilation. In Proceedings of the 12th Compiler Construction conference. Mar 2009.

[4] Z.Ma, P. Marchal, D. Scarpazza, P. Yang, C. Wong, I. Gomez, S. Himpe, C. Ykman, F. Catthoor. “Systematic methodology for real-time cost-effective mapping of dynamic concurrent task-based systems on heterogeneous platforms”. ISBN 978-1-4020-6328-2, Springer, Heidelberg, Germany, 2007.

[5] A. Bartzas et al., “Software metadata: Systematic characterization of the memory behaviour of dynamic applications,” Journal of Systems and Software, 2010.

[6] P. Bellasi et al., “Constrained Power Management: Application to a Multimedia Mobile Platform,” in Proc. of DATE, 2010.

[7] V.Gheorghita, M.Palkovic, J.Hamers, A.Vandecappelle, S.Mamagkakis, T.Basten, L.Eeckhout, H.Corporaal, F.Catthoor, F.Vandeputte, K.De Bosschere. “System scenario based design of dynamic embedded systems”. ACM TODAES, 14(1), January 2009.

[8] N.Miniskar, E.Hammari, S.Munaga, S.Mamagkakis, P-G.Kjeldsberg, F.Catthoor. “Scenario-based mapping of dynamic applications on MPSoC: a 3D graphics case study”. Workshop on Systems, Architectures, MOdeling, and Simulation-SAMOS, pp. 48-57, July 2009.

[9] M.Palkovic, P.Raghavan, M.Li, A.Dejonghe, L.Van der Perre, F.Catthoor, “Multicore embedded systems for future SDR platforms: architecture and mapping flow overview”, IEEE Signal Processing Magazine, Vol.27, No.2, pp.22-33, March 2010.

16

[10] H.Chang, L.Cooke, M.Hunt, G.Martin, A.McNelly, L.Todd, Surviving the SOC Revolution: a Guide to Platform-Based Design. Kluwer Academic Publishers.

[11] ARTEMIS “Strategic Research Agenda: Design Methods and Tools”, https://www.artemisia-association.org/downloads/RAPPORT_DMT.pdf

[12] “HIPEAC Roadmap” http://www.hipeac.net/roadmap

[13] www.multicube.eu

[14] http://m3explorer.sourceforge.net/

[15] H.Hübert and B.Stabernack; Profiling-Based Hardware/Software Co-Exploration for the Design of Video Coding Architectures, IEEE Transactions on Circuits and Systems for Videotechnology, Special Issue on Algorithm/Architecture Co Exploration of Visual Computing, August 2009.

[16] V.Derudder, B.Bougard, A.Couvreur, A.Dewilde, S.Dupont, L.Folens, L.Hollevoet, F.Naessens, D.Novo, P.Raghavan, T.Schuster, K.Stinkens, J.-W.Weijers, L.Van der Perre, “A 200Mbps+ 2.14nJ/b digital baseband multi processor system-on-chip for SDRs”, Proceedings of VLSI, June 2009

[17] B.Mei, S.Vernalde, D.Verkest , H.DeMan, R.Lauwereins, “ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix “,Proceedings of FPL, pages 61—70, September 2003

[18] F.Naessens, V.Derudder, H.Cappelle, L.Hollevoet, P.Raghavan, M.Desmet, A.M.AbdelHamid, I.Vos, L.Folens, S.O'Loughlin, S.Singirikonda, S.Dupont, J.-W.Weijers, A.Dejonghe, L.Van der Perre, “A 10.37 mm2 675 mW reconfigurable LDPC and Turbo encoder and decoder for 802.11n, 802.16e and 3GPP-LTE”, Proceedings of VLSI, June 2010

[19] “Parallelization exploration of wireless applications using MPA'', M.Palkovic, P.Raghavan, T.Ashby, A.Folens, H.Cappelle, M.Glassee, L.Van der Perre, F.Catthoor, Intnl. Conf. on Parallel Computing (Parco), Lyon, France, Sep. 2009.

[20] H.Schwarz, D.Marpe, T.Wiegand, Overview of the Scalable Video Coding Extension of H.264/AVC, IEEE Transactions on Circuits and Systemsfor Video Technology, Vol 17, 2007.

[21] V.Ramakrishnan, E.M.Witte, T.Kempf, D.Kammler, G.Ascheid, H.Meyr, M.Adrat, M.Antweiler. “Efficient and Portable SDR Waveform Development: The Nucleus Concept”. In IEEE Military Communications Conference (MILCOM 2009), Oct 2009.

[22] J.Ansari, X.Zhang, A.Achtzehn, M.Petrova, P.Mähönen. "Decomposable MAC Framework for Highly Flexible and Adaptable MAC Realizations". Demonstration abstract. Proc. of DySPAN, 2010, Singapore.

[23] K.Zhang, J.Lu, G.Lafruit. “Cross-based local stereo matching using orthogonal integral images”. IEEE Transactions on Circuits and Systems for Video Technology, 19(7):1073-1079, 2009.

Documents

Parallel Paradigms and Run-time Management Techniques for ...home.deib.polimi.it/fornacia/lib/exe/fetch.php?... · Parallel Paradigms and Run-time Management Techniques for Many-core