Program:

Time Topic and Speaker
08:30 — 08:45 Welcome
08:45 — 09:30 Invasive Computing - An Introduction
Technology roadmaps today foresee already now 1000 and more processors being integrated into one single MPSoC in the year 2020. Obviously, the control of parallel applications cannot be organized in a fully centralized way any more as it is done in today's multi-core processors. Also, feature variations will become a problem if algorithms are not able to cope with such. One way shown how be able to live with expected increase of defects and errors is to exploit reconfigurability of processors, communication links and memories properly. The question comes only at what price this can and should be done and to what degree.
In this keynote, we present a novel paradigm for organizing the computations of large scale MPSoCs of the future decentrally and in a resource-aware manner called Invasive Computing. This involves drastical changes in the way MPSoCs will be programmed in 2020 and drastical changes in the underlying architecture.
The main idea of Invasive Computing relies on the vision that applications will organize themselves and spread their computational load at run-time on processors, communication and memory resources in phases called invasion, and, depending on available degree of parallelism, dynamically changing user objectives or in dependence of the state of the underlying hardware such as temperature profile, load, permissions, or faultiness, again retreat from these.
We present first ideas of how to embed this novel parallel computing paradigm into existing parallel programming languages, what kind of architectural changes will be required, and finally, what applications would benefit from this kind of self-organization on invasive MPSoCs. It will be outlined how and to what degree invasive computing may improve fault-resilience, scalability, efficiency and resource utilization.

Jürgen Teich, University of Erlangen
Juergen Teich received his masters degree (Dipl.-Ing.) in 1989 from the University of Kaiserslautern (with honours). From 1989 to 1993, he was PhD student at the University of Saarland, Saarbruecken, Germany from where he received his PhD degree (summa cum laude). In 1994, Dr. Teich joined the DSP design group of Prof. E. A. Lee and D.G. Messerschmitt in the Department of Electrical Engineering and Computer Sciences (EECS) at UC Berkeley where he was working in the Ptolemy project (PostDoc). From 1995 to 1998, he held a position at Institute of Computer Engineering and Communications Networks Laboratory (TIK) at ETH Zurich, Switzerland, finishing his Habilitation entitled `Synthesis and Optimization of Digital Hardware/ Software Systems' in 1996. From 1998 to 2002, he was full professor in the Electrical Engineering and Information Technology department of the University of Paderborn, holding a chair in Computer Engineering.
Since 2003, he is appointed full professor in the Department of Computer Science of the Friedrich-Alexander University Erlangen-Nuremberg holding a chair in Hardware/Software Co-Design. Dr. Teich has been a member of multiple program committees of international conferences and workshops and program chair for CODES+ISSS 2007, FPL 2008, and ASAP 2010. From 2003-2009, he was coordinator of the DFG priority programme 1148 on Reconfigurable Computing. He is Senior Member of the IEEE. Since 2004, he acts also as reviewer for the German Research Foundation (DFG) for the area of Computer Architecture and Embedded Systems. Since 2010, he is the speaker of the DFG Collaborative Research Center TR89 on Invasive Computing.
09:30 — 10:00 Coffee Break
10:00 — 10:30 Probabilistic Hardware: A New Design Paradigm for System-on-a-Chip Assembly
It is well known that the latest silicon semiconductor process generations -- 32nm and 28nm -- are less predictable in terms of both technology parameters (e.g., transistor widths and doping concentrations) as well as gate performance (power and delay). The main cause is also reasonably well know: fewer atoms (e.g., in a transistor channel) means that tiny variations (e.g., in dopant atom concentration) cause significant variations. In a similar vein, traditional voltage scaling has dramatically slowed down; while we went from 5V to 1V in about a decade or so, we are unlikely to go from 1V to .2V in another decade. One problem in this picture is thermal noise which may have a significant impact in the future.
As a counterpoint to the traditional computing paradigm of deterministic gates, over 50 years ago John Von Nuemann proposed techniques for fault tolerant computation to deal with faulty logic gates. At the software level, a long history exists of so-called "randomized algorithms" where statistical and probabilistic properties -- captured, for example, in a particular number sequence or in a logic network such as a Bayesian network -- are utilized in and intrinsic to the particular algorithm. More recently, in 2003 the first publication (as best as this writer knows) proposing utilizing the statistical behavior of faulty gates and transistors was proposed*. Broadly speaking, the idea is to view nondeterminism (e.g., due to parametric variations or thermal noise) in hardware as a friend rather than a foe by utilizing the statistical properties of the hardware in the software algorithms. The Institute for Sustainable and Applied Infodynamics, a joint effort between Nanyang Technological University and Rice University, including key partners such as Georgia Tech and several other international partners, is pursuing some critical issues in this space which can broadly be called “Probabilistic Computing”. This talk will give an overview of the current set of ideas being pursued and will propose some early results.
*"Energy Aware Algorithm Design via Probabilistic Computing: From Algorithms and Models to Moore's Law and Novel (Semiconductor) Devices," Krishna V. Palem, Proceedings of the Intl. Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), October 30-November 1, 2003.

Vincent Mooney, Georgia Institute of Technology
Vincent J. Mooney III (Senior Member, IEEE, and Member, ACM) received the B.S. degree from Yale University in 1991, where he double majored in Electrical Engineering and Computer Science. He received an M.S. degree in E.E. from Stanford University in 1994, an M.A. degree in Philosophy from Stanford in 1997, and the Ph.D. degree in E.E. from Stanford in June of 1998. He has worked at Bell Labs (Lucent), Allied Signal Aerospace VLSI Design Group, Hughes Network Systems, and Redwood Design Automation (acquired by Cadence). He is currently a Visiting Associate Professor at both the School of Computer Engineering and the School of Electrical and Electronic Engineering at Nanyang Technological U., Singapore. He is on sabbatical from the Georgia Institute of Technology where he is an Associate Professor in the School of Electrical and Computer Engineering and an Adjunct Associate Professor in the College of Computing. He is a recipient of the NSF Career Award. He was General Chair of IFIP/IEEE VLSI-SoC 2007. Since 2006 he has been an Associate Editor of the ACM Transactions on Embedded Computing Systems, and he served as an Associate Editor of the IEEE Transactions on VLSI from 2007 to 2009. His research interests include computer-aided design of integrated circuits with a particular emphasis on hardware-software codesign, reconfigurable computing, power-aware and probabilistic architectures and circuits
10:30 — 11:00 Putting ESL into practice
ESL tools and design methods are increasingly common components of industrial design flows but the various ESL flows and use models (software virtual prototypes, high-level synthesis, architecture analysis and functional verification) have found adoption to different degrees and at different rates. Having spent several years investigating the strengths and opportunities for various ESL tools and techniques, in this presentation I'll offer an industry perspective on evolving the ESL value proposition into practice. I'll give special attention to the current state of the art in ESL performance analysis and detail how advances in the functional verification use case actually benefited the performance analysis use case. I'll outline our most recent experiences in creating a reusable architect's dashboard that leverages a variety of modeling environments and was used in several real-world design projects to answer critical design questions and resolve architectural tradeoffs.

Adam Donlin, Xilinx Research Labs
Adam Donlin is a senior researcher in the performance analysis and verification team of Xilinx Research, San Jose, CA. His interests include transaction level modeling and system level design with a particular focus on new flows and use-models for FPGA design environments. He has also made wide-ranging novel contributions in FPGA dynamic reconfiguration including the FPGA virtual file system and a self-modifying processor architecture. Dr Donlin received his PhD in Computer Science from Edinburgh University in 2001. He is a senior member of the IEEE, serves on the technical program and organizing committees of several international conferences, and is a program chair for the 2010 Embedded Systems Week conference CODES+ISSS.
11:00 — 11:30 The rise and rise of Scratch Pads
Scratch Pad Memories or SPMs have been popular in embedded systems for a long time. SPM is typically implemented as an SRAM, and is physically addressed by the processor. Unlike caches, SPMs do not automatically get data from the lower levels of memory, but explicit DMA instructions have to be inserted in the program to transfer data between the SPM and the lower levels of memories. Typically SPMs are used as small and fast memories, close to the processor. The SuperH cores from Hitachi, that was used in the Sega gaming consoles could lock cache lines to use them as SPM. Gaming engines, especially from Sony have exploited the benefits of scratch pads for a long time. Embedded processors, including DSPs routinely deploy scratch pads. Even though, using SPMs require changing application code, it was feasible in embedded domains, since they are only programmed once.
Meanwhile, general purpose and high performance computing steered away from scratch pads, and continued to use cache memory hierarchies, chiefly because of portability. However, medium and high performance computing has now hit a strong power, thermal, and scalability wall. Processors generate much more power that can be dissipated; as a consequence, even though you can build 20 cores, you cannot keep all of them “on”. Furthermore cache coherency protocols do not scale well with the number of cores. As a result of all this, the memory transfer cost is becoming the single most important source of power-inefficiency in medium and high-performance computing. The stage is set for SPMs to make an entry into the general purpose and high performance computing domain.

Aviral Shrivastava, Arizona State University
Aviral Shrivastava is an Assistant Professor in the Department of Computer Science and Engineering, at the Arizona State University, where he has established and heads the Compiler and Microarchitecture Labs (CML). He received his Ph.D. and Masters in Information and Computer Science from University of California, Irvine, and bachelors in Computer Science and Engineering from Indian Institute of Technology, Delhi. His research interests lie at the intersection of Compilers and Computer Architecture, ranging from embedded systems to high performance computing. Dr. Shrivastava is a lifetime member of ACM, and serves on organizing and program committees of several premier embedded system conferences, including ISLPED, CODES+ISSS, CASES and LCTES, and regularly serves on NSF and DOE review panels.
11:30 — 12:00 Efficient Utilization of Scratch-Pad Memory in Preemptive Multi-Task Systems
Use of Scratch-Pad Memory (SPM) instead of, or together with, cache memory can significantly reduce energy consumption of embedded systems. Although a large amount of research efforts on SPM for energy minimization have been made so far, few researchers have addressed how to utilize SPM in preemptive multi-task systems. Since the capacity of SPM is limited, it is crucial to efficiently share the SPM space among multiple tasks. In this talk, I will describe how to efficiently utilize SPM in fixed-priority preemptive multi-task systems. First, I will present simultaneous SPM partitioning and allocation techniques and formulate them using integer linear programming. Then, I will talk about implementation issues and runtime supports of real-time operating system and hardware.

Hiroyuki Tomiyama, Ritsumeikan University
Hiroyuki Tomiyama received a Ph.D. degree in computer science from Kyushu University in 1999. He is currently Full Professor at the Department of VLSI System Design, Ritsumeikan University, Japan. His research interests include system-level design automation, architectures and compilers for embedded systems and systems-on-chip. He currently serves as Associate Editor-in-Chief of IPSJ Transactions on System LSI Design Methodology, and Associate Editor of IEEE Embedded Systems Letters.
12:00 — 13:00 Lunch Break
13:00 — 13:30 Library Support in an Actor-based Parallel Programming Platform
Embedded software design for multi-core system is very challenging since it involves parallel programming to utilize full potential of hardware performance under resource and timing constraints. Recently, a novel concept of a programming platform, called CIC (Common Intermediate Code), has been proposed as a new programming method for parallel embedded software. In this method, the programmer builds up an application in actor-oriented manner so that the potential parallelism of the application is exposed. Since the actors, called CIC tasks, are written in an architecture-neutral form, the application code can be translated into a target code by the target-specific CIC translator.
In most actor-oriented models, actors are self-contained and data channels are the only sharable object between actors. And they compose a system in a single flat layer. In contrast, it is common to use shared library functions and construct vertically-layered software for efficient and natural implementation. To fill this gap between modeling and implementation, we propose to extend the CIC actor model by introducing a special type of an actor, CIC library task. It is a sharable and mappable object that defines a set of function interfaces inside. A library function is called through new types of ports: library master port and library slave port. N:1 master-slave connection allows multiple master tasks to share a single CIC library task. Moreover, the master-slave connection can specify vertically-layered software and client-server applications naturally.
In order to support the CIC library tasks in our embedded software design automation environment, we develop CIC translators for the library tasks. Preliminary experiments have been performed to validate the proposed approach on two architectures: IBM CELL Broadband Engine and an ARM-based multi-core simulator.

Soonhoi Ha, Seoul National University
Soonhoi Ha is currently an associated dean of college of engineering and a full professor in the School of Computer Science and Engineering at Seoul National University, Korea. From 1993 to 1994, he worked for Hyundai Electronics Industries Corporation. He received his Bachelors (1985) and Masters (1987) in Electronics Engineering from Seoul National University, and PhD (1992) degrees in Electrical Engineering and Computer Science from University of California, Berkeley. He has worked on the Ptolemy project and has led the PeaCE project (development of a HW/SW codesign environment) and the HOPES project (development of parallel embedded S/W design environment) projects. His research interests include hardware-software codesign, design methodology for embedded systems and parallel embedded S/W. He serves as an associate editor of ACM TODAES. He is a senior member of the IEEE Computer Society.
13:30 — 14:00 Distributed Operation Layer: An Efficient and Predictable KPN-Based Design Flow
Multi-processor system-on-chip (MPSoC) is one of the best paradigms for implementing embedded systems for signal processing in communications, medical, and multi-media applications. MPSoC platforms offer high computational capabilities and allow (multiple) applications to be executed in parallel. However, these systems are heterogeneous by nature utilizing multiple computation, communication, memory, and peripheral resources. Therefore, the question which every designer must face is whether a certain MPSoC design is efficient and furthermore, is it predictable? Efficiency refers to the speed-up gained by executing parallelized applications on multiple processors, while predictability refers to the ability to analyze the performance of the system early in the design, thus enabling the optimization of the final implementation.
A promising formalism for programming multi-processor systems is the Kahn process network (KPN) model of computation. The KPN model allows designers to achieve efficient execution by explicitly expressing the data parallelism in the application, as well as the parallelism between computation and communication. Moreover, due to the well-defined semantics of the KPN model and their properties, an early (formal) performance analysis of such systems can be performed at the abstract system-level.
Here, we propose the distributed operation layer (DOL), as a platform independent MPSoC design flow based on the KPN model of computation and targeted at real-time multimedia and (array) signal processing applications. The DOL design cycle follows the Y-chart approach, in which the application specification is platform-independent and needs to be related to a concrete architecture by means of an explicit mapping. Specifically, by using a KPN-based design flow, application developers only need to develop a high-level specification of the application while the architecture specific implementation is automatically synthesized, leading to implementations that are correct-byconstruction. Moreover, a formal performance analysis model is automatically generated from the system description, and the design space can be explored early, at system-level for an optimal mapping implementation.
This talk will highlight specific aspects in the DOL design flow, proposing a system-level solution for MPSoC mapping optimization, built on top of formal performance analysis, fast design space exploration, and automated software synthesis tools.

Iuliana Bacivarov, Swiss Federal Institute of Technology Zurich
Iuliana Bacivarov joined ETH Zürich, Switzerland, as a post-doctoral researcher in the Computer Engineering and Networks Laboratory in 2006. Her research interests include Multi-Processor System-on-Chip (MPSoC) design methods models and tools, in particular system-level models and methods for the performance evaluation and optimization of mapping parallel applications to MPSoCs. Currently, her research is applied to EU-funded large scale research projects related to MPSoC systems: SHAPES, EURETILE, PRO3D, COMBEST, and PREDATOR. She received the MS and PhD degree in Micro and Nano- Electronics from the National Polytechnic Institute of Grenoble, France, in 2003 and 2006, respectively. She received the Electronics Engineer degree from Polytechnic University of Bucharest, Romania in 2002.
14:00 — 14:30 CAM: Constraint-aware Application Mapping for Embedded Systems
The increasing demand for low power, high performance and secure embedded systems has motivated the need for effective solutions to satisfy application bandwidth and latency requirements under a tight power budget. As technology scales, it is imperative that applications are optimized to take full advantage of the underlying resources and meet both power and performance requirements. Furthermore, the increasing amount of shared (memory and/or computational) resources on emerging embedded systems exacerbates security threats due software exploits and software-based side channel attacks. In this talk, we discuss CAM, a framework capable of discovering and enabling parallelism opportunities via code transformations, efficiently distributing the computational load across resources, minimizing unnecessary data transfers (on/off-chip), and determining its security requirements with respect to secure compute, memory, and communication resources. CAM decomposes the application’s tasks into smaller units of computations called kernels, which are distributed and pipelined across the different processing resources. We exploit the ideas of inter/intra-kernel data reuse to minimize unnecessary data transfers, early execution edges to drive performance, kernel pipelining to increase system throughout, and custom policy generation to guarantee secure software execution on an embedded system in the presence of software-based side channel attacks. Our experimental results on JPEG and JPEG2000 show up to 97% off-chip memory access reduction, and up to 80% execution time reduction over standard mapping and task-level pipelining approaches. Our experimental results show that CAM enables secure software execution with minimal performance overhead, while reducing power consumption, since the policies are customized to efficiently utilize the available on-chip resources. For the case study of running DRM in secure mode concurrently with JPEG encoding, we are able to observe 61% performance improvement when compared to standard approaches. Unsecure applications were observed to resume execution up to 99% faster than with the traditional halt approach.

Luis Bathen and Nikil Dutt, University of California, Irvine
Luis Bathen is a final-year Computer Science Ph.D. candidate at the University of California, Irvine. He is associated with the UCI Center for Embedded Computer Systems. His research interests are in performance, power, reliability and security-aware embedded multiprocessor system design with focus on compiler optimizations, computer architecture, scheduling/pipelining, and memory management.
14:30 — 15:00 Control Performance-Aware Task Mapping and Schedule Synthesis for Distributed Controllers on Multiprocessor Platforms
As embedded systems become more complex and distributed – consisting of multiple processing units and communication buses – the gap between high-level control models and their actual implementations seem to widen. In this talk, we will discuss how task mapping and scheduling decisions on a multiprocessor platform influence the quality of control (e.g., stability of the controller or its behavior during transient and steady states) of multiple distributed controllers.

Samarjit Chakraborty, Technical University of Munich
Samarjit Chakraborty is a Professor of Electrical Engineering at the Technical University of Munich, where he heads the Institute for Real-Time Computer Systems. He obtained his Ph.D. in Electrical and Computer Engineering from ETH Zurich in 2003. Prior to joining TU Munich, from 2003 -- 2008 he was an Assistant Professor of Computer Science at the National University of Singapore. His research interests are primarily in system-level power/performance analysis of real-time and embedded systems.
15:00 — 15:30 Coffee Break
15:30 — 16:00 Communication Synthesis of Loop Accelerator Pipelines
Streaming applications are increasingly characterized by the presence of several communicating loop programs. That is, the data is streaming through several loop kernels that are arranged in a certain topology. Both parallelisms at task level and at data level can be exploited by the use of a fixed function accelerator pipeline for speeding up these applications. Such applications can be represented by graphs, where nodes are computation kernels and edges denote communication streams. The communication synthesis for the data transfer and the synchronization between loop accelerators is a major challenge in streaming applications. The complexity of the problem arises from the fact that an optimal memory mapping and the address generation in communication subsystems for parallel data access and out-of-order communication depend on the tiling and scheduling choices. In this talk, we solve the problem of communication synthesis by transforming an intermediate representation of communicating loops in the polyhedral model into the windowed synchronous data flow (WSDF) model.

Frank Hannig, University of Erlangen
Frank Hannig leads the Architecture and Compiler Design Group at the Chair of Hardware/Software Co-Design at the University of Erlangen-Nuremberg, Germany, since 2004. He received a diploma degree in an interdisciplinary course of study in Electrical Engineering and Computer Science from the University of Paderborn, Germany in 2000 and a PhD degree (Dr.-Ing.) in Computer Science from the University of Erlangen-Nuremberg, Erlangen, Germany in 2009. His main research interests are the design of massively parallel architectures, ranging from dedicated hardware to multi-core architectures, mapping methodologies for domain-specific computing, and architecture/compiler co-design. Dr. Hannig has authored or coauthored more than 60 peer-reviewed publications. He is a regular reviewer for multiple journals and conferences including, amongst others, IEEE SPM, IEEE TVLSI, IEEE TSP, IEEE TCAD, IEEE Design & Test, and DAC. He serves on the program committees of several international conferences (ASAP, DATE, ERSA, and DASIP). Frank Hannig is a member of the IEEE and an affiliate member of the European Network of Excellence (NoE) on High Performance and Embedded Architecture and Compilation (HiPEAC).
16:00 — 16:30 Fast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations
The growing complexity of embedded systems and the increasing use of multicore processors require novel solutions for a fast and accurate evaluation of non-functional properties as existing methods are not easily applicable for these complex systems. This talk presents a new approach for fast and precise timing analysis for embedded software under consideration of the target software implementation and the influences of the underlying hardware architecture. A SystemC-based simulation approach for fast performance analysis of parallel software components is presented, which combines the advantages of formal analysis and host-compiled simulation using source-code annotation of low-level timing properties. In contrast to related source-level approaches for performance analysis, timing attributes obtained from the binary code can be annotated even if compiler optimizations are used. In order to consider concurrent accesses to shared resources like caches very accurately during source-level simulation, an extension of the SystemC TLM-2.0 standard for reducing the necessary synchronization overhead is proposed as well. This allows a fast and accurate consideration of low-level timing effects during simulation of the entire hardware platform. Due to the fact, that no binary code interpretation or dynamic translation of the binary code is necessary at simulation time, the achieved simulation speed is very close to native execution at the simulation host.

Oliver Bringmann, FZI
Oliver Bringmann is division manager of the research division "Intelligent Systems and Production Engineering (ISPE)" and department manager of the research group "Microelectronic System Design" at FZI Karlsruhe in Germany. He studied computer science at the University of Karlsruhe and received the doctoral degree (PhD) in computer science from the University of Tuebingen in 2001. He is responsible for acquisition of public and industrial funded research projects and for coordination of the interdisciplinary research activities at FZI. His research interests are focusing on ESL design and verification of systems-on-chip and distributed embedded systems.
16:30 — 17:00 Hardware Observability Framework for Nonintrusive Monitoring of Complex Embedded Systems
As the complexity of digital systems rapidly increases, designers are presented with significant challenges in monitoring, analyzing, and debugging the complex interactions of various software and hardware components. Existing hardware tests and debugging methods are often intrusive, either requiring significant hardware resources or requiring the execution of the system to be halted, thus leading to system perturbations that can change the execution behavior to an extent that the erroneous behavior can no longer be observed – or lead to system failure due to missed execution deadlines. In this talk, we an overview of our recent research in developing a framework for nonintrusive hardware observability. This framework provides designers with the ability to monitor complex application-specific hardware execution behavior at runtime without impacting the system execution.

Roman Lysecky, University of Arizona
Roman Lysecky is an Assistant Professor of Electrical and Computer Engineering at the University of Arizona. He received his B.S., M.S., and Ph.D. in Computer Science from the University of California, Riverside in 1999, 2000, and 2005, respectively. His primary research interests focus on embedded systems design, with emphasis on dynamic adaptability, hardware/software partitioning, field-programmable gates arrays (FPGAs), and low-power methodologies. He has coauthored two textbooks on hardware description languages, published dozens of research papers in top journals and conferences, and holds one US patent. He received a CAREER award from the National Science Foundation in 2009, Best Paper Awards from the International Conference on Hardware-Software Codesign and System Synthesis (CODES+ISSS) and the Design Automation and Test in Europe Conference (DATE), and an Outstanding Ph.D. Dissertation Award from the European Design and Automation Association (EDAA) in 2006.
17:00 — 17:30 Closing