This is a project to develop algorithmic innovations and optimizations for power efficiency in the emerging landscape of embedding computing platforms. Many technological advances are being proposed by the community, which must be effectively exploited at the software and application layers to achieve and sustain the power envelope of the application. These include microarchitectural features to mask run-time errors due to process variations, spatial and temporal heterogeneity, innovative power efficient devices and circuits, microarchitectural capabilities to mask run-time errors due to process variations, and mechanisms for continuous feedback to software layers. However, leveraging such advances requires algorithmic solutions to handle the potential for run-time errors due to near threshold computing.


We propose model-driven power optimization at the architecture-algorithm abstraction for the following reasons:

  1. Optimization at the algorithmic level has a much higher impact on total energy dissipation than microarchitecture or circuits. Some recent studies have shown that the impact ratio is 20:2.5:1 for algorithmic, register, and circuit level energy optimizations.
  2. Optimizing at the architecture-algorithm level allows power to be effectively traded off with other performance parameters. For example, a design consuming 2x power but achieving 3x system throughput is actually 50% more power efficient than the original design.
  3. Proposed technologies by the broader community are expected to include more flexible clock gating, richer set of feature integration, and user-controllable low-power modes. Most of these power-saving features can be effectively exploited only through algorithmic and architectural design optimizations.

Objectives

  1. Develop algorithmic optimizations for latency, throughput and energy performance,
  2. Identify opportunities to integrate these into the compilation capability and also enable overall application composition, and
  3. Demonstrate improved energy and resilience performance for signal processing kernels of interest to DARPA and DOD on next generation embedded computing platforms.


Innovation

  1. Model based exploration of algorithm design space
    1. Tunable parallelism and energy, latency, resilience tradeoffs: Specification of parallel algorithms for kernels with a small number of parameters including latency, energy, resiliency, input (problem) size, and number of processors. The framework will permit the designer to explore the algorithm space by varying the parameters.
    2. Kernel specific architecture-algorithm modeling: A high level performance model denoted Integrated Computational Model (ICOM) to abstract the heterogeneous multicore architectures, including their storage and communication features, to enable design time optimizations. Explicit modeling of storage and data transfer energy costs and spatial and temporal heterogeneity will allow the space of kernel implementations to be explored.
    3. Hierarchical design space exploration: Exploration of the architecture-algorithm pair in the target hardware using ICOM . High-level performance modeling will enable rapid design space exploration followed by detailed simulations using the simulation capability that will be provided separately in this program.
  2. Design time algorithmic optimizations for energy performance
    1. Communication energy optimization: Novel data remapping optimizations to minimize energy costs due to communication. In addition to the initial placement of the data in memory, it is also important that intermediate results be appropriately placed. Thus, the data in memory may need to be reorganized as the computations progress.
    2. Memory energy optimization: Adaptation of layout functions (e.g. Latin squares) optimize layout for the memory system performance . ICOM will explicitly model the energy cost for storage and access. Design of heuristic techniques for layout, scheduling using cost functions in ICOM.
    3. Incorporating heterogeneity: Split jobs into smaller, independent, staged tasks. This decomposition allows the design space exploration phase to select a partitioning to match available resourcesa??including across both general-purpose CPUs and available accelerators using ICOM. This also allows tasks to be annotated with resiliency properties at design time that can be leveraged by the run-time.
  3. Algorithm adaptation at run-time
    1. Dynamic computational model (DCOM): Domain-specific instrumentation of the performance, energy and workload characteristics that will be used to optimize algorithms according to their run-time behavior. This technique is similar to profile-driven compilation, but will be at the algorithm level instead of at the instruction level.
    2. Off-line dynamic model-based optimization: We will develop algorithm optimization techniques that can be applied at design time that leverage the instrumentation and model information collected from DCOM. These techniques will be similar to those described earlier in ICOM, but the optimization decisions will utilize information collected at run-time during profile runs that use test data sets or run in situ.
    3. On-line dynamic model-based optimization: We will also develop optimization techniques that can be used on-line . Again, these optimization techniques will be similar to those described before, but the decision logic will have to be lightweight, so that the benefits of the optimizations are not offset by the cost of performing the optimization.
    4. Resiliency: We will develop generalizable and kernel-specific recovery techniques for resilient completion of the application in the presence of soft-errors at run-time . We will explore model-based techniques like identifying critical sections at different tasks of the algorithm, replicating tasks and their mapping to the architecture to survive soft errors, and characterizing the error propagation pathways to limit the scope of recovery. Algorithmic approaches include lightweight validation of invariants at run-time for early error detection, and aggregative or iterative refinement techniques for hiding errors.

Funding Agency and Program

  • This work is funded by the DARPA PERFECT(Power Efficiency Revolution For Embedded Computing Technologies) program.


  • Publications
  1. Andrea Sanny and Viktor K. Prasanna, Energy-Efficient Median Filter on FPGA, accepted in IEEE International Conference on ReConFigurable Computing and FPGAs (ReConFig '13), December 2013.
  2. Kiran Kumar Matam and Viktor K. Prasanna, Energy-Efficient Large-Scale Matrix Multiplication on FPGAs, accepted in IEEE International Conference on ReConFigurable Computing and FPGAs (ReConFig '13), December 2013.
  3. Ren Chen and Viktor K. Prasanna, Energy-Efficient Architecture for Stride Permutation on Streaming Data, accepted in IEEE International Conference on ReConFigurable Computing and FPGAs (ReConFig '13), December 2013.
  4. Kiran Kumar Matam, Hoang Le, and Viktor K. Prasanna, Evaluating Energy Efficiency of Floating Point Matrix Multiplication on FPGAs, IEEE High Performance Extreme Computing Conference (HPEC '13), September 2013.
  5. Ren Chen, Neungsoo Park, and Viktor K. Prasanna, High Throughput Energy Efficient Parallel FFT Architecture on FPGAs, IEEE High Performance Extreme Computing Conference (HPEC '13), September 2013.
  6. Kiran Kumar Matam, Hoang Le, and Viktor K. Prasanna, Energy Efficient Architecture for Matrix Multiplication on FPGAs, IEEE International Conference on Field Programmable Logic and Applications (FPL '13), August 2013.
  7. Ren Chen, Hoang Le, and Viktor K. Prasanna, Energy Efficient Parameterized FFT Architecture, IEEE International Conference on Field Programmable Logic and Applications (FPL '13), August 2013.
  8. Kiran Kumar Matam and Viktor K. Prasanna, Algorithm Design Methodology for Embedded Architectures, International Symposium on Applied Reconfigurable Computing (ARC '13), March 2013.