Aging Management at the Circuit, Architecture and Runtime Levels
The broad objective, relevant to overall expedition goal, is to design opportunistic software practices that can best exploit and adapt to variations in the underlying hardware, the environment, and the applications requirements. The variations are exposed to the higher software layer, rather than hidden behind the conservative specifications of an over-designed hardware. As such, lower overall cost can be achieved with deliberately under-designed hardware with relaxed design and manufacturing constraints.
Our activities have two broad objectives:
1. Designing an optimization framework and control policies to find the optimal self-tuning over lifetime which guarantees functional operation in the presence of circuit aging (over a broad range of circuit aging mechanisms), subject to constraints imposed by the user applications profiles and underlying hardware. The self-tuning parameters, such as supply voltage, clock frequency, and cooling capacity, may be adjusted dynamically based on offline aging estimation method, or adaptively according to online aging estimation method for systems with built-in mechanisms for learning and forecasting real-time aging information during normal system operation. The gradual nature of aging and its dependence on dynamic factors enable such self-tuning system to be more robust and energy-efficient.
2. Extreme low-cost techniques for extracting special failure signatures in the underlying hardware that can enable highly efficient and effective circuit failure prediction and adaptation through a wide variety of software techniques (with proper architectural support).
An optimization method is introduced to enable efficient and scalable numerical computation of the optimal self-tuning in the presence of multiple age-related wear-out mechanisms. The presented framework uniquely seamlessly unifies both the aging mechanisms (e.g., NBTI, PBTI, and TDDB) that gradually degrade inherent circuit delay and power characteristics; with the aging mechanisms (e.g., HCI, EM, SM, GOI) that gradually degrades break-down characteristics typically accounted for in terms of FIT. The correlations and constraints among the various reliability mechanisms are inherently accounted for. Self-healing is also introduced as a potential self-tuning parameter. The objective of the constrained multi-objective optimization approach is to achieve the optimal trade-off between lifetime, overall performance, overall power, and overall reliability: by finding the optimal value of one (or some) of the attributes, subject to requirements on the other attributes. Such aging-aware design paradigm distinctively emphasizes on capturing the long term behavior and averaging the transient behavior of the system. Instead of addressing the bottleneck of targeting specific design point, by adaptively trading-off various system properties in the multi-dimensional design space, this framework empowers flexible design practice with lower overall costs, where various market opportunities can potentially be met with minimal or no design changes. The underlying framework is also expected to be fully scalable to the user-inputs (e.g., the number of degradation mechanisms, the size and type of large-scale structures and workload benchmarks). A full report for our activities is being prepared. ARM has expressed special interest in this work, and a summer internship is planned at ARM for evaluating these techniques for ARM designs.
The goal of this project to assess efficacy of current aging management techniques used at the circuit/architecture/OS level and propose methods to deal with aging. Our first goal has been to develop a first-principles model of aging (especially NBTI) flexible enough to use at architecture-level.
Publications: " Self-Tuning for Maximized Lifetime Energy-Efficiency in the Presence of Circuit Aging," Mintarno, J. Skaf, R. Zheng, J. Velamala, Y. Cao, S. Boyd, R.W. Dutton and S. Mitra. IEEE Trans. Computer-Aided Design, pp. 760-773, 05-01-11 " On the Efficacy of NBTI Mitigation Techniques," Tuck-Boon Chan, John M. Sartori, Puneet Gupta and Rakesh Kumar. Proc., IEEE/ACM 2011 Design, Automation and Test in Europe, 03-18-11 " Robust System Design," S. Mitra, H. Cho, T. Hong, Y. Kim, H. Lee, L. Leem, Y. Li, D. Lin, E. Mintarno, S. Park, N. Patil, H. Wei and J. Zhang. IPSJ Trans. System LSI Design Methodology 2011, 02-11-11 "Robust System Design to Overcome CMOS Reliability Challenges," S. Mitra, K. Brelsford, Y. Kim, K. Lee and Y. Li. IEEE Journal on Emerging and Selected Topics in Circuits and Systems: Special Issue on the IEEE CAS Forum, 12-15-10
Milestones: We showed that in most cases, for long lifetime systems (5 years+), the benefits of dynamic reliability management (dynamic voltage scaling, activity management with scheduling, power gating, etc) are fairly limited when compared to just traditional guardbanding approaches.
In Stanford's previous work with 90nm test chips, we demonstrated that the existence of delay fluctuations is a new signature for detecting gate-oxide early-life failures (ELF, also known as infant mortality). During this quarter, we created a test chip design plan for a 32nm CMOS technology that will be used to answer the following questions:
• Do delay fluctuations continue to be valid signatures for gate-oxide ELF at advanced technologies using high-k/metal gate processes? – our on-going collaborations with a large semiconductor company has already demonstrated the validity of our signature – our current effort is to validate across multiple technologies;
• How do dynamic environmental variations (e.g., temperature drift and voltage droop) affect the effectiveness of our gate-oxide ELF detection technique?;
• Do defects causing ELF in inter- or intra-layer dielectrics (ILD) exhibit similar signatures for advanced technologies using ultra low-k (ULK) processes?
Samsung is working very closely with us in this effort. Plans/Outlook: -term goals are to have a CUDA implementation of the aging simulator since runtime is still a huge bottleneck. We also plan to study potential benefits of fine-grained power gating for aging management. Also we are investigating ways to derive “optimal guardband” for aging. Longer term, getting some circuit level aging silicon data (as opposed to device-level) is a challenge.
(Top) Guardbanded Vdd vs. dynamically scaled Vdd for reliability management; expected energy saving is less than 7% over 10-year lifetime. (Bottom) Power gating to manage aging: we need 6 years (out of 10-year lifetime) of power gating to get 5% improvement in performance.
Category:
Design Tools / Testing Micro-Architecture / Compilers Runtime Support
Campus:
UCLA UIUC Stanford
People: PIs: Puneet Gupta (UCLA), Subhasish Mitra (Stanford) and Rakesh Kumar (UIUC); Graduate Students: Tuck-Boon Chan (UCLA, now at UCSD), Liangzhen Lai (UCLA), Evelyn Mintarno and Young Moon Kim (Stanford), and John Sartori (UIUC); Undergraduate Student: Johnny Yam (UCLA) Artifacts: Release of open-source aging simulator. The simulator can take arbitrary workloads, voltage schedules, etc. Both a Matlab and a C++ version have been released, and they are available at http://nanocad.ee.ucla.edu/Main/DownloadForm. Awards: Best Paper Award, "Concurrent Autonomous Self-Test for Uncore Components in SoCs," Yanjing Li, Onur Mutlu, Donald S. Gardner and Subhasish Mitra, IEEE VLSI Test Symposium, April 2010.
ACM/SIGDA Outstanding Dissertation Award, "Design and Fabrication of Imperfection-Immune Carbon Nanotube Digital VLSI Circuits," Nishant Patil (advised by Subhasish Mitra). URL: http://c2s2.ece.cmu.edu/news/article/2011/05/04/patil-dissertation-award/
Click here to view other Research Projects
|