Nvidia: Graphics Chips with 20TFLOPS DP Performance Needed for ExaFLOPS Supercomputers.

Nvidia Discusses Future Graphics Chips for Extreme Graphics and Exascale Computing

by Ilya Gavrichenkov
11/24/2010 | 05:51 PM

Nvidia Corp. shed some light onto its exascale development project known as Echelon at the SC10 supercomputer-related trade-show last week. The company's researchers are completely convinced that machines capable of performing at least a quintillion double-precision floating point operations per second (1018 FLOPS) should be heterogeneous, e.g. employ both highly-parallel as well as high-performance serial processors.


Even though today's graphics processing units (GPUs) are more efficient than central processing units (CPUs) in terms of raw performance-per-watt, even modern GPUs cannot power supercomputers that are 1000 times faster than modern ones while consuming reasonable amount of energy. As a result, graphics processors should evolve radically and rapidly in order to enable exascale systems in 2018 - 2020 timeframe. Moreover, in order to efficiently program heterogeneous systems, new programming models and paradigms are required.

William Dally, the chief scientist of Nvidia who also heads the development of Echelon extreme-scale computing project partly funded by DARPA under the Ubiquitous High Performance Computing (UHPC) program, shared his thoughts about the future chips capable of powering an ExaFLOPS-class supercomputer.

Sketch of Nvidia Echelon research system

According to Steve Keckler, the director of architecture research at Nvidia, the Echelon design incorporates a large number (~1024) of stream cores and a smaller (~8) number of latency-optimized CPU-like cores on a single chip, sharing a common memory system. Just like in current architectures, eight stream cores will form a streaming multiprocessor (SM) and 128 of SMs will forum the large pool of throughput-optimized processing elements. Such a chip could deliver 20 teraFLOPS with double precision and a number of them will form a 2.6 petaFLOPS rack. At present Nvidia Fermi (GF110) chip 512 with stream processors operating at 1544MHz can deliver 0.79TFLOPS of DP compute performance. Considerint the 25 times difference in performance, it is highly likely that the Echelon will employ post-Maxwell (~2013 ~ 2014) Nvidia GPU design.

In order to keep power consumption of such a chip relatively low, stream processors have to process a double-precision floating point operation using just 10 picojoules of power, down from 200 picojoules on Nvidia's current Fermi chips, EETimes web-site quoted Mr. Dally as saying. To facilitate that drop in energy consumption, each of 1024 stream processors per chip have to perform four FLOPS per cycle.

To further trim usage of power, Nvidia intends to integrate a large (~1024) number of configurable 256KB SRAM banks into the chip. The huge amount of on-chip memory should allow to keep as many data onboard as possible and as close to processing elements as possible to avoid power-costly fetching operations where doable. The SRAM banks should be configurable and either act as unified memory pool, as dedicated caches for processing elements, as shared memory for explicit management and so on.

At present the Echelon is only a research project and not a chip from Nvidia's roadmap. From some point of view, the Echelon is much like Intel's single-chip cloud computer (SCC) which belongs to Tera-Scale research project.