SC19 was red hot this year as the race to exascale computing got into top gear. Not even the snow on the last afternoon damped the collective ‘exascale enthusiasm’. SC19 is our industry’s exhibition pinnacle and as normal, the weekend before the show opens on Monday evening is packed full of training sessions, briefings, industry updates, etc. that cover everything from the latest HPC and AI product releases and tools, to tours of nearby supercomputing centers.
With my green Icelandic data center heritage, I was drawn to the tour of the National Renewable Energy Laboratory (NREL) about an hour west of Denver set against the beautiful foothills of the Rocky Mountains.
Their supercomputer is located in the building on the stilts on the right side of the picture. After some fascinating briefings about their wind and geothermal energy research, we got practical first-hand insight into wind turbine blade design with a virtual reality visualisation of the airflow around the blades and their interaction with each other. It was really fascinating stuff.
Thereafter we toured the Eagle Supercomputer which is 100% water-cooled, including their NVIDIA V100 GPUs. This interested me because NVIDIA only warranties the V100 for air-cooling. NREL were unfazed by this detail and commented that they are a national laboratory and need to push the boundaries both of science and what their computers can do. Having renewable energy DNA, they also use the heat from their supercomputer to warm the buildings on the campus, thus creating a nice circular heat economy:
Anyway, back to SC19 and my favorite exascale announcement of the show was the launch of the Graphcore IPU. It’s available in Dell DSS 8440 servers and on the cloud. The Graphcore IPU has a great hardware specs:
GraphCore’s new Colossus GC2 chip holds 1216 IPU-Cores™. Each IPU runs at 100GFlops and can run 7 threads. The GC2 chip supports 300MB of memory, with an aggregate of 30TB/s of memory bandwidth. Each IPU supports low precision floating point arithmetic in completely parallel/concurrent execution. The GC2 chip has 23.6B transistors!
Each GC2 chip supports 80 IPU-Links™ to connect to other GC2 chips with 2.5tbps of chip to chip bandwidth and includes a PCIe Gen 4 x16 link (31.5GB/s) to host processors. Additionally, it supports up to 8TB/s IPU-Exchange™ on the chip bandwidth for inter chip, IPU to IPU communications.
The architectural secret sauce is having enough memory on the GC2 chip to allow the whole model to stay in the chip memory. The full mesh high speed connectivity helps too. This all results in fabulous benchmarks like this one for convolution training:
This device is particularly relevant to machine vision, natural language processing, autonomous vehicles and security applications where FP64 is infrequently used. So far, nobody has announced a supercomputer using IPUs but I would expect to see that happen next year at ISC20 or SC20 – stay tuned!
As a sprog engineer, I learned about Direct Memory Access (DMA) hardware acceleration in early microprocessors. Given a source memory address, destination memory address and the size of the transfer special hardware moves the data without the need to write a program to move the data a byte or word at a time using the CPU. Some recent evolutions of this theme have been making a big HPC impact recently.
Infiniband networking has long been the go-to connectivity for multiple nodes of compute and it uses a similar concept Remote Direct Memory Access (RDMA) across the network. At SC-19 Nvidia CEO Jensen Huang’s annual keynote announced Magnum IO, a suite of software optimized to eliminate storage and input/output bottlenecks using GPUDIrect – a GPU DMA technology.
Magnum IO, using GPUDirect, delivers up to 20x faster data processing for multi-server, multi-GPU computing nodes when working with massive datasets to carry out complex financial analysis, climate modeling and other HPC workloads.
Yellowbrick Data, a Palo Alto based storage company recently out of stealth mode, also leverage DMA technology this time using an updated bios to move data directly from the MVME storage to the CPU cachewith amazing results: “Plow through data 10-100x faster with Yellowbrick. Ad-hoc workloads on Yellowbrick run faster than heavily tuned, indexed queries on other data warehouses.” We pleased to have one of the first Yellowbrick nodes in our Icelandic data center nestled with a miriad of other HPC kit.
Flash memory and clever DMA producing outstanding storage performance
Intel have been catching-up in the accelerated computing market. Intel's Nervana AI NNP-I chips and boards were visible everywhere and they announced their new Xe line of GPUs – the top product is named Ponte Vecchio and will be used in Aurora, the first planned U.S. exascale computer.
An Intel gem not yet ready for prime-time was their neuromorphic accelerator a potential next step beyond DNN techniques. This advancement may exploit Spiking neural networks (SNNs) which are a novel model for arranging those elements to emulate natural neural networks that exist in biological brains. Each “neuron” in the SNN can fire independently of the others, and doing so, it sends pulsed signals to other neurons in the network that directly change the electrical states of those neurons. By encoding information within the signals, themselves and their timing, SNNs simulate natural learning processes by dynamically remapping the synapses between artificial neurons in response to stimuli.
Loihi: A Neuromorphic Manycore Processor with On-Chip Learning
Intel Labs is making Loihi-based systems available to the global research community. If only I had more time to experiment with one of these puppies!
Irrespective of your future HPC/AI plans, with or without DMA or SNNs, Verne Global has your colocation bases covered with free-air cooling and very affordable power.