The AI Circuit – Rumours from the Trade Show Floor

AI


I’m starting to feel like a Formula 1 racing driver where every month is a new venue with huge crowds but in my case it’s AI industry events. This autumn I’m alternating events across the Atlantic providing a great insight into any differences in current practices between North America and Europe.

Back in October I attended the excellent World AI Summit in Amsterdam. It was a great event and had a very new-age European feel to it, making extensive use of video, virtual reality and animation with a video DJ as the master of ceremonies. It was quite the AI Grand Prix pit-party!

Despite the show’s glitzy wrapper to the presentations, the event was all business with many deep neural network (DNN) training practitioners walking the exhibition floor. I enjoyed working the aisles near our booth talking to folks who have never considered Iceland as the home for their intensive, industrial scale AI compute, but who after 30 seconds have their lightbulb moment when they suddenly realise the economic savings, and the benefit to their compute’s carbon footprint.

Many of the people who came to our booth had done the easy, and many in the high performance computing world would say lazy choice by just using a hyperscale cloud for their machine learning, but were already complaining of issues. These ranged from performance issues due to noisy neighbours, non-stellar technical support and the obvious one being the pricing. Without degrees in hyperscale sliding pricing mechanics, many were struggling, and one guy I spoke to had spent his monthly budget in a couple of days without realising – ouch.

Since the summer when I got a deep technical briefing on the NVIDIA T4 GPU’s mixed precision capabilities and we got the opportunity to benchmark it against the P100 and V100, I’ve been asking everyone what precision they train their DNNs at. I’ve noticed a dichotomy between subjective and object datasets with voice, language and vision datasets using lower precision than their scientific counterparts. The exceptions to this that I’ve encountered are usually due to historic development tools and are subject to future migration to lower precision when their training volumes escalates.

This has created a sustained interest in our bare metal cloud NVIDIA T4 GPUs since our benchmark showed them performing machine vision DNN training at 90% of the V100 with the latest NVIDIA drivers and appropriate tuning. Once you need ultra-speedy connectivity between GPUs, NVLink or FP/64 arithmetic the V100s have no comparison, especially when configured in an NVIDIA DGX chassis. Hence the number of DGX1/2 systems being used for autonomous driving development.


Clearly the T4 GPU has a different less power-hungry floating-point architecture to the V100. I’m seeing the more sophisticated DNN systems engineers paying careful examination of their DNN operations both in the training and inference hardware domains to test for slight divergences in results, which may be impactful to the final application. I’m sensing a “best practice” evolving to train and run the inference on the same GPU floating-point architecture. If you are training your DNN on V100s and running inference on T4s or FPGAs be thoughtful.

As the DNN training users on our bare metal cloud grow we are seeing an evolution in their storage thoughts. Often the first prototyping or initial proof of concept (POC) compute node is a 2 or 4 GPU, dual CPU, server with a generous internal solid-state drive (SSD). Over time as the training dataset gets larger and the internal storage is augmented with a 20 – 50TB Network File System (NFS) node with RAID protection and a back-up/replication scheme. It’s then easy to get the appropriate training datasets to load into the GPU node.

Once again, over time this is augmented with an object storage solution and we provide Ceph-based ones, which are ideal for storing large datasets for later deployment directly into the GPU node or the NFS storage. This hierarchical storage solution allows the needed data to always be in close proximity and provide the optimum compute performance.


This storage blog gives a great summary of the storage system types and their best use cases: “With block storage, files are split into evenly sized blocks of data, each with its own address but with no additional information (metadata) to provide more context for what that block of data is. ... Object storage, by contrast, doesn't split files up into raw blocks of data and does have describing metadata.”

There are a couple of common land mines poised to kill any new embryonic AI product. The first is any proprietary development tools or APIs which would lock you into a specific and potentially expensive cloud environment. This is particularly important if the product is destined for volume usage, where the higher compute costs associated with using the APIs or toolsets in a cloud becomes a cost-of-goods (COG) issue having a meaningful impact on the ultimate product pricing.

The second is the choice of training and inference operating system. CentOS is by far our most popular operating system due to being open source. Migrating to it from the other Linux or Unix flavors is a modest task but moving from Windows Server is a completely different story. It requires careful planning integration into a busy product roadmap because it delivers no user functionality, customer value, only reduced development and operations costs when in production.

As you start and progress your AI development odyssey, consider our bare metal cloud or extreme “DGX-ready” colocation, both of which come with ample industry experience, low cost Icelandic renewable energy and a campus built for the job. Steve Jobs who famously said in 1996: "Picasso had a saying -- 'good artists copy; great artists steal' -- and we have always been shameless about stealing great ideas." There is no need to steal, just train your DNN with us in Iceland and we’ll provide you with a steady stream of industry best practices to consider.

My Formula 1 calendar has two more US stops this year Supercomputing 19 in Denver November 17th – 21st and the AI Summit in New York City December 11th – 12th which is preceded by our HPC and AI meetup on December 10th. Let’s compare notes at one of these.

Bob Fletcher, VP of Artificial Intelligence, Verne Global (Email: bob.fletcher@verneglobal.com)


Written by Bob Fletcher

See Bob Fletcher's blog

Bob, a veteran of the telecommunications and technology industries, is Verne Global's VP of Strategy. He has a keen interest in HPC and the continuing evolution of AI and deep neural networks (DNN). He's based in Boston, Massachusetts.

Related blogs

Iceland provides the power behind Germany's most pioneering AI start-ups

This week has seen the announcement of Analytic Engineering, a pioneering German AI engineering firm, choosing Verne Global’s data center in Iceland as the location for their intensive computing. This represents another impressive AI and Machine Learning client win for us, following DeepL joining us just before Christmas.

Read more


Will AI Transform Your World and Wardrobe?

Would a robot suit the role of a personal stylist? Whilst the latest vogue changes each month, the fashion industry wreaks environmental havoc. It's a huge culprit of overproduction, making it the 2nd largest water polluter in the world and also responsible for 300 million tonnes of waste each year. As ever, technology has an answer to this dilemma. Developing AI promises a brighter future for the fashion industry: one of invariably popular clothing lines and a personalised customer experience. Already, the current political climate has seen shoppers becoming more eco-conscious, and naturally, clothing brands are responding to this marketing incentive with environmentally-friendly changes. A new system, empowered by technology, could change fashion’s sustainability game for good... and all without being at the expense of a nice outfit.

Read more


Wolf in Open Source clothing

In an interesting Medium Article, Andrew Leonard wrote about how Amazon may be starting to compete with some of its Open Source software partners. Andrew’s article delved into the specifics of the case involving Elastic and their Elasticsearch open source software. Elastic has been happy to offer Elasticsearch in its Open Source form on the AWS platform, and many customers were happy to consume Elastic’s capabilities that way.

Read more

We use cookies to ensure we give you the best experience on our website, to analyse our website traffic, and to understand where our visitors are coming from. By browsing our website, you consent to our use of cookies and other tracking technologies. Read our Privacy Policy for more information.