NVIDIA Breakthrough Brings Real-Time Conversational AI Within Reach

Back in November of last year, Google open-sourced a technique for natural language processing pre-training that it called Bidirectional Encoder Representations from Transformers, or BERT. Despite a number of large companies racing towards conversational AI using similar methods, including Microsoft, Facebook, Alibaba, Baidu, and Uber to name just a few, to date, BERT remains one of the most advanced AI language models in the space.

Google open-sourced BERT so that others could train their own conversational question answering systems. And today, NVIDIA announced that its AI compute platform was the first to train BERT in less than an hour and complete AI inference in just over 2 milliseconds.

“Large language models are revolutionizing AI for natural language,” said Bryan Catanzaro, vice president of Applied Deep Learning Research at NVIDIA. “They are helping us solve exceptionally difficult language problems, bringing us closer to the goal of truly conversational AI. NVIDIA’s groundbreaking work accelerating these models allows organizations to create new, state-of-the-art services that can assist and delight their customers in ways never before imagined.”

Chatbots and Digital Assistants have been available for quite some time now, but it is easy to tell them apart from actual humans due to a number of factors, like poor latency or unnatural interactions as a result of their inability to leverage large AI models in real time. To accelerate the BERT AI language model training, NVIDIA leveraged what it calls a “DGX SuperPOD”, which consists of 92 NVIDIA DGX-2H systems running 1,472 NVIDIA V100 GPUs. This configuration reduced the typical training time for BERT-Large – the largest current BERT model — from a few days to only 53 minutes. To illustrate how its GPUs scale for this type of AI workload, NVIDIA also performed the same training on just one NVIDIA DGX-2 system and it took about 2.8 days.

For the inference component, NVIDIA leveraged its T4 GPUs running NVIDIA TensorRT with the BERT-Base SQuAD dataset. The inference took only 2.2 milliseconds. For reference, optimized CPU code needs over 40 milliseconds for a similar workload and 10 millisecond is the processing threshold for many real-time applications.

To push the boundaries even further, NVIDIA Research also built and trained what it claims is the world’s largest language model based on Transformers, which is one of the foundational elements of BERT and a number of other natural language AI models. NVIDIA’s custom model, dubbed “Megatron”, featured 8.3 billion parameters, which is 24 times the size of BERT-Large.

NVIDIA has made the software optimizations and tools it used for accelerating these models available to developers via GitHub and the NVIDIA GPU Cloud (NGC).

Follow me on Twitter or LinkedIn. Check out my website.

I am a freelance writer, co-founder and Principal Analyst at HTVA, and the longtime Managing Editor at HotHardware.com. My work has been published worldwide, in a num...