Data & Analytics

Deep Learning Has a Size Problem

To make sure deep learning meets its promise, we need to reorient research away from state-of-the-art accuracy and toward state-of-the-art efficiency. We need to ask if models enable the largest number of people to iterate as fast as possible using the fewest amount of resources on the most devices.

elephant.png

Earlier this year, researchers at Nvidia announced MegatronLM, a massive transformer model with 8.3 billion parameters (24 times larger than BERT) that achieved state-of-the-art performance on a variety of language tasks. While this was an undoubtedly impressive technical achievement, I couldn’t help but ask myself: Is deep learning going in the right direction?

The parameters alone weigh in at just over 33 GB on disk. Training the final model took 512 V100 GPUs running continuously for 9.2 days. Given the power requirements per card, a back-of-the-envelope estimate put the amount of energy used to train this model at more than three times the yearly energy consumption of the average American.

I don’t mean to single out this particular project. There are many examples of massive models being trained to achieve ever-so-slightly higher accuracy on various benchmarks. Despite being 24 times larger than BERT, MegatronLM is only 34% better at its language modeling task. As a one-off experiment to demonstrate the performance of new hardware, there isn’t much harm here. But, in the long term, this trend is going to cause a few problems.

First, it hinders democratization. If we believe in a world where millions of engineers are going to use deep learning to make every application and device better, we won’t get there with massive models that take large amounts of time and money to train.

Second, it restricts scale. There are probably less than 100 million processors in every public and private cloud in the world. But there are already 3 billion mobile phones, 12 billion Internet-of-things devices, and 150 billion microcontrollers out there. In the long term, it’s these small, low-power devices that will consume the most deep learning, and massive models simply won’t be an option.

To make sure deep learning lives up to its promise, we need to reorient research away from state-of-the-art accuracy and toward state-of-the-art efficiency. We need to ask if models enable the largest number of people to iterate as fast as possible using the fewest amount of resources on the most devices.

Read the full story here.