When Is a Neural Net Too Big for Production?

During the last couple of years, a ton of exciting developments have been made in natural-language processing (NLP). The development of deep pretrained language models has taken the field by storm. In particular, transformer architectures are everywhere and were popularised by Google’s release of the Bidirectional Encoder Representations from Transformers (BERT) model, OpenAI’s release of the GPT(-2) models, and other similar releases.

Various research teams are continuing to compete to train better language models; looking at the General Language Understanding Evaluation (GLUE) benchmark leaderboard reveals a host of other approaches (many of them also named, for example, BERT: ALBERT and RoBERTa). The overarching trend in this research has been to train bigger models with more data—growing to the extent that researchers have investigated the costly carbon footprint of training these large networks.

For practicioners, the main selling point of pretrained language models is that one does not need to start from scratch when developing a new text classifier: a pre-trained model can be fine tuned and often lead to state-of-the-art results with a fraction of the work. But, as these models continue to grow in size, people have started to question how useful they are in practice.

Patterns for models in production

What is “production?”

For these purposes, production is the environment in which software is placed that was written (and models were trained) so that it can enable features in a product to work without any manual intervention. By this definition, any code that is used for analytics or ad-hoc purposes is excluded, even though there may be potential applications of NLP in those domains (e.g., sentiment analysis on historical data).

This analysis assumes that this environment is designed using microservices—which just happens to be how the Monzo backend is designed.

There are three main ways that models can be used in production:

RESTful Services. This is the first (and sometimes only thing) that comes to mind when people talk about “production.” The idea is to build some kind of microservice with an API that can receive requests, do some work (i.e., get predictions from a model), and return a result. For example, when a customer types a query into the Monzo app’s help screen, a service receives that request and returns relevant help articles (That has been simplified a bit. Quite a few services are involved in this work, but the idea is the same).

Consumer Services. The second approach is to build a service that listens for certain events and requests for some work to be done when they happen. For example, when a customer starts chatting with the customer support team, a service is listening for particular events in order to (a) generate embeddings of the chat’s first turn and (b) trigger recommendations that are shown to the agent for saved responses that may be relevant to the current query.

Cron Jobs. These are batches of work that need to be done on a regular basis. For example, all of the help articles and agent responses are stored in a content management system—and these are regularly edited and updated with new content. The search and recommendation services use the embeddings of this content.

In practice, building an end-to-end system is likely to involve more than one of the above. 

When Is a Model Too Big?

When is a model too big? There are two scenarios to consider: (1) a model is too big to ship at all and (2) a model’s size is making it inefficient.

Too big to ship at all? The main question that may prevent shipping a model at all is about reconciling the hardware (where you want to run a model) with the size of the model. In practice, current models’ sizes are not a big problem in cloud-based backend systems, which have a variety of different instance sizes available—the hardware in the cloud can ship a model like BERT. It may eat up a ton of memory, but it will work.

This could change if one wanted to want to ship a model elsewhere (for any other reason).

Too big to be efficient? Models are often trained using GPUs but shipped on non-GPU instances, where inference will be slower. As models get bigger, inference time often continues to grow. There may be a point where this slow down makes it infeasible. This is going to be a very application-specific decision: For example, a chat bot responding within a few seconds may still be “fast” in the customers’ eyes, while, if it took a similar time to get search results on Google, something would seem odd.

Read the full story here.


Don't miss out on the latest technology delivered to your email monthly.  Sign up for the Data Science and Digital Engineering newsletter.  If you are not logged in, you will receive a confirmation email that you will need to click on to confirm you want to receive the newsletter.