Understanding GPT-3 – Future of AI Text Generation

1. Introduction

If you have been following the recent developments in the NLP space, then it would be almost impossible to avoid the GPT-3 hype in the last few months. It all started with researchers in OpenAl researchers publishing their paper “Language Models are few Shot Learners” which introduced the GPT-3 family of models.

GPT-3’s size and language capabilities are breathtaking, it can create fiction, develop program code, compose thoughtful business memos, summarize text and much more. Its possible use cases are limited only by our imaginations. What makes it fascinating is that the same algorithm can perform a wide range of tasks. At the same time there is widespread misunderstanding about the nature and risks of GPT-3’s abilities.

To better appreciate the powers and limitations of GPT-3, one needs some familiarity with pre-trained NLP models which came before it. Table below compares some of the prominent pre-trained models:

Source: Humboldt-wi 

Let’s look at some of the common characteristics of the pre-trained NLP models before GPT-3:

i) NLP Pre-Trained models are based on Transformer Architecture

Most of the pre-trained models belong to the Transformer family that use Attention techniques;These models can be divided into four categories:

Types of Language Models

ii)Different models for different tasks

The focus has been on creating customized models for various NLP tasks. So we have different pre-trained model for separate NLP tasks like Sentiment Analysis, Question Answering, Entity Extraction etc.

iii). Fine-tuning of pre-trained model for improved performance

For each task the pre-trained model needs to be fine-tuned to customize it to the data at hand. Fine-tuning involves gradient updates for the respective pre-trained model and the updated weights were then stored for making predictions on respective NLP tasks

iv) Dependency of fine-tuning on large datasets

Fine-tuning models requires availability of large custom labeled data. This has been a bottleneck when it comes to extension of Pre-trained model to new domains where labeled data is limited.

v) Focus was more on architectural improvements rather than size

While one saw the emergence of new pre-trained models in a short span, the larger focus has been on bringing architectural improvements or training on different datasets to widen the net of NLP applications.

2. GPT-3 -Quick Overview

Key facts about GPT-3:

  • Models: GPT-3 has eight different models with sizes ranging from 125 million to 175 billion parameters.
  • Model Size: The largest GPT-3 model has 175 billion parameter. This is 470 times bigger than the largest BERT model (375 million parameters)
  • Architecture: GPT-3 is an Autoregressive model and follows a decoder only architecture. It is trained using next word prediction objective
  • Learning: GPT-3 learns through Few Shots and there is no Gradient updates while learning
  • Training Data Needed: GPT-3 needs less training data. It can learn from very less data and this enables its application on domains having less data

Sizes, architectures, and learning hyper-parameters of the GPT 3 models

Source: https://arxiv.org/abs/2005.14165

Key design assumptions in GPT 3 model:

(i) Increase in model size and training on larger data can lead to improvement in performance

(ii) A single model can provide good performance on a host of NLP tasks.

(iii) Model can infer from new data without the need for fine-tuning

(iv) The model can solve problems on datasets it has never been trained upon.

3. How GPT-3 learns

Traditionally the pre-trained models have learnt using fine-tuning. Fine-tuning of models need lot of data for the problem we are solving and also required update to model weights. The existing fine-tuning approach is explained in below diagram.

Learning Process for earlier Pre-Trained Language Models — Fine Tuning

GPT-3 adopts a different learning approach. There is no need for large labeled data for inference on new problems. Instead, it can learn from no data (Zero Shot Learning), just one example (One Shot Learning) or few examples (Few Shot Learning).

Below we have shown a representation of different learning approach followed by GPT-3.

4. How is GPT-3 different from BERT

BERT was among the earliest pre-trained model and is credited with setting the benchmarks for most NLP tasks. Below we compare GPT-3 with BERT on three dimensions:

Things which stand out from above representation are:

  • GPT-3 size is the stand out feature. It’s almost 470 times the size of largest BERT model
  • On the Architecture dimension, BERT still holds the edge. It’ s trained-on challenges which are better able to capture the latent relationship between text in different problem contexts.
  • GPT-3 learning approach is relatively simple and can applied on many problems where sufficient data does not exist. Thus GPT-3 should have a wider application when compared to BERT.

5. Where GPT-3 has really been successful

Application of NLP techniques have evolved with the progress made in learning better representation of the underlying text corpus. Below chart gives a quick overview of some of the traditional NLP application areas.

NLP Application Areas

Traditional NLP built on Bag of Words approach was limited to tasks like parsing text, sentiment analysis, topic models etc. With the emergence of Word vectors and Neural Language models, new applications like Machine Translation, entity recognition, information retrieval came into prominence.

In last couple of years, the emergence of Pre-trained models like BERT, Roberta etc. and supporting frameworks like Hugging Face, Spacy Transformers have made NLP tasks like Reading Comprehension, Text Summarization etc. possible and state of the art benchmarks were created by these NLP models.

The frontiers where pre-trained NLP models struggled were tasks like Natural Language Generation, Natural Language Inference, Common Sense Reasoning tasks. Also, there was question mark of application of NLP in these areas where limited data is available. So the question is how much impact is GPT-3 able to make on some of these tasks.

GPT-3 has been able to make substantive progress on i. Text Generation tasks and ii. Extend NLP’s application into domains where there is lack of enough training data.

  • Text Generation Capabilities: GPT-3 is very powerful when it comes to generating text. Based on the human surveys done, it has been observed that very little separates the text generated by GPT-3 compared to one developed by humans. This is great development for building solutions in the space of generating creative fictions, stories, resumes, narratives, chatbots, text summarization etc. At the same time the world is taking cognizance of the fact that this power can be used by unscrupulous elements to create and plant fake data on social platforms.


  • Build NLP solutions with limited data: The other area where GPT-3 models have left a mark are domains where limited data is available. We have seen the open source community use GPT-3 API’s for tasks like generation of UNIX Shell commands, SQL queries, Machine Learning code etc. All that users need to provide is a task description in plain English and some examples of input/output. This can have huge potential for organizations to automate routine tasks, speeding up processes and focus their talent on higher value tasks

6. GPT-3’s challenges and limitations

Though the buzz might suggest otherwise, GPT-3 isn’t as powerful as some of us hope or fear. Many voices (including Sam Altman, who co-founded OpenAI with Elon Musk), call for a more clear-eyed perspective on GPT-3, including its capabilities and limitations.

“The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.”
-Sam Altman, CEO of OpenAI

GPT-3 doesn’t actually understand neither the input nor its output, and so it can make silly-looking errors (and fails the Turing test). This becomes especially visible over longer outputs, as GPT-3 isn’t very adept at holding onto a train of thought. One could say that GPT-3 practices linguistic scrapbooking: it combines snippets, creating textual collages on demand.


One crucial challenge of any new and powerful technology is safety. GPT-3 has the potential to be used in nefarious ways, “from powering better misinformation bots to helping kids cheat on their homework”. Additionally, because it takes its cues from the entirety of the internet, GPT-3 is somewhat likely to repeat harmful biases (something that had to be fine-tuned out of GPT-2).


Other challenges include: model latency, the same model being used too widely, and selection bias towards good examples. Some questions about GPT-3 remain unanswered. We don’t know what the cost per request might be once the tool becomes commercial, or how copyrights of the output text will be handled. We also haven’t seen the SLA for the API.

7. Conclusions

Despite its limitations, GPT-3 pushes our definition of state-of-the-art language processing technology further ahead. It can generate text in various styles, and, once it becomes a commercial product, it’ll offer a number of exciting benefits to various businesses and individuals. However, it’s important to maintain a level-headed view of GPT-3’s limitations – this will allow us to truly leverage its advantages while avoiding costly mistakes.

In the end, GPT-3 is a correlative, unthinking tool. We may be one step closer to achieving general artificial intelligence, but we’re definitely not there yet. There is potential for select applications of GPT-3, but we must monitor who uses such powerful technology and how. Creators should focus on feedback from the community, and the community should observe the situation as it unfolds, addressing emerging challenges and opportunities.