Evolution of Transformers — Part 2

Sanchit Goel
7 min readApr 18, 2023

--

Photo by Jéan Béller on Unsplash

Introduction

If you read the first part of this three part article series, then you would have learnt about some of the most well-thought architectures based on Transformers. The transformer architectures that we covered in the previous article were:

  1. Transformer by Google
  2. GPT by OpenAI
  3. BERT by Google
  4. GPT-2 by OpenAI
  5. RoBERTa by Facebook

The last architecture that we had covered, RoBERTa, was released on 26th July, 2019. Hence, in this article, we will look at some of the most ground-breaking architectures released post 26th July, 2019.

If you haven’t yet read the first part of this series, you can click here, read it now, and come back here to read about more architectures!

DistilBERT by HuggingFace

Release Date: 28th August, 2019

Courtesy: ResearchGate

DistilBERT was proposed by HuggingFace as a model that used few parameters when compared to other large language models. It has only 60 million parameters, which is almost half the number of parameters in BERT. It uses distillation to reduce the number of parameters.

Distillation is a concept where you train a student model that tries to become as good as the teacher model. The number of layers in DistilBERT are half of BERT and the initial weights set for it are the weights of BERT. This helps in reducing the number of parameters and the training starts from weights that were fine tuned on a massive data, making the training process less lengthy than normal as optimal weights are achieved soon. This is called transfer learning, when you use the knowledge of a model on another model whose task is similar to that of the model whose weights are being used.

GPT-3 by OpenAI

Release Date: 11th June, 2020

Courtesy: The AI Dream

GPT-3 can be called the first “big boy”, with the largest model size having 175 billion parameters. It is a decoder-only transformer that can take up to a maximum of 2048 tokens as input sequence length. Just by looking at the batch sizes you know that OpenAI pre-trained this Large Lanugage Model (LLM) on a huge corpus of data. These corpus include the Common Crawl, WebText2, Books1, Books2 and the wikipedia dataset, all consisting of billions of tokens. GPT-3 also has the capability of coding in Python, CSS and JSX. It was one of the first models used by a lot of companies for text generation due to its highly sophisticated and accurate generation capabilities. It even showed the world that few-shot learning is possible! Few-shot learning means fine-tuning a pre-trained model by training it on a few examples.

BART By Facebook (Or should I say Meta)

Release Date: 8th July, 2020

Courtesy: ProjectPro

BART has the same architecture as the transformer, but was trained using a different approach. You can say that it has a BERT like encoder, followed by a GPT-2 like decoder. The training process of BART included reconstructing the original piece of text from the input, which was the original text with some noise. It was seen that BART performed well on text generation tasks, but was also able to do comprehension tasks.

Really Large LMs

Further innovation by big tech-giants was about increasing the size of these Language Models and the training corpus. Microsoft and NVIDIA trained their MT-NLG model with 530B parameters using data parallelism and was trained on a supercomputer. It was released on 11th October, 2021.

Google then intorduced there LLM called PaLM, with 540B parameters. Google’s wanted to introduce a model that could perform many different tasks using few-shot learning, and that’s what they did with this model. It was released on 4th April, 2022.

DeepMind Chinchilla

Release Date: 12th April, 2022

With GPT-3, OpenAI had introduced the hypothesis that larger the model, better the results. DeepMind proved them wrong by introducing Chinchilla, a model with 70 billion parameters but better results than other LLMs. They achieved this by training the model on 4x more data and proved that current language models are actually under-trained. Their hypothesis stated that the vocabulary of the model is also important, apart from it’s size.

Before I talk about what everyone is talking about these days, I would like to mention HuggingFace’s BLOOM model, which is the largest OpenSource Multi-Lingual transformer (176B parameters). It is made by the community for the community and you should definitely check it out. It came out on 17th June, 2022.

RLHF Models

ChatGPT was released on 30th November, 2022 and GPT-4 was released (well not exactly released) on 14th March, 2023. It was actually InstructGPT that came out first, but ChatGPT was the language model that got every person on this planet talking about Artificial Intelligence. Its capability of answering almost any kind of question is what wowed people. People started testing it by making it write competitive exams and then calculating it’s score. Soon after, GPT-4 was released with results on many of the world’s most difficult competitive exams, ranging from law to medical. OpenAI also took GPT-4 on the next level by incorporating image input. All of this was possible because of RLHF (Reinforcement Learning with Human Feedback).

courtesy: TechTalks

Reinforcement learning from human feedback (RLHF) is a way to teach large language models (LLMs) by getting feedback from humans. RLHF works by first training a reward model, which predicts how humans will rate the quality of a piece of text. The reward model is then used to train the LLM, which is iteratively updated to generate text that is more likely to be rated highly by humans. RLHF is a promising new technique for training LLMs that can generate high-quality text that aligns with human preferences. As RLHF continues to develop, it is likely to play an increasingly important role in the development of LLMs that can be used for a variety of applications.

Just to amaze you, I generated the above paragraph about RLHF using BARD, another ChatGPT like model by Google.

Conclusion

Without a doubt I can say that the age of AI is here. Before ChatGPT, no one was interested in these language models, but ChatGPT changed everything and now every new innovation is being followed by millions of people. Some are scared of the AI’s capabilities today, but I’m excited to see what the future holds for us. Do always keep in mind that AI isn’t flawless and trust me, it will never be. I believe in a world where AI assists humans in all kinds of day-to-day tasks, rather than completely taking over us. I also wish that the development of AGI becomes transparent, so that we as the whole world can be cautious about it if any big company can’t be.

References

[1] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. Language Models are few-shot learners

[3] Mike Lewis Yinhan Liu Naman Goyal Marjan Ghazvininejad Abdelrahman Mohamed Omer Levy Ves Stoyanov Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

[4] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

[5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra (Additional Authors not shown). PaLM: Scaling Language Modeling with Pathways

[6] BigScience Workshop: Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni (Additional Authors not shown). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

[7] OpenAI ChatGPT Blog

[8] OpenAI GPT-4 Blog

--

--

Sanchit Goel
Sanchit Goel

Written by Sanchit Goel

Master of Data Science student @ The University of Adelaide. Looking forward to a career in Data Science.

No responses yet