Project review - Knowledge distillation from powerful LLM, Alpaca and Koala
06 May 2023This post summarizes a talk I gave as the host of DEEPEST Season 13. This post has two main goals:
- A basic explanation of text generation models for people who have heard of Transformer and GPT but have never read the papers or looked into the details
- A proposal for a new methodology for people who want to solve a downstream task with an LLM but are stuck because there is no well-trained, publicly available model
Large Language Model (LLM)
Let’s look at Large Language Models (LLMs) in chronological order. Actually, when we say LLM, there are not only GPT but also the BERT and T5 families, but for now let’s focus on text generation models and look only at the GPT models.
Transformer
The Transformer model proposed in the Attention is all you needs (NeurIPS17’) paper presented a methodology that solved the parallelization problem that was impossible with the existing Recurrent Neural Network (RNN).
Unlike the RNN structure, where the input must be fed in sequentially, the Transformer uses an approach that computes the attention of the input sequence simultaneously. Explaining the Transformer alone would take half a day, so I’ll skip it here.
Anyway, the benefit gained from using the Transformer is that, because parallelization is possible, it became feasible to handle longer sequences and to build larger models. Since RNNs can’t be parallelized, if you made the model big, not only training but also inference would take an enormous amount of time. Also, when people tried it, the performance turned out to be good too (after all, in deep learning all that matters is good performance). As a result, in the NLP field (and even extending to vision), Transformer-based models came to dominate.
GPT-1
Through the Improving Language Understanding by Generative Pre-Training (Arxiv18’) paper, OpenAI released the GPT-1 model. This model uses only the Decoder structure of the Transformer.
As shown in the figure above, training was conducted in two stages: Unsupervised pre-training (PT) and Supervised fine-tuning (SFT). In PT, since training simply involves feeding in a sequence of tokens and predicting the next token, no annotation work is needed. In SFT, a labeled dataset is used depending on the task.
Actually, this structure is not much different from the approach used in Computer Vision. It directly borrows the method of freezing a pre-training model like ResNet and attaching a linear model at the end to solve the downstream task.
What’s interesting is that something called zero-shot behavior began to emerge. Without fine-tuning, just with pre-training and some reasonable heuristic methods, it showed decent performance on NLP tasks.
Taking sentiment analysis as an example, the approach was to feed in a sentence and then judge based only on comparing the probability of the words positive/negative appearing as the next token. With just this, it showed performance approaching nearly 70%.
This shows the possibility that GPT-family language models can achieve good performance while skipping expensive annotation work.
GPT-2
The Language Models are Unsupervised Multitask Learners (Arxiv18’) paper shows an attempt to maximize the zero-shot behavior discovered in GPT-1. They increased the model’s capacity further and built the data used for pre-training to be larger and higher quality.
First, they scaled the model size by nearly 13x. Compared to GPT-1, which was 117M in size, GPT-2 used a 1.5B model. They also built a new dataset called WebText. Task-specific data was not needed, but the number of data points had to be large, so they obtained data through web crawling. They also constructed the dataset based on three criteria to maintain quality.
- Data that received 3 or more Karma (upvotes) on Reddit
- Wikipedia was excluded because the Training/Test overlap is severe
- Only English data The Wikipedia point is impressive, as it apparently has too much influence on evaluation performance. This is because too many sentences in Reddit replies are quoted from Wikipedia. (In effect, the answers to the questions are all on Wikipedia…)
As expected, when they increased the model size, the zero-shot behavior also began to increase.
Beyond simply increasing performance, GPT-2’s performance shows tremendous results, surpassing existing SOTA methods in a zero-shot manner on some tasks. Of course, it wasn’t that outstanding on Summarize, Translate, Factual QA, etc. In particular, on Factual QA, only 4.3% of the answers were correct. (From this point you can start to smell Hallucination…)
GPT-3
With the publication of the Language Models are Few-Shot Learners (NeurIPS20’) paper, OpenAI opened the prelude to the current era of ChatGPT. They confirmed from GPT-2 the fact that zero-shot behavior is proportional to the model size and dataset quality, and they directly put this into practice to create GPT-3. With much larger capacity, a higher-quality dataset, and the few-shot learning technique.
First, they increased the model size all the way to 175B. They suddenly scaled it up by 100x. At 175B, the model size alone is roughly 800GB, and to train it you need at least 160 A100s. Since an A100 costs about 15 million won, the GPU price alone would be around 24 billion won. On top of that, at that scale the electricity cost alone would be about 6 million won per day in Korea. A research lab can’t handle that.
It’s the same with the data. They carefully refined a 45GB dataset scraped from the internet to create a 570GB high-quality dataset. In doing so, they intensively collected high-quality reference corpora such as books and Wikipedia, and performed work to remove duplication as much as possible.
Finally, the few-shot learning technique is a method where, when solving a downstream task, instead of zero-shot, you present a few examples along with the problem and have it solve the problem.
Even when translating “cheese” into French, giving a few examples would help it understand the problem better. (At this point it’s no different from human metacognition.)
GPT-3 began to solve parts that existing language models had not been able to solve, parts that were thought to be the domain of humanity. It became able to do addition and subtraction, and able to solve anagrams.
Instruct-GPT
The Training language models to follow instructions with human feedback (Arxiv22’) paper takes one step further from GPT-3. GPT-3 shows amazing things, but in reality it was a bit different from what people actually want. The reason is actually that the objective is different. GPT looks at the preceding token sequence and predicts the next token that will appear, whereas the language model that people want produces answers that are helpful, harmless, and truthful to them. So OpenAI tried to create a user-aligned LLM using reinforcement learning, and as a result they proposed InstructGPT.
InstructGPT consists of three stages in total: Supervised Fine-tuning (SFT), Reward model training, and Optimizing policy. Let’s look at each in detail.
SFT
First, OpenAI decided to turn all tasks into Instructions. An Instruction has the following form.
Let me give an example of asking a language model to write an essay about school safety. In this case, the Instruction would be “writing an essay about following topic” and the instance input would be “school safety”.
OpenAI turns the scraped data into this kind of form. They hired 40 labelers full-time from Upwork and ScaleAI, and created a set of 13K Instructions. (It probably cost quite a lot.)
They fine-tuned on the created Instruction set. Apparently it was done for 16 epochs.
Reward model training
The Reward model is trained using PbRL as proposed in Deep reinforcement learning from human preferences. The approach is that human labelers rank the multiple outputs generated by the SFT-ed model, and this is used to estimate the reward for each prompt. (For details on PbRL, see here.)
For this too, 40 human labelers estimated rankings for 33K data points. It must have taken a tremendous amount of money, time, and effort, right? For the Reward model, they apparently used a 6B LLM. There was an attempt to use 175B, but apparently the reward model becomes unstable as its size grows.
Optimize policy
Finally, they optimize the LLM using the trained reward model and a reinforcement learning algorithm (PPO). At this point no separate labeling is needed, and they conducted training using 31K prompts obtained from users on OpenAI’s GPT-3.
I’ll skip the results for InstructGPT. This is because it’s effectively only a qualitative evaluation. I’ll just note the point that InstructGPT’s results had much higher human preference compared to GPT-3.
InstructGPT is a language model that generates answers well-aligned to humans while minimizing the decrease in general performance. However, quite a lot of resources were needed in that training process. It seems like work that’s hard to do at the research-lab level.
GPT-4
And in February 2023, GPT-4 was released along with the GPT-4 Technical Report (Arxiv23’). It’s a model capable of processing multi-modal data with images added, that improved the hallucination problem, and that produces higher-quality answers. Unfortunately, they did not disclose any of the model size, architecture, HW info, dataset, or training process. Speculation puts it somewhere between a 1~100T model, and apparently around 30K GPUs are being used for model serving.
Closed-source LLMs
The three LLMs currently provided as services that are known to have the best performance - OpenAI’s ChatGPT, Google’s Bard, and Anthropic AI’s Claude - are all closed-source. That is, they don’t release the models and only provide services in the form of API calls.
In a situation where the pre-trained weights are not released, the training dataset is not released, and SFT and RLHF are also very expensive (annotators composed of highly educated people with master’s degrees or higher), it’s difficult to do research on various downstream tasks using pre-trained LLMs.
Also, these LLMs actually use a very inefficient training method. The amount of text data a human sees in a lifetime is said to be about 0.16GB, but GPT-3 used 450GB of data. Humans learn through educational curricula, or learn from teachers.
Therefore, I’ll introduce two projects that are open-source LLMs and that quickly trained models using an efficient training method (knowledge distillation).
Knowledge distillation from openAI’s LLM
Both models are really fresh off the press. Stanford’s Alpaca was released on March 13, 2023, and Berkeley’s Koala was released on April 3, 2023.
Both models are based on LLaMA (Arxiv23’), an open-source LLM released by Meta. It’s a very fresh model released on February 4, 2023. Surprisingly, Meta provided this model for research purposes, not as a service via API. They used the expression “further democratizing access.” It feels a bit like the company doing this because things were going badly for them lately… (shh…)
LLaMA was released in 7B, 13B, 33B, and 65B models. According to the paper, it used 4x more refined data, and the 13B model showed performance similar to GPT-3 (175B) on some tasks. But it still shows lower performance when compared to GPT-3.5 or GPT-4.
Alpaca and Koala are results of applying knowledge distillation to LLaMA to train smaller models that have performance rivaling OpenAI’s models. Both models start from the hypothesis that if you have 1) a reasonably powerful open-source pre-trained LLM and 2) high-quality Instruction data, you can create a powerful open-source LLM.
Alpaca
Alpaca created its Instruction dataset using a method called self-instruct. SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions (Arxiv22’) proposes a method of generating data using an LLM.
Using the method above, they created 52K instruction data points using only 175 seed instructions. Data generation is carried out through 5 steps.
Step 1 : Seed Instruction
A human directly writes 175 Instruction datasets.
Step 2 : Task generation
Using 6 seed tasks and 2 generated tasks, they generate 8 new tasks. They use the Template below and use text-davinci-003, GPT-3’s completion model.
You can see that new tasks are generated well, as shown below.
Step 3 : Classification task identification
Using 12 classification and 19 non-classification tasks, they determine whether the generated task is a classification task or not. They use the few-shot learning methodology (although with 31 examples it’s hardly “few-shot”). Below is an example using just about 3.
Step 4 : Instance generation
Based on the given task, they generate instance input-output. However, for non-classification problems they construct the template to generate in input-output order, and for classification problems in output-input order.
The reason is that for classification tasks, when the input is generated first, there were many cases where only one output was generated. It’s a heuristic approach.
Step 5 : Filtering
Finally, among the generated instructions, only those satisfying the conditions below are added to the instruction pool.
- Add the instruction to the dataset pool only when ROUGE-L < 0.7 (ROUGE-L: a metric indicating the degree of string matching)
- Exclude cases containing the keywords images, pictures, graphs (there’s a possibility it couldn’t be properly expressed in text)
- Exclude cases where the input of an instance is the same but the output is different
The 52K self-instruct dataset generated this way apparently guarantees sufficient diversity and sufficient quality.
Alpaca fine-tuned the LLaMA 7B model using self-instruct data generated with text-davinci-003 (GPT-3). Generating 52K Instructions via the self-instruction method cost about $500 in OpenAI API call costs, and fine-tuning LLaMA on eight 80GB A100s for 3 hours cost about $100 on GCS.
In their own qualitative evaluation, when they had 5 student authors do a blind test against GPT-3, Alpaca won (?) by 90:89. However, there were drawbacks: the answers were relatively short and hallucinations occurred frequently.
But Alpaca showed that you can own an LLM with performance rivaling GPT-3 for just $600. It’s not at all burdensome to use in a research lab, and it could solve countless downstream tasks.
Koala
Koala took an even simpler approach than Alpaca. They directly used ChatGPT distillation data on the LLaMA 13B model. Since there wasn’t even a self-instruct data generation process, they performed only 6 hours of fine-tuning using eight 80GB A100s and trained the Koala model using only about $100.
There were two kinds of ChatGPT distillation data: shareGPT and HC3. shareGPT is a public dataset with 60K conversations from ChatGPT made publicly available, and HC3 is a dataset gathering 24K human answers and 27K ChatGPT answers for 60K questions. The authors of the Koala project named the model fine-tuned on LLaMA using ChatGPT distillation data “Koala-distill.”
Additionally, they also tested a model that used various open-source data for the fine-tuning work. The model trained by adding the data below was named “Koala-all.”
Unlike Alpaca, Koala conducted a qualitative evaluation by forming a panel of 100 evaluators on the Amazon Mechanical Turk platform.
As a result, it showed slightly better preference than Alpaca, but still lower preference than ChatGPT. But to rationalize (?) it, you can see it as an advantage given that the model is more than 10x smaller.
The surprising point was that Koala-Distill showed higher preference than Koala-All. This shows that rather than roughly scraping together data, performing only knowledge distillation using a powerful model yields much better performance.
Conclusion
There’s a movement in academia to create powerful yet open-source text generation LLMs using LLaMA. It’s showing that methods of generating data from closed-source LLMs, such as self-instruction and ChatGPT distillation datasets, work very efficiently. In other words, if you have a pre-trained LLM (the initial condition) and data created by a good teacher, you can easily create a powerful language model.
However, one thing to be careful about is that there are license issues, so it can’t be used commercially. LLaMA’s license is non-commercial bespoke, and the ChatGPT terms of use have a condition that it can’t be used for models competing with OpenAI, so please be careful about using these in a company!