Generative Pre-Trained Transformer - 2
—
Today's Amazon Deals - https://amzn.to/3FeoGyg
–
Generative Pre-Trained Transformer - 2
Launch date: 14th Feb, 2019
The next version of the GPT model was introduced in 2019, GPT-2 which was trained on a larger dataset and enriched with more parameters to make this model better. In this second version and on the typical improvisation on GPT - 1, it is basically built to tackle multiple tasks together such as question answering, machine translation, reading comprehension, and summarization; and trying to achieve more closer tasks to human-abilities. It was scaled to have more than 10x the number of parameters than GPT - 1 (or the small GPT - 2).
Base Framework
The base model is similar to the initial GPT model, which is a transformer based architecture with decoder blocks only. To perform the task, the learning goal is needed to be adjusted to P (output|input, task). Task conditioning alludes to this modification, in which different outputs for the same input for different tasks are expected from the model. Some models give the model both the task and the input at the architectural level, using task conditioning. For language models, the job, input, and output are all linguistic stanzas. As a result, task conditioning for language models is carried out by giving the model examples or instructions in natural language. The foundation for zero-shot task transfer, mentioned in GPT-2, is task conditioning.
GPT 2’s capacity to transfer zero shot tasks is intriguing. As a special case of zero shot task transfer, zero shot learning occurs when no examples are given at all, and the model is instructed to perform the task. For fine-tuning, input to GPT-2 was presented in a format that anticipated the model to comprehend the nature of the assignment and provide answers rather than altering the sequences as was done for GPT-1. To mimic zero-shot task transfer behavior, this was done. For instance, the model was given an English sentence, followed by the word France, and a prompt for the English to French translation assignment. The model was expected to comprehend that the task involved translation and provide the French equivalent of the English statement. These tasks are expected to be executed in an unsupervised manner.
In order to create a substantial and excellent dataset, the authors scraped the Reddit site( posts which at least had minimum 3 karma) and gathered data from outbound links of highly upvoted posts. The final product, called WebText, had 40GB of text data from over 8 million publications. This dataset, which was huge, was used to train the GPT-2 model as opposed to the Book Corpus dataset, which was used to train the GPT-1 model. Due to the prevalence of Wikipedia material in test sets, WebText lacks Wikipedia content. The encoding is done in a unicode mechanism which increased the vocabulary base from 256 to 130,000.
Model Specifications
1.5 billion parameters were in GPT-2 which is ten times the amount of GPT-1 (117M parameters). There are some major elements in the model which are similar to GPT - 1 though there are few significant variations from GPT-1 included as well:
For word embedding, GPT-2 ( for GPT large) used 1600 dimensional vectors across 48 layers and a total 50,257 tokens from a larger vocabulary were used.
Larger batch size of 512 and larger context window from 512 to 1024 tokens were used.
Layer normalization was moved to the input of each sub-block and an additional layer normalization was added after the final self-attention block.
At initialisation, the weight of residual layers was scaled by 1/√N, where N was the number of residual layers.
There have been around 117M (GPT-1), 345M, 762M, and 1.5B (GPT-2) parameters to train four language models with 12,24,36,48 layers respectively along with 768, 1024, 1280, 1600 dimensional layers respectively. Every successive model was less perplexing than the one before it. This shows that as the number of parameters increases, the complexity of language models on the same dataset reduces. Also, every downstream task was completed better by the model with the most parameters.
Evaluation
Many datasets of downstream tasks, such as reading comprehension, summarization, translation, and question-answering, were used to evaluate GPT-2. The GPT-2 model has gone through many different kinds of objectives and database testing:
In zero shot settings, GPT-2 improved the then-current state-of-the-art for 7 of the 8 language modeling datasets across domains and datasets. Though it lacked a lot with One Billion Word Benchmark from performance perspective, most likely due to it being the most data samples and having the most destructive pre-processing.
The Children’s Book Dataset assesses how well language models perform when applied to various word categories, including nouns, prepositions, and named entities; basically to estimate the correct omitted word out of 10 possible choices. GPT-2 achieved a steady growth in accuracy with both CBT-named entity and CBT-common as the model parameter grows; with new state of the art accuracy results of 93.3% and 89.1% respectively for common nouns and named entities.
The LAMBADA dataset evaluates how well models do at finding far-off dependencies and guessing the sentence’s last word. GPT-2 enhanced the state of the art accuracy by Language models(LMs) from 19% to 52.66% and cut down perplexity from 99.8 to 8.6. It worked better with valid continuations of the sentence but not with valid final words. By adding, a stop-filter, it worked better with an improvement by 4%
By assessing a system’s capacity to resolve ambiguities in the text, the Winograd Schema challenge seeks to gauge its capacity for commonsense thinking. GPT–2 got a better rate of accuracy of 70.70% with an increment of 7%.
The CoQA dataset comprises papers from several fields that naturally exchange questions and answers. The exercise measures one’s capacity for reading comprehension as well as their capacity to respond to inquiries based on prior conversations. GPT-2 matched or exceeded the results from 3 of 4 baselines on zero shot tasks involving reading comprehension, which were trained on the 127,000+ question-answer pairs of the training data.
On an overview, The language model’s ability to grasp tasks and outperform the state-of-the-art on numerous tasks in zero shot scenarios was improved, according to GPT-2, by training on a larger dataset and employing more parameters. The essay claims that as model capacity increased, performance increased in a log-linear manner.
Figure 9.2: Performance of GPT-2 in CBT dataset
[Source: GPT -2 paper]
Also, when the number of parameters increased, the drop in language model perplexity did not approach a point of saturation. The WebText dataset really underfit GPT-2, and perhaps lengthier training sessions further reduced perplexity. According to research, the GPT-2 model size was not the maximum and that a larger language model will help people grasp natural language by reducing confusion.
Figure 9.3: Performance of Winograd Schema Challenge of GPT -2
[Source: GPT -2 paper]
To learn more technical aspect of GPT - 2, you can refer to - Language Models are Unsupervised Multitask Learners - https://tinyurl.com/3x7b74n9