AhbarjietMalta

Generative Pre-Trained Transformer - 1


Listen Later

Generative Pre-Trained Transformer - 1
Today's Amazon Deals - https://amzn.to/3FeoGyg
Generative Pre-Trained Transformer - 1
Launch date: 11th June, 2018
In 2018, the first GPT model, GPT-1, was released which was trained with a diverse level of unlabeled textual corpus data to get a strong Natural language understanding( NLU) base with fine-tuning and generative pre-training.
Basic Framework
The GPT-1 model really trained the language model using a transformer structure with about 12 layers of decoders and disguised self-attention. It was trained using data from the BookCorpus dataset, which contained over 7000 unpublished books to get the idea of working that model under unrecognized and unseen data with long stretched data which makes the model get better and longer contexts.
Model Training Stages
GPT - 1 model has 3 stages training:
Pre-training the model on the high corpus textual data where texts are getting tokenized and fed into likelihood function to optimize.
In this stage, the fine-tuning is being engaged to get the model accustomed with discriminative task with labeled data - which was passed through a transformer’s block and forwarded into L2 maximization and finally infused the a final linear optimization objective function
Task-specific Input Transformations contain organized inputs like triplets of documents, ordered sentence pairs, questions, and replies for particular tasks like question answering or textual entailment. The tokens of each input sequence are reinforced into an order with start and end tokens as well as delimiter tokens to maintain the order.
Figure 9.1: Picture defines the normal transformer architecture and input patterns for different information for different tasks for fine-tuning
[Source: GPT -1 paper]
Model Implementation Specifications
Model used a 768-dimensional state for encoding tokens into word embeddings and for position wise feed forward layer 3072-dimensional state was used with 12 attention heads. The adam optimiser was used with a learning rate 2.5 x 10 -4 and this learning rate is increased with 0 to 2000 updates with a cosinusoidal schedule. Attention, residual, byte pair encoding (BPE) vocabulary with 40,000 merges and embedding dropout rates with 0.1 were used for regularization and the Gaussian Error Linear Unit (GELU) was used as activation function. The model was trained for 100 epochs on mini-batches of size 64 and sequence length of 512. The model had 117M parameters in total.
For the fine-tuning part, the same hyperparameters settings have been observed from pretraining. The dropout rate was 0.1, with a learning rate 6.25e-5 and a batch size of 32. The fine-tune was made very prompt with 3 steps of epochs and Warmup occurs over 0.2% of training and is scheduled using a linear learning rate decay schedule.
Evaluation
The study showed how pre-training improved the model’s zero shot performance on a variety of NLP tasks, including sentiment analysis, question answering, and schema resolution. The architecture was capable of performing a range of NLP tasks with comparatively little fine-tuning and enabled transfer learning. This model demonstrated the efficacy of generative pre-training and created opportunities for future models to better realize this efficacy using larger datasets and additional parameters. GPT-1 performed better than specifically trained supervised state-of-the-art models in 9 out of 12 tasks the models were compared on.
They’ve made use of the just recently made available RACE dataset, which consists of English texts and the corresponding questions from middle and high school exams. It has been demonstrated that this corpus contains more questions of the reasoning variety than other datasets like CNN or SQuaD, making it the ideal testing ground for the model, which was trained to handle long-range contexts. Also, they assessed using the Narrative Cloze Test, which requires choosing the right conclusion from two possibilities for stories with several sentences. The GPT -1 model once again performed significantly better on these tasks than the prior best results, with gains of up to 8.9% on Story Cloze and 5.7% overall on RACE.
To learn more technical aspect of GPT - 1, you can refer to- Improving Language Understanding by Generative Pre-Training - https://tinyurl.com/3fu53mrd
...more
View all episodesView all episodes
Download on the App Store

AhbarjietMaltaBy AhbarjietMalta


More shows like AhbarjietMalta

View all
DJ AKD Remixes by Dj Akd

DJ AKD Remixes

2 Listeners