Member-only story
Step-by-Step Implementation of Generative Pre-Trained Transformers (GPT)
In this post, we explore the implementation of a small GPT model using Keras and Tensorflow.
I will not delve into too many details, as I am already planning to publish a post that explains how GPT was created and sheds some light on the theoretical frameworks behind it.
If you’d like to skip straight to the implementation, here’s a link to the python notebook https://github.com/mouadk/GPT/blob/main/GPT.ipynb.
Attention Mechanism
The core of GPT is the attention mechanism, the backbone of the transformer model presented years ago in this paper https://arxiv.org/pdf/1706.03762.pdf.
Many derived models such as BERT have been designed since that time; some rely on the original architecture entirely, while others only on part of it. This is exemplified by GPT, which focuses on using the decoder part of the architecture.
The self-attention operates on your data as an informational retrieval system would. You have a set of queries, keys, and values. Each query is run against a set of keys to measure the similarity between the keys and the actual query, and then the weighted average is taken (the values being weighted by the similarity measure, such as cosine similarity) to get a final…