This is a very simple explanation for those who don’t want to learn hardcore math but also don’t like to accept this core technology as “magic” that just works. There’s no magic of course, and the idea is rather simple.
As you probably know, an LLM is just a statistical predictor of the next word in a sentence based on the previous one. At least such prediction is the training objective, the emergent capabilities extend far beyond simple tokens prediction.
So, for example if an LLL sees “The dog sits on the …” it would continue ”on the floor”.
If you think the last word in the sentence is “the”. How did it deduce from just “the” that the next word should be “floor”?
Easy, you do that all the time, here’s an example:
It is just one frame from a movie. On the surface it’s just an image of 800×600 pixels. It shows a man dressed in black and another one hanging from a rail. Not much.
But if you know what this image is about, this one image contains in itself the whole meaning and the whole plot of the movie, right?
And just from that image you can correctly (more or less, depending on how big of a fan you are) tell (generate) the events that follow that scene. Of course, if you don’t know the movie well, you can also hallucinate.
Same for an LLM, because “the” contains all necessary meaning of the preceding text, stored in its vector representation (or un technical terms it’s the full context built through attention mechanism).
This is the key to understanding the transformer architecture. Before Transformers came along, neural networks processed words sequentially, one after another, forgetting what came earlier in a sentence. It’s like trying to understand a movie by only remembering the scene you’re currently watching or reading a book while forgetting the previous page. Recurrent Neural Networks (RNNs) tried to fix this by carrying information forward, but they had the memory of a goldfish when sentences got too long. Transformers said, “Hold my parameters” and revolutionized everything.
Vector encoding
Picture this: words arrive at a Transformer’s doorstep but they’re from different languages.
So the first dude they meet is the input embedding layer.
The input embedding layer says, “No problem! I’ll convert everything into the universal language of vectors”. Vector is an array of numbers so it goes like this:
"Hello" → [0.2, -0.6, 0.1, 0.8…]
Positional encoding
Now we have a problem. Our words are represented as vectors, but they’ve lost their place in line! The sentence “Dog bites man” means something very different from “Man bites dog,” but our embedding vectors don’t know their positions.
So next dude is “positional encoding”, he adds a token position into the embedding encoding so it becomes:
"Hello" → [0.21, -0.62, 0.13, 0.81…]
How exactly it adds the position you can look up, it’s just basic math using sine and cosine functions (and you thought you’d never need them after school, didn’t you?).
Attention mechanism
Imagine you’re at a party trying to understand a conversation. When someone mentions “it was terrible,” you naturally look for context clues to understand what “it” refers to. Maybe they were talking about a movie earlier? Or maybe it is about the way you just tried to hit on that girl at the bar. Now you see that the difference in “it” meaning can be very important. That’s essentially what attention does — it creates relationships between words in a sentence, so they are not just words but like that “the” — words with the meaning.
Multi-head attention is like you having multiple heads. Each “head” pays attention to different aspects of relationships between words:
- Head 1 might focus on subject-verb relationships
- Head 2 might track pronouns and their referents
- Head 3 might be all about adjective-noun pairs
And then heads are pooling their findings together for a comprehensive understanding. Here’s a quite complex diagram of the above that you don’t need to understand.
The diagram shows how the multi-head attention mechanism works within the transformer architecture, allowing it to focus on different relationships between words simultaneously.
Feed forward
After all that attention, our Transformer needs to process what it discovered. The feed-forward neural networks are the next step. They are like the deep thinkers who process information after gathering all the gossip.
“But why do we need this step?” you might ask. Well, attention mechanisms are great at gathering information, but they’re not so great at transforming it into meaning. Feed-forward networks give the model the ability to think on the information and extract deeper patterns.
Here all found aspects of the meaning are stored in the token embedding:
"Hello" → [0.2145, -0.6239, 0.1314, 0.8169…]
Essentially, these are coordinates in the vector space of a concept that represents the full meaning of that word in its context, including all found aspects. You would think that if there are many aspects — this space should have a lot of dimensions. And you will be right. It usually has something close to 12 thousand dimensions (btw trying to imagine it is futile, our brains are not fitted to work with more than 3 dimensions).
Layer Normalization — The winamp plugin of AI
Layer normalization is like volume normalization in your Spotify playlist, different songs can have different volume levels but you want them to be about the same. It standardizes the outputs to prevent any single neuron in the network from shouting and thus dominating the conversation.
Why is the transformer so revolutionary?
Before transformers, training neural networks for language tasks was like teaching a dog to read — slow, painful, and with limited results. Transformers changed the game for several reasons:
Parallelization: Unlike RNNs that process words one at a time, transformers can process entire sentences at once.
Long-Range Dependencies: Thanks to attention, transformers can connect words no matter how far apart they are in a sentence. “The man who wore the red hat that his sister bought from the store that closed last year is happy” — a transformer can easily connect “man” and “happy” despite the long journey between them.
Pre-training: Transformers can be pre-trained on massive amounts of text and then fine-tuned for specific tasks. It’s like training a general athlete who can then quickly specialize in any specific sport.
The original transformer was modest in size, but models like GPT-4 have billions of parameters. That’s like going from a small family business to a multinational corporation in just a few years.
“The larger the model, the better it performs,” seems to be the mantra, though this has raised questions about computational resources and environmental impact. Training these giants requires enough electricity to power a small city, leading some researchers to joke that AI stands for “Always Inefficient.”
Further Reading: [Attention Is All You Need] (the original Transformer paper).