Unpacking the encoder, decoder and the building blocks of AI

In Chapter 1, we explored how Self-Attention allows Transformers to understand entire sentences simultaneously. But understanding your prompt is only half the conversation. Once AI comprehends what you’re asking, it still has to construct a coherent response from scratch.
How does ChatGPT move from understanding “The weather is beautiful” to generating “I think I’ll go for a walk”? The answer lies in two specialized components: the Encoder and the Decoder.
An Encoder is a neural network component that processes input text and creates a mathematical representation of its meaning. A Decoder is a neural network component that uses that representation to generate output text one word at a time.
Together, they form the system that allows AI models to both understand prompts and produce relevant responses.
Self-Attention solves comprehension, but generating language requires a different process.
When you read, you process everything at once. When you speak, you construct words one at a time, constantly checking that each new word fits what you’ve already said. This is why Transformer models split the work: Encoders handle comprehension, Decoders handle generation.
The Encoder is the input processing specialist. Think of the encoder as the ultimate librarian. When you type a prompt in ChatGPT, the encoder sees a string of letters and immediately goes to work tagging every word with its own specific context.
When you ask “Pick up the bat,” the Encoder looks at the surrounding words to figure out if you’re talking about baseball or a cave-dwelling animal.
It uses that Self-Attention, we discussed in chapter 1, to create a “map” of your intent. By the time it finishes, it has compressed your request into a context map or a mathematical representation capturing your words and also the intent behind them.
This context map includes which words relate to each other using attention scores, the likely meaning of ambiguous terms and the overall structure of your request. The Encoder strips away linguistic ambiguity and creates a focused summary that passes to the Decoder.
Once the Encoder creates its context map, the Decoder generates responses word by word, ensuring each choice is grammatically correct and relevant.
The Decoder operates through three steps:
Checking the context map: Before generating each word, it examines the Encoder’s map to understand what you asked and what it has already written.
Predicting the next word: Using the context map and its own generated words, it calculates which word should come next based on probability scores from billions of training examples.
The feedback loop: After generating each word, it updates its understanding and repeats this process, ensuring it doesn’t drift off-topic.
Imagine you are ordering a pizza. The encoder is the waiter who writes down “large Pepperoni, thin crust.” The decoder is the chef who looks at that note (Step 1), picks up the dough (Step 2) and then checks the note again to make sure they grab the pepperoni and not the olives (Step 3). This “double-checking” is why AI feels so consistent.
A token is a chunk of text that AI breaks language into for processing. Common words like ‘the’ are one token, while complex words like ‘unhappily’ might split into three tokens. Before Encoder-Decoder can operate, AI must convert text into something mathematically processable. This is where tokens, the fundamental units that language models use, come in.
To understand how tokens work, think of them as the shredded pieces of a sentence. Computers don’t see “words” as abstract concepts the way we do; they see patterns of characters.
Here is how the machine turns your messy typing into something it can actually calculate:
How Tokenization Works in ChatGPT
• Common words like “the” or “and” stay whole because AI has encountered them millions of times.
• Complex words split into smaller pieces. “Unhappily” becomes three tokens: “un,” “happi,” and “ly.” Punctuation and spaces get their own tokens.
• Once tokenized, AI assigns each token a numerical ID. In the model’s dictionary, “un” might be 42, “happi” might be 809, “ly” might be 1,247.
This tokenization system is the secret to AI’s flexibility with language. By breaking words into reusable pieces, the model doesn’t need to memorize every possible word in every language. Instead, it learns how approximately 50,000 token “building blocks” combine to create meaning.
When you use a made-up word like “unflurble,” the AI recognizes the “un-” prefix from thousands of other words (unhappy, uncertain, unusual) and calculates that you probably mean the opposite of whatever a “flurble” is even though it has never seen “flurble” before.
This is also why AI occasionally makes strange spelling errors or struggles with very new slang. If a word is too new or too rare to be in its token vocabulary, the model has to approximate using the pieces it knows.
The complete Transformer process in systems like ChatGPT now looks like this:
Tokenization: Your prompt “Explain quantum computing” becomes a sequence of token IDs
Encoder processing: Using Self-Attention, the Encoder analyzes all tokens simultaneously and creates a context map
Decoder generation: The Decoder consults the context map and generates response tokens one at a time, checking each against the map and its own output
Detokenization: Token IDs are converted back into readable text
This pipeline is why modern AI feels conversational rather than mechanical. The Encoder ensures the AI understands the nuance of your request, while the Decoder ensures the response is coherent and relevant, both working with tokens as their fundamental units of language.
Not all AI models use both Encoders and Decoders. The architecture depends on the task:
Encoder-only models (like BERT) excel at understanding and classification tasks, sentiment analysis, content moderation or search ranking. They don’t need to generate new text and only comprehend existing text.
Decoder-only models (like GPT-4, Claude, ChatGPT) focus on generation. They combine both understanding and generation into a single streamlined system optimized for conversation and content creation. This is the most common architecture for conversational AI in 2026.
Encoder-Decoder models (like the original Transformer, T5) use both components separately, making them ideal for translation tasks where you need a deep understanding of one language and precise generation in another.
This is why ChatGPT feels different from Google Translate. They’re using different Transformer architectures optimized for different goals.
You now understand the full anatomy of how AI language models process text:
• Self-Attention allows understanding of context (Chapter 1)
• Encoders map your intent into mathematical representations
• Decoders generate responses word by word
• Tokens serve as the fundamental building blocks for all processing
But one critical question remains: How does the Decoder actually choose which word comes next? When faced with multiple plausible options, what determines whether it picks “apple” or “orange,” “happy” or “joyful”?
In Chapter 3, we’ll explore the prediction engine: how AI uses probability weights to prioritize information and how settings like “temperature” act as creativity dials, allowing the same model to be either a precise, literal assistant or a wildly imaginative storyteller.
Get ready to see how the math finally turns into a conversation. See you in the next part…