LLM Architecture
Inference
Prefill Phase
- This phase is computationally-intensive because it needs to process all input tokens at once.
- Steps:
- Tokenization: Converting the input text into tokens (think of these as the basic building blocks the model understands)
- Embedding Conversion: Transforming these tokens into numerical representations that capture their meaning
- Initial Processing: Running these embeddings through the model’s neural networks to create a rich understanding of the context
Decode Phase
- This phase is where the actual text generation happens
- The model generates one token at a time in what we call an autoregressive process
- This phase is memory-intensive because the model needs to keep track of all previously generated tokens and their relationships.
- Steps:
- Attention Computation: Looking back at all previous tokens to understand context
- Probability Calculation: Determining the likelihood of each possible next token
- Token Selection: Choosing the next token based on these probabilities
- Continuation Check: Deciding whether to continue or stop generation