LLM Architecture

Inference

Prefill Phase

  • This phase is computationally-intensive because it needs to process all input tokens at once.
  • Steps:
    • Tokenization: Converting the input text into tokens (think of these as the basic building blocks the model understands)
    • Embedding Conversion: Transforming these tokens into numerical representations that capture their meaning
    • Initial Processing: Running these embeddings through the model’s neural networks to create a rich understanding of the context

Decode Phase

  • This phase is where the actual text generation happens
  • The model generates one token at a time in what we call an autoregressive process
  • This phase is memory-intensive because the model needs to keep track of all previously generated tokens and their relationships.
  • Steps:
    • Attention Computation: Looking back at all previous tokens to understand context
    • Probability Calculation: Determining the likelihood of each possible next token
    • Token Selection: Choosing the next token based on these probabilities
    • Continuation Check: Deciding whether to continue or stop generation