LLM Settings

Sampling Parameters

  • These parameters are used during inference that control how LLM selects its next token
  • Temperature
    • Range: [0,1]
    • Controls randomness/creativity of the output
  • Top P
    • Range: (0, 1]
    • Cumulative probability of the top tokens to consider
  • Top K
    • Integer
    • Number of top tokens to consider
    • -1: Consider all tokens

Core Parameters

  • Temperature
    • Increasing temperature increases randomness and more creative output
    • Use lower temperature for tasks like fact based QA
    • Use higher temperature for tasks like poem generation
  • Top P
    • aka Nucleus sampling
    • controls how deterministic the model is
    • Top P cuts off the sample pool as soon as the sum of the probabilities of the top words reaches the value P
      • higher Top P means more tokens considered
    • Increase Top P for diverse responses
    • Keep Top P low for factual responses
    • Either change Temperature or Top P but not both
  • Top K
    • Consider only top K number of candidates
  • Max Tokens
    • Number of tokens the model generates
    • It helps prevent long irrelevant responses and control cost

Penalties and Repetition Control

  • Alter Frequency or Presence Penalty but not both
  • Frequency Penalty
    • Applies penalty on the next token proportional to how many times that token already appeared in the response and prompt
    • Higher frequency penalty reduces chance of model repeating the same word
  • Presence Penalty
    • Same as frequency penalty, but penalty is same for all repeated tokens
    • a token that appeared twice and the one that appears 10 times are penalized same
    • Increase Presence penalty for Diverse and creative outputs

Sequence Control

  • Stop Sequence
    • It is a string that stops the model from generating tokens
    • Example \n or User: