LLM Settings

Sampling Parameters

These parameters are used during inference that control how LLM selects its next token
Temperature
- Range: [0,1]
- Controls randomness/creativity of the output
Top P
- Range: (0, 1]
- Cumulative probability of the top tokens to consider
Top K
- Integer
- Number of top tokens to consider
- -1: Consider all tokens

Temperature
- Increasing temperature increases randomness and more creative output
- Use lower temperature for tasks like fact based QA
- Use higher temperature for tasks like poem generation
Top P
- aka Nucleus sampling
- controls how deterministic the model is
- Top P cuts off the sample pool as soon as the sum of the probabilities of the top words reaches the value P
  - higher Top P means more tokens considered
- Increase Top P for diverse responses
- Keep Top P low for factual responses
- Either change Temperature or Top P but not both
Top K
- Consider only top K number of candidates
Max Tokens
- Number of tokens the model generates
- It helps prevent long irrelevant responses and control cost

Alter Frequency or Presence Penalty but not both
Frequency Penalty
- Applies penalty on the next token proportional to how many times that token already appeared in the response and prompt
- Higher frequency penalty reduces chance of model repeating the same word
Presence Penalty
- Same as frequency penalty, but penalty is same for all repeated tokens
- a token that appeared twice and the one that appears 10 times are penalized same
- Increase Presence penalty for Diverse and creative outputs

Stop Sequence
- It is a string that stops the model from generating tokens
- Example \n or User: