LLM Settings
Sampling Parameters
- These parameters are used during inference that control how LLM selects its next token
- Temperature
- Range:
[0,1]
- Controls randomness/creativity of the output
- Top P
- Range:
(0, 1]
- Cumulative probability of the top tokens to consider
- Top K
- Integer
- Number of top tokens to consider
-1: Consider all tokens
Core Parameters
- Temperature
- Increasing temperature increases randomness and more creative output
- Use lower temperature for tasks like fact based QA
- Use higher temperature for tasks like poem generation
- Top P
- aka Nucleus sampling
- controls how deterministic the model is
- Top P cuts off the sample pool as soon as the sum of the probabilities of the top words reaches the value P
- higher Top P means more tokens considered
- Increase Top P for diverse responses
- Keep Top P low for factual responses
- Either change Temperature or Top P but not both
- Top K
- Consider only top K number of candidates
- Max Tokens
- Number of tokens the model generates
- It helps prevent long irrelevant responses and control cost
Penalties and Repetition Control
- Alter Frequency or Presence Penalty but not both
- Frequency Penalty
- Applies penalty on the next token proportional to how many times that token already appeared in the response and prompt
- Higher frequency penalty reduces chance of model repeating the same word
- Presence Penalty
- Same as frequency penalty, but penalty is same for all repeated tokens
- a token that appeared twice and the one that appears 10 times are penalized same
- Increase Presence penalty for Diverse and creative outputs
Sequence Control
- Stop Sequence
- It is a string that stops the model from generating tokens
- Example
\n or User: