Based on ChatML (Chat Markup Language) template format
ChatML was developed by OpenAI
Types of Roles
system
user
assistant
tool
Chat template is jinja2 based
Original Message
[ {"role": "system", "content": "You are a helpful assistant focused on technical topics."}, {"role": "user", "content": "Can you explain what a chat template is?"}, {"role": "assistant", "content": "A chat template structures conversations between users and AI models..."}, {"role": "user", "content": "How do I use it ?"},]
Chat Template for SmolLM2
{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}<|im_start|>systemYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>{% endif %}<|im_start|>{{ message['role'] }}{{ message['content'] }}<|im_end|>{% endfor %}
Final Result
<|im_start|>systemYou are a helpful assistant focused on technical topics.<|im_end|><|im_start|>userCan you explain what a chat template is?<|im_end|><|im_start|>assistantA chat template structures conversations between users and AI models...<|im_end|><|im_start|>userHow do I use it ?<|im_end|>
Example of Llama3.2 model
User: How are you?
AI: Hello, How is your day going so far?
-------- Converted using special tokens -------
<|begin_of_text|>
<|start_header_id|>user<|end_header_id|>
How are you?
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Hello, How is your day going so far?
<|eot_id|>
Decoding
Input is tokenized
Model outputs the scores that rank likelihood of each token in its vocabulary as the Next token
Based on the scores, we have multiple strategies
Always take token with maximum score
Beam search: explores multiple candidate sequences to find the one with the maximum total score
Attention
It is process of identifying the most relevant words to predict the next token
Context Length
Context length refers to the maximum number of tokens the model can consider at once when generating a response
It represents the maximum attention span it has
gemma4:26B has context length of 256K tokens
Sliding Window Size
Maximum number of historical tokens a model’s local attention layer can look back on at any given moment
gemma4:26B has sliding window size of 1024 tokens
Vocabulary Size
Total number of unique words or tokens your model knows
gemma4:26B has vocabulary size of 262K
Embedding Size
Number of dimensions used to mathematically represent each token
embeddinggemma has 768 embedding length
Parameters
The total count of weights and biases in the neural network, representing the model’s overall memory capacity
gemma3:4b has 4.3B parameters
Number of Layers
A Layer is a distinct structural processing block within the model’s neural network
The input sequence you provide an LLM is called a prompt
When we pass information to LLMs, we structure our input in a way that guides the generation of the LLM toward the desired output. This is called prompting.