Large Language Model

It is a type of model that excels at understanding and generating human language
trained on vast amount of text data
consists of million of parameters
mostly based on Transformers
Typically decoder-based transformers with billions of params
Examples
- DeepSeek-R1 by DeepSeek
- GPT4 by OpenAI
- Llama3 by Meta
- SmolLM2 by Hugging Face
- Gemma by Google
- Mistral by Mistral
Ref: https://huggingface.co/learn/agents-course/unit1/what-are-llms

Limitations of LLM

Hallucinations: They can generate incorrect information confidently
Lack of true understanding: operate purely on statistical patterns
Bias: They may reproduce biases present in their training data or inputs.
Context windows: They have limited context windows
Computational resources: They require significant computational resources

NLP

NLP is a field of linguistics and machine learning focused on understanding everything related to human language
The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words
LLMs are modern technique within NLP
AI → ML → NLP → LLM
NLP common tasks
- Classify sentence (sentiment, spam or not etc.)
- Classify part of speech of a word in sentence
- Speech recognition (speech-to-text)
- Text-to-speech
https://huggingface.co/learn/llm-course/chapter1/2

Transformers

Based on paper: https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
Deep learning architecture based on “Attention” algorithm
Used in LLM, Image and Video Generation, Speech recognition
Transformers Arch: https://huggingface.co/learn/llm-course/chapter1/4#general-transformer-architecture

Transformer Types

Encoders
- Takes text (or other data) as input and outputs a dense representation (or embedding) of that text
- Example: BERT from Google
- Use Cases: Text classification, semantic search, Named Entity Recognition
- Typical Size: Millions of parameters
Decoders
- Focuses on generating new tokens to complete a sequence, one token at a time
- Example: Llama from Meta
- Use Cases: Text generation, chatbots, code generation
- Typical Size: Billions of parameters
Seq2Seq (Encoder–Decoder)
- Combines an encoder and a decoder
- Encoder first processes the input sequence into a context representation, then the decoder generates an output sequence.
- Example: T5, BART, Whisper
- Use Cases: Translation, Summarization, Paraphrasing
- Typical Size: Millions of parameters

Token

Unit of information LLM works with

Input: It ain't worth it
Tokens: It| ain|'t| worth| it

English has ~600k words
LLM has ~32k words
Each LLM has special tokens
- BOS: Beginning of Sequence
- EOS: End of Sequence
  - Examples:
    - <|endoftext|> for GPT4
    - <end_of_turn> for Gemma
- PAD: Padding Token
- UNK: Unknown Token
LLMs are autoregressive meaning output from one pass becomes input to the next one
This continues till the generation reaches EOS token

Role and Chat Template Tokens

https://huggingface.co/learn/llm-course/chapter11/2
https://huggingface.co/learn/agents-course/unit1/messages-and-special-tokens
Based on ChatML (Chat Markup Language) template format
ChatML was developed by OpenAI
Types of Roles
- system
- user
- assistant
- tool
Chat template is jinja2 based
Original Message

[
    {"role": "system", "content": "You are a helpful assistant focused on technical topics."},
    {"role": "user", "content": "Can you explain what a chat template is?"},
    {"role": "assistant", "content": "A chat template structures conversations between users and AI models..."},
    {"role": "user", "content": "How do I use it ?"},
]

Chat Template for SmolLM2

{% for message in messages %}
{% if loop.first and messages[0]['role'] != 'system' %}
<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
<|im_end|>
{% endif %}
<|im_start|>{{ message['role'] }}
{{ message['content'] }}<|im_end|>
{% endfor %}

Final Result

<|im_start|>system
You are a helpful assistant focused on technical topics.<|im_end|>
<|im_start|>user
Can you explain what a chat template is?<|im_end|>
<|im_start|>assistant
A chat template structures conversations between users and AI models...<|im_end|>
<|im_start|>user
How do I use it ?<|im_end|>

Example of Llama3.2 model

User: How are you?
AI: Hello, How is your day going so far?


-------- Converted using special tokens -------

<|begin_of_text|>

<|start_header_id|>user<|end_header_id|>
How are you?
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
Hello, How is your day going so far?
<|eot_id|>

Decoding

Input is tokenized
Model outputs the scores that rank likelihood of each token in its vocabulary as the Next token
Based on the scores, we have multiple strategies
- Always take token with maximum score
- Beam search: explores multiple candidate sequences to find the one with the maximum total score

Attention

It is process of identifying the most relevant words to predict the next token

Context Length

Context length refers to the maximum number of tokens the model can consider at once when generating a response
It represents the maximum attention span it has
gemma4:26B has context length of 256K tokens

Sliding Window Size

Maximum number of historical tokens a model’s local attention layer can look back on at any given moment
gemma4:26B has sliding window size of 1024 tokens

Vocabulary Size

Total number of unique words or tokens your model knows
gemma4:26B has vocabulary size of 262K

Embedding Size

Number of dimensions used to mathematically represent each token
embeddinggemma has 768 embedding length

Parameters

The total count of weights and biases in the neural network, representing the model’s overall memory capacity
gemma3:4b has 4.3B parameters

Number of Layers

A Layer is a distinct structural processing block within the model’s neural network
Number of Layers represents depth of the model
gemma4:26B has 30 layers

LLM Settings

See LLM_Settings

Prompt

The input sequence you provide an LLM is called a prompt
When we pass information to LLMs, we structure our input in a way that guides the generation of the LLM toward the desired output. This is called prompting.
See Prompt_Engineering

Training LLM

They are trained on large datasets of text
It is unsupervised learning and model learns structure and pattern of language
After pre-training, they are fine-tuned for specific tasks
See __Training

Distillation

It creates smaller version of LLM
It reduces the number of parameters of the model
Distilled LLM generates faster predictions but are not as good as original
https://developers.google.com/machine-learning/crash-course/llm/tuning#distillation

Quantization

When AI Model is created its weights are stored as massive 16-bit floating-point (FP16) numbers
Quantization reduces the precision to 8-bit (FP8) or 4-bit (FP4) which speeds up generation times with minimum loss in reasoning capabilities
It introduces quantization error causing a drop in model accuracy, reasoning, or response quality
Perplexity measures how quantization degrades model accuracy
This reduces the size of the model
DP64 (Double Precision 64-bit floating-point) are rarely used by Models because of their memory footprint

	Use case
DP64	Used in Physics Simulations
FP16	Used in LLM Training and high precision inference
FP8	Enterprise Deployment for LLM inference
FP4	Ultra-low quantization for local consumer grade device

GGUF Quantization formats

https://medium.com/@paul.ilvez/demystifying-llm-quantization-suffixes-what-q4-k-m-q8-0-and-q6-k-really-mean-0ec2770f17d3
Q{bits}_{quantization type}_{precision size}
bits
- number of bits to save each weight
quantization type
- K: grouped
- 0: ungrouped or global (legacy)
- 1: (legacy)
precision size
- S: Small
- M: Medium
- L: Large

Model	Perplexity (ppl)	Notes	Recommended
`Q4_0`	+0.2499	small, high quality loss
`Q4_K_M` (aka `Q4_K`)	+0.0535	medium, balanced quality	Yes
`Q5_K_S`	+0.0353	large, low quality loss	Yes
`Q5_K_M`	+0.0142	large, very low quality loss	Yes
`Q8_0`	+0.0004	very large, extremely low quality loss

LLM capabilities

Base model is trained on raw text data
- It can be fine tuned and allowed capabilities like instruction following and tool calling
Instruct: To follow instructions and engage in conversations
Tool calling: To perform function calls
Multilingual
Vision
Audio
Reasoning
Multimodal: Process Text, Image, Audio simultaneously
Mixture of Experts (MoE)
Embedding
Deep Research
Structured Output

Mixture of Experts (MoE)

https://www.reddit.com/r/LocalLLaMA/comments/174f42z/can_anyone_explain_moe_like_im_25/
Model consists of many smaller sub-networks called experts
In standard dense network model, all parameters are used
- Every weight is used for every token
In MoE, a routing mechanism only triggers a fraction of these experts
- meaning the active parameter count is much lower than the total
- Only few weights are used for every token
In gemma4:26b
- https://ollama.com/library/gemma4
- Expert Count: 8 active / 128 total and 1 shared
- Active Parameters: 3.8B
- Total Parameters: 25.2B

LLM Model Contents

Example:
Architecture and Weights
- config.json: number of layers, attention heads, hidden layers, base model type
- .safetensors: weights of model
- .safetensors.index.json: map of each layer to weights file, used for big models split into multiple weights file
Tokenizer
- tokenizer.json: Vocabulary of model mapping each word/syllable/character to unique ID
- tokenizer_config.json: Tells library how to handle special tokens and rules like max sequence length etc.
- special_tokens_map.json: explicitly defines which IDs represent structural markers like the start, end, or padding of text
Chat
- chat_template.jinja: Template to structure text into raw prompt strings
Output Control
- generation_config.json: default behavior when the model generates text like temperature, top_p, top_k
Multimodal
- preprocessor_config.json: Feature extractor containing mathematical arrays for image resizing, cropping, or audio sample normalization.
  - contains processor class
- processor_config.json: links tokenization with image/video/audio processing
  - contains processor class
LoRA
- adapter_config.json: tells loading library how to attach the plugin to the base model
- adapter_model.safetensors: adapter weights

Experiments

Explorer

LLM

Large Language Model

Limitations of LLM

NLP

Transformers

Transformer Types

Token

Role and Chat Template Tokens

Decoding

Attention

Context Length

Sliding Window Size

Vocabulary Size

Embedding Size

Parameters

Number of Layers

LLM Settings

Prompt

Training LLM

Distillation

Quantization

GGUF Quantization formats

LLM capabilities

Mixture of Experts (MoE)

LLM Model Contents

Table of Contents