Large Language Model

  • It is a type of model that excels at understanding and generating human language
  • trained on vast amount of text data
  • consists of million of parameters
  • mostly based on Transformers
  • Typically decoder-based transformers with billions of params
  • Examples
    • DeepSeek-R1 by DeepSeek
    • GPT4 by OpenAI
    • Llama3 by Meta
    • SmolLM2 by Hugging Face
    • Gemma by Google
    • Mistral by Mistral
  • Ref: https://huggingface.co/learn/agents-course/unit1/what-are-llms

Limitations of LLM

  • Hallucinations: They can generate incorrect information confidently
  • Lack of true understanding: operate purely on statistical patterns
  • Bias: They may reproduce biases present in their training data or inputs.
  • Context windows: They have limited context windows
  • Computational resources: They require significant computational resources

NLP

  • NLP is a field of linguistics and machine learning focused on understanding everything related to human language
  • The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words
  • LLMs are modern technique within NLP
  • AI ML NLP LLM
  • NLP common tasks
    • Classify sentence (sentiment, spam or not etc.)
    • Classify part of speech of a word in sentence
    • Speech recognition (speech-to-text)
    • Text-to-speech
  • https://huggingface.co/learn/llm-course/chapter1/2

Transformers

Transformer Types

  • Encoders
    • Takes text (or other data) as input and outputs a dense representation (or embedding) of that text
    • Example: BERT from Google
    • Use Cases: Text classification, semantic search, Named Entity Recognition
    • Typical Size: Millions of parameters
  • Decoders
    • Focuses on generating new tokens to complete a sequence, one token at a time
    • Example: Llama from Meta
    • Use Cases: Text generation, chatbots, code generation
    • Typical Size: Billions of parameters
  • Seq2Seq (Encoder–Decoder)
    • Combines an encoder and a decoder
    • Encoder first processes the input sequence into a context representation, then the decoder generates an output sequence.
    • Example: T5, BART, Whisper
    • Use Cases: Translation, Summarization, Paraphrasing
    • Typical Size: Millions of parameters

Token

  • Unit of information LLM works with
Input: It ain't worth it
Tokens: It| ain|'t| worth| it
  • English has ~600k words
  • LLM has ~32k words
  • Each LLM has special tokens
    • BOS: Beginning of Sequence
    • EOS: End of Sequence
      • Examples:
        • <|endoftext|> for GPT4
        • <end_of_turn> for Gemma
    • PAD: Padding Token
    • UNK: Unknown Token
  • LLMs are autoregressive meaning output from one pass becomes input to the next one
  • This continues till the generation reaches EOS token

Role and Chat Template Tokens

[
    {"role": "system", "content": "You are a helpful assistant focused on technical topics."},
    {"role": "user", "content": "Can you explain what a chat template is?"},
    {"role": "assistant", "content": "A chat template structures conversations between users and AI models..."},
    {"role": "user", "content": "How do I use it ?"},
]
  • Chat Template for SmolLM2
{% for message in messages %}
{% if loop.first and messages[0]['role'] != 'system' %}
<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face
<|im_end|>
{% endif %}
<|im_start|>{{ message['role'] }}
{{ message['content'] }}<|im_end|>
{% endfor %}
  • Final Result
<|im_start|>system
You are a helpful assistant focused on technical topics.<|im_end|>
<|im_start|>user
Can you explain what a chat template is?<|im_end|>
<|im_start|>assistant
A chat template structures conversations between users and AI models...<|im_end|>
<|im_start|>user
How do I use it ?<|im_end|>
  • Example of Llama3.2 model
User: How are you?
AI: Hello, How is your day going so far?


-------- Converted using special tokens -------

<|begin_of_text|>

<|start_header_id|>user<|end_header_id|>
How are you?
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
Hello, How is your day going so far?
<|eot_id|>

Decoding

  • Input is tokenized
  • Model outputs the scores that rank likelihood of each token in its vocabulary as the Next token
  • Based on the scores, we have multiple strategies
    • Always take token with maximum score
    • Beam search: explores multiple candidate sequences to find the one with the maximum total score

Attention

  • It is process of identifying the most relevant words to predict the next token

Context Length

  • Context length refers to the maximum number of tokens the model can consider at once when generating a response
  • It represents the maximum attention span it has
  • gemma4:26B has context length of 256K tokens

Sliding Window Size

  • Maximum number of historical tokens a model’s local attention layer can look back on at any given moment
  • gemma4:26B has sliding window size of 1024 tokens

Vocabulary Size

  • Total number of unique words or tokens your model knows
  • gemma4:26B has vocabulary size of 262K

Embedding Size

  • Number of dimensions used to mathematically represent each token
  • embeddinggemma has 768 embedding length

Parameters

  • The total count of weights and biases in the neural network, representing the model’s overall memory capacity
  • gemma3:4b has 4.3B parameters

Number of Layers

  • A Layer is a distinct structural processing block within the model’s neural network
  • Number of Layers represents depth of the model
  • gemma4:26B has 30 layers

LLM Settings

Prompt

  • The input sequence you provide an LLM is called a prompt
  • When we pass information to LLMs, we structure our input in a way that guides the generation of the LLM toward the desired output. This is called prompting.
  • See Prompt_Engineering

Training LLM

  • They are trained on large datasets of text
  • It is unsupervised learning and model learns structure and pattern of language
  • After pre-training, they are fine-tuned for specific tasks
  • See __Training

Distillation

Quantization

  • When AI Model is created its weights are stored as massive 16-bit floating-point (FP16) numbers
  • Quantization reduces the precision to 8-bit (FP8) or 4-bit (FP4) which speeds up generation times with minimum loss in reasoning capabilities
  • It introduces quantization error causing a drop in model accuracy, reasoning, or response quality
  • Perplexity measures how quantization degrades model accuracy
  • This reduces the size of the model
  • DP64 (Double Precision 64-bit floating-point) are rarely used by Models because of their memory footprint
Use case
DP64Used in Physics Simulations
FP16Used in LLM Training and high precision inference
FP8Enterprise Deployment for LLM inference
FP4Ultra-low quantization for local consumer grade device

GGUF Quantization formats

ModelPerplexity (ppl)NotesRecommended
Q4_0+0.2499small, high quality loss
Q4_K_M (aka Q4_K)+0.0535medium, balanced qualityYes
Q5_K_S+0.0353large, low quality lossYes
Q5_K_M+0.0142large, very low quality lossYes
Q8_0+0.0004very large, extremely low quality loss

LLM capabilities

  • Base model is trained on raw text data
    • It can be fine tuned and allowed capabilities like instruction following and tool calling
  • Instruct: To follow instructions and engage in conversations
  • Tool calling: To perform function calls
  • Multilingual
  • Vision
  • Audio
  • Reasoning
  • Multimodal: Process Text, Image, Audio simultaneously
  • Mixture of Experts (MoE)
  • Embedding
  • Deep Research
  • Structured Output

Mixture of Experts (MoE)

LLM Model Contents

  • Example:
  • Architecture and Weights
    • config.json: number of layers, attention heads, hidden layers, base model type
    • .safetensors: weights of model
    • .safetensors.index.json: map of each layer to weights file, used for big models split into multiple weights file
  • Tokenizer
    • tokenizer.json: Vocabulary of model mapping each word/syllable/character to unique ID
    • tokenizer_config.json: Tells library how to handle special tokens and rules like max sequence length etc.
    • special_tokens_map.json: explicitly defines which IDs represent structural markers like the start, end, or padding of text
  • Chat
    • chat_template.jinja: Template to structure text into raw prompt strings
  • Output Control
    • generation_config.json: default behavior when the model generates text like temperature, top_p, top_k
  • Multimodal
    • preprocessor_config.json: Feature extractor containing mathematical arrays for image resizing, cropping, or audio sample normalization.
      • contains processor class
    • processor_config.json: links tokenization with image/video/audio processing
      • contains processor class
  • LoRA
    • adapter_config.json: tells loading library how to attach the plugin to the base model
    • adapter_model.safetensors: adapter weights