Local LLMs
LLaMA
Large Language Model Meta AI
Family of highly popular open-source AI models by Meta
First time released open-weights of the model
Model Inference engines
llama.cpp
part of ggml: https://github.com/ggml-org
developed by Georgi Gerganov
open source library to perform inference on various LLM models
introduced GGUF (GGML Universal File) format which is used to save and load model data
written in C++
vLLM
open source framework for inference and serving of LLM models
developed by University of California, Berkeley’s Sky Computing Lab
OpenAI compatible API
used for high traffic production environment
written in C++, Python, CUDA
LiteRT-LM
GGUF
Safetensors
developed by hugging face
Llamafile
developed by Mozilla
Data (weights) + Runtime (llama.cpp engine + server)
can execute models directly
LiteRT
Ollama
model inference: llama.cpp, MLX for MacOS
open source
it is slow and good for personal use
downloads models from its own repo + support for hugging face
LM Studio
model inference: llama.cpp, MLX for MacOS
closed source
downloads GGUF/MLX models from hugging face
has lm studio community GGUF models in hugging face repo
LocalAI
LLM Frontend
AI Edge Computing
Compute
https://blog.dailydoseofds.com/p/cpu-vs-gpu-vs-tpu-vs-npu-vs-lpu
https://priyankavergadia.substack.com/p/cpu-vs-gpu-vs-tpu-vs-npu-vs-dpu-vs
CPU (Central Processing Unit)
General Purpose few powerful cores
Deep Cache hierarchy (L1, L2, L3) and off-chip main memory (DRAM)
Good for OS, DB, Decision Heavy code
GPU (Graphics Processing Unit)
Spread work across thousands of smaller cores
Execute same instruction on different data parallelly
Dedicated High Bandwidth Memory (HBM) or Graphics Double Data Rate (GDDR) Memory
HBM and GDDR are type of VRAM, HBM is more premium
Execution is compiler controlled not hardware scheduled
Heavily used in AI Training
Measured in TFLOPS
Supports FP16, BF16 (modern GPUs)
TPU (Tensor Processing Unit)
Built by Google
One core is a grid of MAC (Multiply-Accumulate) units
Specialized for Matrix Multiplication of Neural Networks
Data flows in Wave like pattern
Has On-chip SRAM (Static RAM ) and Off-chip HBM
Uses BF16 (Brain Floating-Point 16-bit) developed by Google
NPU (Neural Processing Unit)
Edge Optimized Variant
Built around Neural Compute Engine
Has On-chip SRAM and Off-chip low power system memory
Supports INT8, modern NPU also supports FP16 or INT4
Measured in TOPS
Example: Apple Neural Engine or Intel’s NPU
LPU (Language Processing Unit)
developed by Groq
Execution is fully deterministic and compiler scheduled
Provides limited memory per chip
Has On-chip SRAM only
Lot of chips required to run LLM
TOPS (Tera/Trillions Of Operations Per Second)
Measures processor’s potential peak AI inference performance
Involves integer calculations
Microsoft defined minimum of 40 TOPS NPU and 16 GB RAM for AI PCs
TFLOPS (Tera/Trillions of Floating Point Operations Per Second)
Measure AI training performance or gaming performance
Involves floating point calculations
AI chip has TFLOP > TOP
GPUs are measured in TFLOP and can be found in hugging face profiles to measure how GPU rich you are