Local LLMs

LLaMA

Large Language Model Meta AI
Family of highly popular open-source AI models by Meta
First time released open-weights of the model

Model Inference engines

https://huggingface.co/learn/llm-course/chapter2/8
llama.cpp
vLLM
SGLang
TensorRT-LLM by Nvidia
mlc-llm
transformers by hugging face
MLX
LiteRT-LM

llama.cpp

part of ggml: https://github.com/ggml-org
developed by Georgi Gerganov
open source library to perform inference on various LLM models
introduced GGUF (GGML Universal File) format which is used to save and load model data
written in C++

vLLM

open source framework for inference and serving of LLM models
developed by University of California, Berkeley’s Sky Computing Lab
OpenAI compatible API
used for high traffic production environment
written in C++, Python, CUDA

LiteRT-LM

created by Google
high performance inference on edge devices
Runs on Android, iOS, Web, Desktop, IoT
Supports smaller models from Gemma, Qwen
https://developers.google.com/edge/litert-lm/overview

Model Formats

GGUF
- developed by llama.cpp
Safetensors
- developed by hugging face
Llamafile
- developed by Mozilla
- Data (weights) + Runtime (llama.cpp engine + server)
- can execute models directly
LiteRT
- developed by Google

Tools to download and run models

Ollama

model inference: llama.cpp, MLX for MacOS
open source
it is slow and good for personal use
downloads models from its own repo + support for hugging face

LM Studio

model inference: llama.cpp, MLX for MacOS
closed source
downloads GGUF/MLX models from hugging face
has lm studio community GGUF models in hugging face repo

LocalAI

uses backends which include llama.cpp, vLLM, whisper, MLX etc.
open source
https://localai.io/model-compatibility/

LLM Frontend

Open WebUI
AnythingLLM

AI Edge Computing

It refers to deployment of artificial intelligence models directly on edge devices, such as IoT sensors, smartphones, autonomous vehicles, and embedded systems
Benefits
- Realtime responsiveness
- Reduced infra overhead
- Enhanced privacy
- Scalable
- Offline Functionality
Use cases
- Automotive: Autonomous vehicles
- Healthcare: wearable devices and medical equipments
- Smart Home Devices
LiteRT-LM from Google is developed for such use cases
https://www.arm.com/glossary/edge-ai
https://developers.googleblog.com/gemma-4-12b-the-developer-guide/
https://developers.google.com/edge

Compute

https://blog.dailydoseofds.com/p/cpu-vs-gpu-vs-tpu-vs-npu-vs-lpu
https://priyankavergadia.substack.com/p/cpu-vs-gpu-vs-tpu-vs-npu-vs-dpu-vs
CPU (Central Processing Unit)
- General Purpose few powerful cores
- Deep Cache hierarchy (L1, L2, L3) and off-chip main memory (DRAM)
- Good for OS, DB, Decision Heavy code
GPU (Graphics Processing Unit)
- Spread work across thousands of smaller cores
- Execute same instruction on different data parallelly
- Dedicated High Bandwidth Memory (HBM) or Graphics Double Data Rate (GDDR) Memory
- HBM and GDDR are type of VRAM, HBM is more premium
- Execution is compiler controlled not hardware scheduled
- Heavily used in AI Training
- Measured in TFLOPS
- Supports FP16, BF16 (modern GPUs)
TPU (Tensor Processing Unit)
- Built by Google
- One core is a grid of MAC (Multiply-Accumulate) units
- Specialized for Matrix Multiplication of Neural Networks
- Data flows in Wave like pattern
- Has On-chip SRAM (Static RAM) and Off-chip HBM
- Uses BF16 (Brain Floating-Point 16-bit) developed by Google
NPU (Neural Processing Unit)
- Edge Optimized Variant
- Built around Neural Compute Engine
- Has On-chip SRAM and Off-chip low power system memory
- Supports INT8, modern NPU also supports FP16 or INT4
- Measured in TOPS
- Example: Apple Neural Engine or Intel’s NPU
LPU (Language Processing Unit)
- developed by Groq
- Execution is fully deterministic and compiler scheduled
- Provides limited memory per chip
- Has On-chip SRAM only
- Lot of chips required to run LLM
TOPS (Tera/Trillions Of Operations Per Second)
- Measures processor’s potential peak AI inference performance
- Involves integer calculations
- Microsoft defined minimum of 40 TOPS NPU and 16 GB RAM for AI PCs
TFLOPS (Tera/Trillions of Floating Point Operations Per Second)
- Measure AI training performance or gaming performance
- Involves floating point calculations
- AI chip has TFLOP > TOP
- GPUs are measured in TFLOP and can be found in hugging face profiles to measure how GPU rich you are

Experiments

Explorer

Local_LLM

Local LLMs

LLaMA

Model Inference engines

llama.cpp

vLLM

LiteRT-LM

Model Formats

Tools to download and run models

Ollama

LM Studio

LocalAI

LLM Frontend

AI Edge Computing

Compute

Table of Contents