Local LLMs

LLaMA

  • Large Language Model Meta AI
  • Family of highly popular open-source AI models by Meta
  • First time released open-weights of the model

Model Inference engines

llama.cpp

  • part of ggml: https://github.com/ggml-org
  • developed by Georgi Gerganov
  • open source library to perform inference on various LLM models
  • introduced GGUF (GGML Universal File) format which is used to save and load model data
  • written in C++

vLLM

  • open source framework for inference and serving of LLM models
  • developed by University of California, Berkeley’s Sky Computing Lab
  • OpenAI compatible API
  • used for high traffic production environment
  • written in C++, Python, CUDA

LiteRT-LM

Model Formats

  • GGUF
    • developed by llama.cpp
  • Safetensors
    • developed by hugging face
  • Llamafile
    • developed by Mozilla
    • Data (weights) + Runtime (llama.cpp engine + server)
    • can execute models directly
  • LiteRT
    • developed by Google

Tools to download and run models

Ollama

  • model inference: llama.cpp, MLX for MacOS
  • open source
  • it is slow and good for personal use
  • downloads models from its own repo + support for hugging face

LM Studio

  • model inference: llama.cpp, MLX for MacOS
  • closed source
  • downloads GGUF/MLX models from hugging face
  • has lm studio community GGUF models in hugging face repo

LocalAI

LLM Frontend

  • Open WebUI
  • AnythingLLM

AI Edge Computing

Compute

  • https://blog.dailydoseofds.com/p/cpu-vs-gpu-vs-tpu-vs-npu-vs-lpu
  • https://priyankavergadia.substack.com/p/cpu-vs-gpu-vs-tpu-vs-npu-vs-dpu-vs
  • CPU (Central Processing Unit)
    • General Purpose few powerful cores
    • Deep Cache hierarchy (L1, L2, L3) and off-chip main memory (DRAM)
    • Good for OS, DB, Decision Heavy code
  • GPU (Graphics Processing Unit)
    • Spread work across thousands of smaller cores
    • Execute same instruction on different data parallelly
    • Dedicated High Bandwidth Memory (HBM) or Graphics Double Data Rate (GDDR) Memory
    • HBM and GDDR are type of VRAM, HBM is more premium
    • Execution is compiler controlled not hardware scheduled
    • Heavily used in AI Training
    • Measured in TFLOPS
    • Supports FP16, BF16 (modern GPUs)
  • TPU (Tensor Processing Unit)
    • Built by Google
    • One core is a grid of MAC (Multiply-Accumulate) units
    • Specialized for Matrix Multiplication of Neural Networks
    • Data flows in Wave like pattern
    • Has On-chip SRAM (Static RAM) and Off-chip HBM
    • Uses BF16 (Brain Floating-Point 16-bit) developed by Google
  • NPU (Neural Processing Unit)
    • Edge Optimized Variant
    • Built around Neural Compute Engine
    • Has On-chip SRAM and Off-chip low power system memory
    • Supports INT8, modern NPU also supports FP16 or INT4
    • Measured in TOPS
    • Example: Apple Neural Engine or Intel’s NPU
  • LPU (Language Processing Unit)
    • developed by Groq
    • Execution is fully deterministic and compiler scheduled
    • Provides limited memory per chip
    • Has On-chip SRAM only
    • Lot of chips required to run LLM
  • TOPS (Tera/Trillions Of Operations Per Second)
    • Measures processor’s potential peak AI inference performance
    • Involves integer calculations
    • Microsoft defined minimum of 40 TOPS NPU and 16 GB RAM for AI PCs
  • TFLOPS (Tera/Trillions of Floating Point Operations Per Second)
    • Measure AI training performance or gaming performance
    • Involves floating point calculations
    • AI chip has TFLOP > TOP
    • GPUs are measured in TFLOP and can be found in hugging face profiles to measure how GPU rich you are