
Local LLM - 101
So, you want to run LLM locally & set yourself with some sort of agentic workflow. The whole local LLM scene is quite overwhelming. You can run both text & vision model locally pretty good these days. This is a self note to myself & maybe to you too.
When it comes to running LLM locally, the first issue is hardware. LLMs require a lot of computational power. If you are using Apple Silicon, you'll be surprised to see it's performance (at least I was ¯\_(ツ)_/¯
)
Typically LLMs require a lot of storage, memory & compute powers (lots of $$).
LLMs are huge collection of numbers, billions of them. These numbers need to be stored with incredibile precision. Normally AI models use 32 bit or 4 bytes precision. Now, if we do a quick math for a LLM with 7 billion parameter, we'll get
7B × 4 bytes = 28 GB
!
This is where quantization steps in.
LLM quantization reduces the precision of model weights from 32-bit floating point to lower bit formats (8-bit, 4-bit, or even lower).
Wait! If quantization reduces the precision, shouldn't it affect the model quality??!
Not necessarily! Modern quantization techniques are surprisingly effective at preserving model quality while drastically reducing size.
Quantization works by mapping the original high-precision values to a smaller set of representative values. Quantization doesn't randomly reduce precision. It carefully analyzes which weights are most important & preserves their values more accurately.
This process:
- Significantly reduces model size
- Decreases memory requirements during inference
- Speeds up inference time
- Enables deployment on resource-constrained devices
Now for the same 7 billion LLM with 8-bit or 2 bytes precision, well get
7B × 2 = 7 GB
!!
A whopping 75% reduction in size.
You can read further on how does quantization affect model output?
Now, we need to talk about LLM formats. There's a lot of model formats. Some of them are briefly
GGUF
- What it is: Container format for quantized models optimized for CPU/consumer hardware
- Used by: llama.cpp, Ollama, LM Studio
- Key benefit: Efficient memory usage with various quantization levels (q2, q4, q8)
GPTQ
- What it is: Post-training quantization method using calibration data
- Used by: AutoGPTQ, ExLlama, Hugging Face
- Key benefit: Better quality preservation at 4-bit & 3-bit precision
AWQ (Activation-aware Weight Quantization)
- What it is: Preserves important weights based on activation values
- Used by: vLLM, Hugging Face
- Key benefit: Better performance than GPTQ at 4-bit precision
SAFETENSORS
- What it is: Safe storage format for model weights (not quantized)
- Used by: Hugging Face, most modern models
- Key benefit: Secure, memory-mapped loading without execution risks
Out of these formats, GGUF is our best buddy. It is widely used by tools like Ollama, llama.cpp, LM Studio etc.
GGUF (Georgi Gerganov Universal Format) is a special file format designed for efficiently storing & running quantized LLMs on consumer hardware. Here's what you need to know:
- GGUF packages everything needed to run an LLM into a single file (model, tokenizer, settings)
- Enables running models with much less memory than original formats (thanks to quantization!)
- Optimized for CPU & limited GPU systems like laptops
If you search for GGUF models, you'll see some thing like q4_k_m
First number (q2, q4, q5, q8): Bits per weight - lower means smaller file but potentially lower quality q8: 8-bit precision (largest, highest quality) q4: 4-bit precision (good balance) q2: 2-bit precision (smallest, lowest quality)
Letter K indicates "K-quantization" that prioritizes important weights. To learn more you can read this PR
Size indicator (S, M, L): S (Small): Smallest file size, fastest, lowest quality M (Medium): Balanced approach (recommended for most users) L (Large): Larger file size, higher quality
Example,
A file named llama-3-8b.q4_k_m.gguf means:
- 4-bit quantization with K-quantization method
- Medium block size (balanced approach)
- ~4× smaller than the original model with good quality preservation
For most common applications, 4-bit quantized models (like Q4_K_M) provide a sweet spot - 87.5% smaller with only minimal quality reduction that most users won't notice in conversational use.
The best source for LLM is Huggingface. There's a neat tool HuggingFace Model Downloader that supports failed download resume & some cool perks.
Now that we know about quantization & model format, let's download the software to run LLMs.
There are plenty of options available. Notably,
- llama.cpp
- Ollama
- koboldcpp
- vLLM
- Text generation web UI / oobabooga
- Axolotl - Mostly used for fine-tuning
- LM Studio
and a lot more!
You can spend weeks just playing with different ways to run LLMs & frontend. I personally prefer koboldcpp
to quickly test out GGUF models. It has a single binary that you just run & a web UI with a whole lot of functionalities.
If I want to serve OpenAI compatible api endpoints, my choice is Ollama. Simply because it's easy to install on Linux, macOS & Windows. I happen to unfortunately use all of these OS.
Don't forget to export OLLAMA_HOST=0.0.0.0
if you plan to serve api endpoints over the network!
Let's go with Ollama. You can either directly install it or run it through official docker image
Now, you need to choose a model to run with ollama. Find out which model works good for your usecase. There's a LLM leaderboard that you can utilize. For coding, I found DeepSeek-Coder pretty good. You can find the list of models in Ollama library.
You should figure out a good enough parameter number & quantization to run LLM on your system.
Here's a calculator to give you a rough idea.
Now, you have ollama installed, a model of choice with parameter number & quantization.
Let's pull this image from Ollama library. Careful, specify the parameter & quantization in Ollama command.
Resources