Choosing the right model for your hardware
Running a model locally means the model must fit (mostly) in RAM or VRAM. This guide helps you pick a quantization and size that suits your machine.
Quick reference: model size vs. RAM
| Model family | Parameters | Q4_K_M size | Minimum RAM | Recommended RAM |
|---|---|---|---|---|
| Phi-3.5 Mini | 3.8B | ~2.3 GB | 4 GB | 8 GB |
| Qwen 2.5 | 7B | ~4.5 GB | 8 GB | 16 GB |
| Llama 3.1 | 8B | ~5.0 GB | 8 GB | 16 GB |
| Mistral NeMo | 12B | ~7.5 GB | 12 GB | 24 GB |
| Llama 3.3 | 70B | ~43 GB | 64 GB | 80 GB |
Rule of thumb: model file size + ~2 GB for the KV cache and runtime overhead.
Quantization formats explained
GGUF files ship in several quantization levels. The suffix tells you the trade-off.
| Suffix | Bits/weight | Quality | File size | When to use |
|---|---|---|---|---|
Q2_K |
~2.6 | Low | Smallest | Memory-constrained edge devices |
Q4_K_S |
~4.4 | Good | Small | ~Same quality as Q4_K_M, slightly smaller |
Q4_K_M |
~4.8 | Very good | Medium | Best default for most workloads |
Q5_K_M |
~5.7 | Excellent | Larger | When you have headroom and want near-F16 |
Q8_0 |
8.0 | Near lossless | Large | Benchmarking / highest-quality local |
F16 |
16.0 | Lossless | Very large | Not recommended for inference; use for fine-tuning |
Start with Q4_K_M. It gives 95%+ of F16 quality at roughly 30% of the size.
GPU offloading (GpuLayers)
The GpuLayers option controls how many transformer layers are offloaded to the GPU. More layers = faster inference but more VRAM.
builder.Services.AddHearth(options =>
{
options.Model = "./models/qwen2.5-7b-instruct-q4_k_m.gguf";
options.GpuLayers = 35; // tune this for your VRAM budget
});
VRAM budget guide for a 7B Q4_K_M model (~4.5 GB)
| Available VRAM | Suggested GpuLayers |
Effect |
|---|---|---|
| 4 GB | 10–15 | Partial offload; mixed CPU/GPU |
| 6 GB | 20–28 | Most layers on GPU |
| 8 GB | 32–35 | Full offload; fastest |
| 12 GB+ | 999 | Offload everything including KV cache |
Use 999 to offload as many layers as the backend allows. LLamaSharp will silently cap it at the model's actual layer count.
Apple Silicon (Metal)
On a MacBook with unified memory, GPU and CPU share the same pool:
options.GpuLayers = 999; // Metal, unified memory — offload everything
Install Hearth.AI.Metal and set GpuLayers = 999. Most 7–13B models run well even on an 8 GB M-series chip because there is no VRAM/RAM split.
Context size (ContextSize)
Larger context windows cost memory quadratically (KV cache). Only raise it when you need it.
| Use case | Recommended ContextSize |
|---|---|
| Short Q&A, chat | 2048–4096 |
| Document summarization | 8192–16384 |
| Long document RAG | 32768+ |
A 7B model with ContextSize = 4096 needs roughly 512 MB KV cache. With ContextSize = 32768, that grows to ~4 GB.
Recommended starting configurations
Developer laptop (16 GB RAM, no discrete GPU)
options.Model = "./models/qwen2.5-7b-instruct-q4_k_m.gguf";
options.ContextSize = 4096;
options.GpuLayers = 0;
options.Threads = 8;
Gaming PC (16 GB RAM, 8 GB VRAM — CUDA)
options.Model = "./models/qwen2.5-7b-instruct-q4_k_m.gguf";
options.ContextSize = 8192;
options.GpuLayers = 35;
Apple Silicon MacBook Pro (24 GB unified memory)
options.Model = "./models/llama-3.1-8b-instruct-q5_k_m.gguf";
options.ContextSize = 16384;
options.GpuLayers = 999;
Server (64 GB RAM, A100 80 GB VRAM)
options.Model = "./models/llama-3.3-70b-instruct-q4_k_m.gguf";
options.ContextSize = 32768;
options.GpuLayers = 999;
Where to download models
Hearth works with any GGUF model. The most practical source is Hugging Face. Look for repos maintained by bartowski or unsloth — they publish high-quality GGUF conversions with consistent naming conventions.
You can also let Hearth download the model automatically by passing a Hugging Face repo ID:
options.Model = "bartowski/Qwen2.5-7B-Instruct-GGUF";
options.ModelFile = "Qwen2.5-7B-Instruct-Q4_K_M.gguf";
The file is cached in ~/.hearth/models and reused on subsequent runs.