Deploying Hearth to production
Hearth is a library that runs inside your ASP.NET Core process. There is no separate inference server to manage. This guide covers containerization, hardware sizing, performance tuning, and operational considerations.
Docker
The repository ships with a Dockerfile. Build and run it with your model:
# Build the image
docker build -t hearth-ai/demo .
# Run with a local model directory bind-mounted
docker run -p 5000:5000 \
-v /data/models:/app/models \
-e HEARTH_MODEL=/app/models/qwen2.5-7b-instruct-q4_k_m.gguf \
hearth-ai/demo
Then hit it with any OpenAI-compatible client:
curl http://localhost:5000/v1/models
GPU pass-through (CUDA)
docker run --gpus all -p 5000:5000 \
-v /data/models:/app/models \
-e HEARTH_MODEL=/app/models/qwen2.5-7b-instruct-q4_k_m.gguf \
hearth-ai/demo
The base image uses the CUDA runtime variant when Hearth.AI.Cuda is installed. Requires the NVIDIA Container Toolkit on the host.
Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: hearth
spec:
replicas: 1
selector:
matchLabels:
app: hearth
template:
metadata:
labels:
app: hearth
spec:
containers:
- name: hearth
image: hearth-ai/demo:latest
ports:
- containerPort: 5000
env:
- name: HEARTH_MODEL
value: /models/qwen2.5-7b-instruct-q4_k_m.gguf
- name: Hearth__ContextSize
value: "4096"
- name: Hearth__GpuLayers
value: "35"
volumeMounts:
- name: models
mountPath: /models
resources:
requests:
memory: "12Gi"
limits:
memory: "16Gi"
volumes:
- name: models
persistentVolumeClaim:
claimName: model-pvc
Important: set
replicas: 1unless your model fits multiple times in the node's memory. Each pod loads a full copy of the weights.
Configuration via environment variables
All HearthOptions properties can be set via environment variables using the Hearth__ prefix:
Hearth__Model=/models/qwen2.5-7b-instruct-q4_k_m.gguf
Hearth__ContextSize=8192
Hearth__GpuLayers=35
Hearth__BatchSize=512
Hearth__Threads=8
Performance tuning
CPU inference
options.Threads = Environment.ProcessorCount; // use all cores
options.BatchSize = 1024; // larger batches for throughput
Threads = -1 (the default) lets llama.cpp pick the count automatically. For dedicated inference boxes, setting it explicitly to physical core count (not logical) often improves throughput.
GPU inference
options.GpuLayers = 999; // offload as much as VRAM allows
options.BatchSize = 512; // start here; increase if VRAM allows
Monitor VRAM usage with nvidia-smi (CUDA) or sudo powermetrics --samplers gpu_power (Apple Silicon) to find the sweet spot.
Context size vs throughput
Each concurrent request allocates its own KV cache. With 5 concurrent requests at ContextSize = 4096, you need ~2.5 GB for KV caches alone (for a 7B model). Calculate accordingly:
KV cache ≈ 2 × ContextSize × Layers × Heads × HeadDim × sizeof(float16)
For a 7B model at 4096 tokens: roughly 512 MB per context.
Concurrency
Hearth creates one StatelessExecutor per inference call. Concurrent requests run in separate contexts derived from the same shared LLamaWeights. The weights are read-only; there are no thread-safety concerns at the weights level.
CPU inference throughput scales linearly with cores up to roughly 8–12 threads per request. If you have many cores, consider running multiple processes behind a load balancer rather than one process with many threads.
Memory considerations
| Component | Approximate size |
|---|---|
| Model weights (7B Q4_K_M) | 4.5 GB |
| KV cache per active context (4096 tokens) | 512 MB |
| .NET runtime + overhead | ~200 MB |
Set your container's memory limit to at minimum model_size + (max_concurrent × kv_cache_per_context) + 512 MB.
Health check
Add a simple health endpoint alongside MapHearth():
app.MapGet("/health", () => Results.Ok(new { status = "ok" }));
Or use the built-in GET /v1/models endpoint — if it returns 200, the model is loaded and inference is available.
Logging
Hearth emits structured log events via Microsoft.Extensions.Logging. In production, set the minimum level appropriately:
{
"Logging": {
"LogLevel": {
"Hearth": "Warning"
}
}
}
The Hearth category emits model-loaded and model-loading events at Information level, and all inference diagnostics at Debug level. Filter to Warning in production to reduce noise.