Retrieval-augmented generation (RAG)
Hearth.AI.Rag adds a local RAG pipeline on top of the base Hearth.AI package. It chunks and embeds documents using the same model runtime that handles chat, stores vectors in an in-memory or SQLite store, and answers questions by retrieving relevant chunks before calling the model.
Everything runs in-process. There are no external services, no API keys, and no round-trips over the network.
Install
dotnet add package Hearth.AI.Rag
Register the pipeline
Chain .AddRag() from the IHearthBuilder returned by AddHearth():
builder.Services.AddHearth(options =>
{
options.Model = "./models/qwen2.5-7b-q4_k_m.gguf";
})
.AddRag(options =>
{
options.VectorStore = VectorStoreType.InMemory;
options.ChunkSize = 512;
options.ChunkOverlap = 50;
options.Chunker = ChunkerType.Recursive;
});
.AddRag() registers:
IRagPipeline— the main entry point for indexing and queryingIVectorStore— either in-memory or SQLite, based on your optionsIDocumentChunker— splits text into overlapping chunksDocumentLoaderRegistryand the built-in loaders (PlainTextLoader,MarkdownLoader,HtmlLoader)
RagOptions reference
| Option | Default | Description |
|---|---|---|
VectorStore |
InMemory |
InMemory or Sqlite |
SqlitePath |
"hearth-rag.db" |
Database file path, used when VectorStore is Sqlite |
ChunkSize |
512 |
Maximum number of tokens per chunk |
ChunkOverlap |
50 |
Overlap between consecutive chunks |
Chunker |
Recursive |
Recursive (general text) or Markdown (structured docs) |
ContextPromptTemplate |
built-in | System prompt template injected with retrieved chunks |
Basic usage
Inject IRagPipeline and index some text, then ask a question:
public sealed class DocsService(IRagPipeline rag)
{
public async Task IndexAsync(CancellationToken cancellationToken = default)
{
await rag.IndexAsync(
"Hearth is a .NET library for local LLM inference. " +
"It uses LLamaSharp under the hood and exposes Microsoft.Extensions.AI interfaces.",
metadata: new { Source = "readme" },
ct: cancellationToken);
}
public async Task<string> AskAsync(string question, CancellationToken cancellationToken = default)
{
RagResult result = await rag.AskAsync(question, ct: cancellationToken);
return result.Answer;
}
}
IndexAsync chunks the text, generates embeddings, and writes them to the vector store. AskAsync embeds the question, retrieves the top matching chunks, and calls the model with those chunks injected into the system prompt.
Indexing from files
DocumentLoaderRegistry picks the right loader automatically based on file extension:
public sealed class IndexingService(IRagPipeline rag, DocumentLoaderRegistry registry)
{
public async Task IndexDirectoryAsync(string path, CancellationToken cancellationToken = default)
{
var files = Directory.GetFiles(path, "*.*", SearchOption.AllDirectories)
.Where(f => f.EndsWith(".txt") || f.EndsWith(".md") || f.EndsWith(".html"));
foreach (var file in files)
{
IDocument document = await registry.LoadAsync(file, cancellationToken);
await rag.IndexDocumentAsync(document, cancellationToken);
}
}
}
Supported extensions and their loaders:
| Extension | Loader |
|---|---|
.txt |
PlainTextLoader |
.md |
MarkdownLoader |
.html |
HtmlLoader |
You can implement IDocumentLoader and register it with the DI container to add support for other formats. DocumentLoaderRegistry will pick it up automatically.
Vector stores
In-memory
The default. Vectors live in a List<> in memory and are gone when the process exits. Use this during development and for short-lived workloads where you re-index on startup.
.AddRag(options =>
{
options.VectorStore = VectorStoreType.InMemory;
});
SQLite
Persists vectors to a SQLite database file. The store is loaded from disk on startup, so indexed documents survive restarts without re-indexing.
.AddRag(options =>
{
options.VectorStore = VectorStoreType.Sqlite;
options.SqlitePath = "hearth-rag.db";
});
Use SQLite when:
- your document set is large or slow to re-index
- the application restarts frequently (e.g. a long-running API)
- you want to pre-index documents in a background job and share the database with the serving process
Chunking strategies
Recursive (default)
Splits on paragraph breaks, then sentence breaks, then words — whichever boundary keeps chunks under ChunkSize without cutting through sentences. Works well for prose, READMEs, articles, and mixed content.
Markdown
Splits on Markdown heading boundaries first, so each chunk stays within a logical section. Prefer this when your documents have clear heading structure, such as documentation sites or wikis.
.AddRag(options =>
{
options.Chunker = ChunkerType.Markdown;
options.ChunkSize = 768; // larger chunks work well when sections are coherent
options.ChunkOverlap = 64;
});
Tuning ChunkSize and ChunkOverlap
- Smaller chunks (256–512 tokens) improve retrieval precision but may omit surrounding context that the model needs to form a complete answer.
- Larger chunks (768–1024 tokens) give the model more context per retrieved result but may bring in irrelevant content that dilutes the answer.
ChunkOverlapprevents answers from being split across a chunk boundary. Values between 10% and 15% ofChunkSizeare a reasonable starting point.
The right values depend on your document structure and the length of the questions you expect. Index the same corpus with a few different settings and compare answer quality before settling on a configuration.
Querying with RagQueryOptions
var result = await rag.AskAsync(
"What models does Hearth support?",
new RagQueryOptions
{
TopK = 8, // retrieve more chunks when documents are long
MinScore = 0.3f, // discard chunks below this cosine similarity threshold
SystemPrompt = "You are a technical support assistant for Hearth. Answer concisely.",
});
| Option | Default | Description |
|---|---|---|
TopK |
5 |
Maximum number of chunks to retrieve and include in the prompt |
MinScore |
0f |
Minimum cosine similarity — chunks below this threshold are excluded |
SystemPrompt |
null |
Override the system prompt; null uses the template from RagOptions |
ChatOptions |
null |
Pass through MEA ChatOptions (temperature, stop sequences, etc.) |
MinScore is useful when the index is large and low-quality matches could mislead the model. A value of 0.3–0.5 is a reasonable starting point; tune it based on what retrieval results look like for your documents.
Inspecting sources
RagResult.Sources contains the chunks that were used to construct the answer:
RagResult result = await rag.AskAsync("How does Hearth handle GPU offloading?");
Console.WriteLine(result.Answer);
Console.WriteLine();
Console.WriteLine($"Sources ({result.Sources.Count}):");
foreach (VectorSearchResult source in result.Sources)
{
Console.WriteLine($" [{source.Score:F3}] {source.Text[..Math.Min(80, source.Text.Length)]}...");
}
VectorSearchResult exposes:
| Property | Type | Description |
|---|---|---|
Id |
string |
Unique identifier for the chunk |
Text |
string |
The chunk text that was embedded |
Score |
float |
Cosine similarity to the query |
Metadata |
object? |
Metadata passed to IndexAsync or carried from the document |
Use Metadata to record the source file path, document title, or any other attribution data you want to surface in your UI.
End-to-end example: Q&A bot over Markdown docs
This example shows a minimal console app that indexes a folder of Markdown files once and then answers questions interactively.
Program.cs
using Hearth.Rag;
using Hearth.Rag.Documents;
using Hearth.Rag.Pipeline;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
var host = Host.CreateDefaultBuilder(args)
.ConfigureServices(services =>
{
services.AddHearth(options =>
{
options.Model = "./models/qwen2.5-7b-q4_k_m.gguf";
options.ContextSize = 8192;
})
.AddRag(options =>
{
options.VectorStore = VectorStoreType.Sqlite;
options.SqlitePath = "docs-index.db";
options.Chunker = ChunkerType.Markdown;
options.ChunkSize = 768;
options.ChunkOverlap = 64;
});
})
.Build();
var rag = host.Services.GetRequiredService<IRagPipeline>();
var registry = host.Services.GetRequiredService<DocumentLoaderRegistry>();
// Index all Markdown files in ./docs (skip if the database already exists)
if (!File.Exists("docs-index.db"))
{
Console.WriteLine("Indexing docs...");
foreach (var file in Directory.GetFiles("./docs", "*.md", SearchOption.AllDirectories))
{
var document = await registry.LoadAsync(file);
await rag.IndexDocumentAsync(document);
}
Console.WriteLine("Done.");
}
// Interactive Q&A loop
Console.WriteLine("Ask a question (Ctrl+C to exit):");
while (true)
{
Console.Write("> ");
var question = Console.ReadLine();
if (string.IsNullOrWhiteSpace(question)) continue;
var result = await rag.AskAsync(question, new RagQueryOptions
{
TopK = 5,
MinScore = 0.25f,
});
Console.WriteLine();
Console.WriteLine(result.Answer);
Console.WriteLine();
if (result.Sources.Count > 0)
{
Console.WriteLine("Sources:");
foreach (var source in result.Sources)
Console.WriteLine($" [{source.Score:F3}] {source.Metadata}");
}
Console.WriteLine();
}
Pass meaningful metadata when indexing so that sources are easy to interpret:
// Instead of IndexDocumentAsync, use IndexAsync directly when you control the metadata
await rag.IndexAsync(
text: File.ReadAllText(file),
metadata: new { File = file, Indexed = DateTime.UtcNow });
Performance notes
Embedding and inference both execute on the same model instance. A few things to keep in mind:
- Indexing is CPU-bound. Each chunk requires an embedding call. Indexing a large corpus takes time proportional to the number of chunks. Do it once and persist with SQLite rather than re-indexing on every startup.
- Retrieval is fast. The in-memory vector store uses a brute-force cosine search, which is plenty fast for corpora up to tens of thousands of chunks. The SQLite store follows the same pattern.
- Inference follows retrieval.
AskAsyncruns one embedding call (for the question) and one chat call (with the retrieved chunks injected). Total latency is roughlyembedding_time + inference_time, same as a normal chat request. - Context window budget. Each retrieved chunk consumes prompt tokens. With
TopK = 5andChunkSize = 512, you may inject up to ~2,500 tokens of context before your question and system prompt. Make sureContextSizeinHearthOptionsis large enough to accommodate the full prompt.