A fully self-hostable RAG memory layer built in Rust. Combines local vector embeddings via llama.cpp, knowledge graphs, BM25 full-text search, and reranking into a single binary that runs on your machine.
Designed to plug into any AI agent through an OpenAI-compatible API, an MCP server for Claude Code and Cursor, or direct HTTP.
- Local embeddings using any GGUF model via llama.cpp (no cloud required)
- Vector similarity search via sqlite-vec with float32 vectors
- Knowledge graph storage using LightRAG-style entity and relationship extraction
- Hybrid search: vector + BM25 with reciprocal rank fusion
- Optional cross-encoder reranking
- OpenAI-compatible
/v1/chat/completionsand/v1/modelsendpoints - MCP server (stdio) for Claude Code and Cursor
- CLI for ingesting files, searching, and managing models
- Model downloads from HuggingFace Hub (with or without a token)
- Single SQLite database — no external services
- Rust 1.75 or newer
- cmake (for llama.cpp compilation)
- On Windows: Visual C++ Build Tools (MSVC)
- On macOS: Xcode Command Line Tools (Metal acceleration automatic on Apple Silicon)
- On Linux: gcc or clang
git clone <this-repo>
cd layer1
cargo build --release
cp layer1.toml.example layer1.tomlAdd the binary to your PATH or invoke it as ./target/release/layer1.
layer1 model pull nomic-ai/nomic-embed-text-v1.5-GGUF nomic-embed-text-v1.5.Q4_K_M.ggufCopy layer1.toml.example to layer1.toml and set embedding_model to the downloaded path:
[server]
host = "127.0.0.1"
port = 3000
api_key = "your-secret-key"
[database]
path = "layer1.db"
embedding_dim = 768
[models]
embedding_model = "models/nomic-embed-text-v1.5.Q4_K_M.gguf"
n_gpu_layers = 0
models_dir = "models"
[rag]
chunk_size = 512
chunk_overlap = 64
top_k = 10
rerank_top_k = 5layer1 serve# From a file
layer1 ingest README.md
# From stdin
echo "The Eiffel Tower is in Paris." | layer1 ingest -
# With metadata
layer1 ingest notes.txt --metadata '{"source": "personal"}'layer1 search "what is the Eiffel Tower"layer1 initThis writes .claude/mcp.json and .cursor/mcp.json with the MCP server config, and .claude/skills/layer1.md for use as a Claude Code skill.
layer1 [--config <path>] <command>
Commands:
serve Start the HTTP API server
mcp Start MCP server on stdio
model pull <repo> <file> Download a GGUF model from HuggingFace
model list List downloaded models
model remove <name> Delete a model file
ingest <path|-|> Ingest a file or stdin into memory
search <query> Semantic search over stored memory
chat Interactive chat with RAG context
init Generate MCP config files for agent clients
All endpoints except /health require the X-API-Key header (or Authorization: Bearer <key>).
POST /api/ingest
Content-Type: application/json
X-API-Key: <key>
{
"content": "...",
"metadata": { "source": "docs" }
}
Response:
{ "document_id": "uuid", "chunk_count": 4 }POST /api/search
Content-Type: application/json
X-API-Key: <key>
{
"query": "...",
"top_k": 5,
"include_graph": false
}
Response:
{
"results": [
{ "chunk_id": "...", "document_id": "...", "content": "...", "score": 0.91, "source": "hybrid" }
],
"graph_context": []
}POST /v1/chat/completions
Content-Type: application/json
X-API-Key: <key>
{
"model": "layer1",
"messages": [{ "role": "user", "content": "What do you know about Paris?" }]
}
The server injects RAG context from memory into the system prompt automatically. Set generation_model in layer1.toml for AI-generated responses.
GET /v1/models
X-API-Key: <key>
Run the MCP server for Claude Code or Cursor:
layer1 mcpThe server communicates over stdio using JSON-RPC 2.0. It exposes four tools:
| Tool | Description |
|---|---|
store_memory |
Store text with optional metadata |
search_memory |
Semantic search over stored memories |
list_memories |
List recent documents |
delete_memory |
Remove a document by ID |
Add to .claude/mcp.json:
{
"mcpServers": {
"layer1": {
"command": "layer1",
"args": ["--config", "layer1.toml", "mcp"]
}
}
}Or run layer1 init to generate all config files automatically.
Set n_gpu_layers in layer1.toml:
0= CPU only (default)-1= all layers on GPU
On macOS with Apple Silicon, Metal is enabled automatically. On Linux with CUDA, build with:
LLAMA_CUDA=1 cargo build --releaseDue to a known issue in llama-cpp-sys-2, GGUF models larger than 4 GB may fail to load on Windows with MSVC. Use models under 4 GB (Q4_K_S or smaller) or build with the MinGW toolchain as a workaround. See utilityai/llama-cpp-rs#951 for status.
All data is stored in a single SQLite file:
| Table | Contents |
|---|---|
documents |
Full source documents with metadata |
chunks |
Text chunks with document references |
chunk_vec |
Vector embeddings for chunks (sqlite-vec) |
chunks_fts |
FTS5 full-text index for BM25 search |
entities |
Named entities extracted from documents |
entity_vec |
Vector embeddings for entities |
relationships |
Entity relationships with strength scores |
relation_vec |
Vector embeddings for relationships |
MIT