Vault Search Design
Date: 2026-05-10
Purpose
Build a local retrieval foundation for this Obsidian vault. The system should support personal search, AI-assisted context retrieval, and vault governance without replacing Markdown as the source of truth.
The first version will use a local SQLite database with FTS5. It will index Markdown files, frontmatter, tags, links, titles, and content. The database is a derived artifact and can be deleted and rebuilt at any time.
Goals
- Provide fast local search across notes, wiki pages, clippings, and project docs.
- Preserve the vault’s existing organization model: directories, note-type tags, tech-domain tags, status tags, Markdown links, and the
Clippings/→wiki/LLM Wiki layer. - Expose search through a CLI first, so Codex, scripts, and future Obsidian automation can reuse it.
- Support governance checks such as missing tags, missing frontmatter, wikilinks, and broken Markdown links.
- Leave a clean path for future semantic search, graph analysis, and a local Web UI.
Non-Goals
- Do not replace Obsidian or change how notes are authored.
- Do not modify Markdown files during indexing.
- Do not require a background service, external search server, network API, or embedding model in V1.
- Do not build an Obsidian plugin in V1.
- Do not treat
Clippings/as editable source material.
Recommended Approach
Use SQLite + FTS5 as the first retrieval backend.
SQLite is a good fit because this is a personal vault with strong local structure. It is offline, single-file, easy to rebuild, and queryable from CLI tools, scripts, future Web UI code, and AI workflows. FTS5 provides keyword full-text search while normal SQLite tables preserve structured filters and governance data.
Alternatives considered:
- JSON index: simplest, but weaker querying and harder future expansion.
- Search server such as Elasticsearch, Meilisearch, or Typesense: powerful, but too much operational overhead for a personal vault.
- Vector database first: useful later, but premature before the structural index is reliable.
- Obsidian plugin first: better in-app UX, but less reusable by CLI, Codex, and automation.
High-Level Architecture
Obsidian Vault Markdown
|
v
Indexer
|
v
SQLite + FTS5
|
+--> CLI search
+--> Vault health checks
+--> Future Web UI
+--> Future AI/RAG retrievalMarkdown remains the canonical source. The indexer reads the vault and writes a derived SQLite file under tmp/.
File Scope
Index Markdown files under the vault root, excluding:
.git/.obsidian/node_modules/tmp/.trash/- hidden system folders
The indexer should include these first-class areas:
next-level-door/IT-learning/basic-data/code-scripts/Clippings/wiki/archives-years/meta/- root-level Markdown files
Each document should be assigned an area from its first path segment. Root-level files can use root.
Indexed Fields
For each Markdown document:
- Relative path from vault root
- Area
- Title
- Frontmatter presence
- Tags
- Aliases
- Other frontmatter as JSON values
- Markdown body text
- Extracted headings
- Markdown links
- Obsidian wikilinks
- Link target resolution status
- File modified time
- File created time when available from the filesystem
Title selection order:
- First Markdown H1
- Frontmatter
title - Filename without extension
Frontmatter parsing should support both inline arrays and YAML list syntax for tags and aliases.
Database Location
Use:
tmp/vault-search.sqlitetmp/ is already ignored by git, so the index stays local and disposable.
Database Schema
documents
One row per Markdown file.
id INTEGER PRIMARY KEY
path TEXT UNIQUE NOT NULL
area TEXT NOT NULL
title TEXT NOT NULL
content TEXT NOT NULL
mtime TEXT
ctime TEXT
has_frontmatter INTEGER NOT NULL
has_tags INTEGER NOT NULL
content_hash TEXT NOT NULLdocument_fts
FTS5 virtual table for full-text search.
title
contentThe FTS table should map back to documents.id through rowid.
tags
document_id INTEGER NOT NULL
tag TEXT NOT NULLaliases
document_id INTEGER NOT NULL
alias TEXT NOT NULLfrontmatter
document_id INTEGER NOT NULL
key TEXT NOT NULL
value_json TEXT NOT NULLlinks
id INTEGER PRIMARY KEY
source_document_id INTEGER NOT NULL
link_text TEXT
target_raw TEXT NOT NULL
target_path TEXT
link_type TEXT NOT NULL
target_exists INTEGER NOT NULLlink_type values:
markdownwikilink
target_path should be the resolved relative path when resolution succeeds. For unresolved wikilinks, it can be null.
headings
document_id INTEGER NOT NULL
level INTEGER NOT NULL
text TEXT NOT NULL
anchor TEXTIndexing Flow
- Discover Markdown files under the vault root using the exclusion rules.
- Read each file as UTF-8.
- Parse frontmatter if present.
- Extract tags, aliases, headings, body text, Markdown links, and wikilinks.
- Resolve Markdown links relative to the source file.
- Resolve wikilinks by best-effort filename/title matching, but mark ambiguity instead of guessing silently in future versions.
- Compute a content hash.
- Rebuild the SQLite database for V1.
- Print an indexing summary: document count, tag count, link count, missing tag count, broken link count, wikilink count.
V1 can use full rebuilds because the vault is small. Incremental indexing can be added later using mtime and content_hash.
CLI Commands
vault-index
Build or rebuild the SQLite index.
Examples:
vault-index
vault-index --db tmp/vault-search.sqlitevault-search
Search the index.
Examples:
vault-search "java 泛型"
vault-search "网络" --area IT-learning
vault-search --tag 学习 --tag java
vault-search --modified-after 2026-04-01
vault-search "ssl 证书" --jsonSupported filters in V1:
--area--tag--path-prefix--modified-after--modified-before--json--limit
Default output should be a readable terminal table with path, title, area, tags, and a short snippet. JSON output should include stable fields for automation.
vault-health
Run governance queries over the index.
Examples:
vault-health
vault-health --broken-links
vault-health --missing-tags
vault-health --missing-frontmatter
vault-health --wikilinks
vault-health --jsonDefault output should group findings by severity:
- Error: broken Markdown links, README links to missing files, unresolved expected config files
- Warning: missing frontmatter, missing tags, unregistered technical domain tags, wikilinks in ordinary notes
- Info: isolated notes, old archive notes, source-only clippings
Search Ranking
V1 ranking should combine FTS rank with simple domain-specific boosts:
- Title matches rank higher than body-only matches.
- Alias and tag matches rank higher than body-only matches.
wiki/pages rank higher thanClippings/for general search.- Exact phrase matches rank higher than loose token matches when available.
- Recently modified notes get a small boost, not enough to outrank strong title matches.
The first implementation may approximate this with SQL ordering and post-processing.
AI Retrieval Contract
The CLI JSON output should be stable enough for Codex or future AI scripts to consume.
Suggested result shape:
{
"query": "ssl 证书",
"results": [
{
"path": "wiki/ssl-certificate.md",
"title": "SSL 证书",
"area": "wiki",
"tags": ["知识总结", "network", "状态/已完结"],
"score": 12.4,
"snippet": "SSL 证书用于验证服务器身份并加密通信数据..."
}
]
}This keeps AI retrieval separate from prompt construction. Another tool can decide how many files to open or summarize.
Governance Rules
V1 should encode the current vault rules as checks, not automatic fixes:
- Every ordinary note should have at least one note-type tag.
- Technical learning notes should have a technical domain tag.
Clippings/is source material and can be exempt from the ordinary note-type requirement if the governance config says so.- README and instruction files can be exempt from ordinary note frontmatter requirements.
- Markdown links should resolve to existing local files when they point to local
.mdfiles. - Ordinary notes should prefer Markdown links over wikilinks.
These rules should eventually be configurable, because the vault already has useful exceptions.
Configuration
V1 can start with defaults, but should be designed around a future config file.
Suggested future path:
meta/search-config.jsonPossible fields:
{
"exclude": [".git", ".obsidian", "node_modules", "tmp", ".trash"],
"clippingsReadOnly": true,
"frontmatterExemptPaths": ["README.md", "AGENTS.md", "CLAUDE.md"],
"requiredNoteTypeTags": ["日记", "学习", "项目", "灵感", "计划", "复盘", "知识总结"],
"registeredDomainTags": ["java", "python", "english", "obsidian", "gnumake", "network", "devtools"]
}The first implementation may hard-code these values and add config loading later.
Error Handling
- Invalid UTF-8 files should be skipped with a warning.
- Malformed frontmatter should not stop indexing; record the document and mark frontmatter parsing as failed in the summary.
- Broken Markdown links should be indexed as findings, not fatal errors.
- Missing database should produce a clear message asking the user to run
vault-index. - Empty search queries should either show usage help or require filters.
Testing Strategy
V1 should be testable without touching the real vault.
Use a small fixture vault containing:
- A valid daily note
- A valid technical note
- A note missing tags
- A note with Markdown links
- A note with wikilinks
- A broken link
- A clipping
- A wiki page
Tests should verify:
- File discovery respects exclusions.
- Frontmatter tags and aliases parse correctly.
- Markdown links resolve relative paths correctly.
- Wikilinks are recorded.
- FTS search returns expected documents.
- Filters by area, tag, and path prefix work.
- Health checks report expected findings.
- JSON output is stable.
Phased Roadmap
V1: Local Structured Search
- SQLite + FTS5 database
- Full rebuild indexer
- CLI search
- CLI health checks
- JSON output for automation
V2: Semantic Retrieval
- Add embeddings for selected areas first:
wiki/,IT-learning/,next-level-door/知识总结/ - Keep SQLite metadata as the source of filters
- Add semantic search as a second retrieval path, not a replacement
V3: Knowledge Graph
- Use tags, links, aliases, directory hierarchy, and wiki-source relationships to identify:
- isolated notes
- duplicate topics
- missing wiki backlinks
- weakly connected learning notes
V4: Local Web UI
- Search box
- Area and tag filters
- Result preview
- Health dashboard
- Link graph view if V3 data exists
Open Decisions For Implementation Planning
- Choose implementation language: Python is likely best for SQLite, YAML/frontmatter parsing, and quick CLI ergonomics; Node is also viable if the vault automation stack leans JavaScript.
- Decide whether V1 stores full content in SQLite or only normalized body text plus file path. The recommendation is to store normalized body text for search snippets and AI JSON output.
- Decide whether V1 should install dependencies or use only standard library modules. The recommendation is to allow a minimal YAML/frontmatter dependency if Python is chosen, unless dependency-free setup is more important.