Vault Search Design

Date: 2026-05-10

Purpose

Build a local retrieval foundation for this Obsidian vault. The system should support personal search, AI-assisted context retrieval, and vault governance without replacing Markdown as the source of truth.

The first version will use a local SQLite database with FTS5. It will index Markdown files, frontmatter, tags, links, titles, and content. The database is a derived artifact and can be deleted and rebuilt at any time.

Goals

Provide fast local search across notes, wiki pages, clippings, and project docs.
Preserve the vault’s existing organization model: directories, note-type tags, tech-domain tags, status tags, Markdown links, and the Clippings/ → wiki/ LLM Wiki layer.
Expose search through a CLI first, so Codex, scripts, and future Obsidian automation can reuse it.
Support governance checks such as missing tags, missing frontmatter, wikilinks, and broken Markdown links.
Leave a clean path for future semantic search, graph analysis, and a local Web UI.

Non-Goals

Do not replace Obsidian or change how notes are authored.
Do not modify Markdown files during indexing.
Do not require a background service, external search server, network API, or embedding model in V1.
Do not build an Obsidian plugin in V1.
Do not treat Clippings/ as editable source material.

Recommended Approach

Use SQLite + FTS5 as the first retrieval backend.

SQLite is a good fit because this is a personal vault with strong local structure. It is offline, single-file, easy to rebuild, and queryable from CLI tools, scripts, future Web UI code, and AI workflows. FTS5 provides keyword full-text search while normal SQLite tables preserve structured filters and governance data.

Alternatives considered:

JSON index: simplest, but weaker querying and harder future expansion.
Search server such as Elasticsearch, Meilisearch, or Typesense: powerful, but too much operational overhead for a personal vault.
Vector database first: useful later, but premature before the structural index is reliable.
Obsidian plugin first: better in-app UX, but less reusable by CLI, Codex, and automation.

High-Level Architecture

Obsidian Vault Markdown
        |
        v
Indexer
        |
        v
SQLite + FTS5
        |
        +--> CLI search
        +--> Vault health checks
        +--> Future Web UI
        +--> Future AI/RAG retrieval

Markdown remains the canonical source. The indexer reads the vault and writes a derived SQLite file under tmp/.

File Scope

Index Markdown files under the vault root, excluding:

.git/
.obsidian/
node_modules/
tmp/
.trash/
hidden system folders

The indexer should include these first-class areas:

next-level-door/
IT-learning/
basic-data/
code-scripts/
Clippings/
wiki/
archives-years/
meta/
root-level Markdown files

Each document should be assigned an area from its first path segment. Root-level files can use root.

Indexed Fields

For each Markdown document:

Relative path from vault root
Area
Title
Frontmatter presence
Tags
Aliases
Other frontmatter as JSON values
Markdown body text
Extracted headings
Markdown links
Obsidian wikilinks
Link target resolution status
File modified time
File created time when available from the filesystem

Title selection order:

First Markdown H1
Frontmatter title
Filename without extension

Frontmatter parsing should support both inline arrays and YAML list syntax for tags and aliases.

Database Location

Use:

tmp/vault-search.sqlite

tmp/ is already ignored by git, so the index stays local and disposable.

Database Schema

`documents`

One row per Markdown file.

id INTEGER PRIMARY KEY
path TEXT UNIQUE NOT NULL
area TEXT NOT NULL
title TEXT NOT NULL
content TEXT NOT NULL
mtime TEXT
ctime TEXT
has_frontmatter INTEGER NOT NULL
has_tags INTEGER NOT NULL
content_hash TEXT NOT NULL

`document_fts`

FTS5 virtual table for full-text search.

title
content

The FTS table should map back to documents.id through rowid.

`tags`

document_id INTEGER NOT NULL
tag TEXT NOT NULL

`aliases`

document_id INTEGER NOT NULL
alias TEXT NOT NULL

`frontmatter`

document_id INTEGER NOT NULL
key TEXT NOT NULL
value_json TEXT NOT NULL

`links`

id INTEGER PRIMARY KEY
source_document_id INTEGER NOT NULL
link_text TEXT
target_raw TEXT NOT NULL
target_path TEXT
link_type TEXT NOT NULL
target_exists INTEGER NOT NULL

link_type values:

markdown
wikilink

target_path should be the resolved relative path when resolution succeeds. For unresolved wikilinks, it can be null.

`headings`

document_id INTEGER NOT NULL
level INTEGER NOT NULL
text TEXT NOT NULL
anchor TEXT

Indexing Flow

Discover Markdown files under the vault root using the exclusion rules.
Read each file as UTF-8.
Parse frontmatter if present.
Extract tags, aliases, headings, body text, Markdown links, and wikilinks.
Resolve Markdown links relative to the source file.
Resolve wikilinks by best-effort filename/title matching, but mark ambiguity instead of guessing silently in future versions.
Compute a content hash.
Rebuild the SQLite database for V1.
Print an indexing summary: document count, tag count, link count, missing tag count, broken link count, wikilink count.

V1 can use full rebuilds because the vault is small. Incremental indexing can be added later using mtime and content_hash.

CLI Commands

`vault-index`

Build or rebuild the SQLite index.

Examples:

vault-index
vault-index --db tmp/vault-search.sqlite

`vault-search`

Search the index.

Examples:

vault-search "java 泛型"
vault-search "网络" --area IT-learning
vault-search --tag 学习 --tag java
vault-search --modified-after 2026-04-01
vault-search "ssl 证书" --json

Supported filters in V1:

--area
--tag
--path-prefix
--modified-after
--modified-before
--json
--limit

Default output should be a readable terminal table with path, title, area, tags, and a short snippet. JSON output should include stable fields for automation.

`vault-health`

Run governance queries over the index.

Examples:

vault-health
vault-health --broken-links
vault-health --missing-tags
vault-health --missing-frontmatter
vault-health --wikilinks
vault-health --json

Default output should group findings by severity:

Error: broken Markdown links, README links to missing files, unresolved expected config files
Warning: missing frontmatter, missing tags, unregistered technical domain tags, wikilinks in ordinary notes
Info: isolated notes, old archive notes, source-only clippings

Search Ranking

V1 ranking should combine FTS rank with simple domain-specific boosts:

Title matches rank higher than body-only matches.
Alias and tag matches rank higher than body-only matches.
wiki/ pages rank higher than Clippings/ for general search.
Exact phrase matches rank higher than loose token matches when available.
Recently modified notes get a small boost, not enough to outrank strong title matches.

The first implementation may approximate this with SQL ordering and post-processing.

AI Retrieval Contract

The CLI JSON output should be stable enough for Codex or future AI scripts to consume.

Governance Rules

V1 should encode the current vault rules as checks, not automatic fixes:

Every ordinary note should have at least one note-type tag.
Technical learning notes should have a technical domain tag.
Clippings/ is source material and can be exempt from the ordinary note-type requirement if the governance config says so.
README and instruction files can be exempt from ordinary note frontmatter requirements.
Markdown links should resolve to existing local files when they point to local .md files.
Ordinary notes should prefer Markdown links over wikilinks.

These rules should eventually be configurable, because the vault already has useful exceptions.

Configuration

V1 can start with defaults, but should be designed around a future config file.

Suggested future path:

meta/search-config.json

Possible fields:

{
  "exclude": [".git", ".obsidian", "node_modules", "tmp", ".trash"],
  "clippingsReadOnly": true,
  "frontmatterExemptPaths": ["README.md", "AGENTS.md", "CLAUDE.md"],
  "requiredNoteTypeTags": ["日记", "学习", "项目", "灵感", "计划", "复盘", "知识总结"],
  "registeredDomainTags": ["java", "python", "english", "obsidian", "gnumake", "network", "devtools"]
}

The first implementation may hard-code these values and add config loading later.

Error Handling

Invalid UTF-8 files should be skipped with a warning.
Malformed frontmatter should not stop indexing; record the document and mark frontmatter parsing as failed in the summary.
Broken Markdown links should be indexed as findings, not fatal errors.
Missing database should produce a clear message asking the user to run vault-index.
Empty search queries should either show usage help or require filters.

Testing Strategy

V1 should be testable without touching the real vault.

Use a small fixture vault containing:

A valid daily note
A valid technical note
A note missing tags
A note with Markdown links
A note with wikilinks
A broken link
A clipping
A wiki page

Tests should verify:

File discovery respects exclusions.
Frontmatter tags and aliases parse correctly.
Markdown links resolve relative paths correctly.
Wikilinks are recorded.
FTS search returns expected documents.
Filters by area, tag, and path prefix work.
Health checks report expected findings.
JSON output is stable.

Phased Roadmap

V1: Local Structured Search

SQLite + FTS5 database
Full rebuild indexer
CLI search
CLI health checks
JSON output for automation

V2: Semantic Retrieval

Add embeddings for selected areas first: wiki/, IT-learning/, next-level-door/知识总结/
Keep SQLite metadata as the source of filters
Add semantic search as a second retrieval path, not a replacement

V3: Knowledge Graph

Use tags, links, aliases, directory hierarchy, and wiki-source relationships to identify:
- isolated notes
- duplicate topics
- missing wiki backlinks
- weakly connected learning notes

V4: Local Web UI

Search box
Area and tag filters
Result preview
Health dashboard
Link graph view if V3 data exists

Open Decisions For Implementation Planning

Choose implementation language: Python is likely best for SQLite, YAML/frontmatter parsing, and quick CLI ergonomics; Node is also viable if the vault automation stack leans JavaScript.
Decide whether V1 stores full content in SQLite or only normalized body text plus file path. The recommendation is to store normalized body text for search snippets and AI JSON output.
Decide whether V1 should install dependencies or use only standard library modules. The recommendation is to allow a minimal YAML/frontmatter dependency if Python is chosen, unless dependency-free setup is more important.

Yahxin.space

笔记树

2026-05-10-vault-search-design

Vault Search Design

Purpose

Goals

Non-Goals

Recommended Approach

High-Level Architecture

File Scope

Indexed Fields

Database Location

Database Schema

`documents`

`document_fts`

`tags`

`aliases`

`frontmatter`

`links`

`headings`

Indexing Flow

CLI Commands

`vault-index`

`vault-search`

`vault-health`

Search Ranking

AI Retrieval Contract

Governance Rules

Configuration

Error Handling

Testing Strategy

Phased Roadmap

V1: Local Structured Search

V2: Semantic Retrieval

V3: Knowledge Graph

V4: Local Web UI

Open Decisions For Implementation Planning

关系图谱

大纲

反向链接