A module installs by creating tables with convention prefixes. The system discovers what exists. No registration, no base class, no interface.
_raw_* immutable content, embeddings written by compile
_edges_* append-only relationships written by compile or modules
_types_* immutable classification written by compile
_enrich_* mutable graph scores written by manage
The prefix IS the lifecycle declaration. An AI seeing _enrich_source_graph knows: mutable, safe to delete, will be recomputed. An AI seeing _raw_chunks knows: immutable, never wipe.
Views rebuild automatically when tables are added. Presets ship with the module. A cell without a module has full retrieval — those edge columns are simply absent.
Two kinds of modules:
| Type | What it does | Example |
|---|---|---|
| Source | Compiles raw artifacts into chunks | Claude Code, Cursor, Codex |
| Extension | Attaches intelligence to existing chunks | SOMA (file/repo/content/URL identity) |
A source module compiles raw artifacts into chunks. One adapter per format. Different tools, same output shape.
compile/ parse format → chunk-atom tables
manage/ offline enrichment → _enrich_* columns
stock/ presets and views shipped with the module
Write two tables. Embed. Views rebuild automatically.
# minimal adapter
_raw_chunks (id, content, embedding, timestamp)
_edges_source (chunk_id, source_id)
Everything else is additive. Add _types_message for classification. Add _edges_tool_ops if your format has file operations. Call soma_enrich(chunk) inline and four identity edge tables appear automatically.
The reference source module. Indexes your Claude Code session history on first run (Claude retains ~30 days of JSONLs), then captures everything going forward via hooks and a background worker.
| Layer | What | Table |
|---|---|---|
| Messages | Every prompt, response, tool call — full fidelity, no truncation | _raw_chunks |
| Tool operations | Every Read, Write, Edit, Bash — tool name, target file, cwd | _edges_tool_ops |
| Delegations | Parent → child agent tree with agent type — the most-used advanced feature | _edges_delegations |
| Classification | Message type (user_prompt, assistant, tool_call), role, threading | _types_message |
| File identity | SOMA UUIDs for every file touch — survives renames | _edges_file_identity |
| Repo identity | Git root commit hash — survives repo moves | _edges_repo_identity |
| Content identity | SHA-256 of file content at capture time | _edges_content_identity |
| URL identity | Stable UUID for WebFetch operations | _edges_url_identity |
Claude Code tool use
↓
[hooks] write session_id to ~/.flex/queue.db
↓
[worker] polls every 2s, reads JSONL, embeds, writes to cell
↓
[MCP] exposes cell as read-only SQL
Hooks are notification-only — they write a session ID and timestamp. The worker reads the actual JSONL for data. Crash-safe: idempotent inserts, startup backfill recovers missed sessions.
The worker runs a full enrichment cycle every 30 minutes:
| Layer | What it produces |
|---|---|
| Source graph | Centrality, hub status, community membership, community labels |
| File graph | File co-edit relationships across sessions |
| Delegation graph | Parent → child agent topology with betweenness |
| Fingerprints | Navigational index per session — key decisions, tool patterns |
| Project attribution | Maps sessions to repos via 5-tier resolution |
All enrichment is in _enrich_* tables. Safe to wipe. Recomputed automatically.
Two curated views — messages (17 columns, chunk-level) and sessions (15 columns, session-level) — compose all raw tables into a flat surface. The AI queries views, never raw tables. See Views for the full breakdown.
claude_code alongside design docs from a documentation cell. It runs them in parallel automatically.
@orient schema, views, presets, graph topology
@health pipeline freshness, queue depth, embedding coverage
@digest multi-day activity summary
@sprints work periods detected by 6h gaps
@story session narrative — timeline, artifacts, agents
@file every session that touched a file, across renames
@genealogy concept lineage — hubs, key excerpts
@delegation-tree recursive sub-agent tree
@bridges cross-community connector sessions
An extension module attaches intelligence to existing chunks. It doesn't parse — it enriches. Install by creating tables with convention prefixes. The view generator discovers them. The AI can JOIN on them immediately.
-- create a table, it appears in views
CREATE TABLE _edges_my_module (
chunk_id TEXT,
my_field TEXT NOT NULL
);
-- drop it, it disappears
DROP TABLE _edges_my_module;
No registration. No coupling. A cell without the module has full retrieval — those columns are simply absent.
Architecturally, SOMA is an extension module — it installs by creating tables, uninstalls by dropping them. But for agentic coding, it's foundational. Git is technically optional for writing code. In practice, you'd never ship without it. SOMA is the same — that's why it ships with the Claude Code module.
SOMA provides stable identity for files, repos, content, and URLs. It's a standalone system (~/.soma/) with its own databases, shared across all cells and projects on the machine. When the sessions view shows project = 'myapp', that's SOMA — a 5-tier resolution stack traced the session back through repo identity. When @file tracks a file across renames, that's SOMA — a UUID assigned once and persisted in xattr.
Paths are fragile. A session from six months ago worked in a worktree deleted the next day. The cwd is dead. The git root is dead. But the file exists in main, and the repo is still on disk. Without content-addressed identity, that session is an orphan. With SOMA, you trace every session that touched the file — across renames, repo moves, and deleted worktrees.
| Layer | Table | What it tracks | Survives |
|---|---|---|---|
| File | _edges_file_identity | Stable UUID per file | Renames, moves, repo migrations |
| Repo | _edges_repo_identity | Git root commit hash | Repo moves, worktree deletion |
| Content | _edges_content_identity | SHA-256 of file content + git blob hash | Path changes, branch switches |
| URL | _edges_url_identity | Stable UUID per URL | Normalization differences |
All four are written at compile time. Identity is resolved once, written into edge tables, and persists forever.
JSONL sync (worker reads session file)
↓
soma.compile.enrich(chunk)
↓
FileIdentity.get_or_create(path) → file_uuid
git rev-parse --show-toplevel → repo_root
git hash-object {file} → blob_hash
sha256(file_content) → content_hash
URLIdentity.get_or_create(url) → url_uuid
↓
insert_edges(conn, chunk)
→ _edges_file_identity
→ _edges_repo_identity
→ _edges_content_identity
→ _edges_url_identity
Most of SOMA's value is ambient — you benefit without knowing it's there:
| You see | SOMA does |
|---|---|
project = 'myapp' on sessions | 5-tier repo attribution from git root commit hash |
@file path=auth.py tracks renames | Stable UUID per file, fan-out across all historical paths |
file_uuids in messages view | 1:N identity collapsed to JSON array per chunk |
The @file preset is the primary interface — it resolves identity, fans out across renames, and returns a unified history:
# every session that touched a file, across all renames
$ flex search "@file path=src/auth.py"
# or just ask Claude
"Use flex: what's the history of auth.py?"
SOMA runs a 4-pass heal cycle every 24 hours: file UUIDs, content hashes, URL UUIDs, and pre-edit blob hashes from Claude Code's ~/.claude/file-history/ backup files. The forward path captures identity at sync time. The heal pass backfills gaps from capture failures.
try/except, and a cell without it has full retrieval and graph. But the Claude Code module ships with SOMA because agentic coding without stable file identity is like coding without git — technically possible, practically unthinkable.
Modules define what goes into a cell. Views define what comes out.
Raw tables are what gets compiled — immutable facts written once and never modified. A cell has normalized relationships, identity edges, and enrichment scores spread across many tables.
Views compose those tables into a flat surface the AI queries directly. When the @orient preset is invoked it exposes the view-level schema — this presents a curated view of the data. It sees two views:
| View | Level | What it shows |
|---|---|---|
messages | Chunk | Every message, tool call, and file operation — with session context, file identity, delegation edges, and full file content pre-joined into flat columns |
sessions | Session | Every session with project, graph intelligence (centrality, hubs, communities), fingerprints, and warmup noise already filtered out |
The views handle the joins, the 1:N collapse, and the noise filtering. The AI writes WHERE is_hub = 1 like it's a column that always existed.
Without views:
-- 5 tables, 4 JOINs, noise filter, GROUP BY
SELECT src.source_id, src.project, g.centrality
FROM _raw_sources src
LEFT JOIN _types_source_warmup w ON src.source_id = w.source_id
LEFT JOIN _enrich_source_graph g ON src.source_id = g.source_id
WHERE COALESCE(w.is_warmup_only, 0) = 0
AND g.is_hub = 1
With views:
SELECT session_id, project, centrality
FROM sessions
WHERE is_hub = 1
Views are plain .sql files at ~/.flex/views/claude_code/. Edit them to change what the AI sees. Your copy takes precedence over module defaults. Run flex sync to install.
@orient work. The AI reads view columns to learn what to filter on. The cell describes itself through its views.
To index a new tool (Cursor, Codex, or anything that produces session artifacts):
flex/modules/your_tool/
└─ compile/
│ └─ worker.py parse your format → _raw_chunks + _edges_source
│ └─ skip.py noise filtering (optional)
└─ manage/
│ └─ noise.py graph filter config (optional)
└─ stock/
└─ presets/ .sql files shipped with your module
└─ views/ curated view .sql files (optional)
The minimal implementation:
# 1. Parse your format into chunks
for chunk in parse_your_format(source_file):
insert_chunk_atom(conn, chunk_id, content, timestamp)
insert_edge(conn, chunk_id, source_id)
# 2. Embed
embedder = get_model()
for chunk in get_unembedded(conn):
embedding = embedder.encode(chunk.content)
update_embedding(conn, chunk.id, embedding)
# 3. Views rebuild automatically
regenerate_views(conn)
That's it. You now have retrieval, graph intelligence, presets, and MCP access. The rest is additive — classification, tool ops, identity edges, enrichment scripts.
To attach new intelligence to existing chunks:
# 1. Create tables with convention prefixes
conn.execute("""
CREATE TABLE IF NOT EXISTS _edges_your_module (
chunk_id TEXT,
your_field TEXT NOT NULL
)
""")
# 2. Populate from existing data
for chunk_id, value in compute_your_intelligence(conn):
conn.execute(
"INSERT INTO _edges_your_module VALUES (?, ?)",
(chunk_id, value)
)
# 3. Views rebuild automatically
regenerate_views(conn)
_edges_* for relationships (1:N, no PK on chunk_id — excluded from auto-generated views, query via explicit JOIN). _enrich_* for mutable scores (1:1 with PK — auto-joins into views, safe to wipe). _types_* for immutable classification. The prefix declares the lifecycle. The PK declares view inclusion.