Skip to content

Operations

Day-to-day tasks: debugging stuck runs, managing the database, memory reports, and recovery procedures.


Debug a stuck run

Step 1: Check active runs

bash
vlx status

Shows all active runs, their current stage, and time since last event. A run that has not advanced in > 30 minutes is likely stuck (the watchdog marks it needs_review at the 30-minute threshold).

Step 2: Check Telegram

The story's Telegram thread contains every status update. Look for:

  • A clarification question that has not been answered.
  • An APPROVAL request that has expired.
  • An error message from the orchestrator.

Step 3: Get the event trail

bash
# In Telegram:
/history <storyId>

Prints the last 30 events for the story's latest run — stage transitions, clarifications asked/answered, approvals, gate failures, PR events.

Step 4: Check stage checkpoints

Per-stage checkpoint manifests are stored locally in the run's worktree:

.vlx/worktrees/<run-id>/.vlx/<storyId>/checkpoints/<stage>.json

These are read-only (orchestrator-owned, gitignored). Each manifest records the completed stage, its artifacts with SHA-256 hashes, and the commit SHA. If a manifest exists for a stage, that stage is considered complete.

Step 5: Check DB integrity

bash
vlx db check

Runs PRAGMA integrity_check on state.db. Any output other than ok indicates corruption — see Database recovery below.

Step 6: Apply an operator command

SymptomCommand
Run stuck / no progress/requeue <storyId> — revive from the stage it died at
Need to re-run from a specific stage/restart <storyId> <stage>
Run is irrecoverably broken/cancel <storyId> — kill the run
Plan needs a full redo/restart <storyId> think

Database backup

The daemon takes a daily backup at startup. Take a manual snapshot before any risky operation:

bash
vlx db backup

Backups are stored in backups/ adjacent to state.db (or $VLX_DB_PATH).

Retention policy:

  • Last 14 daily snapshots.
  • Last 12 monthly snapshots.

Database restore

Destructive. Stop the daemon first.

bash
# 1. Stop the daemon
sudo systemctl stop vlx

# 2. Inspect available backups
ls -lh backups/

# 3. Check the backup before restoring
VLX_DB_PATH=backups/state-<timestamp>.db vlx db check

# 4. Restore
vlx db restore backups/state-<timestamp>.db --yes

# 5. Restart the daemon
sudo systemctl start vlx

There is no auto-rollback. If the restore makes things worse, restore from an earlier snapshot.


Database archive

Archive event logs for old terminal runs to keep the events table lean:

bash
vlx db archive

Moves event rows for runs in a terminal state (shipped, failed, abandoned, needs_review) older than 90 days to a separate archive file. The archived runs are still queryable via the archive file if needed.


Memory hygiene report

bash
vlx memory report

Scans the project memory at <repo>/.vlx/memory/ and reports:

  • Stale entries: files not updated in > 90 days (may contain outdated info).
  • Large files: files > 10 KB (suggesting accumulated cruft rather than useful learnings).
  • Potential duplicates: entries with high content similarity (could be merged).

No files are modified. The report is informational — act on it by editing or deleting the flagged files in a normal PR.


Recovery model

Daemon restarts

On startup, the orchestrator scans runs for active rows. For each active run:

  1. Checks the per-stage checkpoint manifests in the worktree. A completed stage with a valid manifest is not re-run.
  2. Runs reconcileIntents: checks the external system (ADO, SCM) for any side effects that were "intended but unconfirmed" in the event log (i.e., a crash between issuing the ADO update and recording the completion). Re-records or retries idempotently.
  3. Resumes the run from the last incomplete stage.

If a run's worktree is missing, the run is marked needs_review — a missing worktree is itself a signal worth surfacing, not silently re-creating.

Corrupt event log (seq gap)

If (run_id, seq=N) exists but seq=N+1 is missing, the event log has a gap. The run is marked needs_review instead of attempting to resume on incomplete history.

Audit the gap:

bash
# Query events table directly
VLX_DB_PATH=state.db sqlite3 state.db \
  "SELECT seq, type FROM events WHERE run_id='<run-id>' ORDER BY seq"

needs_review state

Runs marked needs_review are surfaced in vlx status and via a Telegram notification. They do not auto-recover — use /requeue or /restart to resume, or /cancel to abandon.

Auto-escape

Pre-Ship failures land in queue.status = failed. The auto-escape sweep (runs every few minutes) automatically requeues such stories up to 2 times per story before requiring operator action. The cap counts auto_escape_triggered events across all runs for the story. Cancelled runs and stories with an open PR are excluded.

At the cap: run marked needs_review, one Telegram notification.


Worktree management

Per-run git worktrees live at:

<runtime.worktree_root>/<run-id>/

Default: .vlx/worktrees/<run-id>/.

Worktrees are created by the orchestrator and destroyed when:

  • The run completes successfully (post-Ship cleanup).
  • The run is cancelled.
  • A /cancel command is issued.

If a worktree is left behind (daemon crash, manual kill), remove it manually:

bash
git worktree remove .vlx/worktrees/<run-id> --force
git branch -D vlx-bot/<storyId>   # only if no longer needed

Do not delete the worktree while the daemon is running and a run is active — the orchestrator owns the worktree lifecycle.

Internal Veloxcore tool — not a public product.