Control-Plane HA Operator Runbook¶
Audience: operators managing an AutonomyOps control-plane cluster in HA mode (primary + streaming replica with Patroni or equivalent).
This runbook covers the four most common operational scenarios:
Health endpoint quick reference:
Endpoint |
Healthy response |
When to check |
|---|---|---|
|
|
DB reachable from this pod |
|
|
This pod is the write authority |
|
|
Full HA metrics snapshot |
|
|
Audit completeness status |
|
|
Current epoch + lock holder |
1. Leader Failover¶
What triggers a failover¶
Patroni promotes a replica to primary (e.g. primary host crash,
patronictl switchover).The old leader’s keepalive loop fails to renew the advisory lock (
pg_try_advisory_lock).The old leader’s
keepaliveFailuremetric increments;session_lock_helddrops to 0.A replica’s
Campaign()call wins the advisory lock and increments the epoch.
Observing the failover¶
# Watch the current leader state from any node
watch -n 2 'curl -sf http://localhost:8888/v1/health/leader | jq .'
# Example output immediately after a failover:
# {
# "current_epoch": 4,
# "holder_id": "cp-pod-2",
# "acquired_at": "2026-03-09T14:23:01Z",
# "session_lock_held": true
# }
Key fields:
current_epoch— durable monotonic counter; must be strictly greater than the previous epoch.holder_id— node ID of the new leader; was written by itsCampaign()call.session_lock_held—trueonly on the node that currently holds the advisory lock.
Check the epoch history to confirm clean transition:
# GET /v1/health/leader shows only the singleton row.
# For full epoch history, query the leader_epochs table directly:
psql "$POSTGRES_URL" -c "
SELECT epoch, holder_id, acquired_at, resigned_at, notes
FROM leader_epochs
ORDER BY epoch DESC
LIMIT 10;
"
A clean failover shows:
The prior epoch row has
resigned_atpopulated (if the old leader calledResign), or NULL (hard crash).A new epoch row with the new holder_id.
No gap in epoch numbers.
Metrics to watch during failover¶
Metric key |
Expected during failover |
|---|---|
|
Drops to 0 on old leader, rises to 1 on new leader |
|
Increments by 1 on the new leader |
|
May spike on the old leader before it exits |
|
+1 on new leader |
|
+1 on old leader |
What to do if write-ready never recovers¶
Check that the new primary’s
pg_is_in_recovery()returnsfalse:SELECT pg_is_in_recovery(); -- must be false on the primary
Check that MinSyncReplicas is satisfied (if configured > 0):
curl -sf http://localhost:8888/v1/health/quorum | jq .sync_replica_count
If sync replica count is insufficient, check replica connectivity:
SELECT application_name, sync_state, state FROM pg_stat_replication;
If advisory lock is stuck (old leader crashed without releasing), it will expire automatically when its PostgreSQL session is terminated. Check for lingering connections:
SELECT pid, application_name, state, query_start FROM pg_stat_activity WHERE application_name LIKE 'cp-%';
2. Degraded Audit Status¶
Background¶
The audit status reflects whether deferred promotion decisions were successfully persisted. A decision is “deferred” when the promotion evaluator cannot immediately write to the primary (e.g. during a brief leadership gap or network partition).
Statuses:
complete— all decisions persisted normally.degraded— one or more deferred decisions failed to persist (INV-HA-12).legacy_limited— pre-HA event stages exist in the horizon (INV-HA-15); see §3.
Detecting degraded status¶
curl -sf http://localhost:8888/v1/health/audit | jq .
# {
# "audit_status": "degraded",
# "recorded_gaps": ["stage s-003 deferred write failed at 2026-03-09T11:00:00Z"],
# "caveat": "..."
# }
Metrics:
cp.audit.deferred_decision_write_failed — total failures since last restart
cp.audit.deferred_decision_write_retried — total retries since last restart
Understanding the gap¶
A degraded status means at least one promotion decision was written during recovery
(RecordInsufficientHistoryDecisions) but could not be persisted to promotion_decisions.
The decision is not lost from the WAL — it is recorded as a gap in the audit trail.
Query the promotion decisions to find gaps:
-- Decisions written with 'deferred_insufficient_history' origin may indicate
-- stages that were promoted without full evidence during a gap.
SELECT pd.decision_id, pd.rollout_plan_id, pd.stage_id,
pd.epoch, pd.outcome, pd.promotion_origin,
pd.decided_at, pd.reason
FROM promotion_decisions pd
WHERE pd.promotion_origin = 'deferred_insufficient_history'
OR pd.promotion_origin IS NULL
ORDER BY pd.decided_at DESC
LIMIT 20;
Cross-reference with evidence snapshots:
-- Decisions that reference a snapshot have full evidence.
-- Decisions with NULL evidence_snapshot_id are legacy or gap rows.
SELECT pd.decision_id, pd.stage_id, pd.outcome,
es.snapshot_id IS NOT NULL AS has_evidence
FROM promotion_decisions pd
LEFT JOIN evidence_snapshots es ON pd.evidence_snapshot_id = es.snapshot_id
WHERE pd.rollout_plan_id = '<your-plan-id>'
ORDER BY pd.decided_at;
Resolving degraded status¶
The degraded status does NOT block stage promotion or rollout progress. It is an audit
trail completeness flag. To clear it:
Identify which stages are missing evidence (query above).
If the stage is still active and evidence can be re-gathered, re-evaluate the stage. The next successful
Promote()call will write a fresh evidence snapshot.If the stage is terminal and the gap is acceptable (e.g. during a known partition), document the gap. The
audit_statuswill remaindegradeduntil the process restarts and the counter resets (it is in-process; not persisted to the DB).
If degraded status persists across multiple restarts and the counter keeps rising,
check for a persistent write failure path (e.g. the promotion_decisions table is locked
or the primary is rejecting writes from the leader’s connection).
3. Legacy Evidence Caveats¶
Background¶
legacy_limited audit status applies when the HA workplan §15.1 pre-condition is detected:
stages that were promoted before the HA schema was applied have no evidence_snapshots rows
and no promotion_decisions rows written by the epoch-fenced path.
These stages are correctly promoted in the legacy sense (the SQLite-based promoter confirmed them), but the append-only evidence trail for those decisions does not exist in PostgreSQL.
Detecting legacy stages¶
curl -sf http://localhost:8888/v1/health/audit | jq .
# {
# "audit_status": "legacy_limited",
# "caveat": "Pre-HA stages exist in the promotion horizon. Evidence integrity
# is advisory for these stages. See runbook §3."
# }
Metric:
cp.recovery.legacy_events_in_horizon — count of pre-HA events in the replay window
Query to enumerate legacy stages:
-- Stages in stage_status that have no evidence snapshot linked to them.
-- These are candidates for legacy evidence review.
SELECT ss.plan_id, ss.stage_id, ss.phase, ss.promoted_at
FROM stage_status ss
LEFT JOIN promotion_decisions pd
ON pd.rollout_plan_id = ss.plan_id AND pd.stage_id = ss.stage_id
WHERE pd.decision_id IS NULL
AND ss.phase IN ('promoted', 'halted', 'rollback')
ORDER BY ss.promoted_at;
Operator decision tree¶
Is the legacy stage in a terminal phase (promoted/halted/rollback)?
│
├─ YES: Is the stage outcome correct per external ground truth
│ (e.g. fleet telemetry, release tracking)?
│ ├─ YES: Acknowledge the gap. No corrective action needed.
│ │ The rollout is correct; audit trail is incomplete for this stage only.
│ └─ NO: Investigate using fleet telemetry and rollout event history.
│ If the promotion was erroneous, use HaltStage or RollbackStage
│ on any still-active successor stages.
│
└─ NO: The stage is still in progress. The next promotion via the epoch-fenced
path (Promote/HaltStage/RollbackStage) will create full evidence.
No corrective action needed.
Migrating from SQLite¶
If you are migrating an existing deployment from the SQLite backend, the migration tool
(autonomy-orchestrator migrate) copies events and plan state but cannot retroactively
create promotion_decisions or evidence_snapshots for stages promoted under SQLite.
This is expected and documented (workplan §15 migration caveats). After migration:
5. Local Reproducibility Lab¶
When you need to re-run HA status and manual failover verification locally, use a
disposable PostgreSQL primary + standby lab instead of the demo compose stack.
The demo environment does not provide a streaming standby, so it cannot exercise
GET /v1/ha/status, GET /v1/ha/quorum, or the PR-12/PR-24 HA safety checks
honestly.
Prerequisites¶
Docker available locally
Go
1.25.7built CLI binary from this repo
the repo-managed HA lab helper at
scripts/labs/orchestrator_ha_server.go, which exposes:GET /v1/ha/statusGET /v1/ha/quorumPOST /v1/ha/failoverthe health endpoints from
pgstore.HealthServer--quorum-monitor-intervalfor faster local quorum-transition capture
Build the helper binaries¶
export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local
export GOCACHE=/tmp/go-build-verify
export GOTMPDIR=/tmp/go-tmp-verify
go build -o /tmp/autonomy ./cmd/autonomy
go build -o /tmp/pr12_ha_server ./scripts/labs/orchestrator_ha_server.go
Bring up a temporary primary + standby¶
docker network create pr12-ha-net
docker volume create pr12ha-primary-data
docker volume create pr12ha-standby-data
docker run -d --name pr12ha-primary \
--network pr12-ha-net \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_USER=postgres \
-e POSTGRES_DB=autonomy \
-v pr12ha-primary-data:/var/lib/postgresql/data \
postgres:16 \
-c wal_level=replica \
-c max_wal_senders=10 \
-c max_replication_slots=10 \
-c hot_standby=on
docker exec pr12ha-primary psql -U postgres -d postgres -c \
"CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'replica';"
docker exec pr12ha-primary bash -lc \
"echo 'host replication replicator all scram-sha-256' >> /var/lib/postgresql/data/pg_hba.conf && psql -U postgres -d postgres -c 'SELECT pg_reload_conf();'"
docker run --rm --network pr12-ha-net \
-v pr12ha-standby-data:/var/lib/postgresql/data \
postgres:16 bash -lc '
set -euo pipefail
rm -rf /var/lib/postgresql/data/*
export PGPASSWORD=replica
pg_basebackup \
-d "host=pr12ha-primary port=5432 user=replicator password=replica dbname=postgres application_name=standby1" \
-D /var/lib/postgresql/data \
-Fp -Xs -P -R -C -S standby1_slot
'
docker run -d --name pr12ha-standby \
--network pr12-ha-net \
-e POSTGRES_PASSWORD=postgres \
-v pr12ha-standby-data:/var/lib/postgresql/data \
postgres:16 \
-c hot_standby=on
docker exec pr12ha-primary psql -U postgres -d postgres -c \
"ALTER SYSTEM SET synchronous_standby_names = 'FIRST 1 (standby1)';"
docker exec pr12ha-primary psql -U postgres -d postgres -c \
"ALTER SYSTEM SET synchronous_commit = 'remote_apply';"
docker exec pr12ha-primary psql -U postgres -d postgres -c \
"SELECT pg_reload_conf();"
Confirm replication is healthy:
docker exec pr12ha-primary psql -U postgres -d postgres -Atqc \
"SELECT application_name || '|' || state || '|' || sync_state FROM pg_stat_replication;"
Expected:
standby1|streaming|sync
Run two local HA nodes¶
PRIMARY_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' pr12ha-primary)
Terminal 1:
/tmp/pr12_ha_server \
--postgres-url "postgres://postgres:postgres@${PRIMARY_IP}:5432/autonomy?sslmode=disable" \
--holder-id cp-node-1:18088 \
--listen 127.0.0.1:18088 \
--min-sync-replicas 1
Terminal 2:
/tmp/pr12_ha_server \
--postgres-url "postgres://postgres:postgres@${PRIMARY_IP}:5432/autonomy?sslmode=disable" \
--holder-id cp-node-2:18089 \
--listen 127.0.0.1:18089 \
--min-sync-replicas 1
Exercise the operator flow¶
/tmp/autonomy ha status \
--orchestrator-url http://127.0.0.1:18088
/tmp/autonomy ha failover trigger \
--orchestrator-url http://127.0.0.1:18088 \
--operator local-verify \
--reason "PR-12 local verification"
/tmp/autonomy ha status \
--orchestrator-url http://127.0.0.1:18089
Success criteria:
before failover, the leader reports
Write-Ready: truereplication reports one synchronous standby
failover returns a successful graceful resignation
the other node acquires the next epoch
post-failover status remains
Write-Ready: true
Tear down¶
docker rm -f pr12ha-primary pr12ha-standby || true
docker volume rm pr12ha-primary-data pr12ha-standby-data || true
docker network rm pr12-ha-net || true
Backup and restore extension¶
The same lab can also verify the PR-13 backup workflow honestly, because it provides a real primary plus synchronous standby instead of a single-node demo database.
Use a local helper server built from the current branch that exposes the HA health routes and enables maintenance mode for restore verification. For local verification, the helper used in evidence was a small wrapper around:
pgstore.OpenPGLeaderElectorHealthServer.RegisterRoutesHealthServer.WithMaintenanceMode(true)for the restore phase
Recommended verification sequence:
Start the helper in normal mode against the primary.
Create a probe row in PostgreSQL that you can mutate and later verify after restore.
Run:
/tmp/autonomy ha backup create \ --orchestrator-url http://127.0.0.1:18088 \ --operator local-verify \ --reason "PR-13 local verification" \ --backup-id backup-pr13-local \ --output-dir /tmp/pr13-backups
Validate the backup file:
xxd -l 5 /tmp/pr13-backups/backup-pr13-local.dump # expected magic header: PGDMP docker cp /tmp/pr13-backups/backup-pr13-local.dump pr12ha-primary:/tmp/backup-pr13-local.dump docker exec pr12ha-primary pg_restore -l /tmp/backup-pr13-local.dump docker exec pr12ha-primary rm -f /tmp/backup-pr13-local.dump
Verify inventory metadata from the running node:
/tmp/autonomy ha backup list \ --orchestrator-url http://127.0.0.1:18088
Mutate the probe row after the backup.
Restart the helper in maintenance mode and run:
/tmp/autonomy ha backup restore \ --orchestrator-url http://127.0.0.1:18088 \ --backup-id backup-pr13-local \ --operator local-verify \ --reason "PR-13 restore verification" \ --confirm
Query the probe row again to confirm it returned to the pre-backup value.
Notes:
Restore now requires explicit maintenance mode. This is intentional and matches the restore safety invariant: destructive restore must run only in an operator-managed maintenance workflow.
If
backup_inventoryappears empty after restore, check when the snapshot was taken. A backup created before its own inventory row was committed will restore the earlier inventory state even though the backup file itself is valid.If you want a restored snapshot to include real inventory rows, use a two-backup sequence:
create a prior backup to establish an inventory row
capture the state you want to preserve
create the main backup you intend to restore
after restore, expect the prior row to remain and the restored backup’s own row to be absent if it was recorded after the snapshot boundary
Run
GET /v1/health/audit— expectlegacy_limiteduntil the replay horizon advances past all pre-migration stages.The horizon is configured by
PromotionReplayHorizon(default 2 hours). After the horizon age passes,cp.recovery.legacy_events_in_horizondrops to 0 and status returns tocomplete.
4. Contention Monitoring¶
What is lock contention?¶
In the HA control-plane, “lock contention” refers to two sources:
Advisory lock contention: Multiple CP pods attempting
Campaign()simultaneously. Only one wins the session advisory lock. The others seeErrLockHeldByOtherand back off.Row-level lock contention:
EpochFenceissuesSELECT ... FOR UPDATEon theleadership_statesingleton row. If multiple transactions contend for this row, they queue behind the lock holder.
Key metrics¶
Metric |
Description |
|---|---|
|
Successful leadership acquisitions |
|
Campaign attempts that found another leader |
|
Campaign errors (DB failure, etc.) |
|
EpochFence rejections (stale-epoch writes blocked) |
|
Advisory lock wait histogram |
|
Epochs that did not have a clean closeout written |
|
Epochs reconciled during recovery |
Reading the campaign ratio¶
# High contention ratio: many lock_held vs success
campaigns_success = snapshot["cp.leader.campaigns_total{result=success}"]
campaigns_lock_held = snapshot["cp.leader.campaigns_total{result=lock_held}"]
ratio = campaigns_lock_held / (campaigns_success + campaigns_lock_held)
A ratio > 0.5 in steady state suggests the campaign interval is too short relative to
the keepalive TTL, or there are more CP pods than expected contending for leadership.
Check AUTONOMY_CP_REPLICAS and PGLeaderElectorConfig.KeepaliveInterval.
Lock wait histogram¶
The cp.backend.lock_wait_us_sum / cp.backend.lock_wait_count pair gives the mean
advisory lock wait time per campaign:
mean_wait_us = lock_wait_us_sum / lock_wait_count
# Alert if mean_wait_us > 500_000 (500ms) sustained over 5 minutes.
High lock wait times indicate the primary is under heavy lock pressure. Check:
-- Active locks on the leadership_state table:
SELECT pid, mode, granted, query_start
FROM pg_locks l
JOIN pg_stat_activity a USING (pid)
WHERE relation = 'leadership_state'::regclass;
Epoch mismatch monitoring¶
cp.authority.epoch_mismatch_total increments each time a write transaction is rejected
because the caller’s cached epoch does not match the durable epoch in leadership_state.
Expected: 0 in steady state; ≤1 per failover event.
Alert: > 2 per minute suggests a bug in epoch cache invalidation or an operator running a tool with a stale epoch handle.
Query recent epoch transitions:
SELECT epoch, holder_id, acquired_at, resigned_at,
EXTRACT(EPOCH FROM (resigned_at - acquired_at)) AS tenure_seconds
FROM leader_epochs
ORDER BY epoch DESC
LIMIT 20;
Short tenure (< 30s) on many consecutive epochs indicates a flapping leader. Possible causes: network instability, keepalive timeout too aggressive, or the primary PostgreSQL server restarting frequently.
Prior-epoch closeout¶
cp.audit.prior_epoch_closeout_missed — epochs where closeout INSERT was skipped
cp.audit.prior_epoch_closeout_reconciled — epochs subsequently reconciled in recovery
In normal operation, missed should be 0. A non-zero value means a leader acquired the
lock and started a new epoch before the prior epoch’s leader_epochs.resigned_at was
written (hard crash or network partition between advisory lock release and DB write).
reconciled increments when recovery detects and back-fills these rows. If missed is
growing faster than reconciled, recovery is not keeping up. Check:
curl -sf http://localhost:8888/v1/health/quorum | jq '{epoch_missed: .audit_closeout_missed, epoch_reconciled: .audit_closeout_reconciled}'
Appendix: Metric Key Reference¶
All metrics are exposed via the AtomicMetrics.Snapshot() method. The canonical key names
are listed here for alert configuration.
Counters¶
cp.leader.campaigns_total{result=success}
cp.leader.campaigns_total{result=lock_held}
cp.leader.campaigns_total{result=failure}
cp.leader.keepalive_failures_total
cp.authority.epoch_mismatch_total
cp.audit.prior_epoch_closeout_missed
cp.audit.prior_epoch_closeout_reconciled
cp.audit.deferred_decision_write_failed
cp.audit.deferred_decision_write_retried
cp.promoter.evidence_write_failures_total
cp.promoter.decisions_total{outcome=promoted}
cp.promoter.decisions_total{outcome=blocked}
cp.promoter.decisions_total{outcome=rollback_triggered}
cp.promoter.decisions_total{outcome=deferred_insufficient_history}
cp.outbox.purge_total
cp.leader.read_write_state_transition_total{not_leader:leader}
cp.leader.read_write_state_transition_total{leader:not_leader}
Gauges¶
cp.leader.epoch — current durable epoch (0 when not leader)
cp.leader.session_lock_held — 1 = this node holds the advisory lock, 0 = does not
cp.backend.write_ready — 1 = write-authority conditions met
cp.backend.read_ready — 1 = DB reachable
cp.backend.sync_replica_count — current sync replica count
cp.backend.connected_to_primary — 1 = connected to PostgreSQL primary
cp.recovery.legacy_events_in_horizon — count of pre-HA events in replay window
cp.audit.status{status=complete} — 1 when audit is complete
cp.audit.status{status=degraded} — 1 when deferred writes failed
cp.audit.status{status=legacy_limited} — 1 when pre-HA stages exist
Histograms (sum + count pairs)¶
cp.recovery.startup_scan_duration_us_sum — total recovery scan time (µs)
cp.recovery.startup_scan_duration_count — number of recovery scans
cp.recovery.events_replayed — total events replayed across all scans
cp.outbox.dispatch_lag_us_sum — total outbox dispatch lag (µs)
cp.outbox.dispatch_lag_count — number of dispatched outbox messages
cp.backend.lock_wait_us_sum — total advisory lock wait (µs)
cp.backend.lock_wait_count — number of lock wait observations
Generated for workplan v0.7 PR-7 (Observability + Audit Trail). Last updated 2026-03-09.