Control-Plane HA Operator Runbook

Audience: operators managing an AutonomyOps control-plane cluster in HA mode (primary + streaming replica with Patroni or equivalent).

This runbook covers the four most common operational scenarios:

  1. Leader failover

  2. Degraded audit status

  3. Legacy evidence caveats

  4. Contention monitoring

  5. Local reproducibility lab

Health endpoint quick reference:

Endpoint

Healthy response

When to check

GET /v1/health/read-ready

200 {"ready":true}

DB reachable from this pod

GET /v1/health/write-ready

200 {"ready":true}

This pod is the write authority

GET /v1/health/quorum

200 always

Full HA metrics snapshot

GET /v1/health/audit

200 always

Audit completeness status

GET /v1/health/leader

200 always

Current epoch + lock holder


1. Leader Failover

What triggers a failover

  • Patroni promotes a replica to primary (e.g. primary host crash, patronictl switchover).

  • The old leader’s keepalive loop fails to renew the advisory lock (pg_try_advisory_lock).

  • The old leader’s keepaliveFailure metric increments; session_lock_held drops to 0.

  • A replica’s Campaign() call wins the advisory lock and increments the epoch.

Observing the failover

# Watch the current leader state from any node
watch -n 2 'curl -sf http://localhost:8888/v1/health/leader | jq .'

# Example output immediately after a failover:
# {
#   "current_epoch": 4,
#   "holder_id": "cp-pod-2",
#   "acquired_at": "2026-03-09T14:23:01Z",
#   "session_lock_held": true
# }

Key fields:

  • current_epoch — durable monotonic counter; must be strictly greater than the previous epoch.

  • holder_id — node ID of the new leader; was written by its Campaign() call.

  • session_lock_heldtrue only on the node that currently holds the advisory lock.

Check the epoch history to confirm clean transition:

# GET /v1/health/leader shows only the singleton row.
# For full epoch history, query the leader_epochs table directly:
psql "$POSTGRES_URL" -c "
  SELECT epoch, holder_id, acquired_at, resigned_at, notes
  FROM leader_epochs
  ORDER BY epoch DESC
  LIMIT 10;
"

A clean failover shows:

  • The prior epoch row has resigned_at populated (if the old leader called Resign), or NULL (hard crash).

  • A new epoch row with the new holder_id.

  • No gap in epoch numbers.

Metrics to watch during failover

Metric key

Expected during failover

cp.leader.session_lock_held

Drops to 0 on old leader, rises to 1 on new leader

cp.leader.epoch

Increments by 1 on the new leader

cp.leader.keepalive_failures_total

May spike on the old leader before it exits

cp.leader.read_write_state_transition_total{not_leader:leader}

+1 on new leader

cp.leader.read_write_state_transition_total{leader:not_leader}

+1 on old leader

Confirming write authority restored

# On the new leader pod — expect 200 + ready:true
curl -sf http://localhost:8888/v1/health/write-ready | jq .

# On old leader pod — expect 503 + ready:false
curl -sf http://old-leader:8888/v1/health/write-ready | jq .

What to do if write-ready never recovers

  1. Check that the new primary’s pg_is_in_recovery() returns false:

    SELECT pg_is_in_recovery();  -- must be false on the primary
    
  2. Check that MinSyncReplicas is satisfied (if configured > 0):

    curl -sf http://localhost:8888/v1/health/quorum | jq .sync_replica_count
    
  3. If sync replica count is insufficient, check replica connectivity:

    SELECT application_name, sync_state, state
    FROM pg_stat_replication;
    
  4. If advisory lock is stuck (old leader crashed without releasing), it will expire automatically when its PostgreSQL session is terminated. Check for lingering connections:

    SELECT pid, application_name, state, query_start
    FROM pg_stat_activity
    WHERE application_name LIKE 'cp-%';
    

2. Degraded Audit Status

Background

The audit status reflects whether deferred promotion decisions were successfully persisted. A decision is “deferred” when the promotion evaluator cannot immediately write to the primary (e.g. during a brief leadership gap or network partition).

Statuses:

  • complete — all decisions persisted normally.

  • degraded — one or more deferred decisions failed to persist (INV-HA-12).

  • legacy_limited — pre-HA event stages exist in the horizon (INV-HA-15); see §3.

Detecting degraded status

curl -sf http://localhost:8888/v1/health/audit | jq .
# {
#   "audit_status": "degraded",
#   "recorded_gaps": ["stage s-003 deferred write failed at 2026-03-09T11:00:00Z"],
#   "caveat": "..."
# }

Metrics:

cp.audit.deferred_decision_write_failed   — total failures since last restart
cp.audit.deferred_decision_write_retried  — total retries since last restart

Understanding the gap

A degraded status means at least one promotion decision was written during recovery (RecordInsufficientHistoryDecisions) but could not be persisted to promotion_decisions. The decision is not lost from the WAL — it is recorded as a gap in the audit trail.

Query the promotion decisions to find gaps:

-- Decisions written with 'deferred_insufficient_history' origin may indicate
-- stages that were promoted without full evidence during a gap.
SELECT pd.decision_id, pd.rollout_plan_id, pd.stage_id,
       pd.epoch, pd.outcome, pd.promotion_origin,
       pd.decided_at, pd.reason
FROM promotion_decisions pd
WHERE pd.promotion_origin = 'deferred_insufficient_history'
  OR pd.promotion_origin IS NULL
ORDER BY pd.decided_at DESC
LIMIT 20;

Cross-reference with evidence snapshots:

-- Decisions that reference a snapshot have full evidence.
-- Decisions with NULL evidence_snapshot_id are legacy or gap rows.
SELECT pd.decision_id, pd.stage_id, pd.outcome,
       es.snapshot_id IS NOT NULL AS has_evidence
FROM promotion_decisions pd
LEFT JOIN evidence_snapshots es ON pd.evidence_snapshot_id = es.snapshot_id
WHERE pd.rollout_plan_id = '<your-plan-id>'
ORDER BY pd.decided_at;

Resolving degraded status

The degraded status does NOT block stage promotion or rollout progress. It is an audit trail completeness flag. To clear it:

  1. Identify which stages are missing evidence (query above).

  2. If the stage is still active and evidence can be re-gathered, re-evaluate the stage. The next successful Promote() call will write a fresh evidence snapshot.

  3. If the stage is terminal and the gap is acceptable (e.g. during a known partition), document the gap. The audit_status will remain degraded until the process restarts and the counter resets (it is in-process; not persisted to the DB).

If degraded status persists across multiple restarts and the counter keeps rising, check for a persistent write failure path (e.g. the promotion_decisions table is locked or the primary is rejecting writes from the leader’s connection).


3. Legacy Evidence Caveats

Background

legacy_limited audit status applies when the HA workplan §15.1 pre-condition is detected: stages that were promoted before the HA schema was applied have no evidence_snapshots rows and no promotion_decisions rows written by the epoch-fenced path.

These stages are correctly promoted in the legacy sense (the SQLite-based promoter confirmed them), but the append-only evidence trail for those decisions does not exist in PostgreSQL.

Detecting legacy stages

curl -sf http://localhost:8888/v1/health/audit | jq .
# {
#   "audit_status": "legacy_limited",
#   "caveat": "Pre-HA stages exist in the promotion horizon. Evidence integrity
#              is advisory for these stages. See runbook §3."
# }

Metric:

cp.recovery.legacy_events_in_horizon   — count of pre-HA events in the replay window

Query to enumerate legacy stages:

-- Stages in stage_status that have no evidence snapshot linked to them.
-- These are candidates for legacy evidence review.
SELECT ss.plan_id, ss.stage_id, ss.phase, ss.promoted_at
FROM stage_status ss
LEFT JOIN promotion_decisions pd
  ON pd.rollout_plan_id = ss.plan_id AND pd.stage_id = ss.stage_id
WHERE pd.decision_id IS NULL
  AND ss.phase IN ('promoted', 'halted', 'rollback')
ORDER BY ss.promoted_at;

Operator decision tree

Is the legacy stage in a terminal phase (promoted/halted/rollback)?
│
├─ YES: Is the stage outcome correct per external ground truth
│       (e.g. fleet telemetry, release tracking)?
│       ├─ YES: Acknowledge the gap. No corrective action needed.
│       │       The rollout is correct; audit trail is incomplete for this stage only.
│       └─ NO:  Investigate using fleet telemetry and rollout event history.
│               If the promotion was erroneous, use HaltStage or RollbackStage
│               on any still-active successor stages.
│
└─ NO:  The stage is still in progress. The next promotion via the epoch-fenced
        path (Promote/HaltStage/RollbackStage) will create full evidence.
        No corrective action needed.

Migrating from SQLite

If you are migrating an existing deployment from the SQLite backend, the migration tool (autonomy-orchestrator migrate) copies events and plan state but cannot retroactively create promotion_decisions or evidence_snapshots for stages promoted under SQLite.

This is expected and documented (workplan §15 migration caveats). After migration:


5. Local Reproducibility Lab

When you need to re-run HA status and manual failover verification locally, use a disposable PostgreSQL primary + standby lab instead of the demo compose stack. The demo environment does not provide a streaming standby, so it cannot exercise GET /v1/ha/status, GET /v1/ha/quorum, or the PR-12/PR-24 HA safety checks honestly.

Prerequisites

  • Docker available locally

  • Go 1.25.7

  • built CLI binary from this repo

  • the repo-managed HA lab helper at scripts/labs/orchestrator_ha_server.go, which exposes:

    • GET /v1/ha/status

    • GET /v1/ha/quorum

    • POST /v1/ha/failover

    • the health endpoints from pgstore.HealthServer

    • --quorum-monitor-interval for faster local quorum-transition capture

Build the helper binaries

export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local
export GOCACHE=/tmp/go-build-verify
export GOTMPDIR=/tmp/go-tmp-verify

go build -o /tmp/autonomy ./cmd/autonomy
go build -o /tmp/pr12_ha_server ./scripts/labs/orchestrator_ha_server.go

Bring up a temporary primary + standby

docker network create pr12-ha-net
docker volume create pr12ha-primary-data
docker volume create pr12ha-standby-data

docker run -d --name pr12ha-primary \
  --network pr12-ha-net \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_DB=autonomy \
  -v pr12ha-primary-data:/var/lib/postgresql/data \
  postgres:16 \
  -c wal_level=replica \
  -c max_wal_senders=10 \
  -c max_replication_slots=10 \
  -c hot_standby=on

docker exec pr12ha-primary psql -U postgres -d postgres -c \
  "CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'replica';"

docker exec pr12ha-primary bash -lc \
  "echo 'host replication replicator all scram-sha-256' >> /var/lib/postgresql/data/pg_hba.conf && psql -U postgres -d postgres -c 'SELECT pg_reload_conf();'"

docker run --rm --network pr12-ha-net \
  -v pr12ha-standby-data:/var/lib/postgresql/data \
  postgres:16 bash -lc '
    set -euo pipefail
    rm -rf /var/lib/postgresql/data/*
    export PGPASSWORD=replica
    pg_basebackup \
      -d "host=pr12ha-primary port=5432 user=replicator password=replica dbname=postgres application_name=standby1" \
      -D /var/lib/postgresql/data \
      -Fp -Xs -P -R -C -S standby1_slot
  '

docker run -d --name pr12ha-standby \
  --network pr12-ha-net \
  -e POSTGRES_PASSWORD=postgres \
  -v pr12ha-standby-data:/var/lib/postgresql/data \
  postgres:16 \
  -c hot_standby=on

docker exec pr12ha-primary psql -U postgres -d postgres -c \
  "ALTER SYSTEM SET synchronous_standby_names = 'FIRST 1 (standby1)';"
docker exec pr12ha-primary psql -U postgres -d postgres -c \
  "ALTER SYSTEM SET synchronous_commit = 'remote_apply';"
docker exec pr12ha-primary psql -U postgres -d postgres -c \
  "SELECT pg_reload_conf();"

Confirm replication is healthy:

docker exec pr12ha-primary psql -U postgres -d postgres -Atqc \
  "SELECT application_name || '|' || state || '|' || sync_state FROM pg_stat_replication;"

Expected:

standby1|streaming|sync

Run two local HA nodes

PRIMARY_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' pr12ha-primary)

Terminal 1:

/tmp/pr12_ha_server \
  --postgres-url "postgres://postgres:postgres@${PRIMARY_IP}:5432/autonomy?sslmode=disable" \
  --holder-id cp-node-1:18088 \
  --listen 127.0.0.1:18088 \
  --min-sync-replicas 1

Terminal 2:

/tmp/pr12_ha_server \
  --postgres-url "postgres://postgres:postgres@${PRIMARY_IP}:5432/autonomy?sslmode=disable" \
  --holder-id cp-node-2:18089 \
  --listen 127.0.0.1:18089 \
  --min-sync-replicas 1

Exercise the operator flow

/tmp/autonomy ha status \
  --orchestrator-url http://127.0.0.1:18088
/tmp/autonomy ha failover trigger \
  --orchestrator-url http://127.0.0.1:18088 \
  --operator local-verify \
  --reason "PR-12 local verification"
/tmp/autonomy ha status \
  --orchestrator-url http://127.0.0.1:18089

Success criteria:

  • before failover, the leader reports Write-Ready: true

  • replication reports one synchronous standby

  • failover returns a successful graceful resignation

  • the other node acquires the next epoch

  • post-failover status remains Write-Ready: true

Tear down

docker rm -f pr12ha-primary pr12ha-standby || true
docker volume rm pr12ha-primary-data pr12ha-standby-data || true
docker network rm pr12-ha-net || true

Backup and restore extension

The same lab can also verify the PR-13 backup workflow honestly, because it provides a real primary plus synchronous standby instead of a single-node demo database.

Use a local helper server built from the current branch that exposes the HA health routes and enables maintenance mode for restore verification. For local verification, the helper used in evidence was a small wrapper around:

  • pgstore.Open

  • PGLeaderElector

  • HealthServer.RegisterRoutes

  • HealthServer.WithMaintenanceMode(true) for the restore phase

Recommended verification sequence:

  1. Start the helper in normal mode against the primary.

  2. Create a probe row in PostgreSQL that you can mutate and later verify after restore.

  3. Run:

    /tmp/autonomy ha backup create \
      --orchestrator-url http://127.0.0.1:18088 \
      --operator local-verify \
      --reason "PR-13 local verification" \
      --backup-id backup-pr13-local \
      --output-dir /tmp/pr13-backups
    
  4. Validate the backup file:

    xxd -l 5 /tmp/pr13-backups/backup-pr13-local.dump
    # expected magic header: PGDMP
    
    docker cp /tmp/pr13-backups/backup-pr13-local.dump pr12ha-primary:/tmp/backup-pr13-local.dump
    docker exec pr12ha-primary pg_restore -l /tmp/backup-pr13-local.dump
    docker exec pr12ha-primary rm -f /tmp/backup-pr13-local.dump
    
  5. Verify inventory metadata from the running node:

    /tmp/autonomy ha backup list \
      --orchestrator-url http://127.0.0.1:18088
    
  6. Mutate the probe row after the backup.

  7. Restart the helper in maintenance mode and run:

    /tmp/autonomy ha backup restore \
      --orchestrator-url http://127.0.0.1:18088 \
      --backup-id backup-pr13-local \
      --operator local-verify \
      --reason "PR-13 restore verification" \
      --confirm
    
  8. Query the probe row again to confirm it returned to the pre-backup value.

Notes:

  • Restore now requires explicit maintenance mode. This is intentional and matches the restore safety invariant: destructive restore must run only in an operator-managed maintenance workflow.

  • If backup_inventory appears empty after restore, check when the snapshot was taken. A backup created before its own inventory row was committed will restore the earlier inventory state even though the backup file itself is valid.

  • If you want a restored snapshot to include real inventory rows, use a two-backup sequence:

    1. create a prior backup to establish an inventory row

    2. capture the state you want to preserve

    3. create the main backup you intend to restore

    4. after restore, expect the prior row to remain and the restored backup’s own row to be absent if it was recorded after the snapshot boundary

  1. Run GET /v1/health/audit — expect legacy_limited until the replay horizon advances past all pre-migration stages.

  2. The horizon is configured by PromotionReplayHorizon (default 2 hours). After the horizon age passes, cp.recovery.legacy_events_in_horizon drops to 0 and status returns to complete.


4. Contention Monitoring

What is lock contention?

In the HA control-plane, “lock contention” refers to two sources:

  1. Advisory lock contention: Multiple CP pods attempting Campaign() simultaneously. Only one wins the session advisory lock. The others see ErrLockHeldByOther and back off.

  2. Row-level lock contention: EpochFence issues SELECT ... FOR UPDATE on the leadership_state singleton row. If multiple transactions contend for this row, they queue behind the lock holder.

Key metrics

Metric

Description

cp.leader.campaigns_total{result=success}

Successful leadership acquisitions

cp.leader.campaigns_total{result=lock_held}

Campaign attempts that found another leader

cp.leader.campaigns_total{result=failure}

Campaign errors (DB failure, etc.)

cp.authority.epoch_mismatch_total

EpochFence rejections (stale-epoch writes blocked)

cp.backend.lock_wait_us_sum / cp.backend.lock_wait_count

Advisory lock wait histogram

cp.audit.prior_epoch_closeout_missed

Epochs that did not have a clean closeout written

cp.audit.prior_epoch_closeout_reconciled

Epochs reconciled during recovery

Reading the campaign ratio

# High contention ratio: many lock_held vs success
campaigns_success = snapshot["cp.leader.campaigns_total{result=success}"]
campaigns_lock_held = snapshot["cp.leader.campaigns_total{result=lock_held}"]
ratio = campaigns_lock_held / (campaigns_success + campaigns_lock_held)

A ratio > 0.5 in steady state suggests the campaign interval is too short relative to the keepalive TTL, or there are more CP pods than expected contending for leadership. Check AUTONOMY_CP_REPLICAS and PGLeaderElectorConfig.KeepaliveInterval.

Lock wait histogram

The cp.backend.lock_wait_us_sum / cp.backend.lock_wait_count pair gives the mean advisory lock wait time per campaign:

mean_wait_us = lock_wait_us_sum / lock_wait_count
# Alert if mean_wait_us > 500_000 (500ms) sustained over 5 minutes.

High lock wait times indicate the primary is under heavy lock pressure. Check:

-- Active locks on the leadership_state table:
SELECT pid, mode, granted, query_start
FROM pg_locks l
JOIN pg_stat_activity a USING (pid)
WHERE relation = 'leadership_state'::regclass;

Epoch mismatch monitoring

cp.authority.epoch_mismatch_total increments each time a write transaction is rejected because the caller’s cached epoch does not match the durable epoch in leadership_state.

  • Expected: 0 in steady state; ≤1 per failover event.

  • Alert: > 2 per minute suggests a bug in epoch cache invalidation or an operator running a tool with a stale epoch handle.

Query recent epoch transitions:

SELECT epoch, holder_id, acquired_at, resigned_at,
       EXTRACT(EPOCH FROM (resigned_at - acquired_at)) AS tenure_seconds
FROM leader_epochs
ORDER BY epoch DESC
LIMIT 20;

Short tenure (< 30s) on many consecutive epochs indicates a flapping leader. Possible causes: network instability, keepalive timeout too aggressive, or the primary PostgreSQL server restarting frequently.

Prior-epoch closeout

cp.audit.prior_epoch_closeout_missed      — epochs where closeout INSERT was skipped
cp.audit.prior_epoch_closeout_reconciled  — epochs subsequently reconciled in recovery

In normal operation, missed should be 0. A non-zero value means a leader acquired the lock and started a new epoch before the prior epoch’s leader_epochs.resigned_at was written (hard crash or network partition between advisory lock release and DB write).

reconciled increments when recovery detects and back-fills these rows. If missed is growing faster than reconciled, recovery is not keeping up. Check:

curl -sf http://localhost:8888/v1/health/quorum | jq '{epoch_missed: .audit_closeout_missed, epoch_reconciled: .audit_closeout_reconciled}'

Appendix: Metric Key Reference

All metrics are exposed via the AtomicMetrics.Snapshot() method. The canonical key names are listed here for alert configuration.

Counters

cp.leader.campaigns_total{result=success}
cp.leader.campaigns_total{result=lock_held}
cp.leader.campaigns_total{result=failure}
cp.leader.keepalive_failures_total
cp.authority.epoch_mismatch_total
cp.audit.prior_epoch_closeout_missed
cp.audit.prior_epoch_closeout_reconciled
cp.audit.deferred_decision_write_failed
cp.audit.deferred_decision_write_retried
cp.promoter.evidence_write_failures_total
cp.promoter.decisions_total{outcome=promoted}
cp.promoter.decisions_total{outcome=blocked}
cp.promoter.decisions_total{outcome=rollback_triggered}
cp.promoter.decisions_total{outcome=deferred_insufficient_history}
cp.outbox.purge_total
cp.leader.read_write_state_transition_total{not_leader:leader}
cp.leader.read_write_state_transition_total{leader:not_leader}

Gauges

cp.leader.epoch                    — current durable epoch (0 when not leader)
cp.leader.session_lock_held        — 1 = this node holds the advisory lock, 0 = does not
cp.backend.write_ready             — 1 = write-authority conditions met
cp.backend.read_ready              — 1 = DB reachable
cp.backend.sync_replica_count      — current sync replica count
cp.backend.connected_to_primary    — 1 = connected to PostgreSQL primary
cp.recovery.legacy_events_in_horizon — count of pre-HA events in replay window
cp.audit.status{status=complete}   — 1 when audit is complete
cp.audit.status{status=degraded}   — 1 when deferred writes failed
cp.audit.status{status=legacy_limited} — 1 when pre-HA stages exist

Histograms (sum + count pairs)

cp.recovery.startup_scan_duration_us_sum    — total recovery scan time (µs)
cp.recovery.startup_scan_duration_count     — number of recovery scans
cp.recovery.events_replayed                 — total events replayed across all scans
cp.outbox.dispatch_lag_us_sum               — total outbox dispatch lag (µs)
cp.outbox.dispatch_lag_count                — number of dispatched outbox messages
cp.backend.lock_wait_us_sum                 — total advisory lock wait (µs)
cp.backend.lock_wait_count                  — number of lock wait observations

Generated for workplan v0.7 PR-7 (Observability + Audit Trail). Last updated 2026-03-09.