Operations Runbook

Last Updated: April 2026

This runbook covers common operational scenarios, emergency procedures, and troubleshooting steps for the BattlesBit backend.

Emergency Procedures

Stop All Trading (Emergency Stop)

When to use: Market anomaly, suspected exploit, critical bug in position calculation.

Steps:

Execute the emergencyStop mutation (requires admin permission):


mutation {
  emergencyStop
}

This sets trading_locked = true on ALL open matches platform-wide.
Existing positions remain open but no new positions can be created.
To resume trading, unlock each match individually:


mutation {
  lockTrading(matchID: "...", locked: false)
}

Verification: Query open matches and confirm tradingLocked is true on all of them.

Drain NATS Consumers Before Restart

When to use: Before restarting NATS or worker pods to avoid message loss.

Steps:

Check pending messages per consumer:


nats consumer info POSITIONS arena-worker
nats consumer info WITHDRAWALS withdraw-worker
nats consumer info ACHIEVEMENTS achievement-worker

Wait for Unprocessed count to reach 0 before stopping workers.
If messages are stuck (high redelivery count), investigate the specific message:


nats consumer next POSITIONS arena-worker --count 1 --no-ack

Once drained, gracefully stop worker pods:


kubectl rollout restart deployment/arena-worker

Database Issues

Database Is Down

Symptoms:

All GraphQL queries return 500 errors
Service logs show connection refused or too many connections
Health check endpoint returns unhealthy

What breaks:

All reads and writes fail
Positions in-flight may miss close/fill events (NATS will retry via Nak)
WebSocket subscriptions disconnect

Recovery:

Check PostgreSQL status:


kubectl get pods -l app=postgres
pg_isready -h $DB_HOST -p 5432

If the pod is down, check events:


kubectl describe pod postgres-0
kubectl logs postgres-0 --previous

Once Postgres is back, services auto-reconnect (Ent connection pool handles retries).
Check for stuck NATS messages that failed during the outage and let consumers reprocess them.

NATS Issues

NATS Disconnected

Symptoms:

Arena worker stops processing positions (no new fills, no PnL updates)
Achievement events not triggering
Withdrawal worker stalled
Logs show nats: no servers available for connection

Auto-reconnect: The NATS client has built-in reconnect with exponential backoff. It will automatically attempt to reconnect.

Manual recovery:

Check NATS server health:


kubectl get pods -l app=nats
nats server info

If NATS is healthy but workers are not reconnecting, restart the workers:


kubectl rollout restart deployment/arena-worker
kubectl rollout restart deployment/withdrawal-worker

After reconnect, check stream and consumer status:


nats stream ls
nats consumer ls POSITIONS

Market Data Issues

Binance WebSocket Dropped

Symptoms:

Stale prices (price timestamps stop advancing)
PnL calculations use outdated prices
Logs show websocket: close or dial tcp: connect: connection refused
Positions may miss SL/TP triggers

Built-in recovery: The Binance client implements exponential backoff reconnect (1s to 30s max), with automatic backoff reset after 1 minute of sustained connection.

Manual intervention:

Check market service logs for reconnect attempts:


kubectl logs -l app=web --tail=100 | grep -i binance

If reconnection is failing repeatedly, check Binance API status at https://www.binance.com/en/support/announcement.
Verify network connectivity from the pod:


kubectl exec -it deploy/web -- curl -s https://api.binance.com/api/v3/ping

If Binance is down, consider triggering EmergencyStop to prevent stale-price trades.

Migration Issues

Bad Migration

Symptoms:

Service fails to start after deployment
Logs show migration failed or schema errors
Ent queries return unexpected errors

Recovery:

Check which migration was applied:


atlas migrate status --env production

Roll back the last migration:


atlas migrate down --env production

If Atlas rollback is unavailable, apply manual SQL:


psql $DATABASE_URL < manual_rollback.sql

Redeploy the previous service version.
Fix the migration, test in staging, then redeploy.

Debugging Scenarios

User Reports Wrong Balance

Investigation steps:

Query the user’s wallet:


query {
  user(id: "USER_ID") {
    balance
    wallets {
      id
      balance
      locked
      currency
    }
  }
}

Check trade_audit_log for recent events:


SELECT * FROM trade_audit_logs
WHERE user_id = 'USER_ID'
ORDER BY created_at DESC
LIMIT 20;

Check participant v_balance in active matches:


SELECT gmp.v_balance, gmp.peak_v_balance, gmp.is_disqualified, gm.status
FROM game_match_participants gmp
JOIN game_matches gm ON gm.id = gmp.game_match_id
WHERE gmp.user_id = 'USER_ID'
ORDER BY gmp.created_at DESC
LIMIT 5;

Check transaction history for deposits/withdrawals:


SELECT * FROM transactions
WHERE wallet_id IN (SELECT id FROM wallets WHERE user_id = 'USER_ID')
ORDER BY created_at DESC
LIMIT 20;

Achievement Not Triggering

Investigation steps:

Check the NATS ACHIEVEMENTS stream for pending messages:


nats stream info ACHIEVEMENTS
nats consumer info ACHIEVEMENTS achievement-worker

Look for consumer lag (unprocessed messages). High lag means the worker is behind.
Check worker logs for processing errors:


kubectl logs -l app=achievement-worker --tail=100

Verify the achievement rule exists and is active:


SELECT * FROM achievement_rules
WHERE achievement_id = 'ACHIEVEMENT_ID';

Check if the user already has the achievement:


SELECT * FROM user_achievements
WHERE user_id = 'USER_ID' AND achievement_id = 'ACHIEVEMENT_ID';

High Latency

Investigation steps:

Check Prometheus metrics for GraphQL operation times:


histogram_quantile(0.95, rate(graphql_operation_duration_seconds_bucket[5m]))

Check Redis rate limit keys for throttling:


redis-cli KEYS "rate_limit:*" | head -20

Check database connection pool saturation:


SELECT count(*) FROM pg_stat_activity WHERE datname = 'postgres';

Check NATS consumer lag (slow consumers cause backpressure):


nats consumer info POSITIONS arena-worker

Check for slow queries in PostgreSQL:


SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 seconds'
ORDER BY duration DESC;

If latency is isolated to specific operations, check the corresponding service logs and consider scaling the relevant deployment.