Skip to Content
BackendOperations Runbook

Operations Runbook

Last Updated: April 2026

This runbook covers common operational scenarios, emergency procedures, and troubleshooting steps for the BattlesBit backend.


Emergency Procedures

Stop All Trading (Emergency Stop)

When to use: Market anomaly, suspected exploit, critical bug in position calculation.

Steps:

  1. Execute the emergencyStop mutation (requires admin permission):
mutation { emergencyStop }
  1. This sets trading_locked = true on ALL open matches platform-wide.
  2. Existing positions remain open but no new positions can be created.
  3. To resume trading, unlock each match individually:
mutation { lockTrading(matchID: "...", locked: false) }

Verification: Query open matches and confirm tradingLocked is true on all of them.


Drain NATS Consumers Before Restart

When to use: Before restarting NATS or worker pods to avoid message loss.

Steps:

  1. Check pending messages per consumer:
nats consumer info POSITIONS arena-worker nats consumer info WITHDRAWALS withdraw-worker nats consumer info ACHIEVEMENTS achievement-worker
  1. Wait for Unprocessed count to reach 0 before stopping workers.
  2. If messages are stuck (high redelivery count), investigate the specific message:
nats consumer next POSITIONS arena-worker --count 1 --no-ack
  1. Once drained, gracefully stop worker pods:
kubectl rollout restart deployment/arena-worker

Database Issues

Database Is Down

Symptoms:

  • All GraphQL queries return 500 errors
  • Service logs show connection refused or too many connections
  • Health check endpoint returns unhealthy

What breaks:

  • All reads and writes fail
  • Positions in-flight may miss close/fill events (NATS will retry via Nak)
  • WebSocket subscriptions disconnect

Recovery:

  1. Check PostgreSQL status:
kubectl get pods -l app=postgres pg_isready -h $DB_HOST -p 5432
  1. If the pod is down, check events:
kubectl describe pod postgres-0 kubectl logs postgres-0 --previous
  1. Once Postgres is back, services auto-reconnect (Ent connection pool handles retries).
  2. Check for stuck NATS messages that failed during the outage and let consumers reprocess them.

NATS Issues

NATS Disconnected

Symptoms:

  • Arena worker stops processing positions (no new fills, no PnL updates)
  • Achievement events not triggering
  • Withdrawal worker stalled
  • Logs show nats: no servers available for connection

Auto-reconnect: The NATS client has built-in reconnect with exponential backoff. It will automatically attempt to reconnect.

Manual recovery:

  1. Check NATS server health:
kubectl get pods -l app=nats nats server info
  1. If NATS is healthy but workers are not reconnecting, restart the workers:
kubectl rollout restart deployment/arena-worker kubectl rollout restart deployment/withdrawal-worker
  1. After reconnect, check stream and consumer status:
nats stream ls nats consumer ls POSITIONS

Market Data Issues

Binance WebSocket Dropped

Symptoms:

  • Stale prices (price timestamps stop advancing)
  • PnL calculations use outdated prices
  • Logs show websocket: close or dial tcp: connect: connection refused
  • Positions may miss SL/TP triggers

Built-in recovery: The Binance client implements exponential backoff reconnect (1s to 30s max), with automatic backoff reset after 1 minute of sustained connection.

Manual intervention:

  1. Check market service logs for reconnect attempts:
kubectl logs -l app=web --tail=100 | grep -i binance
  1. If reconnection is failing repeatedly, check Binance API status at https://www.binance.com/en/support/announcement.

  2. Verify network connectivity from the pod:

kubectl exec -it deploy/web -- curl -s https://api.binance.com/api/v3/ping
  1. If Binance is down, consider triggering EmergencyStop to prevent stale-price trades.

Migration Issues

Bad Migration

Symptoms:

  • Service fails to start after deployment
  • Logs show migration failed or schema errors
  • Ent queries return unexpected errors

Recovery:

  1. Check which migration was applied:
atlas migrate status --env production
  1. Roll back the last migration:
atlas migrate down --env production
  1. If Atlas rollback is unavailable, apply manual SQL:
psql $DATABASE_URL < manual_rollback.sql
  1. Redeploy the previous service version.
  2. Fix the migration, test in staging, then redeploy.

Debugging Scenarios

User Reports Wrong Balance

Investigation steps:

  1. Query the user’s wallet:
query { user(id: "USER_ID") { balance wallets { id balance locked currency } } }
  1. Check trade_audit_log for recent events:
SELECT * FROM trade_audit_logs WHERE user_id = 'USER_ID' ORDER BY created_at DESC LIMIT 20;
  1. Check participant v_balance in active matches:
SELECT gmp.v_balance, gmp.peak_v_balance, gmp.is_disqualified, gm.status FROM game_match_participants gmp JOIN game_matches gm ON gm.id = gmp.game_match_id WHERE gmp.user_id = 'USER_ID' ORDER BY gmp.created_at DESC LIMIT 5;
  1. Check transaction history for deposits/withdrawals:
SELECT * FROM transactions WHERE wallet_id IN (SELECT id FROM wallets WHERE user_id = 'USER_ID') ORDER BY created_at DESC LIMIT 20;

Achievement Not Triggering

Investigation steps:

  1. Check the NATS ACHIEVEMENTS stream for pending messages:
nats stream info ACHIEVEMENTS nats consumer info ACHIEVEMENTS achievement-worker
  1. Look for consumer lag (unprocessed messages). High lag means the worker is behind.

  2. Check worker logs for processing errors:

kubectl logs -l app=achievement-worker --tail=100
  1. Verify the achievement rule exists and is active:
SELECT * FROM achievement_rules WHERE achievement_id = 'ACHIEVEMENT_ID';
  1. Check if the user already has the achievement:
SELECT * FROM user_achievements WHERE user_id = 'USER_ID' AND achievement_id = 'ACHIEVEMENT_ID';

High Latency

Investigation steps:

  1. Check Prometheus metrics for GraphQL operation times:
histogram_quantile(0.95, rate(graphql_operation_duration_seconds_bucket[5m]))
  1. Check Redis rate limit keys for throttling:
redis-cli KEYS "rate_limit:*" | head -20
  1. Check database connection pool saturation:
SELECT count(*) FROM pg_stat_activity WHERE datname = 'postgres';
  1. Check NATS consumer lag (slow consumers cause backpressure):
nats consumer info POSITIONS arena-worker
  1. Check for slow queries in PostgreSQL:
SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '5 seconds' ORDER BY duration DESC;
  1. If latency is isolated to specific operations, check the corresponding service logs and consider scaling the relevant deployment.
Last updated on