Operations Runbook
Last Updated: April 2026
This runbook covers common operational scenarios, emergency procedures, and troubleshooting steps for the BattlesBit backend.
Emergency Procedures
Stop All Trading (Emergency Stop)
When to use: Market anomaly, suspected exploit, critical bug in position calculation.
Steps:
- Execute the
emergencyStopmutation (requires admin permission):
mutation {
emergencyStop
}- This sets
trading_locked = trueon ALL open matches platform-wide. - Existing positions remain open but no new positions can be created.
- To resume trading, unlock each match individually:
mutation {
lockTrading(matchID: "...", locked: false)
}Verification: Query open matches and confirm tradingLocked is true on all of them.
Drain NATS Consumers Before Restart
When to use: Before restarting NATS or worker pods to avoid message loss.
Steps:
- Check pending messages per consumer:
nats consumer info POSITIONS arena-worker
nats consumer info WITHDRAWALS withdraw-worker
nats consumer info ACHIEVEMENTS achievement-worker- Wait for
Unprocessedcount to reach 0 before stopping workers. - If messages are stuck (high redelivery count), investigate the specific message:
nats consumer next POSITIONS arena-worker --count 1 --no-ack- Once drained, gracefully stop worker pods:
kubectl rollout restart deployment/arena-workerDatabase Issues
Database Is Down
Symptoms:
- All GraphQL queries return 500 errors
- Service logs show
connection refusedortoo many connections - Health check endpoint returns unhealthy
What breaks:
- All reads and writes fail
- Positions in-flight may miss close/fill events (NATS will retry via Nak)
- WebSocket subscriptions disconnect
Recovery:
- Check PostgreSQL status:
kubectl get pods -l app=postgres
pg_isready -h $DB_HOST -p 5432- If the pod is down, check events:
kubectl describe pod postgres-0
kubectl logs postgres-0 --previous- Once Postgres is back, services auto-reconnect (Ent connection pool handles retries).
- Check for stuck NATS messages that failed during the outage and let consumers reprocess them.
NATS Issues
NATS Disconnected
Symptoms:
- Arena worker stops processing positions (no new fills, no PnL updates)
- Achievement events not triggering
- Withdrawal worker stalled
- Logs show
nats: no servers available for connection
Auto-reconnect: The NATS client has built-in reconnect with exponential backoff. It will automatically attempt to reconnect.
Manual recovery:
- Check NATS server health:
kubectl get pods -l app=nats
nats server info- If NATS is healthy but workers are not reconnecting, restart the workers:
kubectl rollout restart deployment/arena-worker
kubectl rollout restart deployment/withdrawal-worker- After reconnect, check stream and consumer status:
nats stream ls
nats consumer ls POSITIONSMarket Data Issues
Binance WebSocket Dropped
Symptoms:
- Stale prices (price timestamps stop advancing)
- PnL calculations use outdated prices
- Logs show
websocket: closeordial tcp: connect: connection refused - Positions may miss SL/TP triggers
Built-in recovery: The Binance client implements exponential backoff reconnect (1s to 30s max), with automatic backoff reset after 1 minute of sustained connection.
Manual intervention:
- Check market service logs for reconnect attempts:
kubectl logs -l app=web --tail=100 | grep -i binance-
If reconnection is failing repeatedly, check Binance API status at
https://www.binance.com/en/support/announcement. -
Verify network connectivity from the pod:
kubectl exec -it deploy/web -- curl -s https://api.binance.com/api/v3/ping- If Binance is down, consider triggering EmergencyStop to prevent stale-price trades.
Migration Issues
Bad Migration
Symptoms:
- Service fails to start after deployment
- Logs show
migration failedor schema errors - Ent queries return unexpected errors
Recovery:
- Check which migration was applied:
atlas migrate status --env production- Roll back the last migration:
atlas migrate down --env production- If Atlas rollback is unavailable, apply manual SQL:
psql $DATABASE_URL < manual_rollback.sql- Redeploy the previous service version.
- Fix the migration, test in staging, then redeploy.
Debugging Scenarios
User Reports Wrong Balance
Investigation steps:
- Query the user’s wallet:
query {
user(id: "USER_ID") {
balance
wallets {
id
balance
locked
currency
}
}
}- Check
trade_audit_logfor recent events:
SELECT * FROM trade_audit_logs
WHERE user_id = 'USER_ID'
ORDER BY created_at DESC
LIMIT 20;- Check participant
v_balancein active matches:
SELECT gmp.v_balance, gmp.peak_v_balance, gmp.is_disqualified, gm.status
FROM game_match_participants gmp
JOIN game_matches gm ON gm.id = gmp.game_match_id
WHERE gmp.user_id = 'USER_ID'
ORDER BY gmp.created_at DESC
LIMIT 5;- Check transaction history for deposits/withdrawals:
SELECT * FROM transactions
WHERE wallet_id IN (SELECT id FROM wallets WHERE user_id = 'USER_ID')
ORDER BY created_at DESC
LIMIT 20;Achievement Not Triggering
Investigation steps:
- Check the NATS
ACHIEVEMENTSstream for pending messages:
nats stream info ACHIEVEMENTS
nats consumer info ACHIEVEMENTS achievement-worker-
Look for consumer lag (unprocessed messages). High lag means the worker is behind.
-
Check worker logs for processing errors:
kubectl logs -l app=achievement-worker --tail=100- Verify the achievement rule exists and is active:
SELECT * FROM achievement_rules
WHERE achievement_id = 'ACHIEVEMENT_ID';- Check if the user already has the achievement:
SELECT * FROM user_achievements
WHERE user_id = 'USER_ID' AND achievement_id = 'ACHIEVEMENT_ID';High Latency
Investigation steps:
- Check Prometheus metrics for GraphQL operation times:
histogram_quantile(0.95, rate(graphql_operation_duration_seconds_bucket[5m]))- Check Redis rate limit keys for throttling:
redis-cli KEYS "rate_limit:*" | head -20- Check database connection pool saturation:
SELECT count(*) FROM pg_stat_activity WHERE datname = 'postgres';- Check NATS consumer lag (slow consumers cause backpressure):
nats consumer info POSITIONS arena-worker- Check for slow queries in PostgreSQL:
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 seconds'
ORDER BY duration DESC;- If latency is isolated to specific operations, check the corresponding service logs and consider scaling the relevant deployment.