Two production crypto trading bots running 24/7 on Kraken + Alpaca, backed by a 3-year backtest harness and weekly walk-forward analysis.
Owner ran one trading bot, manually monitored, with ad-hoc state files and no backtest discipline. Wanted to scale to multiple bots across venues without doubling the babysitting load.
Hard constraints: zero downtime tolerance on live capital, no manual restarts, no silent failures, must reproduce backtest results within ±2% in live execution.
Started with a hard separation between strategy logic, execution, and infrastructure. Each bot is a single Python process with a typed state file in SQLite — never JSON in production. State writes are atomic (write to .tmp, rename) so a crash never corrupts.
All bots share a common watchdog daemon that runs systemd-style health checks every 30s. If a bot misses two heartbeats, watchdog restarts it via systemctl and sends a Telegram alert with the last 40 log lines. No silent zombies.
Backtest-vs-live drift was the killer risk. Built a walk-forward analysis (WFA) harness that runs every Friday on the prior week's data — if drift exceeds 2% on any bot, the dashboard flags it and trading auto-pauses until reviewed.
Per-coin tuned configs (step %, rungs, budget) come from WFA mode analysis, not from intuition. Each parameter has provenance — you can trace any live setting back to a specific backtest run.
The same pattern works for any system where uptime matters and you want reproducible behavior from research to production: order processing, inventory rebalancing, lead routing, content publishing pipelines.
Trading is just the highest-stakes version. The discipline transfers.
Start with a free 30-min call — figure out fit before money changes hands.