Deployment problems often look like this:
- After an image reset / OS reinstall / container rebuild, the Control UI asks you to enter the token again.
- Your platform is stuck in
STARTINGwith errors like:Startup probe failed: dial tcp ... connect: connection refused
- The Control UI disconnects with:
disconnected (1008): pairing required
This guide gives you a repeatable playbook: fix persistence first (so you stop “losing” tokens), then debug STARTING and probe failures from the bottom up.
Further reading in this site:
- Configuration fundamentals: /guides/openclaw-configuration
- Docker quick start: /guides/docker-deployment
- Windows runtime choice: /guides/openclaw-windows-native-vs-wsl2
- Native Windows service/runtime behavior: /guides/openclaw-native-windows-field-guide
- Related troubleshooting:
What this guide helps you finish
By the end of this guide, you should be able to do two things safely:
- rebuild or move a deployment without “losing” gateway auth or state,
- and diagnose why a platform thinks the service is stuck in
STARTINGor failing probes.
That is the real operator outcome here: a deployment that survives resets and a startup path you can verify from the bottom up.
Who this is for (and not for)
Use this guide if:
- an image reset, rebuild, or redeploy keeps forcing token re-entry,
- your platform shows
STARTING,connect ECONNREFUSED, or startup probe failures, - you run OpenClaw on Docker, VMs, Kubernetes, or a PaaS with platform health checks,
- or you want one deployment playbook that stays sane across Alibaba Cloud, Zeabur, Unraid, AWS, and GCP.
This is not the main page for:
- a purely local desktop install,
- app-level auth issues with no deployment changes,
- or model/provider failures unrelated to host persistence or startup health.
Before you redeploy: collect these five facts
Before making any changes, confirm:
- Where the current OpenClaw state directory lives.
- Whether that location survives image resets, container replacement, or instance recreation.
- How the running service gets
OPENCLAW_GATEWAY_TOKEN. - Whether the gateway binds to the interface your platform probes can actually reach.
- What the platform considers a healthy startup signal (TCP, HTTP, timeout window).
These five facts determine whether you are dealing with persistence loss, bind mismatch, or a real crash loop.
1) Why tokens “disappear” after resets (the real cause)
When people say “token”, they usually mean the Gateway token (used for Control UI / API authentication).
The key idea:
- OpenClaw is stateful. The state directory (commonly
~/.openclaw/) contains config, tokens, sessions, device approvals, channel credentials, and caches. - If your platform reset/redeploy deletes the disk or container write layer, you did not just lose a token - you provisioned a new machine.
This is why “reset image” frequently means “re-enter token”.
2) The golden rules (works on every platform)
You want two outcomes:
- Tokens are reproducible (defined by env vars / secret manager).
- The OpenClaw state directory is persistent (volume / attached disk / network filesystem).
2.1 Pin gateway auth via env vars (recommended)
In openclaw.json, prefer env substitution:
{
gateway: {
auth: { mode: 'token', token: '${OPENCLAW_GATEWAY_TOKEN}' },
// Remote deployments usually need non-loopback bind
bind: 'lan',
},
}
Set it in the runtime environment (systemd/Docker/PaaS env settings):
export OPENCLAW_GATEWAY_TOKEN='your-long-random-token'
Why this helps:
- Browser storage can be wiped, but your gateway token stays stable.
- You can rotate tokens by changing the secret, without editing JSON in multiple places.
2.2 Persist the state directory (this is the real fix)
Default state dir is ~/.openclaw/ (unless overridden). Persist it using one of:
- Docker: bind-mount a host directory or a named volume into the container.
- VM: store state on an attached data disk (EBS / Persistent Disk) rather than the system/root disk.
- Kubernetes: store state on a PersistentVolume (EBS/EFS/CSI, etc.).
- PaaS: enable platform persistence and mount it where the OpenClaw home/state lives.
If you do only one thing in this guide, do this.
3) Recover or rotate your token (without wiping state)
If the service was not reset and only your browser forgot the token:
3.1 Docker: check .env
The Docker setup flow commonly writes the token into .env.
cat .env | rg OPENCLAW_GATEWAY_TOKEN
3.2 Check config and the service runtime environment
- If config uses
${OPENCLAW_GATEWAY_TOKEN}, the actual value lives in systemd/Docker/PaaS env settings. - If you hardcoded it, you will see it in
openclaw.json.
Useful CLI command:
openclaw config get gateway.auth
3.3 “pairing required” (1008): approve the device
The Control UI requires one-time device approval for new browser profiles/devices.
openclaw devices list
openclaw devices approve <requestId>
If you truly lost state (fresh machine), you must repeat device approval - that is expected.
4) STARTING / startup probe failed: debug in layers
STARTING is a platform health-check label, not an OpenClaw diagnosis. Use this bottom-up checklist:
- Is the process crashing/restarting? (logs)
- Is the port listening? (platform probes check TCP/HTTP reachability)
- Is bind correct? (binding only to
127.0.0.1makes platform probes fail) - Does config load? (config parse, missing env vars, permissions, disk full)
- Is the platform health check too aggressive? (startup too slow, probe kills it early)
4.1 Logs first
Docker:
docker compose logs -f openclaw-gateway
Typical restart-loop causes:
- invalid config / bad include paths / missing env vars
- state directory permissions (container user cannot write)
- OOM (memory limit too low)
4.2 Verify the port is listening
On the host:
ss -ltnp | rg 18789 || true
Inside the container:
docker compose exec openclaw-gateway sh -lc 'ss -ltnp || netstat -ltnp'
If nothing is listening, probes will fail. Fix the startup error from logs.
4.3 Bind matters: remote deployments usually require lan
If you are debugging this on native Windows and what you actually wanted was a more server-like service model, do not forget that Scheduled Task is not the same as a true always-on Windows Service. In that case, also read:
- /guides/openclaw-windows-native-vs-wsl2
- /troubleshooting/solutions/windows-native-node-run-hangs-or-runtime-unstable
A very common failure mode:
- Gateway is running and listening, but only on
127.0.0.1. - Platform probes come from outside the container/VM and get
ECONNREFUSED.
For remote deployments, set:
{ gateway: { bind: 'lan' } }
Important: when binding beyond loopback, OpenClaw requires auth (token/password). If you see a message like “refusing to bind … without auth”, configure auth and restart.
4.4 Use OpenClaw health/probe commands
If you have shell access on the host/container, probe locally:
openclaw gateway health --url ws://127.0.0.1:18789 --token "$OPENCLAW_GATEWAY_TOKEN"
If local health fails, the issue is not the platform probe - it is a gateway startup/config/network issue.
4.5 Platform probes too strict: give startup more time
For Kubernetes-style platforms, add a startupProbe (TCP is usually the least surprising):
startupProbe:
tcpSocket:
port: 18789
periodSeconds: 5
failureThreshold: 60 # ~5 minutes
For PaaS platforms with a health check toggle, it is often useful to temporarily disable health checks while fixing a bad config (to break the restart loop), then re-enable.
5) Platform guidance
5.1 Alibaba Cloud (Simple Application Server image)
Alibaba’s OpenClaw image FAQ highlights the key pitfall: a system reset deletes system disk data, and you must reconfigure tokens/keys afterward.
Checklist:
- Treat
~/.openclaw/as production data; snapshot/backup before resets. - Avoid exposing the Control UI publicly; prefer SSH tunnel/Tailscale.
- If you do expose it, ensure firewall/security group rules allow only your IPs.
5.2 Volcengine
Volcengine behaves like most cloud VM/container platforms: if you rebuild without persistence, state is gone.
Recommended:
- Put OpenClaw state on a data disk (example mount:
/data/openclaw). - Back up using disk snapshots.
- If using Docker, bind-mount the data disk path into the container.
5.3 Zeabur
Zeabur deployments usually fail for two reasons:
- Persistence is not configured, so redeploy wipes state.
- Health checks kill the container before it becomes ready.
Suggested workflow:
- Enable persistent storage and mount it to where
/home/node/.openclawlives. - Fix bind/auth/port.
- If needed, use rescue mode and temporarily disable health checks while you fix config.
5.4 Unraid
Unraid best practice is to map application state into /mnt/user/appdata/... so rebuilding the container (or
recreating docker.img) does not wipe state.
Suggested mapping:
Host: /mnt/user/appdata/openclaw
Container: /home/node/.openclaw
If state was stored in the container write layer, rebuilding/updating will cause token resets and session loss.
5.5 AWS (EC2 / ECS/Fargate / EKS)
EC2 (recommended for a stateful gateway)
- Use EBS for persistence.
- Put OpenClaw state on a dedicated EBS data volume mounted at
/data/openclaw. - Set
OPENCLAW_STATE_DIR=/data/openclaw/.openclawand back up with EBS snapshots. - Be mindful that root volumes are often deleted on termination unless configured otherwise.
ECS/Fargate (stateless by default)
Fargate is great for stateless services. For OpenClaw, you typically need a persistent filesystem for state.
- If you deploy on Fargate, mount an EFS volume and store the OpenClaw state directory there.
EKS (Kubernetes)
- Store
~/.openclawon a PersistentVolume (EBS/EFS CSI). - Add a TCP
startupProbeto avoid early restarts.
5.6 Google Cloud (Compute Engine / GKE / Cloud Run)
Compute Engine (recommended for a stateful gateway)
- Use a Persistent Disk mounted at
/data/openclaw. - Set
OPENCLAW_STATE_DIR=/data/openclaw/.openclaw. - Back up with disk snapshots.
Tip: use stable device naming (UUID or /dev/disk/by-id) when mounting disks.
GKE (Kubernetes)
- Mount a PersistentVolume for
~/.openclaw. - Add a
startupProbe(TCP) for port18789.
Cloud Run (not ideal for OpenClaw)
Cloud Run is optimized for stateless services. Local disk is ephemeral and instances can be replaced at any time.
If you still want to experiment, Cloud Run supports Cloud Storage volume mounts, but this changes filesystem semantics (object storage presented as files) and may not be a drop-in replacement for all state patterns.
For production, prefer Compute Engine or GKE.
6) Backup + restore drill (do this once before you need it)
Minimal backup (VM / bare metal / Docker host):
tar -czf openclaw-state-backup.tgz ~/.openclaw
Verify you can:
- restore onto a new VM/container
- reuse the same
OPENCLAW_GATEWAY_TOKEN - keep channels/devices working (or at least understand what must be re-approved)
Verification checklist after the recovery
Treat the deployment as healthy only when:
- the state directory is mounted on persistent storage,
- the gateway token is reproducible from env/secret configuration,
- the service listens on the expected host/port after restart,
- the platform probe passes without manual intervention,
- and a browser/device reconnect proves you did not silently create a “new machine” by accident.
If one of those checks fails, the deployment is still fragile even if it looks alive for a moment.
What to tighten first when the platform still feels brittle
Use this order:
- Fix persistence and token reproducibility.
- Fix bind/listen and platform probe assumptions.
- Fix permissions, missing env vars, and restart loops from logs.
- Only then optimize platform-specific startup timing or release workflow.
That order keeps you from tuning probes around a deployment that still cannot preserve identity.
References (official docs)
- OpenClaw Docker install: https://docs.openclaw.ai/install/docker
- OpenClaw gateway CLI docs: https://docs.openclaw.ai/cli/gateway
- OpenClaw dashboard + token: https://docs.openclaw.ai/web/dashboard
- Zeabur OpenClaw template: https://zeabur.com/templates/VTZ4FX
- Alibaba Cloud OpenClaw FAQ (reset wipes disk; token invalid): https://www.alibabacloud.com/help/en/simple-application-server/use-cases/openclaw-faq
- Unraid Docker container management (appdata mappings, docker.img): https://docs.unraid.net/unraid-os/using-unraid-to/run-docker-containers/managing-and-customizing-containers/
- Kubernetes probes (startupProbe): https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
- AWS EC2 preserve volumes on termination: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/preserving-volumes-on-termination.html
- AWS ECS EFS volumes: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/efs-volumes.html
- Google Compute Engine Persistent Disk: https://cloud.google.com/compute/docs/disks/persistent-disks
- Google Compute Engine mount disks using UUID: https://cloud.google.com/compute/docs/disks/mounting-disks#uuid
- Cloud Run Cloud Storage volume mounts: https://cloud.google.com/run/docs/configuring/services/cloud-storage-volume-mounts