Deployment problems often look like this:

After an image reset / OS reinstall / container rebuild, the Control UI asks you to enter the token again.
Your platform is stuck in STARTING with errors like:
- Startup probe failed: dial tcp ... connect: connection refused
The Control UI disconnects with:
- disconnected (1008): pairing required

This guide gives you a repeatable playbook: fix persistence first (so you stop “losing” tokens), then debug STARTING and probe failures from the bottom up.

What this guide helps you finish

By the end of this guide, you should be able to do two things safely:

rebuild or move a deployment without “losing” gateway auth or state,
and diagnose why a platform thinks the service is stuck in STARTING or failing probes.

That is the real operator outcome here: a deployment that survives resets and a startup path you can verify from the bottom up.

Who this is for (and not for)

Use this guide if:

an image reset, rebuild, or redeploy keeps forcing token re-entry,
your platform shows STARTING, connect ECONNREFUSED, or startup probe failures,
you run OpenClaw on Docker, VMs, Kubernetes, or a PaaS with platform health checks,
or you want one deployment playbook that stays sane across Alibaba Cloud, Zeabur, Unraid, AWS, and GCP.

This is not the main page for:

a purely local desktop install,
app-level auth issues with no deployment changes,
or model/provider failures unrelated to host persistence or startup health.

Before you redeploy: collect these five facts

Before making any changes, confirm:

Where the current OpenClaw state directory lives.
Whether that location survives image resets, container replacement, or instance recreation.
How the running service gets OPENCLAW_GATEWAY_TOKEN.
Whether the gateway binds to the interface your platform probes can actually reach.
What the platform considers a healthy startup signal (TCP, HTTP, timeout window).

These five facts determine whether you are dealing with persistence loss, bind mismatch, or a real crash loop.

1) Why tokens “disappear” after resets (the real cause)

When people say “token”, they usually mean the Gateway token (used for Control UI / API authentication).

The key idea:

OpenClaw is stateful. The state directory (commonly ~/.openclaw/) contains config, tokens, sessions, device approvals, channel credentials, and caches.
If your platform reset/redeploy deletes the disk or container write layer, you did not just lose a token - you provisioned a new machine.

This is why “reset image” frequently means “re-enter token”.

2) The golden rules (works on every platform)

You want two outcomes:

Tokens are reproducible (defined by env vars / secret manager).
The OpenClaw state directory is persistent (volume / attached disk / network filesystem).

2.1 Pin gateway auth via env vars (recommended)

In openclaw.json, prefer env substitution:

{
  gateway: {
    auth: { mode: 'token', token: '${OPENCLAW_GATEWAY_TOKEN}' },
    // Remote deployments usually need non-loopback bind
    bind: 'lan',
  },
}

Set it in the runtime environment (systemd/Docker/PaaS env settings):

export OPENCLAW_GATEWAY_TOKEN='your-long-random-token'

Why this helps:

Browser storage can be wiped, but your gateway token stays stable.
You can rotate tokens by changing the secret, without editing JSON in multiple places.

2.2 Persist the state directory (this is the real fix)

Default state dir is ~/.openclaw/ (unless overridden). Persist it using one of:

Docker: bind-mount a host directory or a named volume into the container.
VM: store state on an attached data disk (EBS / Persistent Disk) rather than the system/root disk.
Kubernetes: store state on a PersistentVolume (EBS/EFS/CSI, etc.).
PaaS: enable platform persistence and mount it where the OpenClaw home/state lives.

If you do only one thing in this guide, do this.

3) Recover or rotate your token (without wiping state)

If the service was not reset and only your browser forgot the token:

3.1 Docker: check `.env`

The Docker setup flow commonly writes the token into .env.

cat .env | rg OPENCLAW_GATEWAY_TOKEN

3.2 Check config and the service runtime environment

If config uses ${OPENCLAW_GATEWAY_TOKEN}, the actual value lives in systemd/Docker/PaaS env settings.
If you hardcoded it, you will see it in openclaw.json.

Useful CLI command:

openclaw config get gateway.auth

3.3 “pairing required” (1008): approve the device

The Control UI requires one-time device approval for new browser profiles/devices.

openclaw devices list
openclaw devices approve <requestId>

If you truly lost state (fresh machine), you must repeat device approval - that is expected.

4) STARTING / startup probe failed: debug in layers

STARTING is a platform health-check label, not an OpenClaw diagnosis. Use this bottom-up checklist:

Is the process crashing/restarting? (logs)
Is the port listening? (platform probes check TCP/HTTP reachability)
Is bind correct? (binding only to 127.0.0.1 makes platform probes fail)
Does config load? (config parse, missing env vars, permissions, disk full)
Is the platform health check too aggressive? (startup too slow, probe kills it early)

4.1 Logs first

Docker:

docker compose logs -f openclaw-gateway

Typical restart-loop causes:

invalid config / bad include paths / missing env vars
state directory permissions (container user cannot write)
OOM (memory limit too low)

4.2 Verify the port is listening

On the host:

ss -ltnp | rg 18789 || true

Inside the container:

docker compose exec openclaw-gateway sh -lc 'ss -ltnp || netstat -ltnp'

If nothing is listening, probes will fail. Fix the startup error from logs.

4.3 Bind matters: remote deployments usually require `lan`

If you are debugging this on native Windows and what you actually wanted was a more server-like service model, do not forget that Scheduled Task is not the same as a true always-on Windows Service. In that case, also read:

A very common failure mode:

Gateway is running and listening, but only on 127.0.0.1.
Platform probes come from outside the container/VM and get ECONNREFUSED.

For remote deployments, set:

{ gateway: { bind: 'lan' } }

Important: when binding beyond loopback, OpenClaw requires auth (token/password). If you see a message like “refusing to bind … without auth”, configure auth and restart.

4.4 Use OpenClaw health/probe commands

If you have shell access on the host/container, probe locally:

openclaw gateway health --url ws://127.0.0.1:18789 --token "$OPENCLAW_GATEWAY_TOKEN"

If local health fails, the issue is not the platform probe - it is a gateway startup/config/network issue.

4.5 Platform probes too strict: give startup more time

For Kubernetes-style platforms, add a startupProbe (TCP is usually the least surprising):

startupProbe:
  tcpSocket:
    port: 18789
  periodSeconds: 5
  failureThreshold: 60 # ~5 minutes

For PaaS platforms with a health check toggle, it is often useful to temporarily disable health checks while fixing a bad config (to break the restart loop), then re-enable.

5) Platform guidance

5.1 Alibaba Cloud (Simple Application Server image)

Alibaba’s OpenClaw image FAQ highlights the key pitfall: a system reset deletes system disk data, and you must reconfigure tokens/keys afterward.

Checklist:

Treat ~/.openclaw/ as production data; snapshot/backup before resets.
Avoid exposing the Control UI publicly; prefer SSH tunnel/Tailscale.
If you do expose it, ensure firewall/security group rules allow only your IPs.

5.2 Volcengine

Volcengine behaves like most cloud VM/container platforms: if you rebuild without persistence, state is gone.

Recommended:

Put OpenClaw state on a data disk (example mount: /data/openclaw).
Back up using disk snapshots.
If using Docker, bind-mount the data disk path into the container.

5.3 Zeabur

Zeabur deployments usually fail for two reasons:

Persistence is not configured, so redeploy wipes state.
Health checks kill the container before it becomes ready.

Suggested workflow:

Enable persistent storage and mount it to where /home/node/.openclaw lives.
Fix bind/auth/port.
If needed, use rescue mode and temporarily disable health checks while you fix config.

5.4 Unraid

Unraid best practice is to map application state into /mnt/user/appdata/... so rebuilding the container (or recreating docker.img) does not wipe state.

Suggested mapping:

Host:      /mnt/user/appdata/openclaw
Container: /home/node/.openclaw

If state was stored in the container write layer, rebuilding/updating will cause token resets and session loss.

5.5 AWS (EC2 / ECS/Fargate / EKS)

EC2 (recommended for a stateful gateway)

Use EBS for persistence.
Put OpenClaw state on a dedicated EBS data volume mounted at /data/openclaw.
Set OPENCLAW_STATE_DIR=/data/openclaw/.openclaw and back up with EBS snapshots.
Be mindful that root volumes are often deleted on termination unless configured otherwise.

ECS/Fargate (stateless by default)

Fargate is great for stateless services. For OpenClaw, you typically need a persistent filesystem for state.

If you deploy on Fargate, mount an EFS volume and store the OpenClaw state directory there.

EKS (Kubernetes)

Store ~/.openclaw on a PersistentVolume (EBS/EFS CSI).
Add a TCP startupProbe to avoid early restarts.

5.6 Google Cloud (Compute Engine / GKE / Cloud Run)

Compute Engine (recommended for a stateful gateway)

Use a Persistent Disk mounted at /data/openclaw.
Set OPENCLAW_STATE_DIR=/data/openclaw/.openclaw.
Back up with disk snapshots.

Tip: use stable device naming (UUID or /dev/disk/by-id) when mounting disks.

GKE (Kubernetes)

Mount a PersistentVolume for ~/.openclaw.
Add a startupProbe (TCP) for port 18789.

Cloud Run (not ideal for OpenClaw)

Cloud Run is optimized for stateless services. Local disk is ephemeral and instances can be replaced at any time.

If you still want to experiment, Cloud Run supports Cloud Storage volume mounts, but this changes filesystem semantics (object storage presented as files) and may not be a drop-in replacement for all state patterns.

For production, prefer Compute Engine or GKE.

6) Backup + restore drill (do this once before you need it)

Minimal backup (VM / bare metal / Docker host):

tar -czf openclaw-state-backup.tgz ~/.openclaw

Verify you can:

restore onto a new VM/container
reuse the same OPENCLAW_GATEWAY_TOKEN
keep channels/devices working (or at least understand what must be re-approved)

Verification checklist after the recovery

Treat the deployment as healthy only when:

the state directory is mounted on persistent storage,
the gateway token is reproducible from env/secret configuration,
the service listens on the expected host/port after restart,
the platform probe passes without manual intervention,
and a browser/device reconnect proves you did not silently create a “new machine” by accident.

If one of those checks fails, the deployment is still fragile even if it looks alive for a moment.

What to tighten first when the platform still feels brittle

Use this order:

Fix persistence and token reproducibility.
Fix bind/listen and platform probe assumptions.
Fix permissions, missing env vars, and restart loops from logs.
Only then optimize platform-specific startup timing or release workflow.

That order keeps you from tuning probes around a deployment that still cannot preserve identity.

References (official docs)

OpenClaw Docker install: https://docs.openclaw.ai/install/docker
OpenClaw gateway CLI docs: https://docs.openclaw.ai/cli/gateway
OpenClaw dashboard + token: https://docs.openclaw.ai/web/dashboard
Zeabur OpenClaw template: https://zeabur.com/templates/VTZ4FX
Alibaba Cloud OpenClaw FAQ (reset wipes disk; token invalid): https://www.alibabacloud.com/help/en/simple-application-server/use-cases/openclaw-faq
Unraid Docker container management (appdata mappings, docker.img): https://docs.unraid.net/unraid-os/using-unraid-to/run-docker-containers/managing-and-customizing-containers/
Kubernetes probes (startupProbe): https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
AWS EC2 preserve volumes on termination: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/preserving-volumes-on-termination.html
AWS ECS EFS volumes: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/efs-volumes.html
Google Compute Engine Persistent Disk: https://cloud.google.com/compute/docs/disks/persistent-disks
Google Compute Engine mount disks using UUID: https://cloud.google.com/compute/docs/disks/mounting-disks#uuid
Cloud Run Cloud Storage volume mounts: https://cloud.google.com/run/docs/configuring/services/cloud-storage-volume-mounts

OpenClaw Deployment Troubleshooting: Token Resets, STARTING Containers, and Startup Probe Failures

Implementation Steps

What this guide helps you finish

Who this is for (and not for)

Before you redeploy: collect these five facts

1) Why tokens “disappear” after resets (the real cause)

2) The golden rules (works on every platform)

2.1 Pin gateway auth via env vars (recommended)

2.2 Persist the state directory (this is the real fix)

3) Recover or rotate your token (without wiping state)

3.1 Docker: check `.env`

3.2 Check config and the service runtime environment

3.3 “pairing required” (1008): approve the device

4) STARTING / startup probe failed: debug in layers

4.1 Logs first

4.2 Verify the port is listening

4.3 Bind matters: remote deployments usually require `lan`

4.4 Use OpenClaw health/probe commands

4.5 Platform probes too strict: give startup more time

5) Platform guidance

5.1 Alibaba Cloud (Simple Application Server image)

5.2 Volcengine

5.3 Zeabur

5.4 Unraid

5.5 AWS (EC2 / ECS/Fargate / EKS)

EC2 (recommended for a stateful gateway)

ECS/Fargate (stateless by default)

EKS (Kubernetes)

5.6 Google Cloud (Compute Engine / GKE / Cloud Run)

Compute Engine (recommended for a stateful gateway)

GKE (Kubernetes)

Cloud Run (not ideal for OpenClaw)

6) Backup + restore drill (do this once before you need it)

Verification checklist after the recovery

What to tighten first when the platform still feels brittle

References (official docs)

Related Resources

Need live assistance?

OpenClaw Deployment Troubleshooting: Token Resets, STARTING Containers, and Startup Probe Failures

Implementation Steps

Step 1: Start with the real root cause: you didn't lose a token, you lost state

Step 2: Make tokens reproducible and state persistent

Step 3: Fix STARTING / probe failures using a layered checklist

Step 4: Apply platform specifics (Alibaba Cloud / Zeabur / Unraid / AWS / GCP)

Step 5: Do one backup + restore drill

What this guide helps you finish

Who this is for (and not for)

Before you redeploy: collect these five facts

1) Why tokens “disappear” after resets (the real cause)

2) The golden rules (works on every platform)

2.1 Pin gateway auth via env vars (recommended)

2.2 Persist the state directory (this is the real fix)

3) Recover or rotate your token (without wiping state)

3.1 Docker: check .env

3.2 Check config and the service runtime environment

3.3 “pairing required” (1008): approve the device

4) STARTING / startup probe failed: debug in layers

4.1 Logs first

4.2 Verify the port is listening

4.3 Bind matters: remote deployments usually require lan

4.4 Use OpenClaw health/probe commands

4.5 Platform probes too strict: give startup more time

5) Platform guidance

5.1 Alibaba Cloud (Simple Application Server image)

5.2 Volcengine

5.3 Zeabur

5.4 Unraid

5.5 AWS (EC2 / ECS/Fargate / EKS)

EC2 (recommended for a stateful gateway)

ECS/Fargate (stateless by default)

EKS (Kubernetes)

5.6 Google Cloud (Compute Engine / GKE / Cloud Run)

Compute Engine (recommended for a stateful gateway)

GKE (Kubernetes)

Cloud Run (not ideal for OpenClaw)

6) Backup + restore drill (do this once before you need it)

Verification checklist after the recovery

What to tighten first when the platform still feels brittle

References (official docs)

Related Resources

Need live assistance?

3.1 Docker: check `.env`

4.3 Bind matters: remote deployments usually require `lan`