Back to Blog

Why We Run OpenClaw on Kubernetes (And What It Actually Costs Us)

The technical trade-offs of hosting personal AI assistants in Kubernetes.

DV

Dzianis Vashchuk

9 min read

Every cloud-hosted AI assistant product eventually has to answer the same question: how do you give each user their own isolated environment without making operations a full-time job?

For OpenClaw Box, our managed OpenClaw hosting service, we chose Kubernetes. Not because it is trendy, but because it solves the hardest operational problems of running personal AI assistants for hundreds of users. This post covers why that decision made sense, what real challenges we hit, and how we solved (or are solving) each one.

Why Kubernetes

Provisioning a tenant in seconds

When a user sends /create in Telegram, they expect something to happen fast. Our provisioner generates a Kustomize overlay, applies it to the cluster, and a full OpenClaw tenant -- with its own namespace, StatefulSet, persistent storage, networking, and ingress -- exists within seconds. The pod itself takes 60-120 seconds to boot (init containers seeding the workspace, installing tools, configuring the runtime), but the infrastructure is ready almost instantly.

Without Kubernetes, we would be scripting VM creation through cloud provider APIs, managing SSH keys, setting up reverse proxies, configuring DNS -- all with bespoke tooling. With Kubernetes, the entire tenant definition is a set of YAML manifests that kubectl apply handles atomically.

Upgrades without downtime

When we ship a new OpenClaw version, upgrading every tenant means updating one image tag. Our reprovisioning flow deletes the StatefulSet but preserves the PVCs -- the user's data, workspace, installed tools, and browser profile survive intact. The new pod boots with the latest image, picks up the existing volumes, and the user is back online with zero data loss.

Compare this to managing dozens of VMs where you would need to SSH into each one, pull the new version, restart the process, and hope nothing breaks. On Kubernetes, we re-render the manifests, apply them, and wait for the readiness probe to pass. If something goes wrong, the old pod stays running until the new one is healthy.

True isolation between tenants

Each tenant gets its own Kubernetes namespace. That means:

  • Resource quotas per namespace prevent one user from starving others. A Pro plan tenant gets 1Gi-2Gi memory and 250m-1000m CPU. A Max plan gets 2Gi-4Gi memory and 500m-2000m CPU.
  • Network policies restrict traffic. Tenants can only reach DNS, the LiteLLM proxy, and external HTTPS. They cannot talk to each other or access internal cluster services they should not see.
  • Separate secrets per namespace -- each tenant has its own gateway token, never shared.

This is the kind of isolation that would require separate VMs or elaborate network segmentation without Kubernetes. With namespaces, it is declarative and auditable.

Scaling is a cluster concern, not an app concern

Adding capacity means adding nodes to the AKS cluster. We use a dedicated node pool for tenant workloads (openclaw.ai/workload: tenant), so tenant pods land on nodes sized for them while the bot and control plane run elsewhere. When we need more capacity, we scale the node pool -- no changes to application code required.

The Real Challenges

Kubernetes solves the infrastructure orchestration problem, but it introduces its own trade-offs. Here are the ones that actually matter when you are running personal AI environments.

Challenge 1: Persistence inside containers

Containers are ephemeral by design. When a pod restarts, everything written to the container filesystem disappears. For a personal AI assistant that needs to remember context, keep installed tools, and maintain a workspace, this is a fundamental problem.

How we solved it: Every tenant gets two Persistent Volume Claims (PVCs) backed by Azure managed disks:

  • app-data (20-50Gi) -- mounted at /app, holds the OpenClaw application code. Seeded from the container image on first boot, persisted across restarts.
  • config-data (20-50Gi) -- mounted at /home/node/.openclaw, holds the workspace, configuration, runtime packages, and Chrome browser profile.

This means the user's home directory survives pod restarts. When the AI assistant installs a package via pip, npm, cargo, or go install, those binaries land in the home directory and persist. Our init container (rehydrate-runtime-apt) even re-installs system-level apt packages on boot from a saved package list, so the tenant gets back to its exact previous state.

What it does not fully solve: System packages that install outside the home directory (/usr/bin, /usr/lib) are lost on restart. We mitigate this with the rehydration init container, but it adds boot time. In practice, most developer tools install through language-specific package managers (pip, npm, cargo, go install) that default to user-local paths, so this covers the majority of real-world usage. For the rest, our init container re-runs apt install from a persisted list on every boot, which adds 10-30 seconds but guarantees the system state converges.

Challenge 2: Cloud browser and IP restrictions

Each OpenClaw Box tenant runs a headless Chrome instance as a sidecar container (browserless/chrome). The AI assistant can browse the web, fill forms, take screenshots, and interact with any website -- all through Chrome DevTools Protocol (CDP). We even provide a browser UI so users can see and interact with the same browser session their AI is controlling.

The problem: cloud IP addresses are widely blocked. Services like Cloudflare, Akamai, and most major websites fingerprint incoming traffic and block requests from known cloud provider IP ranges (Azure, AWS, GCP). A Chrome browser running in an Azure data center looks like a bot to these services -- because from their perspective, it is one.

This is not unique to OpenClaw Box. Every cloud-hosted browser product faces this: Browserless, Playwright cloud services, and every "browser in the cloud" offering deal with the same IP reputation problem.

How we are solving it: We are working on a proxy architecture that lets users deploy a lightweight proxy on their own network -- a home server, a Raspberry Pi, or any residential IP. The tenant's Chrome traffic would route through this proxy, making requests appear to originate from a residential connection. The proxy itself is simple (a SOCKS5 or HTTP CONNECT relay); the complexity is in making the setup seamless and the connection reliable.

This is not shipped yet. Today, cloud-hosted Chrome works well for sites that do not aggressively block cloud IPs -- APIs, developer tools, internal services, and most web applications. For sites with strict bot protection, the proxy solution is on our roadmap.

Challenge 3: Cold start latency

A new tenant pod takes 60-120 seconds to become ready. Seven init containers run in sequence: seeding the application, creating the workspace, writing configuration, installing apt packages, setting up sudo, downloading CLI tools, and copying the Chrome CDP binary. Only after all of these complete does the main gateway container start, which itself needs 30-60 seconds to initialize.

During this window, any message the user sends to the bot fails with a connection error. We handle this in the bot by detecting the ECONNREFUSED state and returning a clear message ("Your assistant is starting up, please wait"), but two minutes of cold start is noticeable.

What we do about it: We pre-seed as much as possible into the container image to reduce init container work. We are exploring multi-stage init parallelism (running independent init containers concurrently where Kubernetes supports it) and warm-standby pods that can be assigned to new tenants immediately.

Challenge 4: StatefulSet rollouts and availability

Kubernetes StatefulSets update pods one at a time, which means a kubectl replace on a StatefulSet causes a pod restart -- and 2-4 minutes of downtime for that tenant while the new pod goes through the init sequence and the gateway starts up.

When we need to update all tenants (say, for a security patch), doing this across dozens of tenants serially means an hour-long rollout where each tenant experiences a brief outage. We learned the hard way not to do bulk kubectl replace operations -- we now do rolling patches with readiness verification between each tenant.

What Developers Get

For developers and power users who want a personal AI assistant without managing infrastructure, the Kubernetes architecture behind OpenClaw Box translates to concrete benefits:

  • Your own isolated environment with persistent storage -- install tools, configure your workspace, and it stays the way you left it.
  • Built-in Chrome browser accessible from your browser or controlled by your AI assistant via CDP. Use it for web research, form automation, testing, or just browsing.
  • Automatic upgrades -- when OpenClaw ships a new version, your tenant gets it without you doing anything. Your data and configuration are preserved.
  • Resource guarantees -- your tenant has reserved CPU and memory. Other users on the cluster cannot affect your performance.
  • Subdomain access -- each tenant gets its own URL (*.oclawbox.com) with TLS, plus API-compatible endpoints for programmatic access.

Why Not Just Use VMs?

We considered it. VMs give you full system-level isolation, persistent filesystems by default, and no container escape concerns. But they lose on every operational axis:

  • Provisioning speed: Creating a VM, installing dependencies, configuring networking, and setting up reverse proxies takes minutes to tens of minutes. Kubernetes does it in seconds.
  • Upgrades: Upgrading a VM fleet means either image baking (slow, rigid) or configuration management (Ansible/Chef/Puppet -- complex). Kubernetes image updates are atomic.
  • Resource efficiency: VMs have fixed resource allocation. Kubernetes packs workloads onto nodes dynamically, so unused resources on one tenant benefit the cluster.
  • Observability: Kubernetes has built-in health checks, readiness probes, automatic restarts, and event logging. With VMs, you build all of this yourself.

The trade-off is complexity. Kubernetes has a steep learning curve and failure modes that do not exist with VMs (init container ordering, PVC attachment delays, StatefulSet update semantics). But for a multi-tenant product that needs to provision, upgrade, and monitor dozens of isolated environments, the operational leverage is worth it.

What We Are Working On

The challenges above are real, and we are actively working on each one:

  • Proxy-based browser routing to solve cloud IP restrictions -- letting users tunnel Chrome traffic through their own residential IPs.
  • Faster cold starts through image optimization and init container parallelism.
  • Zero-downtime upgrades using canary deployments and pod pre-warming.
  • Runtime package snapshots that capture the full system state (not just home directory packages) into a custom container layer, eliminating the rehydration step entirely.

Kubernetes is not the easy path. But for running personal AI assistants at scale -- where each user needs their own isolated, persistent, upgradeable environment -- it is the right one.


OpenClaw Box is available now. Send /start to @OpenClawBoxBot on Telegram or visit openclaw.vibebrowser.app to learn more.