Back to Blog

How We Test OpenClaw Box: Evaluation-Driven Quality with Telegram CLI, Browser CDP, and Live Canary Checks

Real messages, real browsers, real AI responses — how we verify every release before it reaches customers.

DV

Dzianis Vashchuk

8 min read

Most AI products ship fast and test later. We do it the other way around.

OpenClaw Box is a managed AI assistant platform delivered through Telegram. Each customer gets an isolated OpenClaw instance running on AKS with its own browser, file system, shell access, and 16 hosted models. When something breaks, customers lose their AI assistant. There is no "try again later" that feels acceptable.

That is why we built an evaluation-first testing pipeline that runs real scenarios against the live production bot before every merge. Not mocked. Not stubbed. Real Telegram messages, real browser sessions, real AI responses.

This post is a deep dive into how we do it, why it matters, and what value it delivers.

Why evaluation testing is different from unit testing

Unit tests verify that a function returns the right value. Evaluation tests verify that the product works.

For an AI assistant platform, the gap between those two is enormous:

  • A unit test can confirm that the message handler calls the gateway client correctly
  • An evaluation test confirms that a user sending "What is the capital of France?" to the Telegram bot gets "Paris" back in under 20 seconds

The first catches code regressions. The second catches everything else: gateway connectivity, model availability, Kubernetes networking, Telegram API polling, response formatting, and timeout handling.

We run both. But the evaluation tests are what give us confidence to merge.

The architecture: Telegram CLI + Telethon + Vitest

Our evaluation pipeline has three layers:

1. Telegram CLI (scripts/telegram-cli.py)

We built a lightweight CLI tool on top of Telethon, a Python library for the Telegram MTProto API. It authenticates as a real Telegram user (not the bot) and interacts with @OpenClawBoxBot exactly the way a customer would.

Two core commands:

# Send a message and wait for the bot to reply
python3 scripts/telegram-cli.py ask @OpenClawBoxBot "Hello" --wait 60

# Read recent messages from the chat
python3 scripts/telegram-cli.py read @OpenClawBoxBot --limit 10

The ask command sends a message, then polls the chat history waiting for a new message from the bot. It returns the full exchange — sent message and bot reply — as structured text. If the bot does not reply within the timeout, it exits with a non-zero code.

This is not a mock. It is not hitting a test endpoint. It sends a real Telegram message through the real bot, which routes it through the real gateway, which calls a real LLM, which returns a real response.

2. Vitest test runner (tests-live/provisioner/canary-eval.test.ts)

The evaluation tests live in a Vitest test file that wraps the Telegram CLI calls in structured assertions:

function telegramAsk(message: string, waitSeconds = 60) {
  const stdout = execFileSync("python3", [
    TELEGRAM_CLI, "ask", "@OpenClawBoxBot", message, "--wait", String(waitSeconds)
  ], { timeout: (waitSeconds + 15) * 1000, encoding: "utf-8" });
  return { stdout, exitCode: 0 };
}

Each test sends a specific prompt and validates the response against expected patterns. The tests run sequentially because they share the same Telegram chat context.

3. Retry logic for transient failures

AI responses are non-deterministic. Networks have hiccups. Models occasionally time out. We handle this with a retry wrapper:

function telegramAskWithRetries(message, { waitSeconds, attempts, retryDelayMs }) {
  for (let attempt = 1; attempt <= attempts; attempt++) {
    const result = telegramAsk(message, waitSeconds);
    if (result.exitCode === 0 && !isTransientFailure(result.stdout)) {
      return { ...result, attemptsUsed: attempt };
    }
    sleep(retryDelayMs);
  }
  return lastResult;
}

This separates real failures (bot is broken) from transient noise (model overloaded for 5 seconds). We do not retry indefinitely — 2-3 attempts with a 4-5 second delay between them.

The test scenarios

Our evaluation suite covers 11 real-world scenarios. Here is what each one tests and why it matters.

Smoke test: basic Q&A

it("smoke: bot responds to a simple question", () => {
  const result = telegramAskWithRetries(
    "What is the capital of France?",
    { waitSeconds: 60, attempts: 3 }
  );
  expect(result.stdout).toMatch(/paris/i);
});

This is the canary in the coal mine. If this fails, everything else will too. It validates the entire chain: Telegram polling → bot handler → gateway proxy → LLM → response delivery.

Command interception: /start, /create, /status

it("command-start: /start is intercepted by bot", () => {
  const result = telegramAsk("/start", 30);
  expect(result.stdout).toMatch(/welcome/i);
  expect(result.stdout).toMatch(/openclaw|plan|instance/i);
});

This is critical for onboarding. The /start command must be handled by the bot itself (showing plans and payment options), not forwarded to the AI gateway. We caught a production bug where tenant gateways were independently polling Telegram, racing with the bot for getUpdates calls. These tests would have caught it before it shipped.

Browser CDP: Wikipedia fact extraction

it("browser-cdp: open GPT Wikipedia page and find first release date (2018)", () => {
  const result = telegramAskWithRetries(
    "Use your browser to open https://en.wikipedia.org/wiki/GPT-1 — " +
    "find the year when the first GPT model was released. Reply with the year.",
    { waitSeconds: 120, attempts: 2 }
  );
  expect(result.stdout).toMatch(/2018/);
});

Every OpenClaw Box tenant has a headless Chrome sidecar connected via CDP (Chrome DevTools Protocol). This test verifies that the AI assistant can actually use it: navigate to a URL, read the page, extract specific information, and return it through Telegram.

This is not something you can unit test. The browser runs in a separate container. The AI decides how to use it. The response travels back through the gateway, through the bot, to Telegram. If any part of that chain is broken — Chrome not starting, CDP connection refused, gateway timeout on a long browser task — this test catches it.

Code execution: Python script

it("code-execution: write and run a Python script", () => {
  const result = telegramAskWithRetries(
    "Write a Python script that prints the first 10 prime numbers, " +
    "save it to /tmp/primes.py, run it, and show me the output.",
    { waitSeconds: 90, attempts: 2 }
  );
  expect(result.stdout).toMatch(/\b29\b/);
});

OpenClaw tenants have full shell access with sudo. This test verifies that code execution works end-to-end: the AI writes a file, runs it, captures output, and sends it back through Telegram.

File operations

it("file-ops: create a file in workspace and read it back", () => {
  const marker = `EVAL_${Date.now().toString(36).toUpperCase()}`;
  const result = telegramAskWithRetries(
    `Create a file at /tmp/eval-marker.txt with the exact content "${marker}", ` +
    "then read it back and tell me what it says.",
    { waitSeconds: 60, attempts: 2 }
  );
  expect(result.stdout).toMatch(new RegExp(marker));
});

The marker is unique per run, so the test cannot pass from cached responses. The AI must actually write and read the file.

Shell commands

it("shell: execute a shell command and report output", () => {
  const result = telegramAskWithRetries(
    'Run this exact shell command: echo "EVAL_SHELL_OK_$(date +%Y)" and show me the output.',
    { waitSeconds: 60, attempts: 2 }
  );
  expect(result.stdout).toMatch(/EVAL_SHELL_OK_20\d{2}/);
});

The year in the output changes every year, so this cannot be answered from training data. It must be executed live.

Multi-turn conversation context

it("context: bot maintains conversation memory across turns", () => {
  const fruit = randomFrom(["mango", "papaya", "kumquat", "dragonfruit", "lychee"]);
  // Turn 1: tell the bot a preference
  telegramAskWithRetries(`My favorite fruit is ${fruit}. Can you confirm?`);
  // Turn 2: ask to recall
  const r2 = telegramAskWithRetries("What is my favorite fruit?");
  expect(r2.stdout).toMatch(new RegExp(fruit, "i"));
});

This verifies that the gateway maintains conversation state across messages. If session routing is broken, the second message would go to a fresh context and fail.

Web fetch: live data

Tests that the AI can navigate to a live URL (httpbin.org) and extract information from the response, verifying real-time web access works.

Chat history

Confirms that the Telegram CLI can read past messages, validating the test infrastructure itself.

When these tests run

Pre-merge: canary validation

Before merging any PR that touches message handling, gateway integration, or Kubernetes manifests, we run the full evaluation suite against the live production bot:

npx vitest run --config vitest.live.config.ts \
  tests-live/provisioner/canary-eval.test.ts -t "Telegram CLI E2E"

This takes about 4 minutes. It sends 11+ real messages to the bot and validates every response. If any test fails, the PR does not merge.

Post-merge: CI/CD pipeline

Our CI pipeline runs the full unit test suite (668 tests) on every push. The unit tests validate the Kubernetes manifests, provisioner logic, payment handlers, and message routing — everything that can be tested without a live cluster.

The evaluation tests require a Telegram session and live cluster access, so they run in our canary environment rather than in GitHub Actions. This is intentional: we want the evaluation tests to hit the real production infrastructure, not a staging copy.

Incident response

When a user reports "bot is broken," the first thing we do is run the smoke test:

python3 scripts/telegram-cli.py ask @OpenClawBoxBot "What is 2+2?" --wait 30

If it fails, we know the issue is in the core pipeline. If it passes, we run the specific scenario that matches the user's report (browser test, file ops, etc.) to narrow down the problem.

What value this delivers

For customers

Every OpenClaw Box customer gets an AI assistant that has been verified to:

  • Respond to messages within 20 seconds
  • Execute code and return results
  • Browse the web and extract information
  • Maintain conversation context
  • Handle all Telegram commands correctly

This is not a marketing claim. These are automated checks that run before every release.

For the product

Testing the full pipeline — Telegram → bot → gateway → LLM → browser/shell → response — catches classes of bugs that no amount of unit testing can find:

  • Infrastructure drift: A Kubernetes manifest change that breaks gateway connectivity
  • Config conflicts: Tenant gateways accidentally polling Telegram (a real bug we caught)
  • Model availability: An LLM provider having an outage that affects response quality
  • Timeout regressions: A code change that makes browser tasks slower than the timeout
  • Session routing: A gateway update that breaks conversation memory

For engineering velocity

Because we have high-confidence evaluation tests, we can:

  • Ship infrastructure changes without fear of silent breakage
  • Debug production issues in minutes instead of hours
  • Maintain 20+ tenant instances without manual testing each one
  • Refactor internal code knowing the user-facing behavior is locked down

Why this approach is different

Most AI platforms test at the API level: send a request to the model, check the response. That misses everything between the user and the model.

We test at the user level: send a Telegram message, check what the user sees. That covers the entire stack — infrastructure, routing, authentication, model selection, tool use, response formatting, and delivery.

The tools are simple: Python (Telethon) for Telegram interaction, Vitest for test orchestration, bash for retries. The insight is that you should test what your customers experience, not what your API returns.

If you are building an AI product and you are not testing the full user flow end-to-end, you are shipping blind. We decided not to.


OpenClaw Box is a managed AI assistant platform. Each customer gets an isolated AI instance with browser, shell, file system, and 16+ hosted models — delivered entirely through Telegram. Learn more.