October 7, 2025

Vibe Engineering: Testing an Agentic AI Browser Auto Pilot: A Three-Stage Verification Strategy

Testing an AI-powered browser automation agent at vibebrowser.app presents unique challenges. Unlike traditional software where inputs…

Dzianis Vashchuk

6 min read

Originally on Medium

Author: Dzianis Vashchuk | Site: Medium | Published: 2025-10-07T20:59:03Z

Vibe Engineering: Testing an Agentic AI Browser Auto Pilot: A Three-Stage Verification Strategy Testing an AI-powered browser automation agent at vibebrowser.app presents unique challenges. Unlike …

Testing an AI-powered browser automation agent at vibebrowser.app presents unique challenges. Unlike traditional software where inputs produce deterministic outputs, AI agents operate in dynamic web environments with non-deterministic LLM responses. We developed a three-stage testing pyramid that balances speed, cost, and real-world validation.The Testing PyramidOur testing strategy follows a progression from fast, deterministic unit tests to expensive, real-world integration tests:Foundation: Page Extraction with OCR Verification — Validates core content extractionMiddle Layer: Mock LLM Server Testing — Tests agent logic without API costsTop Layer: Real LLM Integration Tests — End-to-end validation with production APIsStage 1: Page Extraction Testing with OCR VerificationFile: tests/page-extraction.test.jsPurpose: Verify that our page content extractors accurately capture visible webpage content.Why OCR Verification?Traditional DOM-based tests can pass even when content isn’t actually visible to users. We use Tesseract.js OCR to extract text from screenshots, then compare it with our extractor’s output. This validates that:Our extractor captures what users actually seeNo critical content is hidden or missedIndexed elements match visual renderingTesting Two ExtractorsWe test both extraction strategies:const markdownResult = await page.evaluate(() => { const extractor = new MarkdownPageExtractor(); return extractor.extractContent({ showHighlights: false, maxElements: 1000 });});// HtmlPageExtractor - Indexed HTML for automationconst htmlResult = await page.evaluate(() => { const extractor = new HtmlPageExtractor(); return extractor.extractIndexedHtml({ maxElements: 1000 });});The OCR Validation Processconst ocrText = await extractTextFromScreenshot(screenshotPath);// Calculate baseline: OCR vs Raw HTML (establishes OCR accuracy)const baselineMatchStats = calculateOCRMatch(ocrText, htmlText);console.log(Baseline (OCR quality): ${baselineMatchStats.matchPercentage.toFixed(2)}%);// Validate extractor: Must match baseline within toleranceconst markdownMatchStats = calculateOCRMatch(ocrText, markdownResult.content);const requiredPercentage = Math.max(90.0, baselineMatchStats.matchPercentage - 1.5);if (markdownMatchStats.matchPercentage < requiredPercentage) { throw new Error(Extractor failing to capture visible content);}Fuzzy Matching for OCR ErrorsOCR isn’t perfect. We use Levenshtein distance for fuzzy matching:function calculateOCRMatch(ocrText, markdownText) { const ocrWords = ocrText.split(/\s+/).filter(w => w.length >= 3); const mdWordsSet = new Set(markdownText.split(/\s+/)); let exactMatches = 0; let fuzzyMatches = 0; for (const word of ocrWords) { if (mdWordsSet.has(word)) { exactMatches++; } else { // Levenshtein distance ≤ 2 for fuzzy match const match = findBestMatch(word, mdWordsArray, 2); if (match) fuzzyMatches++; } } const matchPercentage = ((exactMatches + fuzzyMatches) / totalWords) * 100; return { matchPercentage, exactMatches, fuzzyMatches, totalWords };}Test CoverageThe test suite validates extraction against diverse web pages:Reference Mode: Static HTML + reference screenshot (Gmail, LinkedIn, HackerNews, Stack Overflow)Live Mode: Real websites (Wikipedia, Example.com, financial sites)Success Criteria: Extractors must achieve ≥90% absolute match or within 1.5% of baseline OCR quality.Stage 2: Mock LLM Server TestingFile: tests/extension.mock.test.js Mock Server: tests/utils/mock-llm-test-server.jsPurpose: Test complete agent workflow without expensive API calls.Why Mock Testing?Testing with real LLMs is:Expensive: $0.01-$0.10+ per test runSlow: Network latency + LLM inferenceNon-deterministic: Different responses each runOur mock server provides deterministic, instant responses while following the exact OpenAI API contract.Mock Server State MachineThe mock server implements a phase-based state machine matching the expected agent flow:const testState = { phase: 'initial', messageCount: 0, toolCallsExecuted: []};app.post('/v1/chat/completions', (req, res) => { const { messages, tools } = req.body; const userMessage = messages?.find(m => m.role === 'user')?.content || ''; // Phase 1: Initial query → Navigate if (testState.phase === 'initial' && userMessage.includes('test')) { response = { choices: [{ message: { role: 'assistant', content: 'Let's test it', tool_calls: [{ function: { name: 'navigate_to_url', arguments: JSON.stringify({ url: http://localhost:${PORT}/test-page }) } }] } }] }; testState.phase = 'navigated'; } // Phase 2: Navigation result → Interact with page else if (testState.phase === 'navigated') { response = { choices: [{ message: { tool_calls: [ { function: { name: 'click_by_index', arguments: '{"index": 0}' }}, { function: { name: 'fill_by_index', arguments: '{"index": 9, "value": "Test Input Value"}' }}, { function: { name: 'select_by_index', arguments: '{"index": 10, "value": "economy"}' }} ] } }] }; testState.phase = 'interacting'; } // Phase 3: Tool results → Completion else if (testState.phase === 'interacting') { response = { choices: [{ message: { content: 'Test completed successfully!' } }] }; testState.phase = 'completed'; }});Dynamic Port AllocationTests run in parallel, so we dynamically allocate ports to prevent conflicts:async function findAvailablePort(startPort) { return new Promise((resolve, reject) => { const server = createServer(); server.listen(startPort, () => { const port = server.address().port; server.close(() => resolve(port)); }); server.on('error', (err) => { if (err.code === 'EADDRINUSE') { findAvailablePort(startPort + 1).then(resolve).catch(reject); } }); });}Extension Test Flowawait page.goto(chrome-extension://${extensionId}/settings.html);await allInputs[0].type('openai:gpt-5-mini'); await allInputs[1].type('mock-api-key-123'); await allInputs[2].type(http://127.0.0.1:${MOCK_SERVER_PORT}/v1); // 2. Trigger agent via home pageawait page.goto(chrome-extension://${extensionId}/home.html);const searchInput = await page.$('input[placeholder*="URL"]');await searchInput.type("Let's test Vibe Browser");await page.keyboard.press('Enter');// 3. Wait for sidepanel to open (validates Chrome side panel API integration)const sidepanelTarget = targets.find(t => t.url().includes('sidepanel.html'));if (!sidepanelTarget) { throw new Error('Sidepanel failed to open');}// 4. OCR verification of tool executionawait verifyScreenshotContainsText( screenshot, ['navigate'], 'Navigation Tool (OCR)', { exactMatch: false, similarityThreshold: 60 });DOM Verification of Tool ExecutionOCR proves the UI updated, but we also verify actual form state:const formValues = await testPage.evaluate(() => { const input = document.getElementById('destinationInput'); const select = document.getElementById('classDropdown'); return { inputValue: input?.value || '', selectedClass: select?.value || '' };});if (!formValues.inputValue.includes('Test Input Value')) { throw new Error(Expected "Test Input Value", found "${formValues.inputValue}");}if (formValues.selectedClass !== 'economy') { throw new Error(Expected "economy", found "${formValues.selectedClass}");}Screenshots Throughout ExecutionWe capture screenshots at every phase using Chrome DevTools Protocol (CDP):async function takeAllPagesScreenshots(step) { const targets = await browser.targets(); for (const target of targets) { const cdpSession = await target.createCDPSession(); const { frameTree } = await cdpSession.send('Page.getFrameTree'); const { data } = await cdpSession.send('Page.captureScreenshot', { format: 'png' }); fs.writeFileSync(screenshotPath, Buffer.from(data, 'base64')); }}Stage 3: Real LLM Integration TestsFile: tests/extension.test.jsPurpose: Validate end-to-end functionality with production OpenRouter API.Real-World ValidationMock tests verify our code logic, but real LLMs test:Prompt engineering effectivenessTool calling accuracyError handling with actual API responsesPerformance under real network conditionsTest Configurationif (!process.env.OPENROUTER_API_KEY) { throw new Error('OPENROUTER_API_KEY environment variable not set');}// Configure extension with real APIawait allInputs[0].type('openrouter:gpt-oss-120b'); // Real modelawait allInputs[1].type(process.env.OPENROUTER_API_KEY); // Real API key// Note: No base URL override - uses production OpenRouterComplex Real-World TaskInstead of controlled test pages, we test against real websites:const testQuery = "Navigate to google.com, search for 'raspberry pi zero 2 w', click first link";// Verify actual navigation to external sitesconst externalPage = pages.find(page => { const url = page.url(); return !url.includes('chrome-extension://') && url.startsWith('http');});if (!externalPage) { throw new Error('Agent did not navigate to any external page');}console.log(Agent navigated to: ${externalPage.url()});Handling Real-World FailuresReal websites have CAPTCHAs, rate limits, and anti-bot measures. Our tests accept graceful degradation:await verifyScreenshotContainsText( screenshot, ['filled', 'successfully', 'failed', 'CAPTCHA', 'verification'], 'Agent Completion/Error (OCR)', { exactMatch: false, fuzzyMatch: true, similarityThreshold: 60 });Key Testing InsightsOCR as Ground TruthOCR provides an objective measure of what’s actually rendered. Traditional DOM assertions can pass even when:Elements are hidden (display: none, visibility: hidden)Content is off-screen or behind other elementsCSS renders content as white-on-whiteJavaScript hasn’t finished renderingFuzzy Matching is CriticalOCR introduces errors. Levenshtein distance ≤2 catches:Character recognition errors: “test” → “fest”Missing punctuation: “browser’s” → “browsers”Extra/missing whitespaceScreenshot Every StepAutomated tests fail. Screenshots enable post-mortem debugging:.test/ExtensionMock-2025-10-07T14-30-00/screenshots/ 01_settings.html.png 02_home.html.png 03_home.html.png 04_sidepanel.html.png 05_sidepanel.html.png 06_localhost_3456_test-page.png 07_localhost_3456_test-page.png 08_sidepanel.html.png Test Isolation with CleanupBrowser automation leaks resources. Proper cleanup is critical:async function cleanup() { if (browser) { await browser.close(); } if (mockServer && !mockServer.killed) { mockServer.kill('SIGKILL'); } await cleanupOCR(); cleanupPerformed = true;}// Handle all exit scenariosprocess.on('SIGINT', async () => await cleanup());process.on('SIGTERM', async () => await cleanup());process.on('uncaughtException', async () => await cleanup());process.on('unhandledRejection', async () => await cleanup());Testing PhilosophyUnit Tests Are Not EnoughTraditional unit tests can’t validate:Chrome extension APIs (side panel, omnibox)Browser automation (Puppeteer/Playwright)LLM tool calling accuracyVisual rendering correctnessMock vs Real: Both RequiredMock tests: Fast feedback during development, deterministic CI/CD pipelines Real tests: Catch production issues, validate prompts, verify API integrationEvidence-Based DebuggingEvery test produces evidence:Screenshots at each stepOCR extracted textExtracted page content (markdown + HTML)DOM element listsConsole logsWhen tests fail, we don’t guess — we review the evidence.Running the Testsnode tests/page-extraction.test.js# Stage 2: Mock LLM server testingnode tests/extension.mock.test.js# Stage 3: Real LLM integration (requires API key)export $(< .env)node tests/extension.test.jsConclusionTesting AI agents requires rethinking traditional testing strategies. Our three-stage approach:Validates core extraction with OCR-verified unit testsTests agent logic with deterministic mock LLMsConfirms production readiness with real API integrationThe key insight: Don’t trust the DOM. Verify what users actually see using OCR. Combined with comprehensive screenshot capture, this creates an evidence-based testing pipeline that catches issues traditional tests miss.YO