Browser Automation Meets AI

Forms are everywhere. Applications, registrations, vendor portals, government filings, client intake screens. If you run a business, you fill out forms. A lot of them. And most of the time, you're entering the same information you've entered a hundred times before.

I built an agent that handles it. Not a browser extension that autofills your name and address. An AI-powered system that looks at any form on any web page, figures out what it's asking, maps the right values from a profile, fills every field, and then waits for you to review before submitting.

The stack is Node.js, Playwright for browser automation, and Claude for the intelligence layer. Here's how the pipeline works.

The Four-Stage Pipeline

Every form interaction follows the same four stages. The system moves through them sequentially, with each stage producing structured output that feeds the next.

  URL
   |
  [1] Detect Fields ---- DOM extraction + vision fallback
   |
  [2] AI Mapping ------- Claude matches fields to profile values
   |
  [3] Fill Form -------- Simulated keystrokes + iframe routing
   |
  [4] Human Review ----- Screenshot + interactive CLI
   |
  Submit / Edit / Cancel

No stage is optional. The human review at the end is a hard requirement, not a nice-to-have. More on that later.

Stage 1: Field Detection

The agent navigates to the target URL in a Chromium browser and runs JavaScript in the page context. It queries every input, select, and textarea on the page, then extracts metadata for each one: field type, label text, placeholder, ARIA attributes, name attribute, and logical grouping.

Labels are the tricky part. A well-built form wraps each input in a <label> element, but plenty of forms use floating labels, aria-label, placeholder-only patterns, or no label at all. The detection script handles all of these, walking up the DOM tree to find the nearest descriptive text when explicit labels are missing.

It also handles grouped fields. Radio button sets, checkbox groups, and multi-select dropdowns get extracted as a single logical field with all available options listed. This gives the AI mapper the full picture of what's being asked, not just a collection of disconnected elements.

Cross-Origin Iframes

Many forms embed third-party components inside iframes. Payment processors, CAPTCHA widgets, embedded surveys. The detection stage identifies every iframe on the page, switches into each frame's context, runs the same extraction logic, and tags the results with the iframe's source. This means the fill stage knows exactly which frame to target for each field.

Vision Fallback

Here's where it gets interesting. Some forms are built with custom components, canvas elements, or JavaScript rendering that produces no queryable DOM elements. The detection script checks its results: if it finds fewer than two fields via DOM extraction, it triggers a fallback.

The fallback takes a full-page screenshot and sends it to Claude's vision capability. The AI analyzes the screenshot and identifies form fields visually — reading labels from the image, inferring field types from visual cues, and producing the same structured output that DOM extraction would. It's slower and less precise, but it means the system doesn't just give up on non-standard forms.

Graceful degradation. DOM extraction handles 90% of forms quickly and accurately. Vision handles the rest.

Stage 2: Intelligent Mapping

This is the stage where raw field data becomes an action plan. The extracted fields, the page context (title, URL, any visible headings), and a user profile are sent to Claude in a single structured prompt.

The AI's job is to determine the best-matching value for every field. It returns a mapping: field identifier, the value to enter, and a confidence level. For dropdown fields, it selects the closest matching option from the available choices. For radio groups, it picks the appropriate option.

Two critical behaviors are built into this stage:

Context-aware matching. The page title and URL give Claude essential context. A field labeled "Name" on a job application means the applicant's name. The same field on a vendor registration form means the company name. The surrounding page context resolves this ambiguity, and accuracy is measurably better with it included.

Sensitive field skipping. The AI is instructed to return null for any field that requests passwords, Social Security numbers, credit card numbers, or other sensitive data. No exceptions. The profile system stores organizational and contact information — never credentials. If a form asks for a password, that field stays empty for the human to handle.

The Profile System

Profiles are JSON files containing the data the mapper draws from. Organization name, contact details, addresses, EIN-type identifiers, standard answers to common questions. No values are hardcoded in the application logic — everything comes from the active profile at runtime.

This means the same agent works for multiple organizations. Swap the profile, and the mapper uses different data. One codebase, many use cases.

Stage 3: Form Filling

With the mapping in hand, the agent fills each field. This is Playwright doing what Playwright does best — but with a few important details.

Selector construction. For each field, the agent builds a CSS selector dynamically. It prefers ID selectors (most specific), falls back to name attribute, then to placeholder text. The goal is the most stable, least fragile selector available for each field.

Iframe routing. If a field was detected inside an iframe, the fill operation switches to that frame's context before interacting with the element. This is transparent to the rest of the pipeline — the mapping stage tags fields with their frame origin, and the fill stage routes accordingly.

Human-like input simulation. This matters more than you'd think. Many modern forms have JavaScript validation that triggers on specific input events. The agent doesn't just set the value property — it clicks the field, types each character with a random delay between 30 and 80 milliseconds, and pauses between fields for 100 to 400 milliseconds. This simulates natural input patterns and avoids triggering anti-bot detection or breaking client-side validation logic.

Scroll handling. Before interacting with any field, the agent scrolls it into view. Long forms with dozens of fields often have off-screen elements that can't receive input events until they're visible in the viewport.

Stage 4: Human Review

After every field is filled, the agent takes a screenshot of the completed form and presents it in the CLI alongside a summary of every field and the value that was entered.

The operator gets three choices: submit, edit specific fields, or cancel entirely.

This isn't a suggestion. The system will not submit a form without explicit human approval. There's no "auto-submit" flag, no batch mode that bypasses review. The agent can detect the submit button on the page — it looks for buttons with submit types, common text patterns, or form-associated elements — but it only clicks after confirmation.

This is the design decision I feel strongest about. The entire pipeline is about saving time on the tedious parts: finding fields, figuring out what goes where, typing it all in. But the moment of commitment — "yes, submit this" — stays with the human. Always.

Key Design Patterns

A few patterns emerged during development that I think apply well beyond form automation:

Graceful degradation: DOM extraction is fast and reliable for standard forms. Vision fallback handles edge cases. The system always produces output, even when the primary method fails. Build the fast path first, then add fallbacks for the weird cases.
Context-aware prompting: Sending page title and URL alongside the field data made a measurable difference in mapping accuracy. When you're sending structured data to an AI, include the surrounding context. It's cheap and the improvement is real.
Cross-origin isolation: Iframes are a fact of life on the modern web. Any automation tool that ignores them will break on real-world forms. Tag elements with their origin context at extraction time, and route operations accordingly at fill time.
Mandatory human review: Automation should handle the repetitive parts. Commitment should stay with the human. This isn't just a safety measure — it's a trust-building pattern. Users trust the system more when they know they'll always get the final say.

What This Isn't

This isn't a general-purpose web scraper. It doesn't collect data from pages. It doesn't navigate multi-step wizards autonomously. It doesn't handle CAPTCHAs. It fills forms — one page, one form, one interaction at a time.

It also isn't a replacement for proper API integrations. If a vendor has an API, use the API. This tool exists for the long tail of web interfaces that don't offer one — the government portals, the legacy vendor systems, the registration pages that only exist as HTML forms.

The Bigger Picture

Browser automation has been around for years. Selenium, Puppeteer, Playwright — the tooling is mature. What's new is the intelligence layer on top. The AI doesn't just follow a script. It reads the form, understands what's being asked, and decides what to enter.

That's the pattern I keep coming back to across every tool I build: take a well-understood automation framework, add an AI layer that handles the judgment calls, and keep a human in the loop for the decisions that matter.

Automation for the tedious parts. Intelligence for the ambiguous parts. Humans for the important parts.

That's the split. And it works.