VeriEnv

Verifiable environments for self-evolving web agents.

VeriEnv automatically clones real-world websites into fully executable synthetic environments, exposing internal state through a Python SDK so agents can learn from deterministic, verifiable rewards instead of brittle LLM-as-a-judge feedback.

Read Paper View Code See Websites

0 synthetic websites
0 verifiable tasks
+0.00 / +0.00 WebArena gains over base models

Animated showcase of websites reconstructed with VeriEnv

Safe Exploration Clone first, train later

Learn in controlled website replicas without touching live production environments.

Verified Rewards Tasks + judges grounded in code and DB state

Every trajectory can be checked programmatically through the synthetic environment.

The core idea is simple: treat language models as environment creators, not just action policies. By reconstructing websites into instrumented training worlds, VeriEnv makes web-agent self-evolution safe, repeatable, and scalable.

Motivation

Why real websites are a bad training substrate

Direct self-evolution on the open web is unsafe, hard to reset, and often judged by ambiguous instructions or non-verifiable reward signals. VeriEnv replaces that loop with controllable, executable replicas.

Motivation figure comparing traditional self-evolution and VeriEnv

Traditional self-evolution vs. verifiable environments

The paper’s motivation figure contrasts fragile real-world exploration with VeriEnv’s synthetic websites, validated tasks, and deterministic reward signals.

Open motivation figure

Method Overview

From a real website to a verifiable training environment

VeriEnv uses a coding agent to reconstruct the full stack of a website, then generates tasks and judges that interact with both the UI and the database through an SDK for end-to-end verification.

Clone

Rebuild frontend, backend, database, and local tooling.

Generate

Create tasks at multiple difficulty levels automatically.

Verify

Judge outcomes deterministically with executable checks over environment state.

Instrumented website cloning pipeline

This overview figure shows the full VeriEnv loop: clone a website, expose code/database interfaces, generate task-judge pairs, and train agents using verified reward signals.

Open method overview figure

Website Showcase

Examples of websites reconstructed inside VeriEnv

These snapshots come from synthetic sites built for agent training. The hero animation cycles across multiple reconstructed websites to emphasize scale and diversity.

Allrecipes

Recipe discovery and structured cooking workflows.

Budget

Task-heavy booking flows with bold transactional UI.

Tesla

Highly visual commercial hero sections and product landing flows.

Healthline

Editorial card layouts and article-heavy browsing patterns.

CA.gov

Public-sector portals with utility-first navigation.

Eventbrite

Event discovery pages with strong promotional hero design.

Why It Matters

Environment scaling instead of prompt hacking

VeriEnv shifts the bottleneck for web-agent training from unverifiable language supervision to scalable environment construction. More websites mean more tasks, more coverage, and more stable reinforcement signals.

149 websites cloned into executable environments

49.5 average tasks generated per website

40.2% easy tasks for broad coverage and bootstrapping

39.2% medium tasks for steady skill development

20.6% hard tasks for long-horizon, high-value evaluation

Verified judging through executable checks, not free-form opinion

Takeaway

Train web agents in environments you can trust.

VeriEnv makes it possible to scale self-evolving web agents with synthetic websites that are faithful enough to be useful and instrumented enough to be verifiable.

Open Paper Back to Top