VeriEnv

Verifiable environments for self-evolving web agents.

VeriEnv automatically clones real-world websites into fully executable synthetic environments, exposing internal state through a Python SDK so agents can learn from deterministic, verifiable rewards instead of brittle LLM-as-a-judge feedback.

  • 0 synthetic websites
  • 0 verifiable tasks
  • +0.00 / +0.00 WebArena gains over base models
Animated showcase of websites reconstructed with VeriEnv
Safe Exploration Clone first, train later

Learn in controlled website replicas without touching live production environments.

Verified Rewards Tasks + judges grounded in code and DB state

Every trajectory can be checked programmatically through the synthetic environment.

The core idea is simple: treat language models as environment creators, not just action policies. By reconstructing websites into instrumented training worlds, VeriEnv makes web-agent self-evolution safe, repeatable, and scalable.

Motivation

Why real websites are a bad training substrate

Direct self-evolution on the open web is unsafe, hard to reset, and often judged by ambiguous instructions or non-verifiable reward signals. VeriEnv replaces that loop with controllable, executable replicas.

Motivation figure comparing traditional self-evolution and VeriEnv

Traditional self-evolution vs. verifiable environments

The paper’s motivation figure contrasts fragile real-world exploration with VeriEnv’s synthetic websites, validated tasks, and deterministic reward signals.

Open motivation figure

Method Overview

From a real website to a verifiable training environment

VeriEnv uses a coding agent to reconstruct the full stack of a website, then generates tasks and judges that interact with both the UI and the database through an SDK for end-to-end verification.

01

Clone

Rebuild frontend, backend, database, and local tooling.

02

Generate

Create tasks at multiple difficulty levels automatically.

03

Verify

Judge outcomes deterministically with executable checks over environment state.

Method overview figure for VeriEnv

Instrumented website cloning pipeline

This overview figure shows the full VeriEnv loop: clone a website, expose code/database interfaces, generate task-judge pairs, and train agents using verified reward signals.

Open method overview figure

Why It Matters

Environment scaling instead of prompt hacking

VeriEnv shifts the bottleneck for web-agent training from unverifiable language supervision to scalable environment construction. More websites mean more tasks, more coverage, and more stable reinforcement signals.

149 websites cloned into executable environments
49.5 average tasks generated per website
40.2% easy tasks for broad coverage and bootstrapping
39.2% medium tasks for steady skill development
20.6% hard tasks for long-horizon, high-value evaluation
Verified judging through executable checks, not free-form opinion

Takeaway

Train web agents in environments you can trust.

VeriEnv makes it possible to scale self-evolving web agents with synthetic websites that are faithful enough to be useful and instrumented enough to be verifiable.