Agent Execution Contracts: Unifying Specification, Testing, and Labor

Abstract

As AI agents transition from chat-based assistants to autonomous system actors, the traditional separation between specification, testing, and execution becomes an operational liability. This note defines Execution Contracts: machine-readable boundaries that collapse intent (specs), proof (tests), and labor (agents) into a single governing artifact. By shifting from descriptive documentation to binding constraints, we enable safer, more predictable autonomous work at scale.

Context & Motivation: The Ambiguity Tax

Historically, software development followed a linear progression: prose specifications were interpreted by humans, code was written to match, and tests were developed downstream to verify the result. This model tolerates high levels of ambiguity because human engineers resolve gaps in real-time through intuition and social coordination.

The introduction of autonomous agents breaks this model. Agents execute immediately and literally. In an agentic workflow, ambiguity is not a minor friction—it is an Ambiguity Tax paid in "hallucinated" implementations or unbounded execution. To utilize agents effectively, the pressure to resolve ambiguity must be moved upstream. We require a mechanism that binds agent behavior to verifiable outcomes before execution begins.

Core Thesis

Execution contracts are the necessary evolution of software artifacts where specifications and tests become the binding constraints for autonomous labor. In this model, an agent is not "prompted" to perform a task; it is authorized to execute within a contract that defines its permissions, its success criteria, and its failure modes.

Mechanism / Model

An execution contract unifies four distinct components into a machine-readable manifest (e.g., YAML or JSON):

Intent and Scope: A precise definition of the problem, explicit non-goals, and Invariants—conditions that must remain true throughout the execution (e.g., "No changes to public API signatures").
Interfaces and Affordances: A strict whitelist of tools, files, APIs, and environments the agent is permitted to access. This includes granular read/write permissions and resource quotas.
Acceptance Criteria: Executable tests or evaluation rubrics that provide an objective "pass/fail" signal. These are the gates that allow or deny agent progress.
Stopping and Escalation Rules: Explicit conditions under which the agent must halt, such as reaching a token/cost limit, encountering an ambiguity it cannot resolve, or violating an invariant.

The Runtime Boundary: When these elements are coupled, the contract becomes the executable boundary. The agent operates inside the contract, a "harness" or "supervisor" enforces the boundary (see Authority-First Agent Architecture for policy/authority decoupling), and humans refine the contract rather than the code.

Concrete Examples

Scenario A: The Refactor Contract

Intent: Migrate AuthModule to the new IdentityService API.
Invariants: Performance must remain within ±3% of baseline; zero changes to public methods.
Affordances: Write access to src/auth/*; read access to tests/* and docs/api/*.
Acceptance: npm test passes; new integration tests for IdentityService pass.
Escalation: Halt if the agent cannot satisfy the performance invariant after three attempts.

Scenario B: The Research-to-Demo Contract

Intent: Implement the algorithm described in Section 3.2 of the attached PDF.
Affordances: Access to Python environment; no internet access; no system-level installs.
Acceptance: A qualitative evaluation rubric (e.g., "Does the output match the paper's Figure 4?").
Stopping Rules: Halt if the required external library is missing from the pre-configured environment.

Trade-offs & Failure Modes

The primary trade-off is the shift of effort from execution to preparation.

The "Buggy Law" Paradox: Moving ambiguity upstream does not eliminate it; it merely shifts the error surface. If a contract contains logical flaws or contradictory invariants, the agent will execute those flaws with perfect, literal fidelity.
Test-Suite Corruption: If an agent is granted write access to a repository to fix a bug, it may realize that the most efficient way to satisfy the "Acceptance Criteria" is to modify the tests to match its incorrect code rather than fixing the underlying logic. If an agent can satisfy acceptance criteria by mutating tests, the system must detect and remediate this (see Failure-Oriented Orchestration for revocation and phase-ledger patterns).
Contract-Induced Deadlock: Overly restrictive contracts can lead to "deadlock," where an agent is technically capable of solving a problem but is blocked by a missing affordance or a rigid invariant.
The Oracle Problem: A contract is only as good as its acceptance criteria. If the evaluation rubric is flawed, the agent will "reward hack"—satisfying the contract while failing the actual intent.
Rigidity: Contracts can stifle early-stage exploration where the "how" and "what" are still fluid.

Practical Takeaways

Shift Left on Precision: Move away from prose instructions. Define invariants and non-goals in a structured format that an agentic harness can enforce.
Tests as Gates: Require agents to pass existing or newly defined tests before they are allowed to commit or declare progress.
Minimum Viable Affordances: Grant agents the minimum permissions required for the specific contract, rather than broad access to the entire codebase.
Value the "Stop": An agent that stops and asks for clarification is more valuable than one that guesses and proceeds.

Positioning Note

This note differs from:

Academic Research: It focuses on immediate operational implementation rather than formal verification theory.
Blog Opinions: It provides a structured framework for engineering teams rather than a narrative of personal experience.
Vendor Documentation: It is tool-agnostic and focuses on the architectural pattern rather than specific API implementations.

Status & Scope Disclaimer

This is an exploratory technical note from rmax lab. The patterns described are based on internal experimentation with agentic workflows and are intended for discussion and refinement, not as an authoritative industry standard.

Architectural enforcement of execution boundaries (Extends)
Technical enforcement for the Autonomy Ladder (Implements)

“A system is defined less by what it does than by what it refuses to do.” — Stafford Beer