Generic Support for Meta's Agent Research Environment by cemde · Pull Request #55 · parameterlab/MASEval

cemde · 2026-03-28T07:13:28Z

Description

ARE (Agent Research Environments) is Meta's platform for building and evaluating LLM-based agents in dynamic, multi-step simulations. MASEval already integrates with ARE through the GAIA2 benchmark, but the ARE-specific logic (lifecycle control, tool wrapping, notification polling) lives entirely inside Gaia2Environment. This means any new ARE-based benchmark would need to duplicate that code.

This PR extracts a generic AREEnvironment and AREToolWrapper so that MASEval can support any ARE-based benchmark through a shared base. Gaia2Environment becomes a thin subclass that only retains GAIA2-specific setup and trace gathering.

What's new

AREEnvironment (maseval/interface/environments/are.py): full ARE lifecycle (start/stop/pause/resume), notification polling (poll_notifications, get_turn_notifications), AUI tool filtering, oracle mode, and both scenario-path and shorthand construction paths.
AREToolWrapper (maseval/interface/environments/are_tool_wrapper.py): wraps any ARE AppTool with simulation-time-aware tracing, invocation history, and JSON schema extraction. Delegates metadata to ARE's AppToolAdapter.
maseval/interface/environments/ package for shared environment implementations.
are optional dependency in pyproject.toml.

What changed

Gaia2Environment now inherits from AREEnvironment. Only GAIA2-specific logic remains: setup_state with preprocessing + judge, gather_traces, and gather_config.
Gaia2GenericTool is now an alias for AREToolWrapper.

Tests

Three test files cover this work: unit tests for AREToolWrapper, mock-based tests for AREEnvironment, and real ARE integration tests that exercise the full lifecycle with real apps and scenarios.

Type of Change

New feature (non-breaking change that adds functionality)
Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

I have read the CONTRIBUTING.md guide.
Commits follow "How to write a good git commit message"

Documentation

Added/updated docstrings for new/modified functions
Updated relevant documentation in docs/

Changelog

Added entry to CHANGELOG.md under [Unreleased] section

Architecture (if applicable)

Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
Dependencies: ARE integration uses optional dependencies

6-task plan covering: environments package, AREToolWrapper, AREEnvironment core + shorthand path, pyproject.toml extra, and integration tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-03-28T07:20:00Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
maseval/benchmark/gaia2
__init__.py
environment.py
evaluator.py					148
gaia2.py
tool_wrapper.py
maseval/interface
__init__.py
maseval/interface/environments
__init__.py					13-14
are.py					29-30, 138, 188, 232, 337-338, 384, 394, 495, 505, 518-519, 531-532, 549
are_tool_wrapper.py					67
Project Total

_{This report was generated by python-coverage-comment-action}

Record simulation_time_before, simulation_time_after, and simulation_time_elapsed in invocation meta dict, matching Gaia2GenericTool behavior. Gracefully returns None when simulation time is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Delegate AREToolWrapper metadata extraction to ARE's AppToolAdapter (canonical source of truth) instead of reading attributes directly. Remove getattr fallbacks in _extract_schema so missing arg_type or has_default attributes raise immediately, surfacing ARE API changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove the outer try/except Exception block that silently returned ([], [], False) on any error. Exceptions now propagate so the benchmark runner can classify them via fail_on_task_error / fail_on_setup_error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Existing standalone tool wrapper tests need the same autouse fixture that mocks AppToolAdapter, since AREToolWrapper now delegates to it for metadata extraction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gaia2Environment now inherits tool wrapping, notification polling, lifecycle control, and cleanup from AREEnvironment. Only setup_state (preprocess_scenario + judge) and GAIA2-specific gather_traces/config remain as overrides. Gaia2GenericTool is now an alias for AREToolWrapper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Main branch renamed task_data to environment_data throughout. Updated AREEnvironment, Gaia2Environment, and all tests to use the new parameter name. Regenerated uv.lock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove unused variable in shorthand path test - Add ty: ignore[unknown-argument] for ARE Scenario constructor (optional dependency not resolvable by type checker) - Apply ruff formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

21 integration tests that exercise AREEnvironment against real ARE apps and scenarios — no mocks. Covers: - Lifecycle (scenario path, shorthand path, start/stop, pause/resume) - Tool wrapping (metadata, calling, error tracing, history) - AUI tool filtering - Oracle mode with real ARE simulation - Tracing and config with real tool calls - Simulation time advancement via wait_for_notification - Convenience accessors returning real ARE objects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cemde and others added 9 commits March 27, 2026 22:30

design specifications

1a2b1f5

Add AREEnvironment implementation plan

96e03d1

6-task plan covering: environments package, AREToolWrapper, AREEnvironment core + shorthand path, pyproject.toml extra, and integration tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add maseval/interface/environments/ package

55f36a6

feat: add AREToolWrapper with tracing and metadata

15d1082

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add AREEnvironment with scenario path and lifecycle control

ee1ee4c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: add shorthand construction path tests for AREEnvironment

0a0222a

feat: add 'are' optional dependency extra

a9b6dab

test: add ARE integration smoke tests

c239b03

added docs

fb8f8ef

cemde and others added 12 commits March 28, 2026 08:54

are issues identifies

5915415

gaia2 simplifaction and issue fixing plan

dcf7139

fix(are): remove hasattr fallbacks from oracle mode

0304e40

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(are): add AUI tool filtering and get_turn_notifications

0c95bdb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cemde added enhancement New feature or request interface regarding the `maseval/interface` subpackage. labels Apr 3, 2026

cemde added 3 commits April 10, 2026 13:49

added code review

0bd966e

updated are integration

56fb7a1

fixed typing error

da6a21d

cemde marked this pull request as ready for review April 10, 2026 15:33

[skip ci] updated changlog

a664b2b

cemde merged commit ef9129f into main Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic Support for Meta's Agent Research Environment#55

Generic Support for Meta's Agent Research Environment#55
cemde merged 25 commits intomainfrom
feature/are-support

cemde commented Mar 28, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cemde commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What's new

What changed

Tests

Type of Change

Checklist

Contribution

Documentation

Changelog

Architecture (if applicable)

Uh oh!

github-actions bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cemde commented Mar 28, 2026 •

edited

Loading

github-actions bot commented Mar 28, 2026 •

edited

Loading