Skip to content

test: interaction-model end-to-end suite with a requirements manifest#2691

Open
maxisbey wants to merge 34 commits into
mainfrom
maxisbey/interaction-test-suite
Open

test: interaction-model end-to-end suite with a requirements manifest#2691
maxisbey wants to merge 34 commits into
mainfrom
maxisbey/interaction-test-suite

Conversation

@maxisbey
Copy link
Copy Markdown
Contributor

@maxisbey maxisbey commented May 26, 2026

Adds tests/interaction/: an end-to-end suite that enumerates the MCP interaction model — one test per behaviour, asserting full client↔server round trips through the public API only — backed by a requirements manifest that maps every behaviour the SDK must satisfy to the spec section that mandates it, the tests that prove it, and the gaps that remain.

Where to look

  • tests/interaction/README.md — the conventions, the manifest model, the divergence lifecycle, and the transport matrix. Read this first.
  • tests/interaction/_requirements.py — the manifest itself: 360 entries, each with a spec link, the required behaviour, and (where the SDK falls short) a structured divergence or deferred note. The 48 divergences are the SDK's spec-compliance gap report; the 62 deferred entries are the feature backlog.
  • src/ — almost entirely coverage-pragma adjustments (removals where the new tests now reach the line, narrowings where a removed block-level pragma exposed a still-uncovered branch). The one behaviour change is in src/mcp/server/sse.py: connect_sse now closes sse_stream_reader after EventSourceResponse returns, fixing a resource leak the in-process SSE tests surfaced.
  • tests/interaction/transports/_bridge.py — the in-process streaming ASGI transport that lets the suite drive the real streamable-HTTP and SSE apps with no sockets.
  • tests/interaction/_connect.py — the connection factories and the connect fixture that runs every transport-agnostic test over in-memory, streamable HTTP, and SSE.
  • tests/interaction/auth/_provider.py / _harness.py — the in-process authorization-server provider and OAuth harness.

Motivation and Context

Two purposes:

  1. A parity bar for internal rewrites of the send/receive path. The suite pins the SDK's current observable behaviour exactly, so before/after runs against the same suite prove the rewrite equivalent. Tests never xfail; where current behaviour falls short of the spec, the test pins the divergent output and the gap is recorded as data on the manifest entry.
  2. A complete, queryable ledger of the SDK's behaviour surface against the 2025-11-25 spec. The manifest tracks the full requirements surface — including behaviours the SDK has not implemented, each with a precise reason and a slot for a tracking issue. The ID vocabulary is aligned with the TypeScript SDK's end-to-end requirements suite so coverage and gaps can be compared across the two SDKs entry by entry.

What's covered

529 tests across 360 tracked requirements:

  • Every protocol feature: tools, resources, prompts, completion, logging, sampling, elicitation, roots, ping, progress, cancellation, pagination, timeouts, _meta, lifecycle and version negotiation, list-changed notifications, wire-level invariants.
  • Transports: every transport-agnostic test runs over the in-memory transport, the real streamable HTTP app, and the legacy SSE transport — all in-process via the streaming ASGI bridge — plus a single SDK-client↔SDK-server stdio subprocess test. Transport-specific tests cover session management, resumability and Last-Event-ID replay, the GET-stream routing rules, stateless mode, origin validation, and stdio framing.
  • OAuth: the full authorization-code flow against an in-process authorization server (discovery, PRM, DCR, PKCE, state, the resource parameter), the bearer middleware error plane, transparent refresh, step-up on 403, CIMD, and the M2M extension providers.
  • Composed flows: resource-link follow-through, multi-step elicitation chains, terminate-then-reconnect, dual-transport servers.

How Has This Been Tested?

uv run --frozen pytest tests/interaction/ — fully in-process and event-driven (no sockets, threads, or sleeps, with one subprocess for stdio); runs in ~10 s. 100% line and branch coverage including the test code itself.

Breaking Changes

None — tests and test-support only.

Types of changes

  • Test/infrastructure addition

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed (tests/interaction/README.md)

Additional context

Known SDK gaps surfaced while writing the suite will be filed as issues separately and linked from the manifest entries' issue fields.

AI Disclaimer

maxisbey added 21 commits May 23, 2026 15:45
New tests/interaction/ suite asserting client<->server round trips through
the public API only. Tests are organised around a requirements manifest
(_requirements.py) mapping each test to the spec or SDK behaviour it
exercises, with known divergences from the spec recorded on the
requirement; test_coverage.py enforces that every non-deferred requirement
is exercised by at least one test.

Covers tools, prompts, resources, and ping against the low-level Server,
plus MCPServer tool-call behaviours. Removes two 'pragma: no cover'
comments on the ping send/answer paths now that they are covered.
…action tests

Extends the interaction suite with the initialize handshake (server identity,
instructions, capability derivation, client identity and capabilities as seen
by the server), completion round trips, logging notifications, and the
MCPServer resource/prompt/structured-output behaviours. Records two more
divergences on the requirements manifest: MCPServer reports unknown resources
and prompts with error code 0 rather than the codes the spec documents.

Removes the 'pragma: no cover' from the method-not-found fallback now that it
is covered.
Covers the server-to-client half of the interaction model: sampling,
form-mode elicitation, roots, progress in both directions, list_changed
notifications, and request cancellation, all against the low-level Server
through the public Client API. Records a further divergence: the server
answers cancelled requests with an error response where the spec says no
response should be sent.

Removes five more 'pragma: no cover' comments on paths these tests now
cover (server list_changed senders, the client roots send path, and the
default elicitation callback).
…eta interaction tests

Covers URL-mode elicitation (including the elicitation/complete lifecycle and
the -32042 rejection flow), resource subscriptions and update notifications,
cursor pagination across all four list methods, request and session read
timeouts, _meta round trips, and the MCPServer Context convenience methods.

Removes the 'pragma: no cover' from the resource-updated send path now that
it is covered.
…ction tests

Adds ClientSession-level tests for pre-initialization request rejection and
protocol version negotiation, a proof that concurrent tool calls are
dispatched simultaneously and answered independently, and tests pinning
three behaviour gaps: tool-set mutations send no list_changed notification,
logging/setLevel is not supported by MCPServer and no level filtering
exists, and tool-enabled sampling is rejected because the high-level client
cannot declare the sampling.tools capability.
A RecordingTransport wrapper tees every message crossing the client's
transport boundary so the suite can assert properties that are invisible to
API callers: request ids are unique and never null, notifications are never
answered, and exactly one initialized notification is sent between the
initialize response and the first feature request.
Drives the streamable HTTP Starlette app through httpx's ASGI transport so
the full HTTP framing layer (session ids, SSE and JSON response encoding,
stateful and stateless modes) runs with no sockets, threads, or
subprocesses. Covers the handshake, tool calls and errors, mid-call
notifications, the stateless rejection of server-initiated requests, and
the routing of unrelated server messages to the standalone stream.

Removes the 'pragma: no cover' comments these tests now cover (the
session-manager accessors, the no-session-id validation path, and the
related-request-id routing branch). The session-manager accessor's
unreachable error guard keeps its pragma, moved onto the raise statement
itself so the now-executed condition above it is measured normally.
…irements manifest

Fixes the spec deep links that pointed at non-existent anchors, records the
divergences for the client's default not-supported answers (the spec names
-32601 for roots and -32602 for an undeclared elicitation mode; the default
callbacks answer -32600), and adds a logging:capability requirement noting
that MCPServer emits log message notifications without declaring the
logging capability.

Also tightens behaviour sentences and docstrings to match what the tests
assert, and adds a test pinning that Context.report_progress is a silent
no-op when the caller supplied no progress token, removing the pragma on
that path.
@maxisbey maxisbey added improves spec compliance When a change improves ability of SDK users to comply with spec definition enhancement Request for a new feature that's not currently supported P1 Significant bug affecting many users, highly requested feature labels May 27, 2026
@maxisbey maxisbey changed the title test: interaction-model end-to-end test suite with a requirements manifest test: interaction-model end-to-end suite with a requirements manifest May 27, 2026
maxisbey added 2 commits May 27, 2026 15:05
- raise pytest floor to 8.4.0 (RaisesGroup, used in the auth tests)
- collect twice in connect_over_sse so the unclosed sse_stream_reader is
  finalized on 3.10 (PEP 442 cycle needs a second pass)
- collapse stacked async-with statements into comma form where it reads
  cleanly, and apply # pragma: no branch on the remaining sync-with +
  async-with shapes coverage.py mis-traces on 3.11/3.14
SseServerTransport.connect_sse never closed sse_stream_reader after
EventSourceResponse returned; the in-process bridge dropped its chunk
reader when a request was cancelled before the response started. Close
both at source so the interaction suite no longer needs a gc.collect()
workaround, and pull one assert inside its async-with body to clear the
last 3.11 coverage gap.
@felixweinberger felixweinberger marked this pull request as ready for review May 27, 2026 15:28
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any bugs in this PR, but it's a very large addition (73 files, ~520 tests, a requirements-manifest framework, in-process OAuth/ASGI harnesses) plus src/ changes beyond pure pragma removal — including a behavioral change in sse.py (closing the SSE stream reader) and many narrowed/relocated coverage pragmas — so it warrants a maintainer's review of the design decisions and the pinned spec divergences.

Extended reasoning...

Overview

This PR adds an entire end-to-end interaction-model test suite under tests/interaction/ (524 tests, a requirements manifest, transport-parametrized connection factories, an in-process streaming ASGI bridge, and an in-process OAuth authorization-server harness), bumps the pytest minimum version, registers a custom requirement marker, and touches 18 src/ files. The src/ changes are predominantly removals or narrowings of # pragma: no cover / # pragma: lax no cover markers now that the new tests execute those lines, but they are not exclusively comment-only: src/mcp/server/sse.py gains an await sse_stream_reader.aclose() call in the SSE response wrapper, which is a real (if small) runtime behavior change.

Security risks

The new code is test-only, so there is no direct production security exposure from the suite itself. However, the suite pins the SDK's current OAuth and bearer-middleware behavior — including documented divergences from the spec (e.g. missing audience validation, missing scope in WWW-Authenticate challenges, lenient redirect-URI scheme registration) — as snapshot-asserted expected behavior. Encoding those divergences as passing tests is a deliberate design choice with security-adjacent implications that maintainers should explicitly endorse rather than have approved automatically.

Level of scrutiny

This is far outside the scope of an auto-approvable change. It is large, introduces a new testing framework and conventions (the manifest model, divergence lifecycle, coverage rules described in the README) that future contributors will be expected to follow, modifies coverage enforcement semantics across many src/ files, and includes one genuine library code change. Design-level decisions of this kind should be weighed in on by a human maintainer, and the PR is labeled P1, suggesting the maintainers intend to review it closely anyway.

Other factors

The bug hunting system reported no bugs, and the suite appears self-consistent (the coverage-contract test, the harness self-tests, and the documented conventions are coherent with the diff). There are no prior reviews or unresolved reviewer comments on the timeline. The deferral here is purely about size, the framework/convention decisions being introduced, and the non-trivial src/ touch surface — not about any specific defect found.

maxisbey added 4 commits May 27, 2026 16:57
…tions deterministically

Four manifest fixes from spec/SDK re-verification:
- lifecycle:capability:* divergence notes used SHOULD; spec basic/lifecycle#operation
  has been MUST since 2025-06-18
- mcpserver:tool:naming-validation deferred reason claimed no naming check exists;
  Tool.from_function calls validate_and_warn_tool_name (warns, doesn't reject) -
  converted to a Divergence with a pinning test
- client-auth:...issuer-validation divergence's second sentence is false
  (OAuthMetadata types the endpoints AnyHttpUrl, so scheme is validated)
- resources:annotations now records that the SDK Annotations model lacks
  lastModified; the round-trip test sends it via model_validate so the snapshot
  pins the drop

Twelve lowlevel tests sent notifications from inside a tool handler without
related_request_id, so on the streamable-HTTP leg they routed to the standalone
GET stream and the assertion relied on cross-stream ordering the suite documents
as not guaranteed. Eight now pass related_request_id; four whose senders don't
accept it use anyio.Event with the snapshot still proving the delivered set. The
module docstrings that overstated the ordering guarantee are corrected.

README §Coverage now documents the four lax-no-cover teardown markers and the
sse.py aclose() fix that landed alongside this suite.
The False arc exits through the enclosing async-with and is dropped by
coverage.py on 3.10 only; 3.11+ record both arcs.
…pass

- _provider.py docstrings now describe the provider, not its authoring history
- eight manifest behavior/divergence notes scoped to what the spec/SDK actually
  state (logging field-shape only, sse endpoint-event spec-only, scope-selection
  AS-fallback step, sampling result-balance ValueError, hosting:session:delete
  doesn't remove transport, low-level elicitation validation/restriction scope,
  progress no-token second clause)
- client-transport:http:reconnect-get deferred reason no longer claims the
  feature is unimplemented
- rename the sampling mixed-content rejection test to match what it asserts
- pagination cursor pass-through asserted by identity per the suite convention
- mcpserver:prompt:unknown-name docstring acknowledges the code-0 divergence
- test_bridge cancel_on_close test bounds the transport-close wait
- drop now-stale no-branch pragma at shared/auth.py:94
Nine notification tests sent capability-gated messages (log, resource_updated,
roots/list_changed, *_list_changed) from a peer that had not declared the
capability, which only works because the SDK does not yet enforce capability
gating. Adding stub handlers / list_roots_callback so the capability is
advertised makes the tests match their requirement preconditions and survive
the gate fix unchanged.

The three list_changed tests cannot set listChanged=True without threading
NotificationOptions through the in-memory and HTTP-manager connection paths;
the requirement behavior text now describes what the tests prove (send ->
arrives) and the module docstring records the remaining coupling on
lifecycle:capability:server-not-advertised.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No bugs found by the automated review, but this is a very large addition (~73 files, 500+ tests) that also includes a runtime change in src/mcp/server/sse.py, widespread coverage-pragma adjustments across src/, and a pytest version bump — worth a human pass before merging.

Extended reasoning...

Overview

This PR adds a new end-to-end interaction-model test suite under tests/interaction/ (~500+ tests across transports, OAuth, and every protocol feature), a requirements manifest mapping behaviours to spec sections, and a transport matrix that drives the real streamable HTTP / SSE apps in-process. Alongside the tests, it modifies 18 files under src/: almost all are # pragma: no cover deletions or narrowings now that the new tests reach those lines, plus one genuine runtime change — an added await sse_stream_reader.aclose() in src/mcp/server/sse.py — a pytest>=8.4.0 bump, a new requirement pytest marker in pyproject.toml, and a uv.lock update.

Security risks

The runtime code changes themselves are low risk: pragma comments do not change behaviour, and the SSE aclose() is a resource-cleanup fix. However, the suite touches auth-adjacent code paths (OAuth client, bearer middleware, authorize handlers) via pragma adjustments, and the new auth test harness intentionally pins several spec divergences in security-relevant areas (missing audience validation, missing scope in WWW-Authenticate, redirect-URI scheme acceptance). These pinned divergences are documented as known gaps rather than introduced by this PR, but a maintainer should confirm they agree with codifying current behaviour as the baseline.

Level of scrutiny

The sheer size (73 files, a multi-hundred-KB diff), the introduction of a long-lived testing convention (the requirements manifest, divergence lifecycle, 100% coverage including test code), and the one behavioural change in the SSE transport all argue for human review. The conventions established here will shape future contributions, which is a design decision a maintainer should sign off on, not something an automated approval should green-light.

Other factors

The automated bug-hunting pass found no bugs, the PR is labelled P1/enhancement, and the author reports the suite runs in ~10s with full coverage. There are no outstanding reviewer comments in the timeline. The deferral is purely about scope and the precedent-setting nature of the suite, not about any identified defect.

maxisbey added 4 commits May 27, 2026 20:25
The previous attempt cancelled only the child task, but ClientSession and
streamable_http_client both cancel internal task groups in __aexit__, so
the comma-form unwind still tripped 3.11's tracer dead-zone. Hoist the
phase-1 assertions and header setup inside the block before cancelling,
and split fail_after into two phases so no sync statements sit between
the cancel-on-exit unwind and the next await.
…r 3.11

CPython gh-106749 (wontfix on 3.11): coro.throw() omits the call trace
event, desyncing coverage.py's CTracer so the first plain statement after
a ClientSession/streamable_http_client __aexit__ is dropped. Restructures
only relocate the miss (verified empirically across five variants); name
the upstream bug and sanction lax-no-cover for this case in the README.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Request for a new feature that's not currently supported improves spec compliance When a change improves ability of SDK users to comply with spec definition P1 Significant bug affecting many users, highly requested feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant