Skip to content

workspace-chat: csv-streaming-user eval persistently picks type=llm instead of type=user #286

Description

@basedfriday

What happens

tools/evals/agents/workspace-chat/agent-type-default.eval.ts > csv-streaming-user fails consistently against the current workspace-chat prompt. Reproduces on back-to-back runs, so it isn't a flake.

Expected an upsert_agent call with type="user". Got: csv_processor(type=llm, agent=?)

The other 53 evals in the same suite (including the sibling agent-type-default/* cases like inbox-triage-llm, send-daily-report-email, explicit-llm-override, etc.) pass. So the prompt's classifier is right for the typical cases and just wrong for this one shape.

User impact

For workspaces where a user-authored streaming process (Python over CSV in this case) is the right tool, the chat agent currently scaffolds it as an inline llm step instead. End-state: the workflow ostensibly works but cannot stream incremental output, can't be re-run as a deterministic script, and loses the upsides users expect from a user-agent. Confusing and quietly worse than what the user described.

Likely structural cause (guess, not verified)

The classifier prompt seems to bias toward llm whenever the natural-language description doesn't explicitly call out streaming / per-row behavior. CSV processing reads as "transform some structured data" which the prompt probably routes to llm. Adding a streaming-shaped few-shot exemplar (or sharpening the "user-agent when output is incremental / row-by-row" rule) would likely close it.

Notes for whoever picks this up

  • Re-run just this case: deno task evals run -F csv-streaming-user
  • The sibling case slack-daily-standup was flaky in the same suite during this investigation (failed once, passed on retry). Worth confirming after a prompt change doesn't make it worse.
  • This was caught during PR-time eval validation on an unrelated FSM change, so it's pre-existing on main, not a regression from current open work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions