What happens
tools/evals/agents/workspace-chat/agent-type-default.eval.ts > csv-streaming-user fails consistently against the current workspace-chat prompt. Reproduces on back-to-back runs, so it isn't a flake.
Expected an upsert_agent call with type="user". Got: csv_processor(type=llm, agent=?)
The other 53 evals in the same suite (including the sibling agent-type-default/* cases like inbox-triage-llm, send-daily-report-email, explicit-llm-override, etc.) pass. So the prompt's classifier is right for the typical cases and just wrong for this one shape.
User impact
For workspaces where a user-authored streaming process (Python over CSV in this case) is the right tool, the chat agent currently scaffolds it as an inline llm step instead. End-state: the workflow ostensibly works but cannot stream incremental output, can't be re-run as a deterministic script, and loses the upsides users expect from a user-agent. Confusing and quietly worse than what the user described.
Likely structural cause (guess, not verified)
The classifier prompt seems to bias toward llm whenever the natural-language description doesn't explicitly call out streaming / per-row behavior. CSV processing reads as "transform some structured data" which the prompt probably routes to llm. Adding a streaming-shaped few-shot exemplar (or sharpening the "user-agent when output is incremental / row-by-row" rule) would likely close it.
Notes for whoever picks this up
- Re-run just this case:
deno task evals run -F csv-streaming-user
- The sibling case
slack-daily-standup was flaky in the same suite during this investigation (failed once, passed on retry). Worth confirming after a prompt change doesn't make it worse.
- This was caught during PR-time eval validation on an unrelated FSM change, so it's pre-existing on main, not a regression from current open work.
What happens
tools/evals/agents/workspace-chat/agent-type-default.eval.ts > csv-streaming-userfails consistently against the currentworkspace-chatprompt. Reproduces on back-to-back runs, so it isn't a flake.The other 53 evals in the same suite (including the sibling
agent-type-default/*cases likeinbox-triage-llm,send-daily-report-email,explicit-llm-override, etc.) pass. So the prompt's classifier is right for the typical cases and just wrong for this one shape.User impact
For workspaces where a user-authored streaming process (Python over CSV in this case) is the right tool, the chat agent currently scaffolds it as an inline
llmstep instead. End-state: the workflow ostensibly works but cannot stream incremental output, can't be re-run as a deterministic script, and loses the upsides users expect from a user-agent. Confusing and quietly worse than what the user described.Likely structural cause (guess, not verified)
The classifier prompt seems to bias toward
llmwhenever the natural-language description doesn't explicitly call out streaming / per-row behavior. CSV processing reads as "transform some structured data" which the prompt probably routes tollm. Adding a streaming-shaped few-shot exemplar (or sharpening the "user-agent when output is incremental / row-by-row" rule) would likely close it.Notes for whoever picks this up
deno task evals run -F csv-streaming-userslack-daily-standupwas flaky in the same suite during this investigation (failed once, passed on retry). Worth confirming after a prompt change doesn't make it worse.