Adversarial-testing fixes: config drift, health checks, cancellation, and CLI robustness by zc277584121 · Pull Request #159 · zilliztech/mfs

zc277584121 · 2026-07-02T13:49:28Z

Summary

A batch of fixes found by running an extended adversarial/stress-testing pass against the CLI, HTTP API, and every connector type (file/web/postgres/mysql/mongo/s3).

Data-loss / correctness (highest severity):

connector update/add no longer silently wipes a connector's stored credentials when --config is omitted and the URI alone can't reconstruct the real config (postgres/mysql/mongo/s3/web) — now rejects with a clear config_required error instead.
A job-reopen race could double-enqueue the same object, producing a duplicate chunk_id that made Milvus reject an entire upsert batch (including unrelated, valid rows) — batches are now deduped by chunk_id before upsert, and chunk_count now reflects what actually persisted rather than what was attempted.
Plaintext secrets passed directly (instead of via env:/file: reference) are now rejected at connector registration instead of being redacted into an unusable placeholder.

Connector health / probing:

probe/estimate now reuse an already-registered connector's stored config when --config is omitted (matching inspect), instead of silently testing an unrelated, meaningless default connection.
web and s3 connectors now have real healthcheck() implementations instead of inheriting the framework's always-ok=true no-op default.
The web connector's allowed_domains matching is now port-aware in both directions (exact host:port entries require an exact match; bare-host entries match any port), and a sync that crawls zero pages because every seed URL was excluded now fails loudly instead of silently reporting success.

Job lifecycle:

job cancel now actually stops future embed work for a cancelled job (skips already-queued batches instead of only flipping a DB status field while the embed loop keeps running).
mfs serve start/restart no longer spawn a duplicate server process when something not tracked by the pidfile is already listening on the target port; start now waits for the server to actually become reachable (or reports why it didn't) instead of declaring success the instant the process spawns.

CLI robustness:

cat/head/tail output byte-fidelity fix + --range argument parsing fix for negative-looking values.
file:// URI forms (file:///abs, file://local/abs) now resolve consistently with bare paths across add/search/cat/ls/grep/etc.
Oversized search/grep queries are now rejected client-side with a clear message instead of echoing the whole query back in a reqwest error.
--upload/--force-upload now correctly conflict with --no-upload; profile URLs are validated at add time; unknown chunk kinds and non-object locator JSON are now rejected before reaching the engine; grep on a nonexistent path now 404s instead of returning an empty result.
job list's plain-text view now shows a truncated error snippet for failed jobs instead of requiring --json to see why something failed.
mfs --version is now stamped with a short git commit, and the server logs its version on startup, so confirming which build is actually running doesn't require cross-referencing ps aux against install paths.

Test plan

cd server/python && uv run --extra dev pytest — 334 passed, 9 skipped (all pre-existing @pytest.mark.live)
cd server/python && uv run --extra dev ruff format --check src/ tests/ and ruff check src/ tests/ — clean
cd cli && cargo test — 20 passed
cd cli && cargo fmt --all -- --check — clean
cd cli && cargo build --release — builds clean
Each connector/job-lifecycle fix additionally verified live against a real running server and real registered connectors (not just unit tests), documented in the corresponding commit messages

register_or_get_connector now validates every config field before it reaches redact()/persistence: a secret-looking key (password, dsn, access_key, ...) or an inline user:pass@host connection string must be an env:VAR or file:/abs/path reference, or registration fails with a 400. Without this, a plaintext secret got redacted to a literal placeholder string in connectors.config_json, and on every rebuild CredentialService.resolve() passed that placeholder straight through as if it were the real credential — permanently breaking auth for the connector. Two more layers close the same gap in depth: redact() now replaces a secret value with None instead of the placeholder string, so a plugin's existing falsy-check fallback to credential_ref works as intended; and resolve() explicitly rejects the old placeholder string if it's ever seen, instead of resolving it as a literal value.

cat/head/tail printed server content with println!, which appended an extra newline even when the content already ended in one (e.g. a range reaching end-of-file), producing a spurious trailing blank line. Switch to print! so output matches the source bytes exactly. Also mark --range as allow_hyphen_values so a negative start (`--range -5:10`) reaches the server for its normal validation error instead of being rejected by clap as an unrecognized flag.

… engine A locator that decodes to a list or number crashed cat with a raw 500 ('list' object has no attribute 'get', etc.), and a JSON null was silently treated as no locator at all instead of the malformed input it is. Validate the decoded locator is a JSON object up front and return a clean 400 for anything else.

…-oping --kind used to accept any string: an empty value behaved like no filter at all, and a typo'd kind (e.g. "boddy") silently matched nothing with exit 0, indistinguishable from a genuine no-results search. Validate each comma-separated value against the framework's ChunkKind literal and return a 400 listing the valid kinds when one doesn't match.

mfs profile add silently accepted any string as a profile URL, so a typo or wrong scheme only surfaced later as an opaque "builder error" from reqwest when the profile was actually used. Parse and check the scheme up front and reject non-http(s) URLs before writing to client.toml.

…heck mfs add only checked Path::exists(target) to decide whether to show the external-connector cost-estimate prompt, so a local directory addressed via file:///abs or file://local/abs still tripped the prompt even though the identical bare path skipped it. Extract the underlying path from those two forms before the existence check, mirroring how the server's file connector already normalizes them.

remote_path/resolve_path_arg canonicalized the raw path argument directly, so a file:///abs or file://local/abs spelling of a local path failed Path::exists()-style checks (the literal URI string is never a path on disk) and fell through unrewritten. Against a local server this meant `mfs cat file:///abs/foo` 404'd and a search scoped to a file:// sub-path silently returned zero hits, even though the bare-path and canonical file://local forms of the identical target worked. Reuse local_fs_path_from_target to strip the scheme before canonicalizing, mirroring the fix already applied to add's target resolution.

grep's pushdown/BM25/linear-scan dispatch never touched the target path directly, so a scope that resolved to a real connector but a missing sub-path just looked like a real search with zero matches instead of erroring like ls/cat do for the same input. Stat the resolved path up front so a bad path 404s consistently across the read commands.

These flags document opposite upload behaviors but had no mutual exclusion, so passing --upload --no-upload (or --force-upload --no-upload) was silently accepted with whichever branch the code happened to check first winning. Wire up conflicts_with so clap rejects the combination up front with a clear error.

…files read() let Python's open() raise a raw FileNotFoundError whose message embeds the absolute local path, propagating unfiltered to the CLI (e.g. mfs head/tail/cat on a missing path). stat()/list()/grep() already guard against this; read() now follows the same pattern and re-raises with only the connector-relative path.

A query or pattern long enough to push the built request past reqwest's internal URL length limit made client.get(url).query(q) fail inside the request builder, and the resulting error string echoed the entire value back -- dumping 100KB+ of query text into stderr for a single search. Validate query/pattern length client-side before building the request and fail with a short, actionable error instead.

…ctors register_or_get_connector could not tell "the user passed --config and it happens to differ" from "--config was omitted and the URI-derived default happens to differ from the real stored config". The latter case previously persisted the derived default anyway, silently dropping credentials, schemas, and [[objects]] mappings whenever the scheme couldn't fully reconstruct the stored config from the bare URI (any add/update without --config on a non-trivial connector). Add a config_explicit flag threaded from add()'s original config argument through to register_or_get_connector: when config was omitted and would drift the stored config, raise config_required before any write happens. Explicit --config still persists on drift as before (with the existing warning); no-drift bare re-syncs remain a safe no-op; brand-new connector registration is unaffected.

…ut chunk_count on failed flushes EmbedConsumer._flush upserted rows to Milvus without deduping by chunk_id, so two chunks that hash to the same id within one batch (a genuine duplicate-source-row or scheduling-race scenario) made Milvus reject the entire batch, dropping unrelated chunks from other tasks along with it. Separately, chunk_count was incremented at chunk-intake time rather than on a successful write, so a task whose batch got dropped by _fail_batch still reported a nonzero chunk_count as if its chunks had been written -- exactly the kind of state that lets mfs status report search data that was never actually persisted. Now rows are deduped by chunk_id (last-write-wins) before the upsert call, and chunk_count is credited per task only from the rows that survive dedup in a flush that actually succeeds.

…t/regex

Plain-view `job list` showed status/op_kind/id only, so a failed sync gave no visible reason without adding --json. Append a short error snippet for failed jobs; --json output is unchanged.

…be no-op gap Audited every connector type for the base class's no-op healthcheck default. Most already override it; only web and s3 didn't, so `mfs connector probe` for either always reported ok=true regardless of the actual config or reachability. web: run the same allowed_domains gate sync() uses against each configured start_url, then a cheap GET against the seed. Verified this catches the exact shape of a prior bug where a port-qualified start_url was silently excluded by a bare-host allowed_domains entry. s3: list_objects_v2(MaxKeys=1) against the configured bucket. Verified against this environment's real (currently invalid) test credentials — head_bucket only surfaced an undifferentiated 403, list_objects_v2 correctly surfaced InvalidAccessKeyId, so list_objects_v2 is used for the more actionable error. Also corrected base.py's healthcheck docstring, which claimed only the github connector overrides the default — most connector types already had real overrides by the time this was checked.

`mfs serve start`/`restart` only ever tracked processes they launched themselves via a pidfile. A server started any other way (e.g. `uv run mfs-server run` directly) was invisible to them: `status` reported "not running" while it was genuinely up, and `restart`/`start` would spawn a second process against the same bind address rather than recognizing the existing one. Reproduced against a real out-of-band server before fixing: the second process doesn't actually race the port (only one process can ever bind it), but `serve start` declared "started" the instant it spawned the child, well before the child reaches its own bind attempt — Milvus connect + embedding model preload took ~15-18s in this environment. For that whole window `serve status` falsely reported the doomed process as healthy; only after ~15-18s did it fail with EADDRINUSE and exit, flipping status back to "not running" even though the original, untouched server had been fine the entire time. Fix: `start` now does a cheap TCP probe of the bind address before spawning — if something's already listening there without a matching pidfile, refuse rather than spawn a duplicate. If nothing's listening, it spawns and polls (bounded, 45s) for either a successful connect or early child exit, only writing the pidfile and reporting success once the server is actually reachable. `status` now distinguishes "nothing running" from "something's listening that this CLI didn't launch." `restart` now waits for the killed process to actually release the port before calling `start` — the first version of this fix had a race where `restart` could kill the old process, immediately hit `start`'s new pre-flight check while the old process was still mid-shutdown, and refuse to start a replacement, leaving nothing running at all; caught via live testing, not review. Also stamps `mfs --version` with a short git commit (new build.rs) and adds a version line to mfs-server's own startup log, so confirming which build is actually running no longer requires cross-referencing `ps aux` against install paths.

Without --config, probe() and estimate() fell back to a URI-derived default config, empty for any scheme the URI alone can't reconstruct (postgres/mysql/mongo/s3/web). For an already-registered connector, that meant testing an unrelated bare connection instead of the real one -- postgres fell through libpq's ambient defaults to a database named after the OS user, and reported that failure as if it were a real connectivity problem with the registered connector. Both now reuse an already-registered connector's stored config, the same way inspect() already does, via a shared _resolve_readonly_config helper. Verified against the real registered postgres connector: probe now reports ok=true instead of the OS-username DSN error, and the `add`-without-y cost-estimate path returns a real object/chunk/token estimate instead of the same misleading 500. Unregistered-connector probing (URI-derived default, no stored row) is unaffected.

…rawl allowed_domains matching compared urlparse(url).netloc (host:port) against entries that are conventionally bare hostnames, so a bare-host entry never matched a non-default-port URL -- including the connector's own seed URL, which produced a sync that crawled zero pages while still reporting the job succeeded. An allowed_domains entry with a port now requires an exact host:port match (precision); an entry without one matches that host on any port via .hostname (convenience) -- both config shapes are valid depending on whether the user wants "this exact host:port" or "any port on this host". Independently of that: a sync that crawls zero pages because every start_url was excluded by allowed_domains now raises instead of silently persisting an empty, "succeeded" connector -- this was true regardless of how the matching semantics above got resolved. Verified: the historical bug shape (bare-host allowlist, port- qualified seed) now matches instead of excluding, confirmed via direct sync() calls; a genuinely-excluded case (mismatched explicit host:port) now raises with a clear message; re-synced the real registered web://mfs-local-docs connector end-to-end (job succeeded, 50/50 objects, chunk_count 245->246, live search returns real content).

…DB flag `mfs job cancel` only ever flipped the job's DB status. The producer loop already checks that status at each object boundary (_should_stop), so a job made of many small objects mostly cancels promptly -- but a single large object, once handed to the embed consumer, keeps flowing through the consumer's batch-embed loop with no cancellation check at all, so a large object already mid-embed kept burning real embed CPU/API time for however long it took to finish, regardless of cancel. EmbedConsumer now tracks cancelled job ids (Engine.cancel_job tells it) and skips embedding/upserting a batch entry belonging to a cancelled job on the NEXT flush, reusing the existing _fail_batch path rather than new bookkeeping -- that method's own docstring already listed "a cancellation" as an intended trigger. This can't interrupt a batch_embed() call already in flight (there's no way to preempt that), but it bounds how much work continues after cancel to at most one more in-flight batch instead of the whole remaining object. Covered by two new EmbedConsumer unit tests: a cancelled job's queued chunks skip the embed/upsert call entirely and finalize failed instead of leaking pending-task bookkeeping; a second job's chunks landing in the same flush are unaffected. Full test_pipeline.py suite (20 tests) and test_engine_connector_lifecycle.py (8 tests) both pass. Live sanity-checked against the real server: restart, search, job list, and a cancel on a nonexistent job id all still behave correctly.

zc277584121 added 20 commits July 2, 2026 13:44

docs(search): note that bm25 grep matches are keyword-based, not exac…

fc37ae9

…t/regex

cli(job list): surface a truncated error snippet on failed rows

1ae82a1

Plain-view `job list` showed status/op_kind/id only, so a failed sync gave no visible reason without adding --json. Append a short error snippet for failed jobs; --json output is unchanged.

zc277584121 merged commit d7b6d3f into zilliztech:main Jul 2, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adversarial-testing fixes: config drift, health checks, cancellation, and CLI robustness#159

Adversarial-testing fixes: config drift, health checks, cancellation, and CLI robustness#159
zc277584121 merged 20 commits into
zilliztech:mainfrom
zc277584121:fix/adversarial-testing-fixes

zc277584121 commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zc277584121 commented Jul 2, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant