Skip to content

Adversarial-testing fixes: config drift, health checks, cancellation, and CLI robustness#159

Merged
zc277584121 merged 20 commits into
zilliztech:mainfrom
zc277584121:fix/adversarial-testing-fixes
Jul 2, 2026
Merged

Adversarial-testing fixes: config drift, health checks, cancellation, and CLI robustness#159
zc277584121 merged 20 commits into
zilliztech:mainfrom
zc277584121:fix/adversarial-testing-fixes

Conversation

@zc277584121

Copy link
Copy Markdown
Collaborator

Summary

A batch of fixes found by running an extended adversarial/stress-testing pass against the CLI, HTTP API, and every connector type (file/web/postgres/mysql/mongo/s3).

Data-loss / correctness (highest severity):

  • connector update/add no longer silently wipes a connector's stored credentials when --config is omitted and the URI alone can't reconstruct the real config (postgres/mysql/mongo/s3/web) — now rejects with a clear config_required error instead.
  • A job-reopen race could double-enqueue the same object, producing a duplicate chunk_id that made Milvus reject an entire upsert batch (including unrelated, valid rows) — batches are now deduped by chunk_id before upsert, and chunk_count now reflects what actually persisted rather than what was attempted.
  • Plaintext secrets passed directly (instead of via env:/file: reference) are now rejected at connector registration instead of being redacted into an unusable placeholder.

Connector health / probing:

  • probe/estimate now reuse an already-registered connector's stored config when --config is omitted (matching inspect), instead of silently testing an unrelated, meaningless default connection.
  • web and s3 connectors now have real healthcheck() implementations instead of inheriting the framework's always-ok=true no-op default.
  • The web connector's allowed_domains matching is now port-aware in both directions (exact host:port entries require an exact match; bare-host entries match any port), and a sync that crawls zero pages because every seed URL was excluded now fails loudly instead of silently reporting success.

Job lifecycle:

  • job cancel now actually stops future embed work for a cancelled job (skips already-queued batches instead of only flipping a DB status field while the embed loop keeps running).
  • mfs serve start/restart no longer spawn a duplicate server process when something not tracked by the pidfile is already listening on the target port; start now waits for the server to actually become reachable (or reports why it didn't) instead of declaring success the instant the process spawns.

CLI robustness:

  • cat/head/tail output byte-fidelity fix + --range argument parsing fix for negative-looking values.
  • file:// URI forms (file:///abs, file://local/abs) now resolve consistently with bare paths across add/search/cat/ls/grep/etc.
  • Oversized search/grep queries are now rejected client-side with a clear message instead of echoing the whole query back in a reqwest error.
  • --upload/--force-upload now correctly conflict with --no-upload; profile URLs are validated at add time; unknown chunk kinds and non-object locator JSON are now rejected before reaching the engine; grep on a nonexistent path now 404s instead of returning an empty result.
  • job list's plain-text view now shows a truncated error snippet for failed jobs instead of requiring --json to see why something failed.
  • mfs --version is now stamped with a short git commit, and the server logs its version on startup, so confirming which build is actually running doesn't require cross-referencing ps aux against install paths.

Test plan

  • cd server/python && uv run --extra dev pytest — 334 passed, 9 skipped (all pre-existing @pytest.mark.live)
  • cd server/python && uv run --extra dev ruff format --check src/ tests/ and ruff check src/ tests/ — clean
  • cd cli && cargo test — 20 passed
  • cd cli && cargo fmt --all -- --check — clean
  • cd cli && cargo build --release — builds clean
  • Each connector/job-lifecycle fix additionally verified live against a real running server and real registered connectors (not just unit tests), documented in the corresponding commit messages

register_or_get_connector now validates every config field before it
reaches redact()/persistence: a secret-looking key (password, dsn,
access_key, ...) or an inline user:pass@host connection string must be
an env:VAR or file:/abs/path reference, or registration fails with a
400. Without this, a plaintext secret got redacted to a literal
placeholder string in connectors.config_json, and on every rebuild
CredentialService.resolve() passed that placeholder straight through
as if it were the real credential — permanently breaking auth for the
connector.

Two more layers close the same gap in depth: redact() now replaces a
secret value with None instead of the placeholder string, so a
plugin's existing falsy-check fallback to credential_ref works as
intended; and resolve() explicitly rejects the old placeholder string
if it's ever seen, instead of resolving it as a literal value.
cat/head/tail printed server content with println!, which appended an
extra newline even when the content already ended in one (e.g. a range
reaching end-of-file), producing a spurious trailing blank line. Switch
to print! so output matches the source bytes exactly.

Also mark --range as allow_hyphen_values so a negative start
(`--range -5:10`) reaches the server for its normal validation error
instead of being rejected by clap as an unrecognized flag.
… engine

A locator that decodes to a list or number crashed cat with a raw 500
('list' object has no attribute 'get', etc.), and a JSON null was silently
treated as no locator at all instead of the malformed input it is. Validate
the decoded locator is a JSON object up front and return a clean 400 for
anything else.
…-oping

--kind used to accept any string: an empty value behaved like no filter
at all, and a typo'd kind (e.g. "boddy") silently matched nothing with
exit 0, indistinguishable from a genuine no-results search. Validate
each comma-separated value against the framework's ChunkKind literal
and return a 400 listing the valid kinds when one doesn't match.
mfs profile add silently accepted any string as a profile URL, so a
typo or wrong scheme only surfaced later as an opaque "builder error"
from reqwest when the profile was actually used. Parse and check the
scheme up front and reject non-http(s) URLs before writing to
client.toml.
…heck

mfs add only checked Path::exists(target) to decide whether to show the
external-connector cost-estimate prompt, so a local directory addressed
via file:///abs or file://local/abs still tripped the prompt even though
the identical bare path skipped it. Extract the underlying path from
those two forms before the existence check, mirroring how the server's
file connector already normalizes them.
remote_path/resolve_path_arg canonicalized the raw path argument
directly, so a file:///abs or file://local/abs spelling of a local
path failed Path::exists()-style checks (the literal URI string is
never a path on disk) and fell through unrewritten. Against a local
server this meant `mfs cat file:///abs/foo` 404'd and a search scoped
to a file:// sub-path silently returned zero hits, even though the
bare-path and canonical file://local forms of the identical target
worked. Reuse local_fs_path_from_target to strip the scheme before
canonicalizing, mirroring the fix already applied to add's target
resolution.
grep's pushdown/BM25/linear-scan dispatch never touched the target
path directly, so a scope that resolved to a real connector but a
missing sub-path just looked like a real search with zero matches
instead of erroring like ls/cat do for the same input. Stat the
resolved path up front so a bad path 404s consistently across the
read commands.
These flags document opposite upload behaviors but had no mutual
exclusion, so passing --upload --no-upload (or --force-upload
--no-upload) was silently accepted with whichever branch the code
happened to check first winning. Wire up conflicts_with so clap
rejects the combination up front with a clear error.
…files

read() let Python's open() raise a raw FileNotFoundError whose message
embeds the absolute local path, propagating unfiltered to the CLI (e.g.
mfs head/tail/cat on a missing path). stat()/list()/grep() already guard
against this; read() now follows the same pattern and re-raises with only
the connector-relative path.
A query or pattern long enough to push the built request past reqwest's
internal URL length limit made client.get(url).query(q) fail inside the
request builder, and the resulting error string echoed the entire value
back -- dumping 100KB+ of query text into stderr for a single search.
Validate query/pattern length client-side before building the request
and fail with a short, actionable error instead.
…ctors

register_or_get_connector could not tell "the user passed --config and it
happens to differ" from "--config was omitted and the URI-derived default
happens to differ from the real stored config". The latter case previously
persisted the derived default anyway, silently dropping credentials,
schemas, and [[objects]] mappings whenever the scheme couldn't fully
reconstruct the stored config from the bare URI (any add/update without
--config on a non-trivial connector).

Add a config_explicit flag threaded from add()'s original config argument
through to register_or_get_connector: when config was omitted and would
drift the stored config, raise config_required before any write happens.
Explicit --config still persists on drift as before (with the existing
warning); no-drift bare re-syncs remain a safe no-op; brand-new connector
registration is unaffected.
…ut chunk_count on failed flushes

EmbedConsumer._flush upserted rows to Milvus without deduping by
chunk_id, so two chunks that hash to the same id within one batch (a
genuine duplicate-source-row or scheduling-race scenario) made Milvus
reject the entire batch, dropping unrelated chunks from other tasks
along with it. Separately, chunk_count was incremented at chunk-intake
time rather than on a successful write, so a task whose batch got
dropped by _fail_batch still reported a nonzero chunk_count as if its
chunks had been written -- exactly the kind of state that lets mfs
status report search data that was never actually persisted.

Now rows are deduped by chunk_id (last-write-wins) before the upsert
call, and chunk_count is credited per task only from the rows that
survive dedup in a flush that actually succeeds.
Plain-view `job list` showed status/op_kind/id only, so a failed sync
gave no visible reason without adding --json. Append a short error
snippet for failed jobs; --json output is unchanged.
…be no-op gap

Audited every connector type for the base class's no-op healthcheck
default. Most already override it; only web and s3 didn't, so `mfs
connector probe` for either always reported ok=true regardless of the
actual config or reachability.

web: run the same allowed_domains gate sync() uses against each
configured start_url, then a cheap GET against the seed. Verified this
catches the exact shape of a prior bug where a port-qualified start_url
was silently excluded by a bare-host allowed_domains entry.

s3: list_objects_v2(MaxKeys=1) against the configured bucket. Verified
against this environment's real (currently invalid) test credentials —
head_bucket only surfaced an undifferentiated 403, list_objects_v2
correctly surfaced InvalidAccessKeyId, so list_objects_v2 is used for
the more actionable error.

Also corrected base.py's healthcheck docstring, which claimed only the
github connector overrides the default — most connector types already
had real overrides by the time this was checked.
`mfs serve start`/`restart` only ever tracked processes they launched
themselves via a pidfile. A server started any other way (e.g. `uv run
mfs-server run` directly) was invisible to them: `status` reported
"not running" while it was genuinely up, and `restart`/`start` would
spawn a second process against the same bind address rather than
recognizing the existing one.

Reproduced against a real out-of-band server before fixing: the second
process doesn't actually race the port (only one process can ever bind
it), but `serve start` declared "started" the instant it spawned the
child, well before the child reaches its own bind attempt — Milvus
connect + embedding model preload took ~15-18s in this environment.
For that whole window `serve status` falsely reported the doomed
process as healthy; only after ~15-18s did it fail with EADDRINUSE and
exit, flipping status back to "not running" even though the original,
untouched server had been fine the entire time.

Fix: `start` now does a cheap TCP probe of the bind address before
spawning — if something's already listening there without a matching
pidfile, refuse rather than spawn a duplicate. If nothing's listening,
it spawns and polls (bounded, 45s) for either a successful connect or
early child exit, only writing the pidfile and reporting success once
the server is actually reachable. `status` now distinguishes "nothing
running" from "something's listening that this CLI didn't launch."
`restart` now waits for the killed process to actually release the
port before calling `start` — the first version of this fix had a race
where `restart` could kill the old process, immediately hit `start`'s
new pre-flight check while the old process was still mid-shutdown, and
refuse to start a replacement, leaving nothing running at all; caught
via live testing, not review.

Also stamps `mfs --version` with a short git commit (new build.rs) and
adds a version line to mfs-server's own startup log, so confirming
which build is actually running no longer requires cross-referencing
`ps aux` against install paths.
Without --config, probe() and estimate() fell back to a URI-derived
default config, empty for any scheme the URI alone can't reconstruct
(postgres/mysql/mongo/s3/web). For an already-registered connector,
that meant testing an unrelated bare connection instead of the real
one -- postgres fell through libpq's ambient defaults to a database
named after the OS user, and reported that failure as if it were a
real connectivity problem with the registered connector.

Both now reuse an already-registered connector's stored config, the
same way inspect() already does, via a shared _resolve_readonly_config
helper. Verified against the real registered postgres connector: probe
now reports ok=true instead of the OS-username DSN error, and the
`add`-without-y cost-estimate path returns a real object/chunk/token
estimate instead of the same misleading 500. Unregistered-connector
probing (URI-derived default, no stored row) is unaffected.
…rawl

allowed_domains matching compared urlparse(url).netloc (host:port)
against entries that are conventionally bare hostnames, so a bare-host
entry never matched a non-default-port URL -- including the
connector's own seed URL, which produced a sync that crawled zero
pages while still reporting the job succeeded.

An allowed_domains entry with a port now requires an exact host:port
match (precision); an entry without one matches that host on any port
via .hostname (convenience) -- both config shapes are valid depending
on whether the user wants "this exact host:port" or "any port on this
host".

Independently of that: a sync that crawls zero pages because every
start_url was excluded by allowed_domains now raises instead of
silently persisting an empty, "succeeded" connector -- this was true
regardless of how the matching semantics above got resolved.

Verified: the historical bug shape (bare-host allowlist, port-
qualified seed) now matches instead of excluding, confirmed via direct
sync() calls; a genuinely-excluded case (mismatched explicit
host:port) now raises with a clear message; re-synced the real
registered web://mfs-local-docs connector end-to-end (job succeeded,
50/50 objects, chunk_count 245->246, live search returns real
content).
…DB flag

`mfs job cancel` only ever flipped the job's DB status. The producer
loop already checks that status at each object boundary (_should_stop),
so a job made of many small objects mostly cancels promptly -- but a
single large object, once handed to the embed consumer, keeps flowing
through the consumer's batch-embed loop with no cancellation check at
all, so a large object already mid-embed kept burning real embed
CPU/API time for however long it took to finish, regardless of cancel.

EmbedConsumer now tracks cancelled job ids (Engine.cancel_job tells it)
and skips embedding/upserting a batch entry belonging to a cancelled
job on the NEXT flush, reusing the existing _fail_batch path rather
than new bookkeeping -- that method's own docstring already listed
"a cancellation" as an intended trigger. This can't interrupt a
batch_embed() call already in flight (there's no way to preempt that),
but it bounds how much work continues after cancel to at most one more
in-flight batch instead of the whole remaining object.

Covered by two new EmbedConsumer unit tests: a cancelled job's queued
chunks skip the embed/upsert call entirely and finalize failed instead
of leaking pending-task bookkeeping; a second job's chunks landing in
the same flush are unaffected. Full test_pipeline.py suite (20 tests)
and test_engine_connector_lifecycle.py (8 tests) both pass. Live
sanity-checked against the real server: restart, search, job list, and
a cancel on a nonexistent job id all still behave correctly.
@zc277584121 zc277584121 merged commit d7b6d3f into zilliztech:main Jul 2, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant