Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
3d40cb7
Clean up e2e, prepare more test cases, Triton model repository is now…
dchoi-viant Nov 26, 2025
f820846
Update comments while considering Field and IO design.
dchoi-viant Nov 26, 2025
de616a6
Refactored out intermediate step in gRPC type handling.
dchoi-viant Nov 26, 2025
e692430
Add model IO detection to Triton models.
dchoi-viant Nov 26, 2025
29169d7
Undo set_sdk change.
dchoi-viant Nov 26, 2025
e0335fd
Fix test to conform with metadata interface.
dchoi-viant Nov 26, 2025
f27c1a7
Fix bug if there is a load error in Router.
dchoi-viant Nov 26, 2025
96ffff9
Fix test to correctly detect auxiliary handling bug.
dchoi-viant Nov 26, 2025
8e72330
Remove unused code, minor refactor.
dchoi-viant Nov 26, 2025
75850e3
Reduce copied code between request and requets tests.
dchoi-viant Nov 26, 2025
3a1f955
Router now can detect and check signatures.
dchoi-viant Dec 2, 2025
306eb25
Add support for multiple routers using the same Triton Inference serv…
dchoi-viant Dec 3, 2025
d922d1d
Retroactively remove support for stating float as an alias for float32
dchoi-viant Dec 8, 2025
34bb186
Add Prometheus metrics to client.
dchoi-viant Jan 26, 2026
3363823
Allow CLI to have non-batch, empty string payloads.
dchoi-viant Jan 26, 2026
034b7b9
Clean up client Prometheus setup, better buckets for batch sizes, add…
dchoi-viant Jan 27, 2026
540d5b2
Add support for batching in router.
dchoi-viant Jan 28, 2026
f918e0b
Fix double decrement on unload.
dchoi-viant Jan 28, 2026
6e2e8ad
Fix nil panics in tests.
dchoi-viant Jan 28, 2026
3948cbd
Fix bug with router registration.
dchoi-viant Jan 28, 2026
882908f
Fix whitespace.
dchoi-viant Jan 28, 2026
bf98130
Add e2e for router
dchoi-viant Jan 29, 2026
7af71fd
Add support for reordered IO in downstream Evaluators for router.
dchoi-viant Jan 29, 2026
259c039
Fix ineffective queue and throttling. Minor code clarifications.
dchoi-viant Jan 29, 2026
73b1b36
Add auxiliary in test case for router.
dchoi-viant Jan 30, 2026
7ec1d4f
Minor comment reformatting.
dchoi-viant Jan 30, 2026
c868968
Add back queueing, minor fix in router config init, refactor router t…
dchoi-viant Feb 2, 2026
b6100a8
Add ARCHITECTURE doc, minor comment updates, removed unused method (t…
dchoi-viant Feb 4, 2026
2ed59ce
Update ARCHITECTURE.md to clarify potential design issues and enhance…
dchoi-viant Feb 12, 2026
3633eec
Add debugging binaries to gitignore, comment router.
dchoi-viant Feb 27, 2026
6cb409e
Add some comments.
dchoi-viant Mar 2, 2026
10fb18a
Fix leak in reloading.
dchoi-viant Mar 2, 2026
75fa792
Add Prometheus Gauge for model health/reload success, some refactors.
dchoi-viant Mar 3, 2026
511e001
Also make Prom gauge for reload OK 1 on boot.
dchoi-viant Mar 3, 2026
c10e88b
Update documentation for dictionary hash behavior.
dchoi-viant Mar 11, 2026
5e6b360
add client support for outputing intended outputs to a file
dchoi-viant Apr 6, 2026
c5ad92d
fix(server handler): no longer swallow Marshal error
dchoi-viant Apr 24, 2026
3d04cb3
Surface io.ReadAll error on 200 OK in client httpPost
dchoi-viant Apr 27, 2026
9678663
refactor(service/handler): explicit-commit response writing with dedi…
dchoi-viant Apr 27, 2026
7a77e45
docs(readme): restructure server endpoints with dedicated eval section
dchoi-viant Apr 28, 2026
b6da5ce
refactor(service/handler): emit JSON error responses with proper stat…
dchoi-viant Apr 28, 2026
90a8e1c
feat(shared/client): parse JSON error body to populate response.Error
dchoi-viant Apr 28, 2026
b4db014
docs(readme): add v0.20.0 versioning note
dchoi-viant Apr 28, 2026
70da0ae
feat(shared/client): add ClientHTTP_shed marker for breaker-rejected …
dchoi-viant Apr 30, 2026
d7730a6
fix(circut): atomic Down writes and lock-protected resetDuration reset
dchoi-viant Apr 30, 2026
0c2dd61
docs(stat): document _pending bugs and per-regime utility
dchoi-viant Apr 30, 2026
1e958f9
feat(circut+client): tail-latency-aware breaker with probabilistic pa…
dchoi-viant Apr 30, 2026
8551b1f
Merge v0.20.x into router-batching
dchoi-viant Apr 30, 2026
8104cd3
fix(self-test): avoid registering native client Prometheus metrics
dchoi-viant Apr 30, 2026
b494c86
fix(shared/client): make disabled Prometheus metrics nil-safe
dchoi-viant Apr 30, 2026
1ecba35
fix(shared/client): validate latency breaker configuration
dchoi-viant May 1, 2026
22b5a0c
Merge remote-tracking branch 'origin/v0.20.x' into router-batching
dchoi-viant May 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,18 @@
.DS_Cache
.DS_Store

__debug_bin*

/devdata
/vendor

/example/e2e/logs
/example/e2e/data/triton_model_repository

# binaries
/example/client/mlyc/mlyc
/example/server/mly/mly

/toolsv2/aerospike/aerospike
/toolsv2/smasher/cmd/cmd
/toolsv2/toolsv2
/toolsv2/toolsv2
193 changes: 193 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Code Architecture

mly consists of the following large conceptual (not strictly programmatically linked) steps and sub-steps:

1. Configuration processing
2. Model initialization
1. Platform evaluator creation
2. Caching support
3. Request handling
1. Input processing
2. Model inference
3. Post-prediction processing
4. Prediction logging
4. Model reloading

---

## 1 Configuration processing

This is a standard step in many services. This will read configuration files and populate defaults.
Details are mainly in [CONFIG.md](CONFIG.md).

*Quirk*: The configuration struct contains both the read configuration and follow-up processed configuration values. For example, `Modified` is populated during model loading, and `DictMeta` is updated when the dictionary is loaded.

---

## 2 Model Initialization

Model initialization occurs in `service.New()` which orchestrates the creation of platform-specific evaluators and supporting infrastructure.

### 2.1 Evaluator Creation

A core concept in mly is the "Evaluator."
An Evaluator is essentially something that can provide some kind of model inference.
All Evaluators implement `platform.PlatformEvaluator`:

```go
type PlatformEvaluator interface {
Predict(ctx context.Context, params []interface{}) ([]interface{}, error)
Signature() *domain.Signature
Dictionary() *common.Dictionary
Inputs() map[string]*domain.Input
ReloadIfNeeded(ctx context.Context) error
Close() error
}
```

There are currently 3 Evaluators:

1. TensorFlow - this Evaluator operates with a `libtensorflow` backend, and has additional logic that supports timeout-based batching.
2. Triton - this Evaluator supports sending prediction requests to a single Triton server via HTTP or gRPC.
3. Router - this Evaluator does not generate any prediction but enables rows in a prediction request to be sent to other Evaluators based on the input.

*Potential design issue*: Evaluator overloading and over-abstraction - the Router operates on the same interface as the TensorFlow and Triton evaluators, but vary in their behavioral labels.

### 2.2 Caching Support

Caching is implemented via `shared/datastore.Service`, which provides a multi-layer cache:

1. **Local in-memory cache** ([`scache`](https://github.com/viant/scache)): Fast local cache with TTL expiration
2. **L1 cache** (Aerospike): Primary distributed cache
3. **L2 cache** (Aerospike): Secondary distributed cache for cache warming

An important concept in mly caching is the *Dictionary hash*.
This is stored with cached values, and is intended to invalidate entries when the model changes (e.g., there is a model weights update).

*Quirk*: Client-read, server-write - based on the observation that if a client does not find a cache entry, then the server is unlikely to also find a cache entry, and to skip the latency overhead from a remote cache check, the server does not check for a cache entry.

*Design debt*: The current client-read, server-write introduces a case when multiple clients concurrently find that a cache entry is missing, and sends the same request to potentially the same mly server, causing the same server to run the same prediction multiple times. This should be controlled on the server side, to avoid unnecessary compute.

*Design debt*: Aerospike coupling - the current implementation depends on Aerospike constructs.

---

## 3 Request Handling

The mly service occupies most of its lifetime serving this purpose. Currently, mly is designed to focus around HTTP requests, using HTTP/2.

Data flow:

```
HTTP Request
→ service.Handler.ServeHTTP()
→ service.Service.Do()
→ service.Evaluator.Predict()
→ service.domain.Transformer
→ service.Response
```

### 3.1 Input Processing

The Input processing step is primarily focused around logic of pulling data from an HTTP-compliant, JSON or URL-based payload and pushing it into a Go (and CGo) compatible data structure for model inference.

This step revolves mainly around the `service/request.Request` struct.

Key components:
- **Feeds**: `[]interface{}` shaped as `[numInputs]([batchSize][1]T)` for model consumption
- **Input**: `*transfer.Input` for Transformer support

The `UnmarshalJSONObject()` method implements `gojay.UnmarshalerJSONObject` for high-performance JSON parsing.

The interaction with *Model inference* involves the `Feeds` field.

*Quirk*: client batching payload reduction - mly provides a convenience / payload reduction feature that permits payloads to have both inputs with a list of 1 values as well as inputs with a list of batch size of values. The server will expand the payload to fit the expected batch size times inputs matrix for the Evaluators.

*Quirk*: payload reading order - the JSON payload must have the `batch_size` key existing before other input keys, as that is required to know if the parser should be expecting a list of values or scalar values.

*Design debt*: `Feeds` type - most of the requests are tracked via input names than offsets; the intermediate data form should be a `map[string]interface{}` (or even `map[string][]interface{}` to capture a potential batch layer), and the conversion to an offset-based slice should be isolated to TensorFlow graph related code.

### 3.2 Model Inference

Model inference is delegated to the platform-specific evaluator via `Predict()`:

**TensorFlow** (`service/tfmodel`):
- Optional batching via `service/tfmodel/batcher.Service` aggregates concurrent requests
- Direct evaluation via `service/tfmodel/evaluator.Service` runs TensorFlow session
- Semaphore-controlled concurrency prevents overload

For Triton, [concurrency and timeout-based batching is controlled via Triton](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html).

**Triton** (`service/triton`):
- HTTP or gRPC call to Triton Inference Server
- Input tensors serialized per Triton protocol
- Timeout-controlled requests

The router is a control layer that can route to various model inference services.
Each downstream evaluator is responsible for controlling their own lifetime and batching concerns.
In theory, a router can also route to another router, but no such implementation yet exists.

**Router** (`service/platform/router`):
- Extracts routing key from input
- Groups rows by target model
- Parallel dispatch to downstream evaluators
- Result reassembly in original order

### 3.3 Post-Prediction Processing

After inference, `Service.buildResponse()` handles output transformation:

**Transformer execution**
The configured `domain.Transformer` function transforms raw model output into a `common.Storable` for serialization.
The default transformer extracts values keyed by output tensor names.

The `domain.Transformer` signature:

```go
type Transformer func(ctx context.Context, signature *Signature, input *gtly.Object, output interface{}) (common.Storable, error)
```

**Cache storage**
If caching is enabled, transformed results are stored asynchronously via `datastore.Put()`.

*Design debt*: Batch-based Transformer - the current Transformer API operates at the request level outputs but at row-level inputs, and is invoked per row.

### 3.4 Prediction Logging

If `Stream` is configured, the `stream.Service` logs requests for analytics:
- Request body
- Model output
- Inference duration

Logging uses `github.com/viant/tapper` for configurable output destinations.

---

## 4 Model Reloading

Model reloading runs continuously in a background goroutine (`Service.pollModelReload()`), checking for updates at configurable intervals (`ReloadPollIntervalSeconds`).

The `ReloadIfNeeded()` implementation is platform-specific, and varies similarly to model prediction in how much is implemented vs. delegated:

**TensorFlow** (`service/tfmodel.Service`):
1. Check file modification times at `URL`
2. If changed: copy model files to `Location`, load SavedModel
3. Extract signature and dictionary from graph
4. Create new `service/tfmodel/evaluator.Service` and optionally `service/tfmodel/batcher.Service`
5. Atomically swap Evaluators under mutex protection

**Triton** (`service/triton.TritonEvaluator`):
1. Check model health via `ModelReady()` API
2. If not ready and in EXPLICIT mode: call `ModelLoad()`
3. Refresh metadata via `ModelMetadata()` if signature not yet captured

**Router** (`service/platform/router.Router`):
1. Check routing configuration file modification
2. Reload routing table if changed
3. Create/destroy downstream Evaluators as needed
4. Atomically swap routing table under mutex protection
5. Unload unused models from Triton via Model Control API

Reload health is tracked via `Service.ReloadOK` for centralized health reporting.

*Design issue*: Over-abstraction of `ReloadIfNeeded()` - we note that this is a very high-level abstraction that could be broken down into separate concerns e.g., check health, load model, check if reload needed, etc.
13 changes: 7 additions & 6 deletions CONFIG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ Properties:
* to use GCS, set environment variable `GOOGLE_APPLICATION_CREDENTIALS=true`
- `Location`: `string` - optional - where a copy of the models will be stored when loading the model. Defaults to the system temporary directory.
- `Dir`: `string` - optional - any further path elements in `Location`. Mainly used if using a ZIP file with additional directories.
- `DataStore`: `string` - optional - name of Datastore to cache, should match `Datastores[].ID`.
- `DataStore`: `string` - optional - name of Datastore to use for caching, should match `Datastores[].ID`. Server-side datastore writes are enabled only when `UseDict` is `true` or unset.
- `Transformer`: `string` - optional - name of model output transformer. See [#Transformer](#Transformer).
- `Batch`: optional - enables or overrides server-side batching configuration. See [`service/tfmodel/batcher/config/config.go`](service/tfmodel/batcher/config/config.go).
- `UseDict`: `bool` - optional - if true, enables capabilities designed to shrink the cache key space by replacing out-of-vocabulary inputs from cache keys with a special token.
- `UseDict`: `bool` - optional - if true or unset, enables dictionary-based cache behavior, including replacing out-of-vocabulary inputs in cache keys with a special token and allowing the server to generate datastore cache entries when `DataStore` is configured. If false, the server will not generate new datastore cache entries for the model.
- `Inputs`: used to further provide or define inputs, a list of `shared.Field`. For TensorFlow models, this is automatically populated, but further caching configurations need to be specified.
* `Name`: `string` - required - input name, only required if an entry is provided.
* `Index`: `int` - optional - used to maintain cache key ordering.
Expand Down Expand Up @@ -73,7 +73,7 @@ Can be empty - represent a list of caching data stores.

Properties:

- `ID`: `string` - required - datastore ID (to be matched with `Models[].DataStores[].ID`)
- `ID`: `string` - required - datastore ID (to be matched with `Models[].DataStore`)
- `Connection`: `string` - optional - connection ID
- `Namespace`: `string` - optional - Aerospike namespace
- `Dataset`: `string` - optional - Aerospike dataset
Expand Down Expand Up @@ -109,9 +109,10 @@ mly := client.New("$modelID", []*client.Host{client.NewHost("mlServiceHost", mlS
```

Where optional `options` can be of, but not limited to, the following:
* `NewCacheSize(sizeOption)`
* `NewCacheScope(CacheScopeLocal|CacheScopeL1|CacheScopeL2)`
* `NewGmetric()` - custom instance of `gmetric` service
* `WithCacheSize(sizeOption)`
* `WithCacheScope(CacheScopeLocal|CacheScopeL1|CacheScopeL2)`
* `WithGmetrics()` - custom instance of `gmetric` service
* `WithHashValidation(true)` - enables client-side rejection of cached entries with a non-zero hash that differs from the client's current dictionary hash

See [`shared/client/option.go`](shared/client/option.go) for more options.

Expand Down
Loading