feat(llmobs): add json metric type support for evaluations#16344
feat(llmobs): add json metric type support for evaluations#16344aniszoubiramar merged 1 commit intomainfrom
Conversation
## Description Adds support for a new `json` metric type in LLMObs evaluation metrics. This allows users to submit dictionary values as evaluation metrics. Changes: - `LLMObs.submit_evaluation()`: Now accepts `metric_type="json"` with dict values - Experiment evaluators: Auto-detect dict return values as `json` metric type - Telemetry: Updated to track `json` metric type - TypedDict: Added `json_value` field to `LLMObsExperimentEvalMetricEvent` ## Testing - Added unit test for json metric type validation - Updated existing tests for new error messages - Tested E2E with staging experiments ## Risks None - additive change, backward compatible ## Additional Notes MLOB-5400
Performance SLOsComparing candidate anis.amar/MLOB-5400/support-json-output-in-the-sdk (1c9764b) with baseline main (ae08405) 📈 Performance Regressions (3 suites)📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 503.778µs (SLO: <700.000µs 📉 -28.0%) vs baseline: 📈 +19.7% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.7% ✅ ospathbasename_noaspectTime: ✅ 430.470µs (SLO: <700.000µs 📉 -38.5%) vs baseline: +2.4% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +5.3% ✅ ospathjoin_aspectTime: ✅ 629.581µs (SLO: <700.000µs 📉 -10.1%) vs baseline: +2.2% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +6.1% ✅ ospathjoin_noaspectTime: ✅ 635.805µs (SLO: <700.000µs -9.2%) vs baseline: +3.0% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +5.4% ✅ ospathnormcase_aspectTime: ✅ 350.954µs (SLO: <700.000µs 📉 -49.9%) vs baseline: +1.4% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.8% ✅ ospathnormcase_noaspectTime: ✅ 363.976µs (SLO: <700.000µs 📉 -48.0%) vs baseline: +3.9% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +5.4% ✅ ospathsplit_aspectTime: ✅ 497.713µs (SLO: <700.000µs 📉 -28.9%) vs baseline: +3.7% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.6% ✅ ospathsplit_noaspectTime: ✅ 503.853µs (SLO: <700.000µs 📉 -28.0%) vs baseline: +3.4% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +5.6% ✅ ospathsplitdrive_aspectTime: ✅ 379.258µs (SLO: <700.000µs 📉 -45.8%) vs baseline: +1.7% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.1% ✅ ospathsplitdrive_noaspectTime: ✅ 73.178µs (SLO: <700.000µs 📉 -89.5%) vs baseline: +0.2% Memory: ✅ 43.018MB (SLO: <46.000MB -6.5%) vs baseline: +5.8% ✅ ospathsplitext_aspectTime: ✅ 461.901µs (SLO: <700.000µs 📉 -34.0%) vs baseline: +0.3% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +5.5% ✅ ospathsplitext_noaspectTime: ✅ 471.216µs (SLO: <700.000µs 📉 -32.7%) vs baseline: +2.0% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.4% 📈 iastaspectssplit - 12/12✅ rsplit_aspectTime: ✅ 162.195µs (SLO: <250.000µs 📉 -35.1%) vs baseline: 📈 +11.1% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +5.9% ✅ rsplit_noaspectTime: ✅ 157.883µs (SLO: <250.000µs 📉 -36.8%) vs baseline: +3.4% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +5.6% ✅ split_aspectTime: ✅ 147.189µs (SLO: <250.000µs 📉 -41.1%) vs baseline: +1.7% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +6.0% ✅ split_noaspectTime: ✅ 155.348µs (SLO: <250.000µs 📉 -37.9%) vs baseline: +2.4% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.6% ✅ splitlines_aspectTime: ✅ 144.195µs (SLO: <250.000µs 📉 -42.3%) vs baseline: -0.7% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +5.8% ✅ splitlines_noaspectTime: ✅ 149.378µs (SLO: <250.000µs 📉 -40.2%) vs baseline: -2.2% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +5.3% 📈 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.419µs (SLO: <20.000µs 📉 -82.9%) vs baseline: 📈 +13.1% Memory: ✅ 35.507MB (SLO: <38.000MB -6.6%) vs baseline: +4.7% ✅ 1-count-metrics-100-timesTime: ✅ 210.021µs (SLO: <220.000µs -4.5%) vs baseline: +4.9% Memory: ✅ 35.468MB (SLO: <38.000MB -6.7%) vs baseline: +4.6% ✅ 1-distribution-metric-1-timesTime: ✅ 3.291µs (SLO: <20.000µs 📉 -83.5%) vs baseline: -3.7% Memory: ✅ 35.606MB (SLO: <38.000MB -6.3%) vs baseline: +5.0% ✅ 1-distribution-metrics-100-timesTime: ✅ 216.844µs (SLO: <230.000µs -5.7%) vs baseline: +0.9% Memory: ✅ 35.527MB (SLO: <38.000MB -6.5%) vs baseline: +5.0% ✅ 1-gauge-metric-1-timesTime: ✅ 2.175µs (SLO: <20.000µs 📉 -89.1%) vs baseline: -0.7% Memory: ✅ 35.507MB (SLO: <38.000MB -6.6%) vs baseline: +5.0% ✅ 1-gauge-metrics-100-timesTime: ✅ 136.264µs (SLO: <150.000µs -9.2%) vs baseline: -0.2% Memory: ✅ 35.547MB (SLO: <38.000MB -6.5%) vs baseline: +5.1% ✅ 1-rate-metric-1-timesTime: ✅ 3.089µs (SLO: <20.000µs 📉 -84.6%) vs baseline: -3.4% Memory: ✅ 35.488MB (SLO: <38.000MB -6.6%) vs baseline: +5.5% ✅ 1-rate-metrics-100-timesTime: ✅ 222.497µs (SLO: <250.000µs 📉 -11.0%) vs baseline: +3.8% Memory: ✅ 35.527MB (SLO: <38.000MB -6.5%) vs baseline: +4.5% ✅ 100-count-metrics-100-timesTime: ✅ 20.535ms (SLO: <22.000ms -6.7%) vs baseline: +1.3% Memory: ✅ 35.488MB (SLO: <38.000MB -6.6%) vs baseline: +4.9% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.229ms (SLO: <2.550ms 📉 -12.6%) vs baseline: -0.9% Memory: ✅ 35.645MB (SLO: <38.000MB -6.2%) vs baseline: +5.1% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.404ms (SLO: <1.550ms -9.4%) vs baseline: -0.7% Memory: ✅ 35.547MB (SLO: <38.000MB -6.5%) vs baseline: +4.9% ✅ 100-rate-metrics-100-timesTime: ✅ 2.239ms (SLO: <2.550ms 📉 -12.2%) vs baseline: +2.2% Memory: ✅ 35.566MB (SLO: <38.000MB -6.4%) vs baseline: +5.0% ✅ flush-1-metricTime: ✅ 4.486µs (SLO: <20.000µs 📉 -77.6%) vs baseline: -3.5% Memory: ✅ 35.566MB (SLO: <38.000MB -6.4%) vs baseline: +4.7% ✅ flush-100-metricsTime: ✅ 174.080µs (SLO: <250.000µs 📉 -30.4%) vs baseline: +0.3% Memory: ✅ 35.566MB (SLO: <38.000MB -6.4%) vs baseline: +4.5% ✅ flush-1000-metricsTime: ✅ 2.195ms (SLO: <2.500ms 📉 -12.2%) vs baseline: +0.1% Memory: ✅ 36.333MB (SLO: <38.750MB -6.2%) vs baseline: +4.5% 🟡 Near SLO Breach (1 suite)🟡 tracer - 6/6✅ largeTime: ✅ 31.560ms (SLO: <32.950ms -4.2%) vs baseline: -0.4% Memory: ✅ 36.825MB (SLO: <39.250MB -6.2%) vs baseline: +5.9% ✅ mediumTime: ✅ 3.145ms (SLO: <3.200ms 🟡 -1.7%) vs baseline: +0.6% Memory: ✅ 35.212MB (SLO: <38.750MB -9.1%) vs baseline: +4.7% ✅ smallTime: ✅ 365.285µs (SLO: <370.000µs 🟡 -1.3%) vs baseline: +4.0% Memory: ✅ 35.173MB (SLO: <38.750MB -9.2%) vs baseline: +4.4%
|
Codeowners resolved as |
Description
Adds support for a new
jsonmetric type in LLMObs evaluation metrics. This allows users to submitdictvalues as evaluation metrics.Changes:
LLMObs.submit_evaluation(): Now acceptsmetric_type="json"with dict valuesjsonmetric typejsonmetric typejson_valuefield toLLMObsExperimentEvalMetricEventTesting
Risks
None - additive change, backward compatible
Additional Notes
MLOB-5400