Releases: ModelCloud/Evalution
Releases · ModelCloud/Evalution
Release list
Evalution v0.0.7
Notable Changes:
What's Changed
- [CI] refactor ci & fix streaming tests by @CSY-ModelCloud in #110
- [CI] install device smi & fix build_server_info has no arg by @CSY-ModelCloud in #111
- Update pyproject.toml by @Qubitium in #113
- Update Transformers fa launch monkeypatch and fold rogue scorer into pkg by @Qubitium in #114
- Update ci scores by @Qubitium in #115
Full Changelog: v0.0.6...v0.0.7
Evalution v0.0.6
Notables
- Added runnable benchmark implementations for
hle,supergpqa,hmmt_feb25,hmmt_nov25,hmmt_feb26,imoanswerbench, andlivecodebench_v6. - Registered capability-gated placeholders for
swe_bench_verified,swe_bench_multilingual,swe_bench_pro,terminal_bench_2,claw_eval_avg,claw_eval_pass3,skillsbench_avg5,qwenclawbench,
nl2repo,qwenwebbench,tau3_bench,vita_bench,deepplanning,tool_decathlon,mcpmark,mcp_atlas, andwidesearch, with clear runtime-capability errors instead of misleading partial implementations. - Exported the new suites through
evalution.benchmarksand added integration metadata/baselines intests/models_support.py. - Added unit coverage plus standalone
Llama 3.2 1B Instructregression tests for the new runnable suites. - Hardened math answer extraction to handle boxed answers, explicit final-answer lines, and inline math spans more reliably, using compiled
pcrepatterns. - Added an optional
apply_chat_templatemode for HLE while keeping the default benchmark-faithful prompt path unchanged.
What's Changed
Full Changelog: v0.0.5...v0.0.6
Evalution v0.0.5
What's Changed
- [CI] always use a clean env & don't install gptqmodel for other tests by @CSY-ModelCloud in #105
- Use last score_count tokens when logits_to_keep is set (no offset slicing) by @ZX-ModelCloud in #106
- Update pyproject.toml by @Qubitium in #107
Full Changelog: v0.0.4...v0.0.5
Evalution v0.0.4
What's Changed
- cleanup by @ZX-ModelCloud in #92
- [CI] show real exit code & fix no GPU job timeout by @CSY-ModelCloud in #91
- [CI] re-mount /monster for uv by @CSY-ModelCloud in #93
- [CI] clean sh codes, simpilify logic by @CSY-ModelCloud in #95
- Evalution setuptools by @Qubitium in #96
- [CI] add common env & add prepare env to init by @CSY-ModelCloud in #97
- [CI] check setuptools compatibility by @CSY-ModelCloud in #98
- [CI] add compatibility check to unit test by @CSY-ModelCloud in #100
- Fix setuptools CI workflow interpreter resolution by @Qubitium in #102
- [CI] mount workspace for uv by @CSY-ModelCloud in #103
- [CI] install torchao==0.17.0+cpu by @CSY-ModelCloud in #104
Full Changelog: v0.0.3...v0.0.4
Evalution v0.0.3
Evalution v0.0.2
What's Changed
🚀 Engines, Integrations, and Runtime Support
- Added new engine support across the stack, including GPTQModel, vLLM, SGLang, TensorRT, and OpenVINO. (#16, #33, #34, #43, #52, #60)
- Improved the Transformers compatibility engine and split client/producer/work-queue responsibilities out of it for a cleaner architecture. (#9, #39)
- Added the compare API and expanded configuration support with cleaner shared base configs and a refactored YAML engine registry. (#14, #41, #46)
- Enabled better runtime defaults and compatibility fixes, including paged FA defaults, FA2 callback work, GGUF tokenizer loading, CPU device resolution, and import-time network fetch fixes. (#51, #57, #66, #68, #38)
📊 Evaluations, Benchmarks, and Scoring
- Expanded evaluation coverage with GSM8K, ARC, MMLU-Pro, additional eval suites, subset control, and LongBench2 legal baselines. (#6, #7, #12, #13, #74, #77)
- Built out benchmarking substantially with a sequence of benchmark additions and baseline updates across multiple PRs. (#20, #21, #22, #23, #24, #25, #28, #35, #36, #70, #67)
- Aligned scoring behavior more closely with original papers and improved metric consistency, including ARC scoring syncs, metric key renames, and citation support. (#5, #17, #18, #64)
🧪 Tests and CI
- Added test coverage and unit-test dependencies, then hardened CI with self-hosted runners, file-matrix execution, isolated test envs, GPU request handling, auth fixes, command fixes, and dependency/install fixes. (#8, #15, #44, #47, #48,
#49, #50, #54, #55, #56, #58, #59, #61, #62, #71, #72, #73, #75, #76, #78) - Updated benchmark baselines and assert messages to make regressions easier to detect. (#67)
🛠️ Refactors, Cleanup, and Reliability
- Cleaned up and refactored large parts of the codebase across several passes, including workflow setup, initial project structure, YAML/registry cleanup, and general maintenance. (#1, #2, #4, #10, #11, #19, #53, #63, #80, #83)
- Fixed multiple correctness and usability issues, including streaming variable consistency, MMLU logging/splits, tokenizer loading, and miscellaneous naming bugs. (#29, #31, #40, #42, #45)
- Replaced stdlib
reusage with PyPcre for regex handling. (#81)
📚 Docs and Developer Experience
- Refreshed and expanded documentation, including README improvements, OpenVINO docs, and general docs updates/highlights. (#65, #69, #79, #82)
New Contributors 🌟
- @Qubitium made their first contribution in #1
- @CSY-ModelCloud made their first contribution in #15
- @ZX-ModelCloud made their first contribution in #34
Full Changelog: https://github.com/ModelCloud/Evalution/commits/v0.0.2