Releases · ModelCloud/Evalution

Release list

Evalution v0.0.7 Latest

Latest

Qubitium released this 19 Apr 08:10

v0.0.7

0ca87c8

Notable Changes:

Add Tinygrad LLM to Engines by @Qubitium in #112

What's Changed

[CI] refactor ci & fix streaming tests by @CSY-ModelCloud in #110
[CI] install device smi & fix build_server_info has no arg by @CSY-ModelCloud in #111
Update pyproject.toml by @Qubitium in #113
Update Transformers fa launch monkeypatch and fold rogue scorer into pkg by @Qubitium in #114
Update ci scores by @Qubitium in #115

Full Changelog: v0.0.6...v0.0.7

Contributors

Qubitium and CSY-ModelCloud

Assets 2

Evalution v0.0.6

Qubitium released this 16 Apr 15:44

v0.0.6

24e393b

Notables

Added runnable benchmark implementations for hle, supergpqa, hmmt_feb25, hmmt_nov25, hmmt_feb26, imoanswerbench, and livecodebench_v6.
Registered capability-gated placeholders for swe_bench_verified, swe_bench_multilingual, swe_bench_pro, terminal_bench_2, claw_eval_avg, claw_eval_pass3, skillsbench_avg5, qwenclawbench,
nl2repo, qwenwebbench, tau3_bench, vita_bench, deepplanning, tool_decathlon, mcpmark, mcp_atlas, and widesearch, with clear runtime-capability errors instead of misleading partial implementations.
Exported the new suites through evalution.benchmarks and added integration metadata/baselines in tests/models_support.py.
Added unit coverage plus standalone Llama 3.2 1B Instruct regression tests for the new runnable suites.
Hardened math answer extraction to handle boxed answers, explicit final-answer lines, and inline math spans more reliably, using compiled pcre patterns.
Added an optional apply_chat_template mode for HLE while keeping the default benchmark-faithful prompt path unchanged.

What's Changed

Add new benchmark suites by @Qubitium in #108
Update pyproject.toml by @Qubitium in #109

Full Changelog: v0.0.5...v0.0.6

Contributors

Qubitium

Assets 2

Evalution v0.0.5

Qubitium released this 16 Apr 08:29

v0.0.5

ced49cb

What's Changed

[CI] always use a clean env & don't install gptqmodel for other tests by @CSY-ModelCloud in #105
Use last score_count tokens when logits_to_keep is set (no offset slicing) by @ZX-ModelCloud in #106
Update pyproject.toml by @Qubitium in #107

Full Changelog: v0.0.4...v0.0.5

Contributors

Qubitium, ZX-ModelCloud, and CSY-ModelCloud

Assets 2

Evalution v0.0.4

Qubitium released this 16 Apr 06:49

v0.0.4

365383e

What's Changed

cleanup by @ZX-ModelCloud in #92
[CI] show real exit code & fix no GPU job timeout by @CSY-ModelCloud in #91
[CI] re-mount /monster for uv by @CSY-ModelCloud in #93
[CI] clean sh codes, simpilify logic by @CSY-ModelCloud in #95
Evalution setuptools by @Qubitium in #96
[CI] add common env & add prepare env to init by @CSY-ModelCloud in #97
[CI] check setuptools compatibility by @CSY-ModelCloud in #98
[CI] add compatibility check to unit test by @CSY-ModelCloud in #100
Fix setuptools CI workflow interpreter resolution by @Qubitium in #102
[CI] mount workspace for uv by @CSY-ModelCloud in #103
[CI] install torchao==0.17.0+cpu by @CSY-ModelCloud in #104

Full Changelog: v0.0.3...v0.0.4

Contributors

Qubitium, ZX-ModelCloud, and CSY-ModelCloud

Assets 2

Evalution v0.0.3

Qubitium released this 11 Apr 21:49

v0.0.3

4421bba

What's Changed

🚀 Added an OpenAI endpoint-compatible engine by @Qubitium in #85
🔖 Bumped version from 0.0.2 to 0.0.3 by @Qubitium in #86
🧭 Clarified OpenAI engine model argument mapping by @Qubitium in #87
🦙 Added Llama.cpp engine by @Qubitium in #84

Full Changelog: v0.0.2...v0.0.3

Contributors

Qubitium

Assets 2

Evalution v0.0.2

Qubitium released this 11 Apr 12:46

v0.0.2

920636a

What's Changed

🚀 Engines, Integrations, and Runtime Support

Added new engine support across the stack, including GPTQModel, vLLM, SGLang, TensorRT, and OpenVINO. (#16, #33, #34, #43, #52, #60)
Improved the Transformers compatibility engine and split client/producer/work-queue responsibilities out of it for a cleaner architecture. (#9, #39)
Added the compare API and expanded configuration support with cleaner shared base configs and a refactored YAML engine registry. (#14, #41, #46)
Enabled better runtime defaults and compatibility fixes, including paged FA defaults, FA2 callback work, GGUF tokenizer loading, CPU device resolution, and import-time network fetch fixes. (#51, #57, #66, #68, #38)

📊 Evaluations, Benchmarks, and Scoring

Expanded evaluation coverage with GSM8K, ARC, MMLU-Pro, additional eval suites, subset control, and LongBench2 legal baselines. (#6, #7, #12, #13, #74, #77)
Built out benchmarking substantially with a sequence of benchmark additions and baseline updates across multiple PRs. (#20, #21, #22, #23, #24, #25, #28, #35, #36, #70, #67)
Aligned scoring behavior more closely with original papers and improved metric consistency, including ARC scoring syncs, metric key renames, and citation support. (#5, #17, #18, #64)

🧪 Tests and CI

Added test coverage and unit-test dependencies, then hardened CI with self-hosted runners, file-matrix execution, isolated test envs, GPU request handling, auth fixes, command fixes, and dependency/install fixes. (#8, #15, #44, #47, #48,
#49, #50, #54, #55, #56, #58, #59, #61, #62, #71, #72, #73, #75, #76, #78)
Updated benchmark baselines and assert messages to make regressions easier to detect. (#67)

🛠️ Refactors, Cleanup, and Reliability

Cleaned up and refactored large parts of the codebase across several passes, including workflow setup, initial project structure, YAML/registry cleanup, and general maintenance. (#1, #2, #4, #10, #11, #19, #53, #63, #80, #83)
Fixed multiple correctness and usability issues, including streaming variable consistency, MMLU logging/splits, tokenizer loading, and miscellaneous naming bugs. (#29, #31, #40, #42, #45)
Replaced stdlib re usage with PyPcre for regex handling. (#81)

📚 Docs and Developer Experience

Refreshed and expanded documentation, including README improvements, OpenVINO docs, and general docs updates/highlights. (#65, #69, #79, #82)

New Contributors 🌟

@Qubitium made their first contribution in #1
@CSY-ModelCloud made their first contribution in #15
@ZX-ModelCloud made their first contribution in #34

Full Changelog: https://github.com/ModelCloud/Evalution/commits/v0.0.2

Contributors

Qubitium, ZX-ModelCloud, and CSY-ModelCloud

Assets 2

Uh oh!

Releases: ModelCloud/Evalution

Release list

Evalution v0.0.7

Notable Changes:

What's Changed

Contributors

Uh oh!

Evalution v0.0.6

Notables

What's Changed

Contributors

Uh oh!

Evalution v0.0.5

What's Changed

Contributors

Uh oh!

Evalution v0.0.4

What's Changed

Contributors

Uh oh!

Evalution v0.0.3

What's Changed

Contributors

Uh oh!

Evalution v0.0.2

What's Changed

🚀 Engines, Integrations, and Runtime Support

📊 Evaluations, Benchmarks, and Scoring

🧪 Tests and CI

🛠️ Refactors, Cleanup, and Reliability

📚 Docs and Developer Experience

New Contributors 🌟

Contributors

Uh oh!