Skip to content

Releases: ModelCloud/Evalution

Evalution v0.0.7

Choose a tag to compare

@Qubitium Qubitium released this 19 Apr 08:10
0ca87c8

Notable Changes:

What's Changed

Full Changelog: v0.0.6...v0.0.7

Evalution v0.0.6

Choose a tag to compare

@Qubitium Qubitium released this 16 Apr 15:44
24e393b

Notables

  • Added runnable benchmark implementations for hle, supergpqa, hmmt_feb25, hmmt_nov25, hmmt_feb26, imoanswerbench, and livecodebench_v6.
  • Registered capability-gated placeholders for swe_bench_verified, swe_bench_multilingual, swe_bench_pro, terminal_bench_2, claw_eval_avg, claw_eval_pass3, skillsbench_avg5, qwenclawbench,
    nl2repo, qwenwebbench, tau3_bench, vita_bench, deepplanning, tool_decathlon, mcpmark, mcp_atlas, and widesearch, with clear runtime-capability errors instead of misleading partial implementations.
  • Exported the new suites through evalution.benchmarks and added integration metadata/baselines in tests/models_support.py.
  • Added unit coverage plus standalone Llama 3.2 1B Instruct regression tests for the new runnable suites.
  • Hardened math answer extraction to handle boxed answers, explicit final-answer lines, and inline math spans more reliably, using compiled pcre patterns.
  • Added an optional apply_chat_template mode for HLE while keeping the default benchmark-faithful prompt path unchanged.

What's Changed

Full Changelog: v0.0.5...v0.0.6

Evalution v0.0.5

Choose a tag to compare

@Qubitium Qubitium released this 16 Apr 08:29
ced49cb

What's Changed

Full Changelog: v0.0.4...v0.0.5

Evalution v0.0.4

Choose a tag to compare

@Qubitium Qubitium released this 16 Apr 06:49
365383e

What's Changed

Full Changelog: v0.0.3...v0.0.4

Evalution v0.0.3

Choose a tag to compare

@Qubitium Qubitium released this 11 Apr 21:49
4421bba

What's Changed

  • 🚀 Added an OpenAI endpoint-compatible engine by @Qubitium in #85
  • 🔖 Bumped version from 0.0.2 to 0.0.3 by @Qubitium in #86
  • 🧭 Clarified OpenAI engine model argument mapping by @Qubitium in #87
  • 🦙 Added Llama.cpp engine by @Qubitium in #84

Full Changelog: v0.0.2...v0.0.3

Evalution v0.0.2

Choose a tag to compare

@Qubitium Qubitium released this 11 Apr 12:46
920636a

What's Changed

🚀 Engines, Integrations, and Runtime Support

  • Added new engine support across the stack, including GPTQModel, vLLM, SGLang, TensorRT, and OpenVINO. (#16, #33, #34, #43, #52, #60)
  • Improved the Transformers compatibility engine and split client/producer/work-queue responsibilities out of it for a cleaner architecture. (#9, #39)
  • Added the compare API and expanded configuration support with cleaner shared base configs and a refactored YAML engine registry. (#14, #41, #46)
  • Enabled better runtime defaults and compatibility fixes, including paged FA defaults, FA2 callback work, GGUF tokenizer loading, CPU device resolution, and import-time network fetch fixes. (#51, #57, #66, #68, #38)

📊 Evaluations, Benchmarks, and Scoring

  • Expanded evaluation coverage with GSM8K, ARC, MMLU-Pro, additional eval suites, subset control, and LongBench2 legal baselines. (#6, #7, #12, #13, #74, #77)
  • Built out benchmarking substantially with a sequence of benchmark additions and baseline updates across multiple PRs. (#20, #21, #22, #23, #24, #25, #28, #35, #36, #70, #67)
  • Aligned scoring behavior more closely with original papers and improved metric consistency, including ARC scoring syncs, metric key renames, and citation support. (#5, #17, #18, #64)

🧪 Tests and CI

  • Added test coverage and unit-test dependencies, then hardened CI with self-hosted runners, file-matrix execution, isolated test envs, GPU request handling, auth fixes, command fixes, and dependency/install fixes. (#8, #15, #44, #47, #48,
    #49, #50, #54, #55, #56, #58, #59, #61, #62, #71, #72, #73, #75, #76, #78)
  • Updated benchmark baselines and assert messages to make regressions easier to detect. (#67)

🛠️ Refactors, Cleanup, and Reliability

  • Cleaned up and refactored large parts of the codebase across several passes, including workflow setup, initial project structure, YAML/registry cleanup, and general maintenance. (#1, #2, #4, #10, #11, #19, #53, #63, #80, #83)
  • Fixed multiple correctness and usability issues, including streaming variable consistency, MMLU logging/splits, tokenizer loading, and miscellaneous naming bugs. (#29, #31, #40, #42, #45)
  • Replaced stdlib re usage with PyPcre for regex handling. (#81)

📚 Docs and Developer Experience

  • Refreshed and expanded documentation, including README improvements, OpenVINO docs, and general docs updates/highlights. (#65, #69, #79, #82)

New Contributors 🌟

Full Changelog: https://github.com/ModelCloud/Evalution/commits/v0.0.2