Question about length bias mitigation in pairwise evaluation

Hi, thanks for open-sourcing the code, it looks great.

I have a question about the length bias mitigation part in the WB reward metric. 

It seems that [the code](https://github.com/allenai/WildBench/blob/d6b8dcaf377d173d031980f97c16e1a82618c03d/src/eval.py#L354C1-L355C62) is shortening both reference output and model output to a fixed word count.

I'm curious about how to implement

> converting outcomes of “slightly better/worse” to “tie” if the winner’s response exceeds the loser’s by more than K characters.

as mentioned in the paper. Or am I missing something here? 

Thank you in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about length bias mitigation in pairwise evaluation #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about length bias mitigation in pairwise evaluation #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions