Hi, thanks for open-sourcing the code, it looks great.
I have a question about the length bias mitigation part in the WB reward metric.
It seems that the code is shortening both reference output and model output to a fixed word count.
I'm curious about how to implement
converting outcomes of “slightly better/worse” to “tie” if the winner’s response exceeds the loser’s by more than K characters.
as mentioned in the paper. Or am I missing something here?
Thank you in advance!
Hi, thanks for open-sourcing the code, it looks great.
I have a question about the length bias mitigation part in the WB reward metric.
It seems that the code is shortening both reference output and model output to a fixed word count.
I'm curious about how to implement
as mentioned in the paper. Or am I missing something here?
Thank you in advance!