Has anyone successfully reproduced the throughput result?

Using model relaxml/Llama-2-7b-E8PRVQ-4Bit
On A6000,  I only got ~82 toks/s which doesn't match 95 toks/s in the paper.
On 6000 Ada, I got ~109 toks/s while in the paper it's 140 toks/s.

command: `python eval/eval_speed.py   --hf_path relaxml/Llama-2-7b-E8PRVQ-4Bit`

I also tried `python interactive_gen.py  --hf_path relaxml/Llama-2-7b-chat-E8PRVQ-4Bit ` but the throughput is strangely slow, only at 5.77 toks/s

The 3rd party repository QuIP-for-all contains some bugs so couldn't be run.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Has anyone successfully reproduced the throughput result? #82

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Has anyone successfully reproduced the throughput result? #82

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions