Skip to content

Has anyone successfully reproduced the throughput result? #82

Description

@HsChen-sys

Using model relaxml/Llama-2-7b-E8PRVQ-4Bit
On A6000, I only got ~82 toks/s which doesn't match 95 toks/s in the paper.
On 6000 Ada, I got ~109 toks/s while in the paper it's 140 toks/s.

command: python eval/eval_speed.py --hf_path relaxml/Llama-2-7b-E8PRVQ-4Bit

I also tried python interactive_gen.py --hf_path relaxml/Llama-2-7b-chat-E8PRVQ-4Bit but the throughput is strangely slow, only at 5.77 toks/s

The 3rd party repository QuIP-for-all contains some bugs so couldn't be run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions