This repo contains the implementation of Llama2-7B model (code base), which is designed with minimal dependencies (only torch and sentencepiece) to provide a straightforward setup.
Beyond minimal-llama, I added:
- Refined decoding
- Correct LoRA fine-tuning
- Add KV cache
- Add multi-turn conversation
- Beam search (in progress).
- Setup environment:
conda create --name llama python=3.10
conda activate llama
git clone https://github.com/YUECHE77/LLaMA2.git
cd LLaMA2
pip install torch sentencepiece
- Download
Llama-2-7b-chatModel and Tokenizer from huggingface. You can download the base model if you prefer.
Similar to minimal-llama, this repo uses the Alpaca dataset with only 200 samples for quick experimentation.
python finetune.py \
--model-path /path/to/Llama-2-7b-chat \
--data-path alpaca_data_200_samples.json \
--save-path /path/to/save/lora_weights.pth \
--lr 1e-5 \
--accumulate-steps 8
Do not use --lora-path if you haven't fine-tuned.
python inference.py \
--model-path /path/to/Llama-2-7b-chat \
--lora-path /path/to/lora_weights.pth \
--max-len 128 \
--sampling \
--temperature 0.7 \
--top-k 50 \
--top-p 0.9
Use 'exit' to end the conversation. You can modify the maximum length for history in ModelArgs (see model.py). The history get truncated if exceed this value.
python chat.py \
--model-path /path/to/Llama-2-7b-chat \
--lora-path /path/to/lora_weights.pth \
--max-len 128 \
--sampling \
--temperature 0.7 \
--top-k 50 \
--top-p 0.9
