This project focuses on fine-tuning large language models (LLMs) for Chinese-to-English translation in the education domain. We compare two state-of-the-art models—mBART and M2M100—using various fine-tuning techniques such as LoRA and Layer Freezing. The goal is to identify the most effective and resource-efficient approach for achieving high-quality translations tailored to this domain.
We provide two main notebooks: one for the education corpus domain and one for the science domain corpus.
-
Load and Preprocess
- Import required libraries, load data, initialize models, and define evaluation metrics.
-
Baseline Evaluation
- Evaluate the original mBART and M2M100 models on a 50,000-sample dataset to establish baseline BLEU scores.
-
Hyperparameter Tuning
- Perform grid search on a smaller 10,000-sample dataset to identify optimal configurations for each fine-tuning technique.
-
Fine-Tune mBART
- Fine-tune the mBART model using the best hyperparameters on the full 50,000-sample dataset.
-
Fine-Tune M2M100
- Fine-tune the M2M100 model using the best hyperparameters on the full 50,000-sample dataset.
-
Evaluation and Analysis
- Assess model performance using BLEU scores, analyze errors (Word-Level, Structural, Other), and compare GPU resource usage.
-
Extension: Quantization and Small Models
- Evaluate a pre-quantized LLaMA model and a compact Chinese-to-English MT model (e.g., Marian MT) for low-resource scenarios.
-
Set Up Environment
- Ensure required libraries are installed (see Dependencies section).
Recommendation: Upload the notebooks to Google Colab and use an A100 GPU for optimal performance.
-
Download Data
- Place the dataset in the working directory or Colab content folder. Update the paths to the
.txtfiles accordingly. - Provided datasets:
Bi-Education.txtfor education domain.Bi-Science.txtfor science domain.
- Feel free to experiment with other domain-specific corpora!
- Place the dataset in the working directory or Colab content folder. Update the paths to the
-
Run Notebook
- Execute the notebook cells sequentially.
-
View Results
- Analyze BLEU scores, error types, GPU usage, and model performance comparisons.
The following libraries are required:
datasetsoptimumauto-gptqsentencepiecebitsandbytessacremosessacrebleutransformerspeftnltktqdmpandastorch
To install all dependencies in your Colab environment, run:
!pip install datasets optimum auto-gptq sentencepiece bitsandbytes sacremoses sacrebleu transformers peft nltk tqdm pandas torchThe notebook includes detailed results with tables, figures, and analysis. Key findings include:
- BLEU scores for all models and fine-tuning methods.
- Error analysis to identify common translation challenges.
- Resource usage comparison (GPU efficiency across methods).
The project concludes with recommendations for achieving the best balance between accuracy and efficiency for Chinese-to-English translation tasks.