Awesome
<!-- # Language Imbalance Driven Rewarding -->Language Imbalance Driven Rewarding for Multilingual Self-improving
<div align="center"> <br> <a>Wen Yang</a><sup><span>1,2*</span></sup>, <a href="https://scholar.google.com/citations?user=Ci4l4yQAAAAJ&hl=zh-CN">Junhong Wu</a><sup><span>1,2*</span></sup>, <a href="https://scholar.google.com/citations?user=FgrrqlAAAAAJ&hl=zh-CN">Chen Wang</a><sup><span>1,2</span></sup>, <a href="https://scholar.google.com/citations?user=l8lvKOQAAAAJ&hl=zh-CN">Chengqing Zong</a><sup><span>1,2</span></sup>, <a href="https://scholar.google.com/citations?user=93zngeYAAAAJ&hl=zh-CN">Jiajun Zhang</a><sup><span>1,2,3,4π</span></sup>, <br>* Equal contribution π Corresponding author
<sup>1</sup> School of Artificial Intelligence, University of Chinese Academy of Sciences<br> <sup>2</sup> Institute of Automation, Chinese Academy of Sciences<br> <sup>3</sup> Wuhan AI Research <sup>4</sup> Shanghai Artificial Intelligence Laboratory, Shanghai, China<br>
<a href='https://arxiv.org/pdf/2410.08964'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
</div> <p align="center"> <img src="assets/outline.png" width="100%" height="100%"> </p><font size=5><div align='center' > [π arXiv Paper] </div></font>
Overview
We introduce Language Imbalance Driven Rewarding, a novel approach that leverages the inherent capability imbalance across different languages in large language models (LLMs) as a reward signal for iterative self-improvement. By applying iterative DPO training, our approach not only enhances the performance of non-dominant languages but also improves outcomes in dominant languages.
Our goal with this approach is to contribute a new perspective to the multilingual LLM community by challenging the assumption that language imbalance is solely a challenge to be mitigated. We hope this approach will inspire further exploration into multilingual self-improvement in LLMs, broadening the horizon for more balanced and capable language models.
π₯ Update
- [28/10/2024]π₯We release the code for Language Imbalance Driven Rewarding!
- [11/10/2024]π₯Language Imbalance Driven Rewarding is coming! We release the paper!
π Contents
π· Setup
Please follow the instructions below to install the required packages.
- Clone this repository
https://github.com/ZNLP/Language-Imbalance-Driven-Rewarding.git
- Install Package
conda create -n mdpo python=3.10 -y
conda activate mdpo
cd Language-Imbalance-Driven-Rewarding
pip install -r requirements.txt
π‘ Preparation
bash ./scripts/batch_inference.sh
bash ./scripts/batch_translate.sh
π Train
Our training is mostly performed on LLaMA-Factory code base. Please refer to that repo for more details.
π Evaluation
bash scripts/batch_inference_for_eval.sh
π Experiments
We provide some results in this section. More detailed results can be found in our paper.
General Instruction Following
- Head-to-head Performance
- X-alpacaEval
Arithmetic Reasoning
- Performances on MGSM benchmark
Schedule
-
Release training & evaluation code
-
Release GPT-4 Score code
Citation
If you find this repo useful for your research, please consider citing the paper
@article{yang2024language,
title={Language Imbalance Driven Rewarding for Multilingual Self-improving},
author={Yang, Wen and Wu, Junhong and Wang, Chen and Zong, Chengqing and Zhang, Jiajun},
journal={arXiv preprint arXiv:2410.08964},
year={2024}
}
Acknowledgement
We would like to thank the following repos for their great work:
- This work utilizes the great work from LLaMA-Factory, Vllm, transformers, LLaMA, Qwen2
License
This project is released under the Apache 2.0 license. Parts of this project contain code and models from other sources, which are subject to their respective licenses.