Home

Awesome

A group of differently dressed Llamas on the starting line of a racetrack about to run.

German Benchmark Datasets

Translating Popular LLM Benchmarks to German

Inspired by the HuggingFace Open LLM leaderboard, this project aims to utilize GPT-3.5 to provide German translations for popular LLM benchmarks, enabling researchers and practitioners to evaluate the performance of various language models on tasks using the German language. We follow the HF leaderboard in designing the datasets to be used with the Eleuther AI Language Model Evaluation Harness. By creating these translated benchmarks, we hope to contribute to the advancement of multilingual natural language processing and foster research in the German NLP community.

All translated datasets are made available on HuggingFace:

Datasets

We are currently providing translations for the four datasets also used in the HuggingFace Open LLM leaderboard:

These datasets are designed to cover a range of tasks and serve as a foundation for evaluating German language models. As the project evolves, we plan to add more benchmarks to further enhance the variety and utility of the available datasets.

Usage

Currently, the best way to evaluate a model on these datasets is to use our our clone of the LM Evaluation Harness: https://github.com/bjoernpl/lm-evaluation-harness-de/tree/mmlu_de. Soon, we will also contribute our changes to the original repository.

To evaluate a model on a dataset, follow these steps:

  1. Clone the repository and checkout the mmlu_de branch:
    git clone https://github.com/bjoernpl/lm-evaluation-harness-de/
    cd lm-evaluation-harness-de
    git checkout mmlu_de
    
  2. Install the requirements:
    pip install -r requirements.txt
    
  3. Run evaluation on any of the tasks. MMLU-DE*, hellaswag_de, arc_challenge_de, truthful_qa_de are the names. Keep in mind, in the LLM leaderboards, the fewshot numbers 5, 10, 25 and 0 are used respectively. The following example runs a Llama-2-7b-hf model on the HellaSwag dataset with a fewshot of 5 on a GPU:
    python run.py --model=hf-causal --model_args=llama-2-7b --tasks MMLU-DE* --fewshot 5
    python main.py \
    --model hf-causal \
    --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=float16 \
    --tasks hellaswag_de \
    --num_fewshot 10 \
    --device cuda:0
    
    For more details see the original LM Evaluation Harness README

Creation Process

We translated each dataset independently, as each required specific considerations. The code to reproduce the translations is available in the dataset_translation folder. While a large part of all examples can be successfully translated with clever prompting, manual post-processing was required to fill in the gaps.

License

This project is licensed under the Apache 2.0 License. However, please note that the original datasets being translated may have their own licenses, so make sure to respect and adhere to them when using the translated benchmarks.

We hope that these translated benchmarks empower the German NLP community and contribute to advancements in multilingual language modeling. Happy coding and researching!

References

Great thanks to the creators of the original datasets for making them available to the public. Please consider citing the original papers if you use the datasets in your research.


Disclaimer: This repository is not affiliated with or endorsed by the creators of the original benchmarks.