

<h1 align="center">Analysis360: Analyze LLMs in 360 degrees</h1> <div align="center"> <img src="./docs/imgs/logo-web250.png"><br><br> </div>
<p align="center"> <a href="https://github.com/LLM360/Analysis360/blob/dev/LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="license"></a> </p> <p align="center"> HuggingFace Repositories 🤗 <a href="https://huggingface.co/LLM360/Amber">[Amber]</a> • 🤗 <a href="https://huggingface.co/LLM360/CrystalCoder">[CrystalCoder]</a> </p> <p align="center"> Metrics and charts&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 📈 <a href="https://wandb.ai/llm360/Amber"> [Amber]</a> • 📈 <a href="https://wandb.ai/llm360/CrystalCoder"> [CrystalCoder]</a> </p> <p align="center"> Publications &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 📃 <a href="https://www.llm360.ai/paper.pdf">LLM360 Paper</a> </p>

Welcome to Analysis360! <br/>

This repo contains all of the code that we used for model evaluation and analysis. It serves as the single source of truth for all evaluation metrics and provides in-depth analysis from many different angles. Feel free to click on the links above to have a quick glance around the LLM360 project and experiments' data.

Our Approach

We run evaluations on a variety of benchmarks, including the conventional benchmarks like MMLU, Hellaswag, ARC, user-preference aligned benchmarks like MT-bench, long-context evaluations like LongEval, and additional studies on safety benchmarks for truthfulness, toxicity, and bias. Moreover, we report results on the model samples we preselected from a suite of LLMs where they all trained on same data seen in the exact same order to better observe and understand how our models develop and evolve over the training process. We also provide public access to all checkpoints, all code and all wandb dashboards for detailed training and evaluation curves.

W&B Dashboards

Every model has one wandb project/dashboard, each project will have multiple runs, and all of projects should be in the same base structure. For example, Amber project has runs train, downstream_eval, and perplexity_eval. The train run collects data for training processes like loss and learning rate while the others collects data for evaluation. Additionally, we added a resources section for Amber project to specifically record the resources related information for anyone who's interested. To quickly find the metric you are looking for, you could use the search bar on the top or/and the filter on the top right.

List of Analysis and Metrics

Here's a full list of analysis/metrics we have collected so far. For each model we release, at this point, Amber and CrystalCoder, we put down the links to specific wandb reports if the evaluation is done. Amber and CrystalCoder currently use their own evaluation scripts, we are working on consolidating these in the future, more details can be found in later sections. Please refer to model cards (Amber, CrystalCoder) for any terms or technology you find unfamiliar. We will keep updating and expanding the list as our study proceeds, please stay tuned on the upcoming changes!

mmluA test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more5 shot0 shot<br>5 shot
raceA test to measure reading comprehension ablity0 shot0 shot
arc_challengeA set of grade-school science questions25 shot0 shot<br>25 shot
boolqA question answering dataset for yes/no questions containing 15942 examples0 shot0 shot
hellaswagA test of commonsense inference10 shot0 shot<br>10 shot
openbookqaA question-answering dataset modeled after open book exams for assessing human understanding of a subject0 shot0 shot
piqaA test to measure physical commonsense and reasoning0 shot0 shot
siqaA test to measure commonsense reasoning about social interactions0 shot
winograndeAn adversarial and difficult Winograd benchmark at scale, for commonsense reasoning0 shot0 shot<br>5 shot
crowspairsA challenge set for evaluating what language models (LMs) on their tendency to generate biased outputs0 shot
truthfulqaA test to measure a model’s propensity to reproduce falsehoods commonly found online0 shot0 shot
pileA test to measure model's perplexity, we covered 18/22 sub datasetsperplexity
dropA reading comprehension benchmark requiring discrete reasoning over paragraphs3 shot
mbppAround 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmerspass 1<br>pass 10
humanevalA test to measure functional correctness for synthesizing programs from docstringspass 1<br>pass 10
gsm8kDiverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems5 shot
copaA test to assess progress in open-domain commonsense causal reasoning0 shot
toxigenA test to measure model's toxicity on text generationtoxigen
toxicity identificationA test to measure model's capability on identifying toxic texttoxicity identification
boldA test to evaluate fairness in open-ended language generation in English languagebold
memorization and token orders analysisAn analysis to understand model's memorization abilitiesmemorization

How to reproduce our results

Most of our evaluations are built based on lm-evaluation-harness's core lm_eval module. We reused the metrics that were supported by harness and added in our own to support more. Please follow the instructions here to get started. For any metric that's not included in the harness folder, users should be able to find a dedicated folder for that metric in the root level of the repo and follow the instructions there. Note, we are still working on getting code consolidated and uploaded so please wait for future releases to fill out the missing gaps.


If you are interested in using our results in your work, you can cite the LLM360 overview paper.

title={LLM360: Towards Fully Transparent Open-Source LLMs},
author={Liu, Zhengzhong and Qiao, Aurick and Neiswanger, Willie and Wang, Hongyi and Tan, Bowen and Tao, Tianhua and Li, Junbo and Wang, Yuqi and Sun, Suqi and Pangarkar, Omkar and Fan, Richard and Gu, Yi and Miller, Victor and Zhuang, Yonghao and He, Guowei and Li, Haonan and Koto, Fajri and Tang, Liping and Ranjan, Nikhil and Shen, Zhiqiang and Ren, Xuguang and Iriondo, Roberto and Mu, Cun and Hu, Zhiting and Schulze, Mark and Nakov, Preslav and Baldwin, Tim and Xing, Eric},