Home

Awesome

Language Models as Science Tutors

This is the official repository for Language Models as Science Tutors.

TutorEval

<br> <p align="center"> <img src="assets/main_radar_fig.png" width="800"> </p> <br>

πŸŽ“ About

TutorEval is a question-answering benchmark which evaluates how well a language model (the LM tutor) can help a user understand a chapter from a science textbook. TutorEval contains over 800 questions written by 17 expert researchers covering math, computer science, physics, life sciences, and environmental science. TutorEval questions relate to chapters from TutorChat (downloaded from libretexts.org) and require the model to answer free-form questions written from the point of view of a student. TutorEval questions are very diverse: they may ask for explanations of complicated content, for additional information going beyond the chapter, for verifications of exercise solutions, etc. Download the TutorEval data from HuggingFace at princeton-nlp/TutorEval.

TutorEval uses an LM as an evaluator. Once the LM tutor has generated responses to TutorEval questions, the evaluator is prompted to compare the tutor's outputs with a set of ground-truth key points. These key points were written by the human experts who created TutorEval, and sketch the most important points that the tutor should cover when answering the student.

πŸ“– OpenBook and πŸ“• ClosedBook

TutorEval questions are very diverse and rely on the textbook chapter in different ways. Some questions explicitly refer to the chapter (open-book), and some questions are phrased in such a way that they can be understood without reading the textbook chapter (closed-book). This means that TutorEval contains two evaluations in one:

πŸ† Leaderboard

We rank the models based on the full TutorEval score, even though TutorEval-ClosedBook rankings sometimes differ.

ModelTutorEvalClosedBook
GPT-485.286.1
Llama-3-70B71.378.3
GPT-3.5-Turbo68.369.6
Phi-3-Medium-128K67.669.5
Mixtral-8x7B66.368.2
Phi-3-Mini-128K59.563.5
Llemma-34B-MathMix56.855.3
Mistral-7B-Instruct-V0.255.558.7
Llama-3-8B55.359.1
Mathstral-7B53.955.6
Llemma-7B-32K-MathMix50.045.6
Zephyr-7B-Beta45.749.4
Vicuna-13B-V1.5-16K32.936.8
Mistral-7B-Instruct-V0.130.535.5
Gemma-7B-IT24.039.5

πŸ§‘β€πŸ’» Evaluating on TutorEval

To evaluate your own model on TutorEval, please use the scripts provided in ./tutoreval.

See ./tutoreval/README.md for detailed instructions.

The file./tutoreval/human_gpt_grades.csv contains the human grades alongside the GPT-4-1106 grades attributed to four models for each of the TutorEval questions. The human grades can be used to calibrate other LLM judges. Human-LLM correlation can be measured using this dataset as in Appendix C.2, Table 9 in the paper. Note that the TutorEval questions in human_gpt_grades.csv may differ slightly from the official set of TutorEval questions as some grammatical typos were corrected after human gradings were completed.

TutorChat

TutorChat is the first dialogue-tuning dataset for science. TutorChat consists of 80,000 synthetic teacher-student dialogues created using GPT-3.5 and GPT-4. Each conversation is grounded in a textbook chapter downloaded from libretexts.org and can take various formats:

We provide TutorChat dialogues for all chapters contained in the TextbookChapters dataset below, which includes humanities and social sciences. 40% of TutorChat dialogues concern STEM subjects.

Download the TutorChat data from HuggingFace at princeton-nlp/TutorChat.

πŸ“š Textbook chapters

Download the processed textbook chapters from HuggingFace at princeton-nlp/TextbookChapters. This dataset was obtained by scraping libretexts.org and processing the cleaned HTML files with the HTML-to-LaTeX parser from Openwebmath.

βš™οΈ TutorChat processing

./tokenization/tokenize_tutorchat.py tokenizes TutorChat and creates training labels according to the recipe used to train Llemma-7B-32K-MathMix. Use the flag --stem_only to tokenize only the STEM split of TutorChat.

πŸ”’ MathMix

MathMix is a fine-tuning dataset composed of the STEM split of TutorChat and a processed version of MetaMath. In ./tokenization, we provide some scripts to re-create and tokenize MathMix.

./tokenization/tokenize_metamath.py tokenizes MetaMath by randomly concatenating question/answer pairs to form longer samples. Use the flag --num_concat to set the number of samples to concatenate. MathMix concatenates 10 samples at a time.

./mathmix_combine.py concatenates and shuffles the tokenized TutorChat and MetaMath datasets to create MathMix. Use the flags --tutorchat and --metamath to set the paths to your tokenized datasets created with ./tokenization/tokenize_tutorchat.py and ./tokenization/tokenize_metamath.py.

Models

Download our models from HuggingFace at princeton-nlp/Llemma-7B-32K-MathMix and princeton-nlp/Llemma-34B-MathMix.

Citation

@misc{chevalier2024language,
      title={Language Models as Science Tutors}, 
      author={Alexis Chevalier and Jiayi Geng and Alexander Wettig and Howard Chen and Sebastian Mizera and Toni Annala and Max Jameson Aragon and Arturo RodrΓ­guez Fanlo and Simon Frieder and Simon Machado and Akshara Prabhakar and Ellie Thieu and Jiachen T. Wang and Zirui Wang and Xindi Wu and Mengzhou Xia and Wenhan Jia and Jiatong Yu and Jun-Jie Zhu and Zhiyong Jason Ren and Sanjeev Arora and Danqi Chen},
      year={2024},
      eprint={2402.11111},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}