Home

Awesome

ChatGPT:开源与超越

<p align="center"> 简体中文 | <a href="README_EN.md"> English </a></p>

开源类ChatGPT模型的实现与超越之路

LLaMA权重意外泄露、以及斯坦福小羊驼用以self-instruct方式从gpt-3 api构建的数据对LLaMA进行指令微调取得令人印象深刻的表现以来,开源社区对实现ChatGPT水平的大语言模型感到越来越有希望。

这个repo就是记录这个复刻与超越的过程,为社区提供一个概览。

包括:相关技术进展、基础模型、领域模型、训练、推理、技术、数据、多语言、多模态,等等

<details> <summary># 目录</summary> </details>

Base Models

contributormodel/projectlicenselanguagemain feature
MetaLLaMA/LLaMA2multiLLaMA-13B outperforms GPT-3(175B) and LLaMA-65B is competitive to PaLM-540M.<br />Base model for most follow-up works.
HuggingFace-BigScienceBLOOMmultian autoregressive Large Language Model (LLM) trained by HuggingFace BigScience.
HuggingFace-BigScienceBLOOMZmultiinstruction-finetuned version of BLOOM & mT5 pretrained multilingual language models on crosslingual task mixture.
EleutherAIGPT-Jentransformer model trained using Ben Wang'sMesh Transformer JAX.
MetaOPTenOpen Pre-trained Transformer Language Models, aim in developing this suite of OPT models is to enable reproducible<br /> and responsible research at scale, and to bring more voices to the table in studying the impact of these LLMs.
Cerebras SystemsCerebras-GPTenPretrained LLM, GPT-3 like, Commercially available, efficiently trained on theAndromeda AI supercomputer,<br />trained in accordance withChinchilla scaling laws (20 tokens per model parameter) which is compute-optimal.
EleutherAIpythiaencombine interpretability analysis and scaling laws to understand how knowledge develops<br />and evolves during training in autoregressive transformers.
Stability-AIStableLMenStability AI Language Models
FDUMOSSen/zhAn open-source tool-augmented conversational language model from Fudan University.
ssymmetry & FDUBBT-2zh12B open-source LM.
@mlfoundationsOpenFlamingoenAn open-source framework for training large multimodal models.
EleutherAIGPT-NeoX-20BenIts architecture intentionally resembles that of GPT-3, and is almost identical to that ofGPT-J- 6B.
UCBOpenLLaMAApache-2.0enAn Open Reproduction of LLaMA.
MosaicMLMPTApache-2.0enMPT-7B is a GPT-style model, and the first in the MosaicML Foundation Series of models.<br /> Trained on 1T tokens of a MosaicML-curated dataset, MPT-7B is open-source,<br /> commercially usable, and equivalent to LLaMa 7B on evaluation metrics.
TogetherComputerRedPajama-INCITE-Base-3B-v1Apache-2.0enA 2.8B parameter pretrained language model, pretrained onRedPajama-Data-1T,<br /> together with an Instruction-tuned Version and a Chat Version.
Lightning-AILit-LLaMAApache-2.0-Independent implementation ofLLaMA that is fully open source under the Apache 2.0 license.
@conceptofmindPaLMMIT LicenseenAn open-source implementation of Google PaLM models.
TIIFalcon-7BTII Falcon LLM Licenseena 7B parameters causal decoder-only model built byTII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora.
TIIFalcon-40BTII Falcon LLM Licensemultia 40B parameters causal decoder-only model built byTII and trained on 1,000B tokens of RefinedWeb enhanced with curated corpora.
TigerResearchTigerBotApache-2.0en/zha multi-language and multitask LLM.
BAAIAquila / Aquila2BAAI_Aquila_Model_Licenseen/zhThe Aquila language model inherits the architectural design advantages of GPT-3 and LLaMA, replacing a batch of more efficient underlying<br /> operator implementations and redesigning the tokenizer for Chinese-English bilingual support.
OpenBMBCPM-Bee通用模型许可协议-来源说明-宣传限制-商业授权en/zhCPM-Bee is a fully open-source, commercially-usable Chinese-English bilingual base model with a capacity of ten billion parameters.<br />And has been pre-trained on an extensive corpus of trillion-scale tokens.
Baichuanbaichuan-7BApache-2.0en/zhIt has achieved the best performance among models of the same size on standard<br /> Chinese and English authoritative benchmarks (C-EVAL, MMLU, etc).
TencentlyraChatGLMMIT Licenseen/zhTo the best of our knowledge, it is thefirst accelerated version of ChatGLM-6B.<br />The inference speed of lyraChatGLM has achieved 300x acceleration upon the early original version.<br /> We are still working hard to further improve the performance.
SalesForceXGenApache-2.0multiSalesforce open-source LLMs with 8k sequence length
Shanghai AI LabInternLMApache-2.0en/zhInternLM has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:<br />It leverages trillions of high-quality tokens for training to establish a powerful knowledge base.<br />It supports an 8k context window length, enabling longer input sequences and stronger reasoning capabilities.<br />It provides a versatile toolset for users to flexibly build their own workflows.
xverse-aiXVERSEApache-2.0multiMultilingual LLMs developed by XVERSE Technology Inc.
WriterpalmyraApache-2.0enextremely powerful while being extremely fast. This model excels at many nuanced tasks<br /> such as sentiment classification and summarization.
Mistral AIMistralApache-2.0enMistral 7B is a 7.3B parameter model that:<br />1. Outperforms Llama 2 13B on all benchmarks<br />2. Outperforms Llama 1 34B on many benchmarks<br />3. Approaches CodeLlama 7B performance on code, while remaining good at English tasks<br />4. Uses Grouped-query attention (GQA) for faster inference<br />5. Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost
SkyworkAISkywork-en/zhIn major evaluation benchmarks, Skywork-13B is at the forefront of Chinese open source models and is the optimal level under the same parameter scale;<br /> it can be used commercially without application; it has also open sourced a 600G (150 billion tokens) Chinese data set.
01.AIYi-en/zhTheYi series models are large language models trained from scratch by developers at 01.AI.
IEIT SystemsYuan-2.0-en/zhIn this work, the Localized Filtering-based Attention (LFA) is introduced to incorporate prior knowledge of local dependencies of natural language into Attention.<br /> Based on LFA, we develop and release Yuan 2.0, a large language model with parameters ranging from 2.1 billion to 102.6 billion. A data filtering and generation method<br /> is presented to build pretraining and fine-tuning dataset in high quality. A distributed training method with non-uniform pipeline parallel, data parallel, and optimizer parallel is proposed,<br /> which greatly reduces the bandwidth requirements of intra-node communication, and achieves good performance in large-scale distributed training.<br /> Yuan 2.0 models display impressive ability in code generation, math problem-solving, and chat compared with existing models.
NanbeigeNanbeigeApache-2.0en/zhNanbeige-16B is a 16 billion parameter language model developed by Nanbeige LLM Lab. It uses 2.5T Tokens for pre-training. The training data includes a large amount of high-quality internet corpus, various books, code, etc. It has achieved good results on various authoritative evaluation data sets. This release includes the Base, Chat, Base-32k and Chat-32k.
deepseek-aideepseek-LLMMIT Licenseen/zhan advanced language model comprising 67 billion parameters. It has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese.
LLM360LLM360--Most open-source LLM releases include model weights and evaluation results. However, additional information is often needed to genuinely understand a model's behavior—and this information is not typically available to most researchers. Hence, we commit to releasing all of the intermediate checkpoints (up to 360!) collected during training, all of the training data (and its mapping to checkpoints), all collected metrics (e.g., loss, gradient norm, evaluation results), and all source code for preprocessing data and model training. These additional artifacts can help researchers and practitioners to have a deeper look into LLM’s construction process and conduct research such as analyzing model dynamics. We hope that LLM360 can help make advanced LLMs more transparent, foster research in smaller-scale labs, and improve reproducibility in AI research.
FDU, etc.CT-LLM-zh/enfocusing on the Chinese language. Starting from scratch, CT-LLM primarily uses Chinese data from a 1,200 billion token corpus, including 800 billion Chinese, 300 billion English, and 100 billion code tokens. By open-sourcing CT-LLM's training process, including data processing and the Massive Appropriate Pretraining Chinese Corpus (MAP-CC), and introducing the Chinese Hard Case Benchmark (CHC-Bench), we encourage further research and innovation, aiming for more inclusive and adaptable language models.
TigerLabMAP-NEO-zh/en第一个从数据处理到模型训练过程、模型权重全流程开源的大模型。
DataCampDCLM--提供了用于处理原始数据、标记化、数据打乱、模型训练以及性能评估的工具和指南。基础baseline 7B模型性能优异。

Domain Models

contributormodeldomainlanguagebase modelmain feature
UT Southwestern/<br />UIUC/OSU/HDUChatDoctormedicalenLLaMAMaybe the first domain-specific chat model tuned on LLaMA.
CambridgeVisual Med-AlpacabiomedicalenLLaMA-7Ba multi-modal foundation model designed specifically for the biomedical domain.
HITBenTsao / ChatGLM-MedmedicalzhLLaMA/ChatGLMfine-tuned with Chinese medical knowledge dataset, which is generated by using gpt3.5 api.
ShanghaiTech, etc.DoctorGLMmedicalen/zhChatGLM-6BChinese medical consultation model fine-tuned on ChatGLM-6B.
THU AIRBioMedGPT-1.6Bbiomedicalen/zh-a pre-trained multi-modal molecular foundation model with 1.6B parameters that associates 2D molecular graphs with texts.
@LiuHC0428LawGPT_zhlegalzhChatGLM-6Ba general model in Chinese legal domain, trained on data generated via Reliable-Self-Instruction.
SJTUMedicalGPT-zhmedicalzhChatGLM-6Ba general model in Chinese medical domain, a diverse data generated via self-instruct.
SJTUPMC-LLaMAmedicalzhLLaMAContinue Training LLaMA on Medical Papers.
HuggingFaceStarCodercode generationen-a language model (LM) trained on source code and natural language text. Its training data incorporates more than<br /> 80 different programming languages as well as text extracted from GitHub issues and commits and from notebooks.
@CogStackNHS-LLMmedicalennot clearA conversational model for healthcare trained usingOpenGPT.
@pengxiao-songLaWGPTlegalzhLLaMA/ChatGLMexpand the vocab with Chinese legal terminologies, instruction fine-tuned on data generated using self-instruct.
DuxiaomanXuanYuanfinancezhBLOOM-176BA Large Chinese Financial Chat Model with Hundreds of Billions Parameters.
CUHKHuatuoGPTmedicalzhnot clearHuatuoGPT, a large language model (LLM) trained on a vast Chinese medical corpus. Our objective with HuatuoGPT is<br /> to construct a more professional ‘ChatGPT’ for medical consultation scenarios.
PKULawyer LLaMAlegalzhLLaMAcontinue pretraining on Chinese legal data, insturction tuned on legal exams and legal consulting qa pairs.
THULexiLawlegalzhChatGLM-6Btrained on a mixture of general data (BELLE 1.5M) and legal data
THU, etc.taolieducationzhLLaMAA large model for international Chinese education. It extends specific vocabulary on the base model,<br /> and uses the domain's proprietary data set for instruction fine-tuning.
NUSGoatarithmeticenLLaMAa fine-tuned LLaMA model that significantly outperforms GPT-4 on a range of arithmetic tasks.<br /> Fine-tuned on a synthetically generated dataset, Goat achieves state-ofthe-art performance on BIG-bench arithmetic sub-task.
CU/NYUFinGPTfinanceen-an end-to-end open-source framework for financial large language models (FinLLMs).
microsoftWizardCodercode generationenStarCodertrained with78k evolved code instructions. surpasses Claude-Plus (+6.8) , Bard (+15.3) and InstructCodeT5+ (+22.3) on the HumanEval Benchmarks.
UCASCornucopiafinancezhLLaMAfinetune LLaMA on Chinese financial knowledge,
PKUChatLawlegalzhZiya / AnimaChinese legal domain model.
@michael-wzhuChatMedmedicalzhLLaMAChinese medical LLM based on LLaMA-7B.
SCUTSoulChatmental healthzhChatGLM-6BChinese dialogue LLM in mental health domain, based on ChatGLM-6B.
@shibing624MedicalGPTmedicalzhChatGLM-6BTraining Your Own Medical GPT Model with ChatGPT Training Pipeline.
BJTUTransGPTtransportationzhLLaMA-7BChinese transportation model.
BAAIAquilaCodecode generationmultiAquilaAquilaCode-multi is a multi-language model that supports high-accuracy code generation for various programming languages, including Python/C++/Java/Javascript/Go, etc.<br /> It has achieved impressive results in HumanEval (Python) evaluation, with Pass@1, Pass@10, and Pass@100 scores of 26/45.7/71.6, respectively. In the HumanEval-X<br /> multi-language code generation evaluation, it significantly outperforms other open-source models with similar parameters (as of July 19, 2023).<br />AquilaCode-py, on the other hand, is a single-language Python version of the model that focuses on Python code generation. <br />It has also demonstrated excellent performance in HumanEval evaluation, with Pass@1, Pass@10, and Pass@100 scores of 28.8/50.6/76.9 (as of July 19, 2023).
MetaCodeLLaMAcode generationmultiLLaMA-2a family of large language models for code based onLlama 2 providing state-of-the-art performance among open models, infilling capabilities,<br /> support for large input contexts, and zero-shot instruction following ability for programming tasks.
UNSW, etcDarwinnatural scienceenLLaMA-7Bthe first open-source LLM for natural science, mainly in physics, chemistry and material science.
alibabaEcomGPTe-commerceen/zhBLOOMZAn Instruction-tuned Large Language Model for E-commerce.
TIGER-AI-LabMAmmoTHmathenLLaMA2/CodeLLaMAa series of open-source large language models (LLMs) specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct,<br /> a meticulously curated instruction tuning dataset that is lightweight yet generalizable. MathInstruct is compiled from 13 math rationale datasets,<br /> six of which are newly curated by this work. It uniquely focuses on the hybrid use of chain-of-thought (CoT) and program-of-thought (PoT) rationales,<br /> and ensures extensive coverage of diverse mathematical fields.
SJTUabelmathenLLaMA2We proposeParental Oversight* , A Babysitting Strategy for Supervised Fine-tuning, Parental Oversight is not limited to any specific data processing method. Instead, it defines the data processing philosophy that should guide supervised fine-tuning in the era of Generative AI GAI).
FDUDISC-LawLLMlegalzhBaichuan-13BFudanDISC has released DISC-LawLLM, a Chinese intelligent legal system driven by a large language model.<br /> The system can provide various legal services for different user groups. In addition, DISC-Law-Eval is constructed to evaluate the large legal language model from both objective and subjective aspects.<br /> The model has obvious advantages compared with the existing large legal models.<br />The team also made available a high-quality Supervised fine-tuning (SFT) dataset of 300,000, DISC-Law-SFT.
HKU, etcChatPsychiatristmental healthenLLaMA-7BThis repo open-sources the Instruct-tuned LLaMA-7B model that has been fine-tuned with counseling domian instruction data.<br /> To construct our 8K size instruct-tuning dataset, we collected real-world counseling dialogue examples and employed GPT-4 as an extractor and filter.<br /> In addition, we have introduced a comprehensive set of metrics, specifically tailored to the LLM+Counseling domain, by incorporating counseling domain evaluation criteria.<br /> These metrics enable the assessment of performance in generating language content that involves multi-dimensional counseling skills.
CASStarWhisperastronomicalzh-StarWhisper, a large astronomical model, significantly improves the reasoning logic and integrity of the model through the fine-tuning of astrophysical corpus labeled by experts,<br /> logical long text training, and direct preference optimization. In the CG-Eval jointly published by the Keguei AI Research Institute and LanguageX AI Lab, it reached the second place overall,<br /> just below GPT-4, and its mathematical reasoning and astronomical capabilities are close to or exceed the GPT 3.5 Turbo.
ZhiPuAIFinGLMfinancezhChatGLMsolutions of SMP2023-ELMFT(The Evaluation of Large Model of Finance Technology).
PKU, etcCodeShellcode generationen/zh-CodeShell is a code large language model (LLM) developed jointly by theKnowledge Computing Lab at Peking University and the AI team of Sichuan Tianfu Bank. CodeShell has 7 billion parameters,<br /> was trained on 500 billion tokens, and has a context window length of 8192. On authoritative code evaluation benchmarks (HumanEval and MBPP), CodeShell achieves the best performance for models of its scale.
FDUDISC-FinLLMfinancezhBaichuan-13B-ChatDISC-FinLLM is a large language model in the financial field. It is a multi-expert intelligent financial system composed of four modules for different financial scenarios: financial consulting,<br /> financial text analysis, financial calculation, and financial knowledge retrieval and question answering.
DeepseekDeepseek Codercode generationen/zh-Deepseek Coder comprises a series of code language models trained on both 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2T tokens.<br />For coding capabilities, Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.
microsoftMathOctopusmathmultiLLaMA2This work pioneers exploring and building powerful Multilingual Math Reasoning (xMR) LLMs. To accomplish this, we make the following works:<br />1. MGSM8KInstruct, the first multilingual math reasoning instruction dataset, encompassing ten distinct languages, thus addressing the issue of training data scarcity in xMR tasks.<br />2. MSVAMP, an out-of-domain xMR test dataset, to conduct a more exhaustive and comprehensive evaluation of the model’s multilingual mathematical capabilities.<br />3. MathOctopus, our effective Multilingual Math Reasoning LLMs, training with different strategies, which notably outperform conventional open-source LLMs and exhibit superiority over ChatGPT in few-shot scenarios.
ITRECZh-MT-LLMmaritimeen/zhChatGLM3-6bThe training data use the maritime domain data Zh-mt-sft organized for three main segments, and 30w general conversation datamoss-003-sft-data. Zh-mt-sft specifically Contains CrimeKgAssitant-1.8w, Zh-law-qa, and Zh-law-court related to maritime laws and regulations Q&A, Zh-edu-qa and Zh-edu-qb related to maritime education and training, and Zh-mt-qa related to maritime specialized knowledge Q&A.
@SmartFlowAIEmoLLM心理健康zh-EmoLLM 是一系列能够支持 理解用户-支持用户-帮助用户 心理健康辅导链路的心理健康大模型,由 LLM指令微调而来。

some medical models: here

some domain llms: Awesome-Domain-LLM

healcare models: Awesome-Healthcare-Foundation-Models

General Domain Instruction Models

contributormodel/projectlanguagebase modelmain feature
StanfordAlpacaenLLaMA/OPTuse 52K instruction-following data generated by Self-Instructt techniques to fine-tune 7B LLaMA,<br /> the resulting model,  Alpaca, behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite.<br />Alpaca has inspired many follow-up models.
LianJiaTechBELLEen/zhBLOOMZ-7B1-mtmaybe the first Chinese model to follow Alpaca.
THUChatGLM-6Ben/zh-well-known Chinese model.
DatabricksDollyenGPT-J 6Buse Alpaca data to fine-tune a 2-year-old model: GPT-J, which exhibits surprisingly high quality<br /> instruction following behavior not characteristic of the foundation model on which it is based.
@tloenAlpaca-LoRAenLLaMA-7Btrained within hours on a single RTX 4090,<br />reproducing the Stanford Alpaca results using low-rank adaptation (LoRA),<br />and can run on a Raspberry pi.
ColossalAICoati7Ben/zhLLaMA-7Ba large language model developed by the ColossalChat project
Shanghai AI LabLLaMA-AdapterenLLaMA-7BFine-tuning LLaMA to follow instructions within 1 Hour and 1.2M Parameters
AetherCortexLlama-XenLLaMAOpen Academic Research on Improving LLaMA to SOTA LLM.
TogetherComputerOpenChatKitenGPT-NeoX-20BOpenChatKit provides a powerful, open-source base to create both specialized and general purpose chatbots for various applications.<br /> The kit includes an instruction-tuned language models, a moderation model, and an extensible retrieval system for including <br />up-to-date responses from custom repositories.
nomic-aiGPT4AllenLLaMAtrained on a massive collection of clean assistant data including code, stories and dialogue
@ymcuiChinese-LLaMA-Alpacaen/zhLLaMA-7B/13Bexpand the Chinese vocabulary based on the original LLaMA and use Chinese data for secondary pre-training,<br /> further enhancing Chinese basic semantic understanding. Additionally, the project uses Chinese instruction data<br /> for fine-tuning on the basis of the Chinese LLaMA, significantly improving the model's understanding and execution of instructions.
UC Berkley<br />Stanford<br />CMUVicunaenLLaMA-13BImpressing GPT-4 with 90% ChatGPT Quality.
UCSD/SYSUbaizeen/zhLLaMAfine-tuned withLoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself. <br />Alpaca's data is also used to improve its performance.
UC BerkleyKoalaenLLaMARather than maximizingquantity by scraping as much web data as possible, the team focus on collecting a small high-quality dataset.
@imClumsyPandalangchain-ChatGLMen/zhChatGLM-6Blocal knowledge based ChatGLM with langchain.
@yangjianxin1Fireflyzhbloom-1b4-zh<br />bloom-2b6-zhInstruction Tuning on Chinese dataset. Vocabulary pruning, ZeRO, and tensor parallelism<br /> are used to effectively reduce memory consumption and improve training efficiency.
microsoftGPT-4-LLMen/zhLLaMAaims to share data generated by GPT-4 for building an instruction-following LLMs with supervised learning and reinforcement learning.
Hugging FaceStackLLaMAenLLaMAtrained on StackExchange data and the main goal is to serve as a tutorial and walkthrough on<br /> how to train model with RLHF and not primarily model performance.
NebulyChatLLaMAen-a library that allows you to create hyper-personalized ChatGPT-like assistants using your own data and the least amount of compute possible.
@juncongmooChatLLaMAenLLaMALLaMA-based RLHF model, runnable in a single GPU.
@juncongmoominichatgptenGPT/OPT ...To Train ChatGPT In 5 Minutes with ColossalAI.
@LC1332Luotuo-Chinese-LLMzhLLaMA/ChatGLMInstruction fine-tuned Chinese Language Models, with colab provided!
@FacicoChinese-VicunazhLLaMAA Chinese Instruction-following LLaMA-based Model, fine-tuned with Lora, cpp inference supported, colab provided.
@yanqiangmiffyInstructGLMen/zhChatGLM-6BChatGLM based instruction-following model, fine-tuned on a variety of data sources, supports deepspeed accelerating and LoRA.
alibabaWombatenLLaMAa novel learning paradigm called RRHF, as an alternative of RLHF,  is proposed, which scores responses generated by<br /> different sampling policies and learns to align them with human preferences through ranking loss. And the performance<br />is comparable to RLHF, with less models used in the process.
@WuJundealpaca-glassoffenLLaMAa mini image-acceptable Chat AI can run on your own laptop,  based onstanford-alpaca and alpaca-lora.
@JosephusCheungGuanacomultiLLaMA-7BA Multilingual Instruction-Following Language Model.
@FreedomIntelligenceLLM ZoomultiBLOOMZ/LLaMAa project that provides data, models, and evaluation benchmark for large language models.<br />model released: Phoenix, Chimera
SZULinlyen/zhLLaMAexpand the Chinese vocabulary, full fine-tuned models, largest LLaMA-based Chinese models, aggregation of Chinese instruction data, reproduceable details..
@lamini-ailaminimulti-data generator for generating instructions to train instruction-following LLMs.
Stability-AIStableVicunaenLLaMAa further instruction fine tuned and RLHF trained version of Vicuna v0 13b, with better performance than Vicuna.
Hugging FaceHuggingChatenLLaMAseems to be the first one available to access as a platform that appears similar to ChatGPT.
microsoftWizardLMenLLaMAtrained with 70k evolved instructions,Evol-Instruct is a novel method using LLMs instead of humans to automatically mass-produce<br /> open-domain instructions of various difficulty levels and skills range, to improve the performance of LLMs.
FDUOpenChineseLLaMAen/zhLLaMA-7Bfurther pretrain LLaMA on Chinese data, improving LLaMA preformance on Chinese tasks.
@chenfeng357open-Chinese-ChatLLaMAen/zhLLaMAThe complete training code of the open-source Chinese-Llama model, including the full process from pre-training instructing and RLHF.
@FSoft-AI4CodeCodeCapybaraenLLaMAOpen Source LLaMA Model that Follow Instruction-Tuning for Code Generation.
@mbzuai-nlpLaMini-LMenLLaMA/Flan-T5 ...A Diverse Herd of Distilled Models from Large-Scale Instructions.
NTUPandaen/zhLLaMAfurther pretraining on Chinese data, full-size of LLaMA models.
IBM/CMU/MITDromedaryenLLaMA-65BPrinciple-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision.
@melodysdreamjWizardVicunaLMmultiVicunaWizard's dataset + ChatGPT's conversation extension + Vicuna's tuning method,<br /> achieving approximately 7% performance improvement over Vicuna.
sambanovasystemsBLOOMChatmultiBLOOMBLOOMChat is a 176 billion parameter multilingual chat model. It is instruction tuned fromBLOOM (176B) on<br /> assistant-style conversation datasets and supports conversation, question answering and generative answers in multiple languages.
TIIFalcon-7B-InstructenFalcon-7Ba 7B parameters causal decoder-only model built byTII based on Falcon-7B and finetuned on a mixture of chat/instruct datasets.
TIIFalcon-40B-InstructmultiFalcon-40Ba 40B parameters causal decoder-only model built byTII based on Falcon-40B and finetuned on a mixture of Baize.
USTC, etc.ExpertLLaMAenLLaMAuse In-Context Learning to automatically write customized expert identity and find the quality quite satisfying.<br /> We then prepend corresponding expert identity to each instruction to produce augmented instruction-following data.<br /> We refer to the overall framework as ExpertPrompting, find more details in our paper.
ZJUCaMAen/zhLLaMAfurther pretrained on Chinese courpus without expansion of vocabulary; optimized on the Information Extraction (IE) tasks.<br />pre-training script is available, which includes transformations, construction, and loading of large-scale corpora, as well as the LoRA instruction fine-tuning script.
THUUltraChatenLLaMAFirst, the UltraChat dataset provides a rich resource for the training of chatbots. Second, by fine-tuning the LLaMA model,<br /> the researchers successfully created a dialogue model UltraLLaMA with superior performance.
RUCYuLan-Chaten/zhLLaMAdeveloped based on fine-tuning LLaMA with high-quality English and Chinese instructions.
AI2TüluenLLaMA/Pythia/OPTa suite of LLaMa models fully-finetuned on a strong mix of datasets.
KAISTSelFeeenLLaMAIterative Self-Revising LLM Empowered by Self-Feedback Generation.
@lyogavinAnimaen/zhLLaMAtrained based on QLoRA's33B guanaco, finetuned for 10000 steps.
THUChatGLM2-6Ben/zh-ChatGLM2 -6B is the second-generation version of the open-source bilingual (Chinese-English) chat model ChatGLM-6B.<br /> It retains the smooth conversation flow and low deployment threshold of the first-generation model, while introducing the following new features:<br />- Stronger Performance<br />- Longer Context<br />- More Efficient Inference- More Open License
OpenChatOpenChatenLLaMA, etc.a series of open-source language models fine-tuned on a small, yet diverse and high-quality dataset of multi-round conversations.<br /> Specifically, we utilize only ~6K GPT-4 conversations directly filtered from the ~90K ShareGPT conversations.<br /> Despite the small size of the dataset, OpenLLMs has demonstrated remarkable performance.
CASBayLingmultiLLaMABayLing is an English/Chinese LLM equipped with advanced language alignment,<br /> showing superior capability in English/Chinese generation, instruction following and multi-turn interaction.
stabilityaiFreeWilly/FreeWilly2enLLaMA/LLaMA2FreeWilly is a Llama65B model fine-tuned on an Orca style Dataset.<br />FreeWilly2 is a Llama2 70B model finetuned on an Orca style Dataset.<br />FreeWilly2 outperforms Llama2 70B on the huggingface Open LLM leaderboard.
alibabaQwen-7Ben/zh-7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud.
ZJUKnowLMen/zhLLaMAWith the rapid development of deep learning technology, large language models such as ChatGPT have made substantial strides in the realm of natural language processing.<br /> However, these expansive models still encounter several challenges in acquiring and comprehending knowledge, including the difficulty of updating knowledge and potential knowledge<br /> discrepancies and biases, collectively known asknowledge fallacies .<br />The KnowLM project endeavors to tackle these issues by launching an open-source large-scale knowledgable language model framework and releasing corresponding models.
NEUTechGPTen/zhLLAMATechGPT mainly strengthens the following three types of tasks:<br />- Various information extraction tasks such as relation triplet extraction with "knowledge graph construction" as the core<br />- Various intelligent question-and-answer tasks centered on "reading comprehension".<br />- Various sequence generation tasks such as keyword generation with "text understanding" as the core.
@MiuLabTaiwan-LLaMaen/zhLLaMA2Traditional Chinese LLMs for Taiwan.
Xwin-LMXwin-LMenLLaMA2Xwin-LM aims to develop and open-source alignment technologies for large language models, including supervised fine-tuning (SFT),<br /> reward models (RM), reject sampling, reinforcement learning from human feedback (RLHF), etc. Our first release, built-upon on the<br /> Llama2 base models, rankedTOP-1 on AlpacaEval. Notably, it's the first to surpass GPT-4 on this benchmark.
wenge-researchYaYien/zhLLaMA/LLaMA2YaYi was fine-tuned on millions of artificially constructed high-quality domain data. This training data covers five key domains:<br /> media publicity, public opinion analysis, public safety, financial risk control, and urban governance, encompassing over a hundred natural language instruction tasks.
HuggingFacezephyrenMistralZephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-α is the first model in the series, and is a fine-tuned version of<br />mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO).
CohereCommand-R / Command R+multi-Command-R has the capability for multilingual generation evaluated in 10 languages and highly performant RAG capabilities.
XAIgroken-314B MoE; context length: 8192
databricksdbrx-instruct--afine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input. It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2.

Model Merging

contributormodel/methodmain featuremain feature
FuseAIFuseChatFirstly, it undertakes pairwise knowledge fusion for source LLMs to derive multiple target LLMs of identical structure and size via lightweight fine-tuning. Then, these target LLMs are merged within the parameter space, wherein we propose a novel method VaRM for determining the merging weights based on the variation ratio of parameter matrices before and after fine-tuning.a fusion of three prominent chat LLMs with diverse architectures and scales, namelyNH2-Mixtral-8x7B, NH2-Solar-10.7B, and OpenChat-3.5-7B. FuseChat-7B-VaRM achieves an average performance of 8.22 on MT-Bench, outperforming various powerful chat LLMs at 7B and 34B scales like Starling-7B and Yi-34B-Chat, even surpassing GPT-3.5 (March), Claude-2.1, and approaching Mixtral-8x7B-Instruct.
arcee-aimergekitTools for merging pretrained large language models.
SakanaAIEvoLLMEvolutionary Optimization of Model Merging Recipes.

Alternatives To Transformer

(maybe successors?)

contributormethodmain feature
BlinkDLRWKV-LMRWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable).<br /> So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
msraRetNetsimultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention.<br /> Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent.<br /> Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-costO(1) inference, which improves decoding throughput,<br /> latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity,<br /> where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results,<br /> parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models.
stanfordBapcpackABackpack is a drop-in replacement for a Transformer that provides new tools for interpretability-through-control while still enabling strong language models.<br /> Backpacks decompose the predictive meaning of words into components non-contextually, and aggregate them by a weighted sum, allowing for precise, predictable interventions.
stanford, etc.Monarch Mixer (M2)The basic idea is to replace the major elements of a Transformer with Monarch matrices — which are a class of structured matrices that generalize the FFT and are sub-quadratic,<br /> hardware-efficient, and expressive. In Monarch Mixer, we use layers built up from Monarch matrices to do both mixing across the sequence (replacing the Attention operation) and mixing across the model dimension (replacing the dense MLP).
CMU, etc.MambaMamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress onstructured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.
TogetherComputerStripedHyenaStripedHyena is thefirst alternative model competitive with the best open-source Transformers of similar sizes in short and long-context evaluations.<br />StripedHyena is a hybrid architecture composed of multi-head, grouped-query attention and gated convolutions arranged inHyena blocks, different from traditional decoder-only Transformers.<br />1. Costant memory decoding in Hyena blocks via representation of convolutions as state-space models (modal or canonical form), or as truncated filters.<br />2. Low latency, faster decoding and higher throughput than Transformers.<br />3. Improvement to training and inference-optimal scaling laws, compared to optimized Transformer architectures such as Llama-2.<br />4. Trained on sequences of up to 32k, allowing it to process longer prompts.
microsoftbGPTbGPT supports generative modelling via next byte prediction on any type of data and can perform any task executable on a computer, showcasing the capability to simulate all activities within the digital world, with its potential only limited by computational resources and our imagination.
DeepMindGriffin-JaxJax + Flax implementation of theGriffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models, not official code(official code is not released yet);<br />the RG-LRU layer, a novel gated linear recurrent layer, around which we design a new recurrent block to replace MQA. We build two new models using this recurrent block: Hawk, a model which interleaves MLPs with recurrent blocks, and Griffin, a hybrid model which interleaves MLPs with a mixture of recurrent blocks and local attention<br />Griffin-3B outperforms Mamba-3B, and Griffin-7B and Griffin-14B achieve performance competitive with Llama-2, despite being trained on nearly 7 times fewer tokens; Griffin can extrapolate on sequences significantly longer than those seen during training.
AI21JambaJamba is the first production-scale Mamba implementation. It’s a pretrained, mixture-of-experts (MoE) generative text model, with 12B active parameters and a total of 52B parameters across all experts. It supports a 256K context length, and can fit up to 140K tokens on a single 80GB GPU.
MetaMegalodonMegalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens.

MoE

contributormodel/projectmain feature
mistralaiMixtral-8x7BThe Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. The Mistral-8x7B outperforms Llama 2 70B on most benchmarks we tested.
Shanghai AI Lab, etc.LLaMA-MoEA small and affordable MoE model based onLLaMA and SlimPajama. The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
NUS, etc.OpenMoEA family of open-sourced Mixture-of-Experts (MoE) Large Language Models.
SnowflakeArcticArctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128x3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating.

Multi-Modal

contributorprojectlanguagebase modelmain feature
BaihaiAIenIDPChaten/zhLLaMA-13B<br />Stable DiffusionOpen Chinese multi-modal model, single GPU runnable, easy to deploy, UI provided.
KAUSTMiniGPT-4en/zhLLaMAMiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer,<br />and yields many emerging vision-language capabilities similar to those demonstrated in GPT-4.
MSR, etc.LLaVAenLLaMAvisual instruction tuning is proposed, towards building large language and vision models with GPT-4 level capabilities.
NUS/THUVPGTransenLLaMA/OPT/<br />Flan-T5/BLIP-2<br />...transferring VPG across LLMs to build VL-LLMs at significantly lower cost. The GPU hours<br /> can be reduced over 10 times and the training data can be reduced to around 10%.<br />Two novel VL-LLMs are released via VPGTrans, including VL-LLaMA and VL-Vicuna.<br />VL-LLaMA is a multimodal version LLaMA by transferring the BLIP-2 OPT-6.7B to LLaMA via VPGTrans.<br />VL-Vicuna is a GPT-4-like multimodal chatbot, based on the Vicuna LLM.
CAS, etcX-LLMen/zhChatGLM-6BX-LLM converts multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and feed them into<br /> a large Language Model (ChatGLM) to accomplish a Multimodal LLM, achieving impressive multimodal chat capabilities.
NTUOtterenOpenFlamingoa multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo),<br /> trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning.<br />Futhermore, optimize OpenFlamingo's implementation, democratizing the required<br /> training resources from 1x A100 GPU to 4x RTX-3090 GPUs.
XMULaVINenLLaMApropose a novel and affordable solution for vision-language instruction tuning, namely Mixture-of-Modality Adaptation (MMA).<br /> Particularly, MMA is an end-to-end optimization regime, which connects the image encoder and LLM via lightweight adapters.<br /> Meanwhile, we also propose a novel routing algorithm in MMA, which can help the model automatically shifts the reasoning paths<br /> for single- and multi-modal instructions.
USTCWoodpecker--the first work to correct hallucination in multimodal large language models.
hpcaitechOpen-Sora--open source alternative to Openai Sora.

see also: awesome-Multimodal-Large-Language-Models

Data

Pretrain Data

contributordata/projectlanguagemain feature
TogetherComputerRedPajama-DataenAn Open Source Recipe to Reproduce LLaMA training dataset.
@goldsmithWikipediamultiA Pythonic wrapper for the Wikipedia API.

Instruction Data

see Alpaca-CoT data collection

contributordatalanguagemain feature
salesforceDialogStudioenDialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI.

Synthetic Data Generation

contributormethodmain feature
UW, etc.self-instructusing the model's own generations to create a large collection of instructional data.
@LiuHC0428Reliable-Self-Instructionuse ChatGPT to generate some questions and answers based on a given text.
PKUEvol-Instructa novel method, proposed inWizardLM,  by using LLMs instead of humans to automatically mass-produce open-domain<br /> instructions of various difficulty levels and skills range, to improve the performance of LLMs.
KAUST, etc.CAMELa novel communicative agent framework namedrole-playing is proposed, which involves using inception prompting to guide chat agents<br /> toward task completion while maintaining consistency with human intentions.<br />role-playing can be used to generate conversational data in a specific task/domain.
@chatarenaChatArenaa library that provides multi-agent language game environments and facilitates research about autonomous LLM agents and their social interactions.<br />it provides a flexible framework to define multiple players, environments and the interactions between them, based on Markov Decision Process.

Evaluation

contributormethodmain feature
-human evaluation-
OpenAIGPT-4/ChatGPT-
PKU/CMU/MSRA ...PandaLMReproducible and Automated Language Model Assessment.
UCBChatbot ArenaChat with two anonymous models side-by-side and vote for which one is better,<br /> then use the Elo rating system to calculate the relative performance of the models.
StanfordAlpacaEvalGPT-4/Claude evaluation onAlpacaFarm dataset.
clueaiSuperCLUElybChinese version ofChatbot Arena developed by clueai.
SJTU, etc.Auto-Ja new open-source generative judge that can effectively evaluate different LLMs on how they align to human preference.
CMUCodeBERTScorean automatic metric for code generation, based onBERTScore.<br />As BERTScore, CodeBERTScore leverages the pre-trained contextual embeddings from a model such as CodeBERT and matches words in candidate and reference sentences by cosine similarity.<br /> Differently from BERTScore, CodeBERTScore also encodes natural language input or other context along with the generated code, but does not use that context to compute cosine similarities.

Benchmark

国内大模型测评现状

contributorbenchmarkmain feature
princetonSWE-bencha benchmark for evaluating large language models on real world software issues collected from GitHub. Given acodebase and an issue,<br /> a language model is tasked with generating a patch that resolves the described problem.
microsoftAGIEvala human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.
clueaiSuperCLUE-AgentAgent evaluation benchmark based on Chinese native tasks.
bytedanceGPT-FathomGPT-Fathom is an open-source and reproducible LLM evaluation suite, benchmarking 10+ leading open-source and closed-source LLMs as well as OpenAI's earlier models on 20+ curated benchmarks under aligned settings.

LeaderBoard

opencompass, huggingface

Framework/ToolKit/Platform

contributorprojectmain feature
CASAlpaca-CoTextend CoT data to Alpaca to boost its reasoning ability.<br />aims at building an instruction finetuning (IFT) platform with extensive instruction collection (especially the CoT datasets)<br />and a unified interface for various large language models.
@hiyougaChatGLM-Efficient-Tuningefficient fine-tuning ChatGLM-6B with PEFT.
@hiyougaLLaMA-Efficient-TuningFine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA).
@jianzhnieEfficient-Tuning-LLMsEfficient Finetuning of QLoRA LLMs.
ColossalAIColossalChatAn open-source low cost solution for cloningChatGPT with a complete RLHF pipeline.
microsoftdeepspeed-chatEasy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.
LAION-AIOpen Assistanta project meant to give everyone access to a great chat based large language model.
HKUSTLMFlowan extensible, convenient, and efficient toolbox for finetuning large machine learning models,<br /> designed to be user-friendly, speedy and reliable, and accessible to the entire community.
UCBEasyLMEasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax.<br /> EasyLM can scale up LLM training to hundreds of TPU/GPU accelerators by leveraging JAX's pjit functionality.
@CogStackOpenGPTA framework for creating grounded instruction based datasets and training conversational domain expert Large Language Models (LLMs).
HugAILabHugNLPa unified and comprehensive NLP library based on HuggingFace Transformer.
ProjectD-AILLaMA-Megatron-DeepSpeedOngoing research training transformer language models at scale, including: BERT & GPT-2.
@PanQiWeiAutoGPTQAn easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
alibabaswiftSWIFT (Scalable lightWeight Infrastructure for Fine-Tuning) is an extensible framwork designed to faciliate lightweight model fine-tuning and inference.<br /> It integrates implementations for various efficient fine-tuning methods, by embracing approaches that is parameter-efficient, memory-efficient, and time-efficient.
alibabaMegatron-LLaMAto facilitate the training of LLaMA-based models and reduce the cost on occupying hardware resources,<br /> Alibaba decides to release the internal optimized Megatron-LLaMA training framework to the community.
@OpenLLMAIOpenRLHFOpenRLHF aims to develop aHigh-performance RLHF training framework based on Ray and DeepSpeed.<br /> OpenRLHF is the Simplest high-performance RLHF librarythat supports 34B models RLHF training with Single DGXA100 (script)).<br />The key idea of OpenRLHF is to distribute the Actor Model, Reward Model, Reference Model, and the Critic Model onto separate GPUs using Ray,<br /> while placing the Adam Optimizer on the CPU. This enables full-scale fine-tuning of 7B models across multiple 24GB RTX4090 GPUs<br /> (or 34B models with multiple A100 80G), with high training efficiency thanks to the ability to use a large generate batch size with Adam Offload and Ray.<br /> Our PPO performance with the 13B llama2 models is 4 times that of DeepSpeedChat.
@zejunwang1LLMTunerLLMTuner is an LLM instruction tuning tool that supports LoRA, QLoRA and full parameter fine-tuning. During training, flash attention and xformers attention technologies<br /> can be used to improve training efficiency, and combined with technologies such as LoRA, DeepSpeed ZeRO, gradient checkpointing and 4-bit quantification, to effectively<br /> reduce video memory usage and achieve the same goal on a single consumer-grade graphics card (A100/A40/A30 /RTX3090/V100) to fine-tune 7B/13B/34B large models.
Shanghai AI LabXTunerA toolkit for efficiently fine-tuning LLM (InternLM, Llama, Baichuan, QWen, ChatGLM2).
alibabaMFTCoderCodeFuse-MFTCoder is an open-source project of CodeFuse for multitasking Code-LLMs(large language model for code tasks),<br /> which includes models, datasets, training codebases and inference guides.
facebookllama-recipesExamples and recipes for Llama 2 model.
microsoftMS-AMPThe FP8-LM framework is highly optimized and uses the FP8 format throughout the forward and backward passes, which greatly reduces the system's computing, memory and communication overhead.

Alignment

contributormethodused inmain feature
-IFTChatGPTInstruction Fine-Tuning.
-RLHFChatGPTRL from Human Feedback.
AnthropicRLAIFClaudeRL from AI Feedback.
alibabaRRHFWombata novel learning paradigm called RRHF, as an alternative of RLHF,  is proposed, which scores responses generated by<br />different sampling policies and learns to align them with human preferences through ranking loss. And the performance<br />is comparable to RLHF, with less models used in the process.
HKUSTRAFT-RAFT is a new alignment algorithm, which is more efficient than conventional (PPO-based) RLHF.
IBM/CMU/MITSELF-ALIGNDromedarycombines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
PKUCVABeaverConstrained Value Alignment via Safe RLHF.
tencentRLTF-Reinforcement Learning from Unit Test Feedback.
stanfordDPO-implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint) but is simple to implement and straightforward to train. Intuitively,<br /> the DPO update increases the relative log probability of preferred to dispreferred responses, but it incorporates a dynamic, per-example importance weight that prevents the model degeneration that we find occurs with a naive probability ratio objective.
THUBPO-The central idea behind BPO is to create an automatic prompt optimizer that rewrites human prompts, which are usually less organized or ambiguous, to prompts that better deliver human intent.<br /> Consequently, these prompts could be more LLM-preferred and hence yielding better human-preferred responses.and the empirical results demonstrate that the BPO-aligned ChatGPT yields a 22% increase in the win rate against its original version, and 10% for GPT-4.
AI2, etc.URIAL-URIAL is a simple,tuning-free alignment method, URIAL (Untuned LLMs with Restyled In-context ALignment). URIAL achieves effective alignment purely through in-context learning (ICL), requiring as few as three constant stylistic examples and a system prompt for achieving a comparable performance with SFT/RLHF.
openaiweak-to-strong-naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.

Multi-Language

vocabulary expansion

according to the official FAQ in LLaMA repo, there's not many tokens other than latin languages, so one of the efforts is to expand the vocabulary, some works are shown below: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

contributormodel/projectlanguagebase modelmain feature
@ymcuiChinese-LLaMA-AlpacazhLLaMA
SZULinlyen/zhLLaMAfull-size LLaMA, further pretrained on Chineses Corpus.
@NeutralzzBiLLaen/zhLLaMA-7Bfurther pretrained onWudaoPILEWMT.
@pengxiao-songLaWGPTzhLLaMA/ChatGLMexpand the vocab with Chinese legal terminologies, instruction fine-tuned on data generated using self-instruct.
IDEAZiyaen/zhLLaMAlarge-scale pre-trained model based on LLaMA with 13 billion parameters.<br />We optimizes LLaMAtokenizer on chinese, and incrementally train 110 billion tokens of data based on LLaMa-13B model,<br />which significantly improved the understanding and generation ability on Chinese.
OpenBuddyOpenBuddymultiLLaMA/Falcon ...Built upon Tii's Falcon model and Facebook's LLaMA model, OpenBuddy is fine-tuned to include an extended vocabulary,<br /> additional common characters, and enhanced token embeddings. By leveraging these improvements and multi-turn dialogue datasets,<br /> OpenBuddy offers a robust model capable of answering questions and performing translation tasks across various languages.
FDUCuteGPTen/zhLLaMACuteGPT expands the Chinese vocabulary and performs pre-training on the Llama model, improving its ability to understand Chinese.<br /> Subsequently, it is fine-tuned with conversational instructions to enhance the model's ability to understand instructions.
FlagAlphaFlagAlphaen/zhLLaMA/LLaMA2based on largs-scale Chinese data, and starting from pre-training, the Chinese abilities of the models are being continuously and iteratively upgraded.

Efficient Training/Fine-Tuning

contributormethodmain feature
microsoftLoRALow-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices<br /> into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
stanfordPrefix Tuninga lightweight alternative to fine-tuning for natural language generation tasks, which keeps language model parameters frozen<br /> and instead optimizes a sequence of continuous task-specific vectors, which we call the prefix.
THUP-TuningP-tuning leverages few continuous free parameters to serve as prompts fed as the input to the pre-trained language models.<br />We then optimize the continuous prompts using gradient descent as an alternative to discrete prompt searching.
THU, etc.P-Tuning v2a novel empirical finding that properly optimized prompt tuning can be comparable to fine-tuning universally across various model scales and NLU tasks.<br />Technically, P-tuning v2 is not conceptually novel. It can be viewed as an optimized and adapted implementation of Deep Prompt Tuning.
GooglePrompt Tuninga simple yet effective mechanism for learning "soft prompts" to condition frozen language models to perform specific downstream tasks.<br />Prompt Tuning can be seen as a simplification of "prefix tuning".
microsoft, etc.AdaLoRAadaptively allocates the parameter budget among weight matrices according to their importance score.<br /> In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition.
UWQLoRAan efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving<br /> full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA).
FDULOMOa new optimizer,LOw-Memory Optimization ( LOMO ), which fuses the gradient computation and the parameter update in one step to reduce memory usage,<br />which enables the full parameter fine-tuning of a 7B model on a single RTX 3090, or a 65B model on a single machine with 8×RTX 3090, each with 24GB memory.
MBZUAI, etc.GLoRAEnhancing Low-Rank Adaptation (LoRA), GLoRA employs a generalized prompt module to optimize pre-trained model weights and adjust intermediate activations,<br /> providing more flexibility and capability across diverse tasks and datasets.
UMass LowellReLoRAReLoRA performs a high-rank update and achieves performance similar to regular neural network training.<br /> The components of ReLoRA include initial full-rank training of the neural network, LoRA training, restarts, a jagged learning rate schedule, and partial optimizer resets.
HuaweiQA-LoRAequips the original LoRA with two-fold abilities:<br />(i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage;<br /> (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy.
UMD, etc.NEFTunewe propose to add random noise to the embedding vectors of the training data during the forward pass of fine-tuning. We show that this simple trick<br /> can improve the outcome of instruction fine-tuning, often by a large margin, with no additional compute or data overhead.
THUSoRAsparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. We achieve this through the incorporation of<br /> a gate unit optimized with proximal gradient method in the training stage, controlling the cardinality of rank under the sparsity of the gate. In the subsequent inference stage,<br /> we eliminate the parameter blocks corresponding to the zeroed-out ranks, to reduce each SoRA module back to a concise yet rank-optimal LoRA.<br />experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
FDU, etc.O-LoRAO-LoRA mitigates catastrophic forgetting of past task knowledge by constraining the gradient updates of the current task to be orthogonal to the gradient subspace of the past tasks.
TUDB-LabsmLoRAm-LoRA (a.k.a Multi-Lora Fine-Tune) is an open-source framework for fine-tuning Large Language Models (LLMs) using the efficient multiple LoRA/QLoRA methods. Key features of m-LoRA include:<br />1. Efficient LoRA/QLoRA: Optimizes the fine-tuning process, significantly reducing GPU memory usage by leveraging a shared frozen-based model.<br />2. Multiple LoRA Adapters: Support for concurrent fine-tuning of multiple LoRA/QLoRA adapters.<br />3. LoRA based Mix-of-Expert: Support for MixLoRA, which implements Mix-of-Expert architecture based on multiple LoRA adapters for frozen FFN layer.

Low-Cost Inference

quantization

contributoralgorithmmain feature
UW, etc.SpQRa new compressed format and quantization technique which enables for the first time near-lossless compression of LLMs across model scales,<br /> while reaching similar compression levels to previous methods.
THUTrain_Transformers_with_INT4For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers.<br /> For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately.
INTELneural-compressortargeting to provide unified APIs for network compression technologies, such as low precision quantization, sparsity, pruning, knowledge distillation,<br /> across different deep learning frameworks to pursue optimal inference performance.
INTELintel-extension-for-transformersIntel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids.
UCBKVQuantPer-channel, Pre-RoPE Key quantization to better match the outlier channels in Keys; Non-Uniform Quantization ( NUQ ) to better represent the non-uniform activations; Dense-and-Sparse Quantization to mitigate the impacts of numerical outliers on quantization difficulty; Q-Norm to mitigate distribution shift at ultra low precisions (eg. 2-bit); KVQuant enables serving the LLaMA-7B model with 1M context length on a single A100-80GB GPU , or even the LLaMA-7B model with 10M context length on an 8-GPU system 🔥

projects

contributorprojectmain feature
@ggerganovllama.cppc/cpp implementation for llama and some other models, using quantization.
@NouamaneTazibloomz.cppC++ implementation for BLOOM inference.
@mlc-aiMLC LLMa universal solution that allows any language models to be deployed natively on a diverse set of hardware backends and native applications,<br />plus a productive framework for everyone to further optimize model performance for their own use cases.  
alibabaChatGLM-MNNconverts the ChatGLM-6B model to MNN and performs inference using C++.
JittorJittorLLMsSignificantly reduce hardware costs (by 80%), currently known as the lowest-cost deployment library, supports multiple platforms.
OpenBMBBMInfBMInf supports running models with more than 10 billion parameters on a single NVIDIA GTX 1060 GPU in its minimum requirements.<br /> In cases where the GPU memory supports the large model inference (such as V100 or A100),<br /> BMInf still has a significant performance improvement over the existing PyTorch implementation.
hpcaitechEnergonAIWith tensor parallel operations, pipeline parallel wrapper, distributed checkpoint loading, and customized CUDA kernel,<br /> EnergonAI can enable efficient parallel inference for larges-scale models.
MegEngineInferLLMa lightweight LLM model inference framework that mainly references and borrows fromthe llama.cpp project.<br /> llama.cpp puts almost all core code and kernels in a single file and use a large number of macros, making it difficult for developers to read and modify.
@saharNoobyrwkv.cppa port ofBlinkDL/RWKV-LM to ggerganov/ggml.
FMInferenceFlexGenFlexGen is a high-throughput generation engine for running large language models with limited GPU memory.<br /> FlexGen allowshigh-throughput generation by IO-efficient offloading, compression, and large effective batch sizes .
huggingface<br />bigcode-projectstarcoder.cppC++ implemention for 💫 StarCoder inference using theggml library.
CMUSpecInferSpecInfer is an open-source distributed multi-GPU system that accelerates generative LLM inference withspeculative inference and token tree verification.<br /> A key insight behind SpecInfer is to combine various collectively boost-tuned small speculative models (SSMs) to jointly predict the LLM’s outputs.
@ztxz16fastllmfull-platform pure c++ llm acceleration library, supports moss, chatglm, baichuan models,  runs smoothly on mobile phones.
UCBvllma fast and easy-to-use library for LLM inference and serving. fast with Efficient management of attention key and value memory withPagedAttention.
stanfordmpt-30B-inferenceRun inference on the latest MPT-30B model using your CPU. This inference code uses aggml quantized model.
Shanghai AI Lablmdeploya toolkit for compressing, deploying, and serving LLM.
@turboderpExLlama / ExLlamaV2A fast inference library for running LLMs locally on modern consumer-class GPUs
PyTorchExecuTorchEnd-to-end solution for enabling on-device AI across mobile and edge devices for PyTorch models.
XorbitsaiXinferencea powerful and versatile library designed to serve language, speech recognition, and multimodal models.<br /> With Xorbits Inference, you can effortlessly deploy and serve your or state-of-the-art built-in models using just a single command.
NVIDIATensorRT-LLMTensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and buildTensorRT engines that contain<br /> state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes<br /> that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs.<br /> Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to<br /> multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).
@sabetAIBatched LoRAsMaximize GPU util by routing inference through multiple LoRAs in the same batch.
huggingfaceTGIText Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs,<br /> including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, andmore. TGI implements many features, such as:<br />- Simple launcher to serve most popular LLMs<br />- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)<br />- Tensor Parallelism for faster inference on multiple GPUs<br />- Token streaming using Server-Sent Events (SSE)<br />- Continuous batching of incoming requests for increased total throughput<br />- Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
microsoftDeepSpeed-MIIUnder-the-hood MII is powered byDeepSpeed-Inference. Based on model type, model size, batch size, and available hardware resources, MII automatically applies the appropriate set of<br /> system optimizations from DeepSpeed-Inference to minimize latency and maximize throughput. It does so by using one of many pre-specified model injection policies, that allows MII and<br /> DeepSpeed-Inference to identify the underlying PyTorch model architecture and replace it with an optimized implementation. In doing so, MII makes the expansive set of<br /> optimizations in DeepSpeed-Inference automatically available for thousands of popular models that it supports.
flexflowFlexFlowA key technique that enables FlexFlow Serve to accelerate LLM serving is speculative inference, which combines various collectively boost-tuned small speculative models (SSMs)<br /> to jointly predict the LLM’s outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences<br /> represented by a token tree is verified against the LLM’s output in parallel using a novel tree-based parallel decoding mechanism. FlexFlow Serve uses an LLM as a token tree verifier instead of<br /> an incremental decoder, which largely reduces the end-to-end inference latency and computational requirement for serving generative LLMs while provably preserving model quality.
BentoMLBentoMLan open platform for machine learning in production. It simplifies model packaging and model management, optimizes model serving workloads<br /> to run at production scale, and accelerates the creation, deployment, and monitoring of prediction services.
@ModelTCLightLLMa Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
@FasterDecodingMedusaSimple Framework for Accelerating LLM Generation with Multiple Decoding Heads.
UCB, etc.S-LoRAS-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation,<br /> S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths.<br /> Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA<br /> to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving),<br /> S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models<br /> and offers the potential for large-scale customized fine-tuning services.
@lyogavinAirLLMWhen executing at a certain layer, the corresponding layer will be loaded from the hard drive, and the calculation of that layer will be performed. Once the calculation is complete,<br /> the memory of that layer can be completely released. This way, the GPU memory usage will only be approximately the size of one layer of transformer parameters.
UW, etc.PunicaWe present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models.<br /> This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation.<br /> Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models<br /> compared to state-of-the-art LLM serving systems while only adding 2ms latency per token.
alibabaMergeLMIn this work, we uncover that Language Models (LMs), either encoder- or decoder-based, canobtain new capabilities by assimilating the parameters of homologous models without the need for retraining or GPUs.<br />1. We introduce a novel operation called DARE to directly set most of (90% or even 99%) the delta parameters to zeros without affecting the capabilities of SFT LMs.<br />2. We sparsify delta parameters of multiple SFT homologous models with DARE as a general preprocessing technique and subsequently merge them into a single model by parameter averaging.
ETH ZürichUltraFastBERTa BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference.<br /> This is achieved by replacing feedforward networks with fast feedforward networks (FFFs).
UCB, etc.LookaheadDecodingLookahead decoding breaks the sequential dependency in autoregressive decoding by concurrently extracting and verifying n-grams directly with the LLM, utilizing theJacobi iteration method.<br /> Lookahead decoding functions without the need for a draft model or a data store. It linearly decreases the number of decoding steps directly correlating with the log(FLOPs) used per decoding step.
IntelBigDLbigdl-llm is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud ) using INT4/FP4/INT8/FP8 with very low latency  (for any PyTorch model).
SenseTime, etc.LightLLM1. Tri-process asynchronous collaboration: tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization.<br />2. Token Attention: implements token-wise's KV cache memory management mechanism, allowing for zero memory waste during inference.<br />3. High-performance Router: collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput.
THU, etc.SoTto guide LLMs to generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-ups across 12 LLMs, but it can also potentially improve the answer quality on several question categories.
ollamaollamadocker-style interation of llm inference.
alibabaRTP-LLMAlibaba's high-performance LLM inference engine for diverse applications. The project is mainly based onFasterTransformer, and on this basis, we have integrated some kernel implementations from TensorRT-LLM. FasterTransformer and TensorRT-LLM have provided us with reliable performance guarantees. Flash-Attention2 and cutlass have also provided a lot of help in our continuous performance optimization process. Our continuous batching and increment decoding draw on the implementation of vllm; sampling draws on transformers, with speculative sampling integrating Medusa's implementation, and the multimodal part integrating implementations from llava and qwen-vl.
腾讯一念一念LLM是面向LLM推理和服务的高性能和高易用的推理引擎。<br />高性能和高吞吐:<br />使用极致优化的 CUDA kernels, 包括来自 vllm, TensorRT-LLM, FastTransformer 等工作的高性能算子<br />基于PagedAttention实现地对注意力机制中key和value的高效显存管理<br />对任务调度和显存占用精细调优的动态batching<br />(实验版) 支持前缀缓存(Prefix caching)

Prompt Compression

contributorprojectmain feature
microsoftLLMLinguaLLMLingua, that uses a well-trained small language model after alignment, such as GPT2-small or LLaMA-7B, to detect the unimportant tokens in the prompt and enable inference with the compressed prompt in black-box LLMs, achieving up to 20x compression with minimal performance loss.

Prompting

Prompt Engineering Guide

contributormethodmain feature
GoogleCoTa technique that allows large language models (LLMs) to solve a problem as a series of intermediate steps before giving a final answer.
Princeton, etc.ToT(Yao et el. (2023) and Long (2023))ToT maintains a tree of thoughts, where thoughts represent coherent language sequences that serve as intermediate steps toward solving a problem.<br /> This approach enables an LM to self-evaluate the progress intermediate thoughts make towards solving a problem through a deliberate reasoning process.
SJTU, etc.GoTwe propose Graph-of-Thought (GoT) reasoning, which models human thought processes not only as a chain but also as a graph. By representing thought units as nodes<br />and connections between them as edges, our approach captures the non-sequential nature of human thinking and allows for a more realistic modeling of thought processes.
Princeton, etc.ReActLLMs are used to generate bothreasoning traces and task-specific actions in an interleaved manner.
SJTUMeta-CoTMeta-CoT is a generalizable CoT prompting method in mixed-task scenarios where the type of input questions is unknown. It consists of three phases:<br /> (i) scenario identification : categorizes the scenario of the input question;<br /> (ii) demonstration selection : fetches the ICL demonstrations for the categorized scenario;<br /> (iii) answer derivation : performs the answer inference by feeding the LLM with the prompt comprising the fetched ICL demonstrations and the input question.
UCLARaRwe present a method named `Rephrase and Respond' (RaR), which allows LLMs to rephrase and expand questions posed by humans and provide responses in a single prompt.<br />Our experiments demonstrate that our methods significantly improve the performance of different models across a wide range to tasks.
CAS, etc.EmotionPromptOur automatic experiments show that LLMs have a grasp of emotional intelligence, and their performance can be improved with emotional prompts (which we call “EmotionPrompt” that combines the original prompt with emotional stimuli),<br />Our human study results demonstrate that EmotionPrompt significantly boosts the performance of generative tasks.
MetaS2AS2A regenerates the input context to only include the relevant portions, before attending to the regenerated context to elicit the final response.
GoogleStep-Back PromptingWe present STEP-BACK PROMPTING, a simple prompting technique that enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details. Using the concepts and principles to guide the reasoning steps, LLMs significantly improve their abilities in following a correct reasoning path towards the solution. We conduct experiments of STEP-BACK PROMPTING with PaLM-2L models and observe substantial performance gains on a wide range of challenging reasoning-intensive tasks including STEM, Knowledge QA, and Multi-Hop Reasoning. For instance, STEP-BACK PROMPTING improves PaLM-2L performance on MMLU Physics and Chemistry by 7% and 11%, TimeQA by 27%, and MuSiQue by 7%.

Safety

contributormethodmain feature
thu-coaiSafety-PromptsChinese safety prompts for evaluating and improving the safety of LLMs.

Truthfulness

contributormethodmain feature
HarvardITIITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads.<br /> This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark.<br /> On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5 to 65.1.

Exceeding Context Window

https://zhuanlan.zhihu.com/p/670280576

Extending Context Window

contributormethodmain feature
UW, etc.ALiBiInstead of adding position embeddings at the bottom of the transformer stack,<br /> ALiBi adds a linear bias to each attention score, allowing the model to be trained on,<br /> for example, 1024 tokens, and then do inference on 2048 (or much more) tokens without any finetuning.
DeepPavlov, etc.RMTuse a recurrent memory to extend the context length.
bytedanceSCMunleash infinite-length input capacity for large-scale language models.
MetaPosition Interpolationextends the context window sizes of RoPE-based  pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps).<br />Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond<br /> the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism.
UCBLongChatInstead of forcing the LLaMA model to adapt to position_ids > 2048, we condense position_ids > 2048 to be within 0 to 2048 (the same machenism asPosition Interpolation, surprisingly!).<br />we observed that our LongChat-13B-16K model reliably retrieves the first topic, with comparable accuracy to gpt-3.5-turbo.
microsoftLongNetreplaces the attention of vanilla Transformers with a novel component nameddilated attention, and successfully scale the sequence length to 1 billion tokens.
IDEAS NCBR, etc.LongLLaMALongLLaMA is built upon the foundation ofOpenLLaMA and fine-tuned using the Focused Transformer (FoT) method, and is capable of handling long contexts of 256k tokens or even more.
Abacus.AIGiraffea range of experiments with different schemes for extending context length capabilities of Llama are conducted.
TogetherComputerLlama-2-7B-32K-Instructlong-context chat model finetuned fromLlama-2-7B-32K, over high-quality instruction and chat data.
Jianlin SuReRoPEset a window with size$w$, the interval between positions inside the window is 1, while the interval outside the window is $\frac 1 k$.
CUHK/MITlongloraan efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost.

Without Extending Context Window

contributormethodmain feature
MIT/Meta/CMUStreamingLLM/<br />SwiftInferdeploy LLMs forinfinite-length inputs without sacrificing efficiency and performance.<br />an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning.<br /> We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.<br /> In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment.<br /> In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup.<br />SwiftInfer:implement StreamingLLM based on TensorRT-LLM.
UCBRing AttentionWe present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while overlapping the communication of<br /> key-value blocks with the computation of blockwise attention. Ring Attention enables training and inference of sequences that are up to device count times longer than those of prior memory-efficient Transformers,<br /> effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in allowing large sequence input size<br /> and improving performance.
UCBMemGPTa system that intelligently manages different memory tiers in LLMs in order to effectively provide extended context within the LLM's limited context window.<br /> For example, MemGPT knows when to push critical information to a vector database and when to retrieve it later in the chat, enabling perpetual conversations.
FDU, etc.ScalingRoPEwe first observe that fine-tuning a RoPE-based LLM with either a smaller or larger base in pre-training context length could significantly enhance its extrapolation performance.<br /> After that, we propose Scaling Laws of RoPE-based Extrapolation, a unified framework from the periodic perspective,<br /> to describe the relationship between the extrapolation performance and base value as well as tuning context length.
THUInfLLMtraining-free, memory-base, InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention computation. Even when the sequence length is scaled to 1, 024K, InfLLM still effectively captures long-distance dependencies.

Knowledge Editing

Must-read Papers on Model Editing: ModelEditingPapers

contributormethodmain feature
MIT, etc.ROMEFirst, we trace the causal effects of hidden state activations within GPT using causal mediation analysis to identify the specific modules that mediate recall of a fact about a subject.<br /> Our analysis reveals that feedforward MLPs at a range of middle layers are decisive when processing the last token of the subject name.<br />Second, we test this finding in model weights by introducing a Rank-One Model Editing method (ROME) to alter the parameters that determine a feedfoward layer’s behavior at the decisive token.<br />Despite the simplicity of the intervention, we find that ROME is similarly effective to other modelediting approaches on a standard zero-shot relation extraction benchmark.

Implementations

contributorprojectmain feature
PKUFastEditinjectingfresh and customized knowledge into large language models efficiently using one single command.
ZJUEasyEdita Python package for edit Large Language Models (LLM) like GPT-J, Llama, GPT-NEO, GPT2, T5(support models from 1B to 65B ), <br />the objective of which is to alter the behavior of LLMs efficiently within a specific domain without negatively impacting performance across other inputs. It is designed to be easy to use and easy to extend.

External Knowledge

allowing the model to access external knowledge, such as internet、KG、databases.

contributorprojectmain feature
@jerryjliuLlamaIndexprovides a central interface to connect your LLM's with external data.
@imClumsyPandalangchain-ChatGLMlocal knowledge based ChatGLM withlangchain.
@wenda-LLMwendaan LLM calling platform designed to find and design automatic execution actions for small model plug-in<br /> knowledge bases to achieve the same generation ability as large models.
@csunnyDB-GPTbuild a complete private large model solution for all database-based scenarios.
THU, BAAI, ZJUChatDBa novel framework integrating symbolic memory with LLMs. ChatDB explores ways of augmenting LLMs with symbolic memory to handle contexts of arbitrary lengths.<br /> Such a symbolic memory framework is instantiated as an LLM with a set of SQL databases. The LLM generates SQL instructions to manipulate the SQL databases<br /> autonomously (including insertion, selection, update, and deletion), aiming to complete a complex task requiring multi-hop reasoning and long-term symbolic memory.
IDEAZiya-Reader"Ziya-Reader-13B-v1.0" is a knowledge question-answering model. It can accurately answer questions given questions and knowledge documents,<br /> and is suitable for both multi-document and single-document question-answering. The model has an 8k context window, and compared to models with longer windows,<br /> we have achieved victory in evaluations across multiple long-text tasks. The tasks include multi-document question-answering, synthetic tasks (document retrieval), and long-text summarization.<br />Additionally, the model also demonstrates excellent generalization capabilities, enabling it to be used for general question-answering.<br /> Its performance on our general ability evaluation set surpassed that of Ziya-Llama-13B.
dockerGenAI Stacksignificantly simplify the entire process by integrating Docker with the Neo4j graph database, LangChain model linking technology, and Ollama for running Large Language models (LLM)
UW, etc.Self-RAGUnlike a widely-adopted Retrieval-Augmented Generation (RAG) approach,Self-RAG retrieves on demand (e.g., can retrieve multiple times or completely skip retrieval) given diverse queries,<br /> and criticize its own generation from multiple fine-grained aspects by predicting reflection tokens as an integral part of generation.
RUCStructGPTInspired by the studies on tool augmentation for LLMs, we develop an Iterative Reading-thenReasoning (IRR) framework to solve question answering tasks based on structured data, called StructGPT.<br /> In this framework, we construct the specialized interfaces to collect relevant evidence from structured data (i.e., reading), and let LLMs concentrate on the reasoning task based on the collected information (i.e., reasoning).<br /> Specially, we propose an invokinglinearization-generation procedure to support LLMs in reasoning on the structured data with the help of the interfaces. By iterating this procedure with provided interfaces,<br /> our approach can gradually approach the target answers to a given query. Experiments conducted on three types of structured data show that StructGPT greatly improves the performance of LLMs,<br /> under the few-shot and zero-shot settings.
BUPTChatKBQAA Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned LLMs.
ZJUKnowPATKnowledgeable Preference AlignmenT (KnowPAT) is a new pipeline to align LLMs with human's knowledge preference.<br /> KnowPAT incorporates domain knowledge graphs to construct preference set and design new alignment objective to fine-tune the LLMs.
NetEaseQAnythinga local knowledge base question-answering system designed to support a wide range of file formats and databases, allowing for offline installation and use. With QAnything, you can simply drop any locally stored file of any format and receive accurate, fast, and reliable answers. Currently supported formats include: PDF(pdf) , Word(docx) , PPT(pptx) , XLS(xlsx) , Markdown(md) , Email(eml) , TXT(txt) , Image(jpg,jpeg,png) , CSV(csv) ,Web links(html) and more formats coming soon…
InfiniFlowRAGFlowan open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
dify.aiDifyDify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
sealos.ioFastGPTFastGPT is a knowledge-based Q&A system built on the LLM, offers out-of-the-box data processing and model invocation capabilities, allows for workflow orchestration through Flow visualization!
vanna-aivanna使用RAG通过llm实现文本到sql的转换。

AI搜索引擎

contributorprojectmain feature
LeptonAISearch with LeptonAI搜索demo,无需API密钥
@rashadphzFarfalleperpldxity开源仿制品,需要搜索API密钥、OpenAI密钥(demo)
@developersdigestllm-answer-engineperplexity开源仿制品,需要搜索API密钥
@ItzCrazyKnsPerplexicaperplexity开源仿制品,无需API密钥
@miurlaMorphicperplexity开源仿制品,无需API密钥 (demo)
@nilsherzigLLocalSearchAI搜索,无需API密钥

Chat with Docs

contributorprojectmain feature
@arc53DocsGPTGPT-powered chat for documentation, chat with your documents

more at: funNLP

内容解析

贡献者项目主要特征
阿里OmniParser设计了一个通用模型OmniParser,可以同时处理3个典型的视觉情境文本解析任务:文本识别、关键信息提取和表格识别。在OmniParser中,所有任务共享统一的编码器-解码器架构,统一的目标:条件文本生成,以及统一的输入和输出表示:提示和结构化序列。广泛的实验表明,OmniParser在三个视觉位置的文本解析任务的7个数据集上取得了最先进的(SOTA)或极具竞争力的性能。

Vector DataBase

contributordbmain feature
milvus-iomilvusa cloud-native vector database with storage and computation separated by design.
MetafaissIt contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU.
nmslibhnswlibHeader-only C++ HNSW implementation with python bindings, insertions and updates.
MyScaleMyScaleDBAn open-source, high-performance SQL vector database built on ClickHouse.
chromaChromathe AI-native open-source embedding database.
WeaviateWeaviatestores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database.

External Tools

Using Existing Tools

allowing the model to access external tools, such as search engine、api.

contributorprojectbase modelmain feature
UCB/microsoftGorillaLLaMAinvokes 1,600+ (and growing) API calls accurately while reducing hallucination.
THUToolLLaMALLaMAThis project aims to constructopen-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction<br /> of powerful LLMs with general tool-use capability. We provide the dataset, the corresponding training and evaluation scripts,<br /> and a capable model ToolLLaMA fine-tuned on ToolBench.

Make New Tools

contributorprojectmain feature
Google, etc.LATMLLMs create their own reusable tools for problem-solving.

Agent

contributorprojectmain feature
@Significant-GravitasAuto-GPTchains together LLM "thoughts", to autonomously achieve whatever goal you set.
@yoheinakajimaBabyAGIThe main idea behind this system is that it creates tasks based on the result of previous tasks and a predefined objective.<br />The script then uses OpenAI's natural language processing (NLP) capabilities to create new tasks based on the objective,<br />and Chroma/Weaviate to store and retrieve task results for context.
microsoftHuggingGPTLanguage serves as an interface for LLMs to connect numerous AI models for solving complicated AI tasks!
microsoft/NCSUReWOOdetaches the reasoning process from external observations, thus significantly reducing token consumption.
Stanfordgenerative_agentsGenerative Agents: Interactive Simulacra of Human Behavior.
THU, etc.AgentVerse🤖 AgentVerse 🪐 provides a flexible framework that simplifies the process of building custom multi-agent environments for large language models (LLMs).
BUAA, etc.TrafficGPTBy seamlessly intertwining large language model and traffic expertise, TrafficGPT not only advances traffic<br />management but also offers a novel approach to leveraging AI capabilities in this domain.
microsoft, etc.ToRAToRA is a series of Tool-integrated Reasoning LLM Agents designed to solve challenging mathematical reasoning problems by interacting with tools.
HKUOpenAgentsan open platform for using and hosting language agents in the wild of everyday life.
THUXAgentan open-source experimental Large Language Model (LLM) driven autonomous agent that can automatically solve various tasks.<br /> It is designed to be a general-purpose agent that can be applied to a wide range of tasks.
Nvidia, etc.Eurekaahuman-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement<br /> capabilities of state-of-the-art LLMs, such as GPT-4, to perform in-context evolutionary optimization over reward code. The resulting rewards can then be used to<br /> acquire complex skills via reinforcement learning. Eureka generates reward functions that outperform expert human-engineered rewards without any task-specific<br /> prompting or pre-defined reward templates. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies,<br /> Eureka outperforms human expert on 83% of the tasks leading to an average normalized improvement of 52% .
THUAgentTuningAgentTuning represents the very first attempt to instruction-tune LLMs using interaction trajectories across multiple agent tasks.<br /> Evaluation results indicate that AgentTuning enables the agent capabilities of LLMs with robust generalization on unseen agent tasks while remaining good on general language abilities.
microsoftAutoGenAutoGen is a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks. AutoGen agents are<br /> customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.
PKURestGPTwe connect LLMs withRESTful APIs and tackle the practical challenges of planning, API calling, and response parsing. To fully evaluate the performance of RestGPT, we propose RestBench,<br /> a high-quality benchmark which consists of two real-world scenarios and human-annotated instructions with gold solution paths.<br />RestGPT adopts an iterative coarse-to-fine online planning framework and uses an executor to call RESTful APIs.
microsoftMusicAgenta music domain agent powered by large language models (LLMs). Its goal is to help developers and non-professional music creators automatically analyze user requests and select appropriate tools to solve the problem.
HW, etc.LEGO-Proverthe first automated theorem prover powered by the LLM that constructs the proof in a block-by-block manner.
alibabaModelScope-AgentAn agent framework connecting models in ModelScope with the world.
CMU, etc.RoboGenA generative and self-guided robotic agent that endlessly propose and master new skills.
PKU, etc.LLaMA-RiderA LLM training framework that enables LLMs to autonomously explore open worlds based on environmental feedback and their own abilities, and to efficiently learn from collected experiences. In the Minecraft environment,<br /> it has demonstrated better multitasking capabilities than other methods, including ChatGPT task planners. This adaptability to open worlds has been a major achievement for LLMs.<br /> Additionally, LLaMA-Rider's ability to use past task experiences to solve new tasks demonstrates the potential of this method for lifelong exploration and learning in large models.
IDEA, etc.ToGThink-on-Graph (ToG), in which the LLM agent iteratively executes beam search on KG, discovers the most promising reasoning paths, and returns the most likely reasoning results.
Yale, etc.ToolkenGPTrepresents eachtool as a token (toolken) and learns an embedding for it, enabling tool calls in the same way as generating a regular word token. Once a toolken is triggered, the LLM is prompted to complete arguments for the tool to execute.
tencentAppAgentOur framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps
Stanford, etc.Meta-PromptingThis approach transforms a single LM into a multi-faceted conductor, adept at managing and integrating multiple independent LM queries. By employing high-level instructions, meta-prompting guides the LM to break down complex tasks into smaller, more manageable subtasks. These subtasks are then handled by distinct "expert" instances of the same LM, each operating under specific, tailored instructions.
tencentMore-Agents-Is-All-You-NeedWe find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated. Also, this method is orthogonal to existing complicated methods to further enhance LLMs, while the degree of enhancement is correlated to the task difficulty.
Pythagora-iogpt-pilotGPT Pilot aims to research how much LLMs can be utilized to generate fully working, production-ready apps while the developer oversees the implementation.
DeepWisdom, etc.MetaGPTA Multi-Agent Framework: First AI Software Company, Towards Natural Language Programming.
OpenBMBXAgentAn Autonomous LLM Agent for Complex Task Solving.
CrewAIcrewAICutting-edge framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
stition.aidevikaan Agentic AI Software Engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to achieve the given objective. Devika aims to be a competitive open-source alternative toDevin by Cognition AI.
OpenDevinOpenDevinan open-source project aiming to replicate Devin, an autonomous AI software engineer who is capable of executing complex engineering tasks and collaborating actively with users on software development projects. This project aspires to replicate, enhance, and innovate upon Devin through the power of the open-source community.
alibabaAgentScope结合丰富的语法工具、内置资源和用户友好的交互,AgentScope 的通信机制显著降低了开发和理解的障碍。为了实现健壮和灵活的多智能体应用,AgentScope 提供了内置和可定制的容错机制,同时也具备系统级支持多模态数据生成、存储和传输的能力。此外,设计了一个基于 actor 的分布式框架,使得本地和分布式部署之间的轻松转换以及自动并行优化成为可能,无需额外努力。通过这些特性,AgentScope 赋予开发者构建充分发挥智能代理潜力的应用程序的能力。
@langchain-ailanggraph图编排方式搭建AI agent应用。
@MaplemxAgently易用,帮助开发者快速搭建AI agent应用。

paper list : LLM-Agent-Paper-List

Papers / Repos / Blogs / ... : Awesome LLM-Powered Agent

LLMs as XXX

contributorLLM asrepomain feature
Google DeepMindoptimizerOPROOptimization by PROmpting (OPRO), a simple and effective approach to LLMs<br />as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that<br />contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step.
HKU, etc.part of graph tasksAwesome-LLMs-in-Graph-tasksA curated collection of research papers exploring the utilization of LLMs for graph-related tasks.

Similar Collections

collections of open instruction-following llms
开源微调大型语言模型(LLM)合集
机器之心SOTA!模型
Awesome Totally Open Chatgpt
LLM-Zoo
Awesome-LLM
🤗 Open LLM Leaderboard
Open LLMs
Awesome-Chinese-LLM
Awesome Pretrained Chinese NLP Models
LLMSurvey