Home

Awesome

:speaking_head: [ 中文 | English ]

<p align="center"> <br> <img src="https://github.com/zjunlp/KnowLM/blob/main/assets/KnowLM.png?raw=true" width="400" height="120"/> <br> </p>

Knowledgeable Large Language Model Framework

KnowLM is a knowledgeable Large Language Model (LLM) framework, including data processing, model pre-training, fine-tuning, augmentation and utilization with knowledge. Additionally, KnowLM provides a model zoo featuring readily accessible models like ZhiXi and OneKE, tailored for immediate implementation.

Features

All weights and datasets have been uploaded to HuggingFace🤗. Click here to get started right away!

If you encounter any issues during the installation or use of KnowLM, please check FAQ or promptly submit an issue, and we will assist you with resolving the problem!

CategoryBaseNameVersionDownload LinkNote
Base ModelLlaMA1KnowLM-13B-BaseV1.0HuggingFace <br/> WiseModel <br/> ModelScopeBase Model
Dialogue ModelLlaMA1KnowLM-13B-ZhiXiV1.0HuggingFace <br/> WiseModel <br/> ModelScopeInformation Extraction Model
Dialogue ModelLlaMA1KnowLM-13B-IEV1.0HuggingFace <br/> WiseModel <br/> ModelScopeInformation Extraction Model
Dialogue ModelLlaMA2OceanGPTV1.0HuggingFace <br/> WiseModelOcean Model
Dialogue ModelLlaMA2OneKEV1.0HuggingFace <br/> WiseModel <br/> ModelScopeInformation Extraction Model
Instruction Dataset NameNumberDownload LinkNote
KnowLM-CR (CoT&Reasoning, Chinese and English)202,333Google Drive <br/> HuggingFace
KnowLM-Tool (Tool Learning,English)38,241Google Drive <br/> HuggingFace
OceanBench (Benchmark,English)11,000HuggingFace
InstructIE (Information Extraction, Chinese and English)364, 076HuggingFace <br/> WiseModel <br/> ModelScopeDue to using distant supervision, there exists noise.
IEPile (Information Extraction, Chinese and English)2,000,000 +HuggingFace <br/> WiseModel <br/> ModelScopeIt is constructed based on 33 exsiting IE datasets.

Data description: 1. Other data sources for information extraction come from CoNLL, ACE, casis, DuEE, People Daily, DuIE, etc. 2. The KnowLM-Tool dataset comes from the paper "Making Language Models Better Tool Learners with Execution Feedback" and the gitHub can be found here. 3. The InstructIE dataset comes from the paper "InstructIE: A Chinese Instruction-based Information Extraction Dataset" and the gitHub can be found here.

📬 NEWS

📍 Technologies in KnowLM

<p align="center"> <br> <img src="https://github.com/zjunlp/KnowLM/blob/main/assets/KnowLM-overview.png?raw=true" width="920" height="400"/> <br> </p>

This is an overview of the KnowLM, which mainly consists of three technical features:

Knowledge Prompting: It generates knowledge prompts based on structured data such as knowledge graphs and utilizes knowledge augmentation constraints to address knowledge extraction and reasoning issues.

Knowledge Editing: It aligns outdated, incorrect, and biased knowledge within large models using knowledge editing techniques to tackle knowledge fallacy problems (English Tutorial).

Knowledge Interaction: It enables dynamic knowledge interaction and feedback to achieve tool-based learning and multi-agent collaboration, resolving the problem of embodiment cognition in LLMs (English Tutorial).

The modules related to these three technologies are EasyInstruct, EasyDetect, EasyEdit. We provide use cases for those modules based on the KnowLMframework.

🗂️ Contents

All Thanks To Our Contributors :

<a href="https://github.com/zjunlp/KnowLM/graphs/contributors"> <img src="https://contrib.rocks/image?repo=zjunlp/KnowLM" /> </a> <h2 id="1">🚴1. Quick Start</h2> <h3 id="1-1">🛠️1.1 Environment Configuration</h3>

KnowLM supports both manual and docker image environment configuration, you can choose the appropriate way to build.

🔧Manual Environment Configuration

git clone https://github.com/zjunlp/KnowLM.git
cd KnowLM
conda create -n knowlm python=3.9 -y
conda activate knowlm
pip install torch==1.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

🐳Building With Docker Images

docker pull zjunlp/knowlm:v.1
docker run -it zjunlp/knowlm:v.1 /bin/bash
<h3 id="1-2">💻1.2 Model Usage Guide</h3>

1. Reproduce the results in Section 2

The cases in Section 2 were all run on V100. If running on other devices, the results may vary. Please run multiple times or change the decoding parameters. We derived knowlm-13b-zhixi and knowlm-13b-ie through training using LoRA, building upon the foundation of knowlm-13b-base. These models, knowlm-13b-zhixi and knowlm-13b-ie, are the result of merging the trained LoRA weights with the existing knowlm-13b-base model parameters.

  1. If you want to reproduce the results in section 2.1(pretraining cases), please run the following command:

    python examples/generate_finetune.py --base_model zjunlp/knowlm-13b-base-v1.0
    

    The result in section 2.1 can be obtained.

  2. If you want to reproduce the results in section 2.2(information extraction cases), please run the following command:

    python examples/generate_lora.py --base_model zjunlp/knowlm-13b-zhixi --run_ie_cases
    

    The result in section 2.2 can be obtained.

  3. If you want to reproduce the results in section 2.3(general ablities cases), please run the following command:

    python examples/generate_lora.py --base_model zjunlp/knowlm-13b-zhixi --run_general_cases
    

    The result in section 2.3 can be obtained.

2. Usage of Pretraining Model

We offer two methods: the first one is command-line interaction, and the second one is web-based interaction, which provides greater flexibility.

  1. Use the following command to enter command-line interaction:

    python examples/generate_finetune.py --base_model zjunlp/knowlm-13b-base-v1.0 --interactive
    

    The disadvantage is the inability to dynamically change decoding parameters.

    If a single GPU is unable to load the model, you can utilize the following command to enable the model to be loaded across different GPU:

    CUDA_VISIBLE_DEVICES=0,1,2 python examples/generate_finetune.py  --base_model zjunlp/knowlm-13b-base-v1.0 --interactive --multi_gpu     # --allocate [10,10,10]
    

    The --allocate above specifies the amount of memory used by each GPU, measured in GB.

  2. Use the following command to enter web-based interaction:

    python examples/generate_finetune_web.py --base_model zjunlp/knowlm-13b-base-v1.0
    

    If a single GPU is unable to load the model, you can utilize the following command to enable the model to be loaded across different GPU:

    CUDA_VISIBLE_DEVICES=0,1,2 python examples/generate_finetune_web.py --base_model zjunlp/knowlm-13b-base-v1.0 --multi_gpu     # --allocate [10,10,10]
    

    Here is a screenshot of the web-based interaction:

    <p align="center" width="100%"> <a href="" target="_blank"><img src="./assets/finetune_web.jpg" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a> </p>

3. Usage of Instruction tuning Model

Here, we provide a web-based interaction method. Use the following command to access the web:

python examples/generate_lora_web.py --base_model zjunlp/knowlm-13b-zhixi

If a single GPU is unable to load the model, you can utilize the following command to enable the model to be loaded across different GPU:

CUDA_VISIBLE_DEVICES=0,1,2 python examples/generate_lora_web.py --base_model zjunlp/knowlm-13b-zhixi --multi_gpu     # --allocate [10,10,10]

Here is a screenshot of the web-based interaction:

<p align="center" width="100%"> <a href="" target="_blank"><img src="./assets/lora_web.png" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a> </p>

The instruction is a required parameter, while input is an optional parameter. For general tasks (such as the examples provided in section 1.3), you can directly enter the input in the instruction field. For information extraction tasks (as shown in the example in section 1.2), please enter the instruction in the instruction field and the sentence to be extracted in the input field. We provide an information extraction prompt in section 2.5.

If you want to perform batch testing, please modify the examples/generate_lora.py file and update the examples and hyperparameters in the variable cases.

According to different task requirements, we have the following suggestions for adjusting decoding strategies and their associated hyperparameters:

  1. If you want more diverse and creative outputs, consider using top-k or top-p (nucleus) sampling with a relatively higher top_k or top_p, and possibly a higher temperature.
  2. If you want more focused and high-quality outputs (e.g., information extraction), consider using beam search with a moderate num_beam, or top-k or top-p sampling with a lower top_k or top_p, and a lower temperature.
  3. Remember to experiment and fine-tune. Depending on your use case, it may be beneficial to iterate and experiment with different strategies and hyperparameters to find the optimal combination.

4. vLLM API server

We integrate vLLM for accelerating LLM inference and providing efficient API service. Use the following command to launch vLLM API server at http://localhost:8090.

max_num_batched_tokens=8000

CUDA_VISIBLE_DEVICES=1,2 python inference/launch_vllm.py \
    --port 8090 \
    --model data/zhixi-13B \
    --use-np-weights \
    --max-num-batched-tokens $max_num_batched_tokens \
    --dtype half \
    --tensor-parallel-size 2

Query the service using POST request:

curl -X POST "http://127.0.0.1:8090/generate" \
  -H 'Content-Type: application/json' \
  -d '{"instruction": "你好", "input": "", "parameters": {"top_p": 0.7, "max_tokens": 256}}'

You could get the following response:

{
  "generated_text":"你好,很高兴见到你。我是一个人工智能助手,可以帮助你解决问题和提供信息。有什么我可以帮助你的吗?</s>",
  "num_output_tokens_cf":65,
  "error":null
}
<h3 id="1-3">🎯1.3 Information Extraction Prompt</h3>

For information extraction tasks such as named entity recognition (NER), event extraction (EE), and relation extraction (RE), we provide some prompts for ease of use. You can refer to this link for examples. Of course, you can also try using your own prompts.

Here is a case where knowlm-13b-zhixi is used to accomplish the instruction-based knowledge graph construction task in CCKS2023.

<h3 id="1-4">🐐1.4 LlaMA.cpp</h3>

If you find yourself lacking sufficient GPU computing resources, you have the option to carry out quantization using llama.cpp. This is possible because llama.cpp shares the same architecture as KnowLM. Once you have set up your environment, you can download our model to a designated path using the following command:

python tools/download.py --specify --download_path ./your/path --repo_name zjunlp/knowlm-13b-zhixi

Next, just substitute the model path at this location with the downloaded one. When executing it in practice, please remember to adjust the model path within this script accordingly.

<h3 id="1-5">📌1.5 Instruction Processing</h3>

Instruction tuning has emerged as a crucial technique to enhance the capabilities of LLMs, which bridges the gap between the next-word prediction objective of LLMs and human preference. To construct a high-quality instruction dataset, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality.

In instruction processing, we utilized EasyInstruct as our processing framework (detailed can be found at https://github.com/zjunlp/EasyInstruct). EasyInstruct modularizes instruction generation, selection, and prompting, while also considering their combination and interaction. The code below shows a running example of instruction generation and selection in EasyInstruct:

from easyinstruct import SelfInstructGenerator, GPTScoreSelector
from easyinstruct.utils.api import set_openai_key

# Step1: Set your own API-KEY
set_openai_key("YOUR-KEY")

# Step2: Declare a generator class
generator = SelfInstructGenerator(num_instructions_to_generate=100)

# Step3: Generate self-instruct data
generator.generate()

# Step4: Declare a selector class
selector = GPTScoreSelector()

# Step5: Process the generated instructions
selector.process()
<h3 id="1-6">🖊️1.6 Model Editing</h3>

Although large language models perform exceptionally well in many tasks, they can still provide incorrect answers. Moreover, as time passes, knowledge that was once accurate may become outdated. This necessitates that we adjust the model's responses to meet our expectations through model editing.

In model editing, we utilized EasyEdit as our editing tool (details can be found at https://github.com/zjunlp/EasyEdit). EasyEdit is a highly integrated model editing tool. All you need to do is define your editor in just three lines of code, similar to how you would in hugging face.

from easyeditor import MENDHyperParams
hparams = MENDHyperParams.from_hparams('./hparams/MEND/gpt2-xl')
editor = BaseEditor.from_hparams(hparams)

The code above demonstrates the editor definition for editing the gpt2-xl model using the MEND method. The next step is to prepare the editing data and the test data.

metrics, edited_model, _ = editor.edit(
    prompts=prompts,
    ground_truth=ground_truth,
    target_new=target_new,
    locality_inputs=locality_inputs,
    keep_original_weight=True
)

With the provided code, you can complete the editing of the model. The edited model is stored in "edit_model," and the corresponding evaluation metrics are saved in "metrics."

<h2 id="2">🌰2. Cases</h2> <h3 id="2-1">🌰2.1 Pretraining Cases</h3>

Our pre-trained model has demonstrated certain abilities in instruction following, coding, reasoning, as well as some translation capabilities, without any fine-tuning using instructions. Additionally, it has acquired new knowledge. Below are some of our sample cases. If you wish to reproduce our examples and view detailed decoding configuration, please first set up the environment, then follow the steps outlined here.

In the following cases, text in bold represents the prompt, while non-bold text represents the model's output.

Due to the maximum inference length set to 512, our cases fall into three situations:

  1. Completed output. The model generates the termination token EOS and completes the output. We mark this with :white_check_mark:.
  2. Incomplete output. The output is cut off due to the maximum inference length. We mark this with :eight_spoked_asterisk:.
  3. Repeated output. We remove repeated content manually and mark it with :arrow_left:.
<details> <summary><b>Translation</b></summary> <details> <summary><b>Knowledge</b></summary> </details> <details> <summary><b>Instruction Following</b></summary> </details> <details> <summary><b>Coding</b></summary> </details> <details> <summary><b>Generate long text in Chinese</b></summary> <details> <summary><b>Generate long text in English</b></summary> <details> <summary><b>Reasoning</b></summary> <h3 id="2-2">🌰2.2 Information Extraction Cases</h3>

The effectiveness of information extraction is illustrated in the following figure. We tested different instructions for different tasks as well as the same instructions for the same task, and achieved good results for all of them.

<p align="center" width="100%"> <a href="" target="_blank"><img src="./assets/ie-case-new_logo-en.png" alt="IE" style="width: 90%; min-width: 90px; display: block; margin: auto;"></a> </p>

Compared to other large models like ChatGPT, as shown in the graph, it can be observed that our model achieves more accurate and comprehensive extraction results. However, we have also identified some extraction errors in ZhiXi. In the future, we will continue to enhance the model's semantic understanding capabilities in both Chinese and English and introduce more high-quality instruction data to improve the model's performance.

<p align="center" width="100%"> <a href="" target="_blank"><img src="./assets/casevschatgpt.png" width="600" height="900"></a> </p> <h3 id="2-3">🌰2.3 General Abilities Cases</h3>

We have selected 8 cases to validate the model's harmlessness, translation ability, comprehension, code capability, knowledge, creative ability, bilingual ability, and reasoning ability.

<details> <summary><b>Harmlessness</b></summary> <details> <summary><b>Translation Ability</b></summary> </details> <details> <summary><b>Comprehension</b></summary> </details> <details> <summary><b>Code Ability</b></summary> </details> <details> <summary><b>Knowledge</b></summary> </details> <details> <summary><b>Creative Ability</b></summary> </details> <details> <summary><b>Bilingual Ability</b></summary> </details> <details> <summary><b>Reasoning Ability</b></summary> </details> <h3 id="2-4">🌰2.4 Model Editing Cases</h3>

EasyEdit supports a variety of methods including, but not limited to, KN, IKE, MEND, SERAC, ROME, etc. Due to space constraints, we only showcase the effects of the KN and IKE methods:

<details> <summary><b>KN method case</b></summary>

Michael Jordan is born from

Answer before editing: Michael Jordan is born from the USA

Answer after editing: Michael Jordan is born from China

</details> <details> <summary><b>IKE method case</b></summary>

Michael Jordan is born from

Answer before editing: Michael Jordan is born from the USA

Answer after editing: Michael Jordan is born from China

</details> <h2 id="3">🥊3. Training Details</h2>

The following figures illustrate the entire training process and dataset construction. The training process is divided into two stages:

(1) Full pre-training stage. The purpose of this stage is to enhance the model's Chinese language proficiency and knowledge base.

(2) Instruction tuning stage using LoRA. This stage enables the model to understand human instructions and generate appropriate responses.

<h3 id="3-1">🧾3.1 Dataset Construction (Pretraining)</h3>

In order to enhance the model's understanding of Chinese while preserving its original code and English language capabilities, we did not expand the vocabulary. Instead, we collected Chinese corpora, English corpora, and code corpora. The Chinese corpora were sourced from Baidu Baike, Wudao, and Chinese Wikipedia. The English dataset was sampled from the original English corpus of LLaMA, with the exception of the Wikipedia data. The original paper's English Wikipedia data was up until August 2022, and we additionally crawled data from September 2022 to February 2023, covering a total of six months. As for the code dataset, due to the low-quality code in the Pile dataset, we crawled code data from GitHub and LeetCode. A portion of the data was used for pre-training, while another portion was used for fine-tuning with instructions.

For the crawled datasets mentioned above, we employed a heuristic approach to filter out harmful content. Additionally, we removed duplicate data.

<h3 id="3-2">⏳3.2 Training Process (Pretraining)</h3>

Detailed data processing code, training code, complete training scripts, and detailed training results can be found in ./pretrain.

Before training, we need to tokenize the data. We set the maximum length of a single sample to 1024, while most documents are much longer than this. Therefore, we need to partition these documents. We designed a greedy algorithm to split the documents, with the goal of ensuring that each sample consists of complete sentences and minimizing the number of segments while maximizing the length of each sample. Additionally, due to the diversity of data sources, we developed a comprehensive data preprocessing tool that can process and merge data from various sources. Finally, considering the large amount of data, loading it directly into memory would impose excessive hardware pressure. Therefore, we referred to DeepSpeed-Megatron and used the mmap method to process and load the data. This involves loading the indices into memory and accessing the corresponding data on disk when needed.

Finally, we performed pre-training on 5.5 million Chinese samples, 1.5 million English samples, and 0.9 million code samples. We utilized the transformers' Trainer in conjunction with Deepspeed ZeRO3 (it was observed that strategy ZeRO2 had slower speeds in a multi-node, multi-GPU setup). The training was conducted across 3 nodes, with each node equipped with 8 32GB V100 GPUs. The table below showcases our training speeds:

ParameterValues
micro batch size20
gradient accumulation3
global batch size20*3*24=1440
Time-consuming of a step260s
<h3 id="3-3">🧾3.3 Dataset Construction (Instruction tuning)</h3>

In addition to incorporating general capabilities such as reasoning and coding, we have also introduced additional information extraction abilities, including NER (Named Entity Recognition), RE (Relation Extraction), and EE (Event Extraction), into the current homogeneous models. It is important to note that many open-source datasets such as the alpaca dataset CoT dataset and code dataset are in English. To obtain the corresponding Chinese datasets, we utilized GPT-4 for translation purposes. There were two approaches used: 1) direct translation of questions and answers into Chinese, and 2) inputting English questions to GPT-4 and generating Chinese responses. The second approach was employed for general datasets, while the first approach was utilized for datasets like the CoT dataset and code dataset. These datasets are readily available online.

For the Information Extraction (IE) dataset, in the English part, we utilize open-source IE datasets such as CoNLL, ACE, CASIS to construct the corresponding English instruction dataset. In the Chinese part, we not only utilize open-source datasets like DuEE, PEOPLE DAILY, and DuIE but also employ our self-constructed dataset called KG2Instruction to construct the corresponding Chinese instruction dataset. Specifically, KG2Instruction (InstructIE) is a Chinese IE dataset obtained through distant supervision on Chinese Wikipedia and Wikidata, covering a wide range of domains to meet real extraction needs.

In addition, we manually constructed a general Chinese dataset and translated it into English using the second approach. Finally, our data distribution is as follows:

DatasetNumber
COT Datasets (Chinese, English)202,333
General Datasets (Chinese, English)105,216
Code Datasets (Chinese, English)44,688
Information Extraction Datasets (English)537,429
Information Extraction Datasets (Chinese)486,768

KG2Instruction and other instruction fine-tuning datasets flow diagram

<p align="center" width="100%"> <a href="" target="_blank"><img src="./assets/kg2instructions-en.png"style="width: 90%; min-width: 90px; display: block; margin: auto;"></a> </p> <h3 id="3-4">⏳3.4 Training Process (Instruction tuning)</h3>

Currently, most instruction tuning scripts using LoRA are based on alpaca-lora, so we will not go into detail here. Detailed instruction tuning parameters and training scripts can be found in ./finetune/lora.

<h2 id="4">🔴4. Limitations</h2>

Due to time constraints, hardware limitations, and technical reasons, our model has limitations, including but not limited to:

<h2 id="5">🕐5. TODO List</h2> <h2 id="6">❓6. FAQ</h2> <h2 id="7">👋7. Others</h2> <h3 id="7-1">👨‍👩‍👦7.1 Contributors</h3>

Ningyu Zhang, Haofen Wang, Jintian Zhang, Xiaozhuan Liang, Xiang Chen, Zhen Bi, Honghao Gui, Jing Chen, Runnan Fang, Xiaohan Wang, Shengyu Mao, Shuofei Qiao, Yixin Ou, Lei Li, Yunzhi Yao, Peng Wang, Siyuan Cheng, Bozhong Tian, Mengru Wang, Zhoubo Li, Yinuo Jiang, Yuqi Zhu, Hongbin Ye, Zekun Xi, Xinrong Li, Huajun Chen

<h3 id="7-2">📇7.2 Citation</h3>

If you use our repository, please cite the following related papers:

@misc{knowlm,
  author = {Ningyu Zhang and Jintian Zhang and Xiaohan Wang and Honghao Gui and Kangwei Liu and Yinuo Jiang and Xiang Chen and Shengyu Mao and Shuofei Qiao and Yuqi Zhu and Zhen Bi and Jing Chen and Xiaozhuan Liang and Yixin Ou and Runnan Fang and Zekun Xi and Xin Xu and Lei Li and Peng Wang and Mengru Wang and Yunzhi Yao and Bozhong Tian and Yin Fang and Guozhou Zheng and Huajun Chen},
  title = {KnowLM Technical Report},
  year = {2023},
 url = {http://knowlm.zjukg.cn/},
}

@article{wang2023easyedit,
  title={EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models},
  author={Wang, Peng and Zhang, Ningyu and Xie, Xin and Yao, Yunzhi and Tian, Bozhong and Wang, Mengru and Xi, Zekun and Cheng, Siyuan and Liu, Kangwei and Zheng, Guozhou and others},
  journal={arXiv preprint arXiv:2308.07269},
  year={2023}
}
@article{ou2024easyinstruct,
  title={EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models},
  author={Ou, Yixin and Zhang, Ningyu and Gui, Honghao and Xu, Ziwen and Qiao, Shuofei and Bi, Zhen and Chen, Huajun},
  journal={arXiv preprint arXiv:2402.03049},
  year={2024}
}

@article{yao2023editing,
  title={Editing Large Language Models: Problems, Methods, and Opportunities},
  author={Yao, Yunzhi and Wang, Peng and Tian, Bozhong and Cheng, Siyuan and Li, Zhoubo and Deng, Shumin and Chen, Huajun and Zhang, Ningyu},
  journal={arXiv preprint arXiv:2305.13172},
  year={2023}
}

<h3 id="7-3">💡7.3 Acknowledgment</h3>

We are very grateful to the following open source projects for their help:

<!--<p align="center"> <br> <img src="./assets/知析 (8).png" width="300"/> <br> </p>-->

Why it's called ZhiXi (智析)?

In Chinese, "Zhi" (智) signifies intelligence, referencing the AI's advanced language understanding capabilities. "Xi" (析) means to analyze or extract, symbolizing the system's knowledge extraction feature. Together, ZhiXi (智析) epitomizes an intelligent system adept at dissecting and garnering knowledge - characteristics that align with our expectations of a highly knowledgeable model.