Home

Awesome

SecEval: A Comprehensive Benchmark for Evaluating Cybersecurity Knowledge of Foundation Models

中文版.

The advent of large language models has ignited a transformative era for the cybersecurity industry. Pioneering applications are being developed, deployed, and utilized in areas such as cybersecurity knowledge QA, vulnerability hunting, and alert investigation. Various researches have indicated that LLMs primarily acquire their knowledge during the pretraining phase, with fine-tuning serving essentially to align the model with user intentions, providing the ability to follow instructions. This suggests that the knowledge and skills embedded in the foundational model significantly influence the model's potential on specific downstream tas ks

Yet, a focused evaluation of cybersecurity knowledge is missing in existing datasets. We address this by introducing "SecEval". SecEval is the first benchmark specifically created for evaluating cybersecurity knowledge in Foundation Models. It offers over 2000 multiple-choice questions across 9 domains: Software Security, Application Security, System Security, Web Security, Cryptography, Memory Safety, Network Security, and PenTest. SecEval generates questions by prompting OpenAI GPT4 with authoritative sources such as open-licensed textbooks, official documentation, and industry guidelines and standards. The generation process is meticulously crafted to ensure the dataset meets rigorous quality, diversity, and impartiality criteria. You can explore our dataset the explore page.

Using SecEval, we conduct an evaluation of 10 state-of-the-art foundational models, providing new insights into their performance in the field of cybersecurity. The results indicate that there is still a long way to go before LLMs can be the master of cybersecurity. We hope that SecEval can serve as a catalyst for future research in this area.

Table of Contents

Leaderboard

#ModelCreatorAccessSubmission DateSystem SecurityApplication SecurityPenTestMemory SafetyNetwork SecurityWeb SecurityVulnerabilitySoftware SecurityCryptographyOverall
1gpt-4-turboOpenAIAPI, Web2023-12-2073.6175.2580.0070.8375.6582.1576.0573.2864.2979.07
2gpt-3.5-turboOpenAIAPI, Web2023-12-2059.1557.1872.0043.7560.8763.0060.1858.1935.7162.09
3Yi-6B01-AIWeight2023-12-2050.6148.8969.2635.4256.5254.9849.4045.6935.7153.57
4Orca-2-7bMicrosoftWeight2023-12-2046.7647.0360.8431.2549.1355.6350.0052.1614.2951.60
5Mistral-7B-v0.1MistralaiWeight2023-12-2040.1938.3753.4733.3336.5246.5742.2243.1028.5743.65
6chatglm3-6b-baseTHUDMWeight2023-12-2039.7237.2557.4731.2543.0441.1437.4339.6628.5741.58
7Aquila2-7BBAAIWeight2023-12-2034.8436.0147.1622.9232.1742.0438.0236.217.1438.29
8Qwen-7BAlibabaWeight2023-12-2028.9228.8441.4718.7529.5733.2531.7430.1714.2931.37
9internlm-7bSensetimeWeight2023-12-2025.9225.8736.2125.0027.8332.8629.3434.057.1430.29
10Llama-2-7b-hfMetaAIWeight2023-12-2020.9418.6926.1116.6714.3522.7721.5620.2621.4322.15

Dataset

Format

The dataset is in json format. Each question has the following fields:

Question Distribution

TopicNo. of Questions
SystemSecurity1065
ApplicationSecurity808
PenTest475
MemorySafety48
NetworkSecurity230
WebSecurity773
Vulnerability334
SoftwareSecurity232
Cryptography14
Overall2126

Download

You can download the json file of the dataset by running.

wget https://huggingface.co/datasets/XuanwuAI/SecEval/blob/main/questions.json

Or you can load the dataset from Huggingface.

Evaluate Your Model on SecEval

You can use our evaluation script to evaluate your model on SecEval dataset.

Generation Process

Data Collection

Questions Generation

To facilitate the evaluation process, we designed the dataset in a multiple-choice question format. Our approach to question generation involved several steps:

  1. Text Parsing: We began by parsing the texts according to their hierarchical structure, such as chapters and sections for textbooks, or tactics and techniques for frameworks like ATT&CK.

  2. Content Sampling: For texts with extensive content, such as CWE or Windows Security Documentation, we employed a sampling strategy to maintain manageability. For example, we selected the top 25 most common weakness types and 175 random types from CWE.

  3. Question Generation: Utilizing GPT-4, we generated multiple-choice questions based on the parsed text, with the level of detail adjusted according to the content's nature. For instance, questions stemming from the CS161 textbook were based on individual sections, while those from ATT&CK were based on techniques.

  4. Question Refinement: We then prompted GPT-4 to identify and filter out questions with issues such as too simplistic or not self-contained. Where possible, questions were revised; otherwise, they were discarded.

  5. Answer Calibration: We refine the selection of answer options by presenting GPT-4 with both the question and the source text from which the question is derived. Should the response generated by GPT-4 diverge from the previously established answer, this discrepancy suggests that obtaining a consistent answer for the question is inherently challenging. In such cases, we opt to eliminate these problematic questions.

  6. Classification: Finally, we organized the questions into 9 topics, and attached a relevant fine-grained keyword to each question.

Limitations

The dataset, while comprehensive, exhibits certain constraints:

  1. Distribution Imbalance: The dataset presents an uneven distribution of questions across different domains, resulting in a higher concentration of questions in certain areas while others are less represented.

  2. Incomplete Scope: Some topics on Cybersecurity are absent from the dataset, such as content security, reverse engineering, and malware analysis. As such, it does not encapsulate the full breadth of knowledge within the field.

Future Work

  1. Improvement on Distribution: We aim to broaden the dataset's comprehensiveness by incorporating additional questions, thereby enriching the coverage of existing cybersecurity topics.

  2. Improvement on Topic Coverage: Efforts will be made to include a wider array of cybersecurity topics within the dataset, which will help achieve a more equitable distribution of questions across various fields.

Licenses

The dataset is released under the CC BY-NC-SA 4.0 license. The code is released under the MIT license.

Citation

@misc{li2023seceval,
    title={SecEval: A Comprehensive Benchmark for Evaluating Cybersecurity Knowledge of Foundation Models},
    author={Li, Guancheng and Li, Yifeng and Wang Guannan and Yang, Haoyu and Yu, Yang},
    publisher = {GitHub},
    howpublished= "https://github.com/XuanwuAI/SecEval",
    year={2023}
}

Credits

This work is supported by Tencent Security Xuanwu Lab. we also apperiate Tencent Spark Talent Program for help.