Intern · WanJuan 1.0 is the first open-source version of Intern · Wanjuan multimodal corpus, which includes three parts: text dataset, image-text dataset, and video dataset, with a total data volume exceeding 2TB. Based on the corpus built by the large model data alliance, the Shanghai AI Lab has carried out fine-grained cleaning, deduplication, and value alignment on some of the data, forming Intern · WanJuan 1.0, which has four characteristics those are multiple integration, fine processing, value alignment, ease of use and efficiency, etc. .

Currently, Intern · WanJuan 1.0 has been applied to the training of those large models such as Intern Multimodal and Intern Puyu. Through the "digestion" of high-quality corpus, the Intern series models have shown excellent performance in various generative tasks such as semantic understanding, knowledge question answering, visual understanding, and visual question answering.



Intern · WanJuan 1.0 - text dataset

Intern · WanJuan 1.0 Text Dataset is composed of cleaned pre-training corpora from different sources such as web pages, encyclopedias, books, patents, textbooks, and exam questions. The total amount of data exceeds 500 million documents, and the data size exceeds 1TB. The corpus processes data in various formats such as html, text, pdf and epub into a jsonl format with unified fields。And after fine-grained cleaning, deduplication, and value alignment, it forms a safe, reliable, and high-quality pre- training corpus.


    "id": "BkORdv3xK7IA0HG7pccr",
- Field

** - id:** [string type] the unique ID of the document. ** - content:** [string type] the content of the document, the format is normal Text format or Markdown format.


Intern · WanJuan 1.0 - image-text dataset

The data of Intern · WanJuan 1.0 - image-text dataset mainly come from public webpages, which are processed to form interlaced images and text documents. The total number of documents exceeds 22 million, and the data size exceeds 140GB (excluding pictures), covering news events, people, natural landscapes, social life and other fields. The data is in a unified jsonl format, where the pictures are given in the form of url. If you need to get the picture data, you can use the following script: https://github.com/opendatalab/image-downloader

    "id": "BkKuk1zxK3YAbgNSWYik",
    "img_list": [
            "url": "http://digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/images/2021-01/21/02/1007771_wangjj_1611154300505_b.jpg",
            "sha256": "019cca88f37ae5ffe59ad48ad5c392fe64e489f08e841b6ea50c79c18f5c6ec3",
            "caption": "",
            "width": "400",
            "height": "266"
- Field

** - id:** [string type] the unique ID of the document. ** - img_list:** [array type], the list of images contained in the document. The information of each picture includes network url, sha256 of url, length and width. ** - content: **[string type] the content of the document, the format is normal Text format or Markdown format.


Intern · WanJuan 1.0 - video dataset

Intern · WanJuan 1.0 Video Dataset is mainly from China Media Group and Shanghai Media Group. It contains various types of program videos, with more than 1,000 video files and a data size of more than 900GB. The content covers military, literature and art, sports, nature, real society, knowledge, video art, media, food, historical documentaries, science and education, etc.


Download link

To download the complete dataset, please go to: https://opendatalab.org.cn/WanJuan1.0


The whole Intern · WanJuan 1.0 adopts the CC BY 4.0 license agreement. You are free to share and adapt this dataset, subject to the following conditions:

For the complete content of the agreement, please visit CC BY 4.0 Agreement Full Text.

Special attention items

Note that some subsets of this dataset may be subject to other agreements. Before using a specific subset, please be sure to read the relevant agreement carefully to ensure compliant use. For more detailed protocol information, please check the relevant documents or metadata of a specific subset.

As a non-profit organization, OpenDataLab advocates a harmonious and friendly open source communication environment. If you find any content that infringes your legal rights in the open source dataset, you can send an email to (OpenDataLab@pjlab.org.cn), and please indicate the relevant infringement in the email. A detailed description of the facts and provide us with relevant ownership certification materials. We will initiate the investigation and processing mechanism within 3 working days, and take necessary measures to deal with it (as listed below). But you should ensure the authenticity of your complaint, otherwise you should be solely responsible for the adverse consequences after taking measures.

Change Log

2023-10-20: Security upgrade: further cleaning and improving the purity of the corpus, the total file size after the upgrade is 2047.6GB

2023-08-14: First release


      title={WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models}, 
      author={Conghui He and Zhenjiang Jin and Chao Xu and Jiantao Qiu and Bin Wang and Wei Li and Hang Yan and Jiaqi Wang and Dahua Lin},