Home

Awesome

GlotCC HomePage

<img align="left" src="assets/images/logo.jpg" width="200" height="200" />

<a href="https://arxiv.org/abs/2410.23825"><img alt="arXiv" src="https://img.shields.io/badge/arXiv-2410.23825-b31b1b.svg"></a>

GlotCC is a multilingual corpus built by the GlotLID language identification and cisnlp/Ungoliant pipeline from CommonCrawl.

Lastest version supports more than 1000 languages and is filtered based on adopted filters from C4, CCNet, MADLAD-400, RedPajama-Data-v2, OSCAR, Gopher, RefinedWeb, FineWeb, Datatrove, Dolma, Pile-CC, Pretrainer's Guide, and GlotScript. ™ The logo features a llama with the style of C.C. from the Code Geass anime reading a book.

Dataset

GlotCC Dataset, Version 1: https://huggingface.co/datasets/cis-lmu/GlotCC-V1

Running the pipeline

We forked oscar-project/ungoliant to cisnlp/ungoliant and made the necessary changes to integrate it with the GlotLID language identification model.

For detailed instructions on running the pipeline, refer to the cisnlp/ungoliant repository. The README is up-to-date.

Acknowledgements

License

Citation

If you find our repo and data useful for your research, please cite:

@article{kargaran2024glotcc,
  title     = {Glot{CC}: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages},
  author    = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
  journal   = {Advances in Neural Information Processing Systems},
  year      = {2024},
  url       = {https://arxiv.org/abs/2410.23825}
}