ArabicText 2022
目前全球数据量最大的开源阿拉伯语预训练数据集ArabicText2022,可用于阿拉伯语语言模型的训练
大模型智源合作阿拉伯语自然语言处理
数据集介绍
数据集文件
In cooperation with institutes of Arabic-speaking countries, containing AASTMT, BA and IIAI, the cognitive model and data research team of Beijing Academy of Artificial Intelligence(BAAI) has published ArabicText 2022, the world’s largest Arabic text dataset among the open-source community for pre-training language models.
北京智源人工智能研究院认知模型与数据研究团队与埃及阿拉伯科技与海运学院(AASTMT)、亚历山大图书馆(BA)、阿布扎比阿联酋起源人工智能研究院(IIAI)等阿拉伯语国家和地区优势高校院所和科研机构合作,构建了目前全球数据量最大的开源阿拉伯语预训练数据集ArabicText 2022,可用于阿拉伯语语言模型的训练。
Data Description/数据概况
By collecting, aggregating and cleaning the public-available Arabic web data, we finally obtains a 200GB+ high-quality text dataset, which is the largest around the world’s open-source community. During the process of data cleaning, we applies and optimizes WudaoCleaner1, an efficient and effective web text cleaning tool approved by WuDaoCorpora[1]. At the same time, we integrate the open-source Arabic text cleaning toolkit, ArabertProcessor2, into the whole cleaning pipeline as a insurance of language-specific data quality. Moreover, the informative data such as news and encyclopedia, account for more than 65% in our dataset, indicating that language models is able to gain prior knowledge easily from our corpus.
通过对现有可用的阿拉伯语网络文本数据进行收集、整理、扩充和清洗,我们最终获得了200GB+的高质量预训练文本。在数据清洗过程中,我们基于支撑WuDaoCorpora[1]的网页文本深度清洗工具WudaoCleaner1,针对阿语进行了高度适配和优化,同时我们将开源的阿语文本清洗库ArabertPreprocessor2融入清洗流程中,保证了清洗后的文本质量。相较于现有开源阿语文本数据集,我们此次开源的数据集的体量为全球最大,且新闻、资讯、百科等文字与知识富集类数据占比超过65%,有利于模型从数据中学习到更多的先验知识。
Data Source and Scale/数据构成和规模
Data ID | Source | Scale(GB) |
---|---|---|
001 | ArabicWeb22-A(collected by BAAI) | 67 |
002 | ArabicWeb16 | 50 |
003 | OSCAR3 | 27 |
004 | ArabicWeb22-B(provided by AASTMT & BA) | 20 |
005 | CC100[2] | 19 |
006 | Abu El-Khair Corpus[3] | 15 |
007 | Arabic Tweets | 2.1 |
008 | Arabic Wikipedia4 | 1.8 |
Total | 201.9 |
Data Format/数据格式
The result data is of plain text. According to the data source, we package the dataset into 8 txt files. The filenames and the IDs of the dataset are mapped correspondingly.
结果数据为文本格式,根据数据来源,我们将数据集打包为8个txt格式的文件,数据集ID与文件名一一对应。
Copyright/数据版权说明
The ArabicText2022 dataset is only used for academic research. To use this dataset, please read and abide by the Data Use Agreement of our platform. The platform does not own the copyright of the data. Users should be responsible for any operation of the data, and shall not spread it privately or use it maliciously. If the copyright of the data is violated, please contact us and we will delete it.
ArabicText 2022数据集仅用于学术研究,要使用本数据集,请阅读并遵守本平台的《数据使用协议》。本平台不拥有这些数据的版权,使用者对数据的任何操作需承担全部责任,不得私自传播、恶意使用。如果数据的版权受到侵犯,请联系我们进行删除。
Additional Information/补充说明
Related Projects/相关项目
1、 https://wudaoai.cn/ecology/cleaner-detail
2、 https://github.com/aub-mind/arabert
4、 https://dumps.wikimedia.org
Reference/参考文献
[1] Sha Yuan, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang and Jie Tang. 2021. Wudaocorpora: A super large-scale Chinese Corpora for Pre-training Language Models. AIOpen, 2:65-68.
[2] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
[3] El-Khair, Ibrahim Abu. 2016. Abu el-khair corpus: A modern standard arabic corpus. International Journal of Recent Trends in Engineering & Research, 2(11), 11.
If you encounter any inconvenience while getting acesss to the dataset, please contact us via data@baai.ac.cn with your Name, Affiliation, Contact and Usage of the data.
如遇到数据获取等方面的问题,可以联系 data@baai.ac.cn,并提供姓名、单位、邮箱、数据用途等基本信息。