ArabicText 2022,数据量最大的开源阿拉伯语预训练数据集ArabicText2022

ArabicText 2022

目前全球数据量最大的开源阿拉伯语预训练数据集ArabicText2022,可用于阿拉伯语语言模型的训练

大模型智源合作阿拉伯语自然语言处理

数据集介绍

数据集文件

In cooperation with institutes of Arabic-speaking countries, containing AASTMT, BA and IIAI, the cognitive model and data research team of Beijing Academy of Artificial Intelligence(BAAI) has published ArabicText 2022, the world’s largest Arabic text dataset among the open-source community for pre-training language models.

北京智源人工智能研究院认知模型与数据研究团队与埃及阿拉伯科技与海运学院(AASTMT)、亚历山大图书馆(BA)、阿布扎比阿联酋起源人工智能研究院(IIAI)等阿拉伯语国家和地区优势高校院所和科研机构合作,构建了目前全球数据量最大的开源阿拉伯语预训练数据集ArabicText 2022,可用于阿拉伯语语言模型的训练。

Data Description/数据概况

By collecting, aggregating and cleaning the public-available Arabic web data, we finally obtains a 200GB+ high-quality text dataset, which is the largest around the world’s open-source community. During the process of data cleaning, we applies and optimizes WudaoCleaner1, an efficient and effective web text cleaning tool approved by WuDaoCorpora[1]. At the same time, we integrate the open-source Arabic text cleaning toolkit, ArabertProcessor2, into the whole cleaning pipeline as a insurance of language-specific data quality. Moreover, the informative data such as news and encyclopedia, account for more than 65% in our dataset, indicating that language models is able to gain prior knowledge easily from our corpus.

通过对现有可用的阿拉伯语网络文本数据进行收集、整理、扩充和清洗,我们最终获得了200GB+的高质量预训练文本。在数据清洗过程中,我们基于支撑WuDaoCorpora[1]的网页文本深度清洗工具WudaoCleaner1,针对阿语进行了高度适配和优化,同时我们将开源的阿语文本清洗库ArabertPreprocessor2融入清洗流程中,保证了清洗后的文本质量。相较于现有开源阿语文本数据集,我们此次开源的数据集的体量为全球最大,且新闻、资讯、百科等文字与知识富集类数据占比超过65%,有利于模型从数据中学习到更多的先验知识。

Data Source and Scale/数据构成和规模

Data IDSourceScale(GB)
001ArabicWeb22-A(collected by BAAI)67
002ArabicWeb1650
003OSCAR327
004ArabicWeb22-B(provided by AASTMT & BA)20
005CC100[2]19
006Abu El-Khair Corpus[3]15
007Arabic Tweets2.1
008Arabic Wikipedia41.8
Total201.9

Data Format/数据格式

The result data is of plain text. According to the data source, we package the dataset into 8 txt files. The filenames and the IDs of the dataset are mapped correspondingly.

结果数据为文本格式,根据数据来源,我们将数据集打包为8个txt格式的文件数据集ID与文件名一一对应。

Copyright/数据版权说明

The ArabicText2022 dataset is only used for academic research. To use this dataset, please read and abide by the Data Use Agreement of our platform. The platform does not own the copyright of the data. Users should be responsible for any operation of the data, and shall not spread it privately or use it maliciously. If the copyright of the data is violated, please contact us and we will delete it.

ArabicText 2022数据集仅用于学术研究,要使用本数据集,请阅读并遵守本平台的《数据使用协议》。本平台不拥有这些数据的版权,使用者对数据的任何操作需承担全部责任,不得私自传播、恶意使用。如果数据的版权受到侵犯,请联系我们进行删除。

Additional Information/补充说明

Related Projects/相关项目

1、 https://wudaoai.cn/ecology/cleaner-detail

2、 https://github.com/aub-mind/arabert

3、 https://oscar-project.org

4、 https://dumps.wikimedia.org

Reference/参考文献

[1] Sha Yuan, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang and Jie Tang. 2021. Wudaocorpora: A super large-scale Chinese Corpora for Pre-training Language Models. AIOpen, 2:65-68.

[2] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.

[3] El-Khair, Ibrahim Abu. 2016. Abu el-khair corpus: A modern standard arabic corpus. International Journal of Recent Trends in Engineering & Research2(11), 11.


If you encounter any inconvenience while getting acesss to the dataset, please contact us via data@baai.ac.cn with your Name, Affiliation, Contact and Usage of the data.

如遇到数据获取等方面的问题,可以联系 data@baai.ac.cn,并提供姓名、单位、邮箱、数据用途等基本信息。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

强化学习曾小健

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值