自然语言处理_1_基本概念&WikiText数据集

WikiText数据集来自维基百科(Wiki)的词条,经过验证的优质文章内容被收录,总数超过1亿个单词(原词叫token,令牌,就是句子拆分为单词的数量)。

与Penn Treebank(PTB)的预处理版本相比,WikiText-2要大2倍多,WikiText-103要大110倍多。WikiText数据集还具有更大的词汇表,并保留了原始大小写、标点符号和数字,所有这些都在PTB中删除。由于数据集由完整的文章组成,因此它非常适合可以利用长期依赖关系的模型。

WikiText-2 约 4.3 MB,WikiText-103 约 181 MB。

这里要明确两个概念:

Tokens:在自然语言处理中,单个单词或符号被称为token。
在WikiText中,tokens指的是文本中实际出现的单词或符号单位,例如单词、标点符号和数字等。
tokens是指文本数据中的最小单位,也是模型输入的最小颗粒,模型的目标就是预测下一个token。

Vocab:词汇表的缩写,指的是所有不同的tokens的集合。
在WikiText中,vocab就是所有出现过的不同的单词或符号的集合,包括单词、标点符号、数字和其他特殊字符等。
对于每个单词或符号,都可以为其分配一个唯一的ID,模型将根据这些ID进行训练和预测。

在使用WikiText进行语言模型训练时,
需要将文本数据中的tokens进行向量化表示,
并使用其中的一个子集构建词汇表(vocab),
便于模型学习和预测。

示例

数据集保存在txt中,大概的格式如下:

= Homarus gammarus = 
 
 Homarus gammarus , known as the European lobster or common lobster , is a species of <unk> lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into <unk> larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . 
 
 = = Description = = 
 
 Homarus gammarus is a large <unk> , with a body length up to 60 centimetres ( 24 in ) and weighing up to 5 – 6 kilograms ( 11 – 13 lb ) , although the lobsters caught in lobster pots are usually 23 – 38 cm ( 9 – 15 in ) long and weigh 0 @.@ 7 – 2 @.@ 2 kg ( 1 @.@ 5 – 4 @.@ 9 lb ) . Like other crustaceans , lobsters have a hard <unk> which they must shed in order to grow , in a process called <unk> ( <unk> ) . This may occur several times a year for young lobsters , but decreases to once every 1 – 2 years for larger animals . 
 The first pair of <unk> is armed with a large , asymmetrical pair of claws . The larger one is the " <unk> " , and has rounded <unk> used for crushing prey ; the other is the " cutter " , which has sharp inner edges , and is used for holding or tearing the prey . Usually , the left claw is the <unk> , and the right is the cutter . 
 The <unk> is generally blue above , with spots that <unk> , and yellow below . The red colour associated with lobsters only appears after cooking . This occurs because , in life , the red pigment <unk> is bound to a protein complex , but the complex is broken up by the heat of cooking , releasing the red pigment . 
 The closest relative of H. gammarus is the American lobster , Homarus americanus . The two species are very similar , and can be crossed artificially , although hybrids are unlikely to occur in the wild since their ranges do not overlap . The two species can be distinguished by a number of characteristics : 
 The <unk> of H. americanus bears one or more spines on the underside , which are lacking in H. gammarus . 
 The spines on the claws of H. americanus are red or red @-@ tipped , while those of H. gammarus are white or white @-@ tipped . 
 The underside of the claw of H. americanus is orange or red , while that of H. gammarus is creamy white or very pale red . 
 
 = = Life cycle = = 
 
 Female H. gammarus reach sexual maturity when they have grown to a carapace length of 80 – 85 millimetres ( 3 @.@ 1 – 3 @.@ 3 in ) , whereas males mature at a slightly smaller size . Mating typically occurs in summer between a recently <unk> female , whose shell is therefore soft , and a hard @-@ shelled male . The female carries the eggs for up to 12 months , depending on the temperature , attached to her <unk> . Females carrying eggs are said to be " <unk> " and can be found throughout the year . 
 The eggs hatch at night , and the larvae swim to the water surface where they drift with the ocean currents , preying on <unk> . This stage involves three <unk> and lasts for 15 – 35 days . After the third moult , the juvenile takes on a form closer to the adult , and adopts a <unk> lifestyle . The juveniles are rarely seen in the wild , and are poorly known , although they are known to be capable of digging extensive burrows . It is estimated that only 1 larva in every 20 @,@ 000 survives to the <unk> phase . When they reach a carapace length of 15 mm ( 0 @.@ 59 in ) , the juveniles leave their burrows and start their adult lives . .....

以上例子是词条 = Homarus gammarus =的介绍 (一种龙虾品种)。其也是wiki词条下的一级标题,后面的 = = Description = = 和 = = Life cycle = = 是二级标题,大意为“简述”和“生命周期“。
同理,=== xxx === 这样的格式就是该词条下的三级标题。

这里需要注意,一个一级标题下的内容称为一篇文章(Article),下面可以有几个二级标题和三级标题,以及对应的内容。

标识符

这里需要注意NLP里面常用的标识符经常出现

  • <unk>

意思是这个单词是低频词,不在统计词频范围内

  • @

这个是连接符,比如词条中有one-apple,那么数据库文本中是这样记录 one @-@ apple

  • <eos>

一段话结尾所添加的标识符,一般是一段话存在一个string中,之后split成为一个列表,列表最后一个元素是<eos>

处理方式

处理需要建立一个语料库(Corpus),一般构造一个字典(Dictionary)来索引(index)全部单词(vocab/words)。
wikitext-2 的Corpus共 33,278个不同的单词。 wikitext-101的Corpus共267,735个不同的单词。文本由这些词组成,字典格式如下

{ 单词A:0 ; 单词B:1 ;... ;单词X:N}

官方给出的数据统计:
在这里插入图片描述
这里需要注意Tokens, 它是整个数据库文本拆分为单词后,单词的统计总量,是有顺序的。

另外如果按行统计的话(就是<eos>隔断), Train, Valid, Test三者的量为36718, 3760, 4358

下载和参考文献

  • https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/
  • https://paperswithcode.com/dataset/wikitext-2
  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
引用\[2\]中提到,为了预训练BERT模型,需要使用数据集进行遮蔽语言模型和下一句预测的训练任务。最初的BERT模型是在两个庞大的图书语料库和英语维基百科的合集上进行预训练的。然而,现成的预训练BERT模型可能不适用于特定领域的应用,因此在定制的数据集上对BERT进行预训练变得越来越流行。例如,可以使用较小的语料库WikiText-2来对BERT进行预训练。WikiText-2保留了原来的标点符号、大小写和数字,并且比用于预训练word2vec的PTB数据集大了一倍以上。 因此,如果您想要使用BERT训练自己的数据集,您可以选择使用WikiText-2或其他适合您特定领域的数据集进行预训练。这样可以使BERT模型更好地适应您的应用场景。 #### 引用[.reference_title] - *1* *3* [bert常用基准数据集:GLUE数据集介绍以及数据集资源](https://blog.csdn.net/qq_40503347/article/details/126976043)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [李沐动手学深度学习V2-bert预训练数据集和代码实现](https://blog.csdn.net/flyingluohaipeng/article/details/126102362)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值