bert模型数据各个特征的含义

features["input_ids"] :每个中文字对应的词库id
features["input_mask"]  : The mask has 1 for real tokens and 0 for padding tokens. Only real
 tokens are attended to.
features["segment_ids"] : 句子标记的id(表明属于哪一个句子)
features["label_ids"]  : 这条样本对应标签的id
features["is_real_example"] :  bool类型,True
tokens ['[CLS]', 'like', 'most', 'of', 'his', 'fellow', 'gold', '-', 'seekers', ',', 'cass', 'was', 'super', '##sti', '##tious', '.', '[SEP]', 'text', 'should', 'be', 'one', '-', 'sentence', '-', 'per', '-', 'line', ',', 'with', 'empty', 'lines', 'between', 'documents', '.', 'this', 'sample', 'text', 'is', 'public', 'domain', 'and', 'was', 'randomly', 'selected', 'from', 'project', 'gut', '##tenberg', '.', '[SEP]']
segment_ids [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
------------ [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48]
============== [47, 3, 8, 11, 4, 46, 40, 7, 28, 30, 33, 26, 18, 12, 22, 39, 35, 21, 31, 42, 15, 1, 38, 34, 44, 29, 32, 19, 17, 43, 6, 37, 45, 27, 41, 36, 13, 20, 14, 23, 25, 9, 24, 48, 2, 10, 5]

被屏蔽的位置和标签
-原来的真实值

 [MaskedLmInstance(index=47, label='##tenberg'), MaskedLmInstance(index=3, label='of'), MaskedLmInstance(index=8, label='seekers'), MaskedLmInstance(index=11, label='was'), MaskedLmInstance(index=4, label='his'), MaskedLmInstance(index=46, label='gut'), MaskedLmInstance(index=40, label='and'), MaskedLmInstance(index=7, label='-')]

排序后的值(但是为什么要排序,还没有弄懂,如果有大佬明白,望不吝赐教)

 [MaskedLmInstance(index=47, label='##tenberg'), MaskedLmInstance(index=3, label='of'), MaskedLmInstance(index=8, label='seekers'), MaskedLmInstance(index=11, label='was'), MaskedLmInstance(index=4, label='his'), MaskedLmInstance(index=46, label='gut'), MaskedLmInstance(index=40, label='and'), MaskedLmInstance(index=7, label='-')]

原始值

 ['[CLS]', 'like', 'most', 'of', 'his', 'fellow', 'gold', '-', 'seekers', ',', 'cass', 'was', 'super', '##sti', '##tious', '.', '[SEP]', 'text', 'should', 'be', 'one', '-', 'sentence', '-', 'per', '-', 'line', ',', 'with', 'empty', 'lines', 'between', 'documents', '.', 'this', 'sample', 'text', 'is', 'public', 'domain', 'and', 'was', 'randomly', 'selected', 'from', 'project', 'gut', '##tenberg', '.', '[SEP]']

屏蔽后的值

['[CLS]', 'like', 'most', '[MASK]', '[MASK]', 'fellow', 'gold', '[MASK]', '[MASK]', ',', 'cass', 'was', 'super', '##sti', '##tious', '.', '[SEP]', 'text', 'should', 'be', 'one', '-', 'sentence', '-', 'per', '-', 'line', ',', 'with', 'empty', 'lines', 'between', 'documents', '.', 'this', 'sample', 'text', 'is', 'public', 'domain', '[MASK]', 'was', 'randomly', 'selected', 'from', 'project', '[MASK]', '[MASK]', '.', '[SEP]']

举例:

[tokens: [CLS] ceased [MASK] the gray streaks of morning at blazing star , and the [MASK] awoke to a [MASK] sense of clean ##liness , and the finding of forgotten knives , tin cups , and smaller [MASK] ut ##ens ##ils , where the heavy showers had washed away the [MASK] [MASK] dust heap ##s before the cabin doors . indeed , it [MASK] recorded in blazing star that a fortunate [MASK] rise ##r had once picked up on the highway a solid chunk [MASK] gold quartz which the rain had freed from its inc ##umber ##ing soil , and [SEP] this text is [MASK] to [MASK] sure unicode is handled [MASK] : [MASK] 加 勝 北 区 ᴵ bobbie ##ᵀ ##ᵃ ##ছ ##জ ##ট ##ড [MASK] ##ত [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: True
masked_lm_positions: 2 6 8 14 18 36 44 49 50 62 70 83 93 103 105 110 112 118 125
masked_lm_labels: with of at settlement moral camp showers debris and was early of inc included make properly 力 ##ᴺ ##ণ
 
, tokens: [CLS] possibly this may have been the reason why early rise [MASK] [MASK] that locality , during the [MASK] season , adopted [MASK] thoughtful habit of body , and seldom lifted their eyes to the rift ##ed [MASK] [MASK] - ink washed skies above them . [SEP] [MASK] , [MASK] not with a view [MASK] discovery . a leak in his cabin roof , - - quite consistent with his careless , imp ##rov ##ide ##nt habits , - - had rouse ##d him at 4 a . m . , with a flooded " bunk " and wet blankets . the [MASK] [MASK] his wood pile independently to kind ##le a fire to ##ᵘ [MASK] bed - [MASK] , and he had rec honesty ##e to [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: False
masked_lm_positions: 11 12 14 18 22 37 38 47 49 54 73 102 103 107 114 115 118 124 125
masked_lm_labels: ##rs in locality rainy a or india morning but to ##rov chips from refused dry his clothes ##ours ##e
 
, tokens: [CLS] this was nearly opposite . mr . cass ##ius crossed the highway , and stopped suddenly . something glitter ##ed in the [MASK] red pool [MASK] him [MASK] gold , surely ! but [MASK] wonderful [MASK] [MASK] , not an irregular , shape ##less fragment of [MASK] ore , fresh from [MASK] ' s cr ##ucible , but a bit of jewel ##er ' s ⁻ ##ic [MASK] ##t in [MASK] form [MASK] a plain gold ring . [MASK] at it [MASK] at ##ten ##tively , he saw that it [MASK] the inscription , " may to cass . [MASK] [SEP] this sample text is public domain and [MASK] randomly selected from project gut ##tenberg . [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: True
masked_lm_positions: 23 26 28 34 36 37 47 52 66 68 71 73 79 82 91 96 100 109
masked_lm_labels: nearest before . , to relate crude nature hand ##raf the of looking more bore may " was
 
, tokens: [CLS] like most [MASK] [MASK] fellow gold [MASK] [MASK] , cass was super ##sti ##tious . [SEP] text should be one - sentence - per - line , with empty lines between documents . this sample text is public domain [MASK] was randomly selected from project [MASK] [MASK] . [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: True
masked_lm_positions: 3 4 7 8 11 40 46 47
masked_lm_labels: of his - seekers was and gut ##tenberg
 
]

write_instance_to_example_files:写入文件 token转换成ID

I
写入tfrecord的一个完整的数据展示

NFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] at [MASK] [unused513] reached the quay at the opposite [MASK] of the street ; [SEP] [MASK] , wonderful to relate [MASK] [MASK] an irregular , shape ##less [MASK] of crude ore [MASK] fresh from nature ' s disagreements ##ucible , muse a bit of jewel ##er ' s hand [MASK] ##raf ##t in the form of a plain gold ring . looking at [MASK] [MASK] at ##ten ##tively , he saw that it bore the inscription , " may to cass . " like most [MASK] his fellow gold - seekers , cass [MASK] super ##sti ##tious . [SEP]
INFO:tensorflow:input_ids: 101 2012 103 518 2584 1996 21048 2012 1996 4500 103 1997 1996 2395 1025 102 103 1010 6919 2000 14396 103 103 2019 12052 1010 4338 3238 103 1997 13587 10848 103 4840 2013 3267 1005 1055 23145 21104 1010 18437 1037 2978 1997 13713 2121 1005 1055 2192 103 27528 2102 1999 1996 2433 1997 1037 5810 2751 3614 1012 2559 2012 103 103 2012 6528 25499 1010 2002 2387 2008 2009 8501 1996 9315 1010 1000 2089 2000 16220 1012 1000 2066 2087 103 2010 3507 2751 1011 24071 1010 16220 103 3565 16643 20771 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 2 3 10 16 21 22 28 32 38 41 50 64 65 86 94 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 2197 2027 2203 2021 1010 2025 15778 1010 13675 2021 2594 2009 2062 1997 2001 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 1
  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
BERT(Bidirectional Encoder Representations from Transformers)是一种预训练模型,可以用于文本特征提取。BERT模型通过在大规模文本数据上进行无监督的预训练来学习通用的文本表示,在此基础上可以进行各种下游任务的微调或特征提取。 要使用BERT模型抽取文本特征,首先需要将输入的文本转换为BERT所需的格式。对于英文文本,可以使用WordPiece标记化方法将文本拆分为词片段,并添加特殊的标记(如[CLS]和[SEP])来表示句子的开始和结束。对于中文文本,可以使用字级别的标记化方法。 然后,将标记化后的文本输入到BERT模型中,获取模型的隐藏状态。BERT模型通常有多层Transformer编码器,每层都会输出相应的隐藏状态。可以选择使用最后一层的隐藏状态,或者将多个层的隐藏状态进行融合,得到最终的文本特征表示。 一种常用的方法是将最后一层的隐藏状态与特定位置的标记(如[CLS])进行连接,得到表示整个句子的特征向量。这个特征向量可以用作文本分类、句子相似度计算等下游任务的输入。 除了最后一层的隐藏状态,BERT模型还可以提供其他层的隐藏状态,这些隐藏状态可以更细粒度地表示文本的各个方面。可以根据具体的任务需求选择相应的隐藏状态进行特征提取。 需要注意的是,由于BERT模型参数较多,对大规模文本数据进行完整的特征提取可能会比较耗时。为了提高效率,可以使用批量化技术对多个文本样本进行并行处理,或者使用模型压缩等方法减少模型的大小和计算量。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值