features["input_ids"] :每个中文字对应的词库id
features["input_mask"] : The mask has 1 for real tokens and 0 for padding tokens. Only real
tokens are attended to.
features["segment_ids"] : 句子标记的id(表明属于哪一个句子)
features["label_ids"] : 这条样本对应标签的id
features["is_real_example"] : bool类型,True
tokens ['[CLS]', 'like', 'most', 'of', 'his', 'fellow', 'gold', '-', 'seekers', ',', 'cass', 'was', 'super', '##sti', '##tious', '.', '[SEP]', 'text', 'should', 'be', 'one', '-', 'sentence', '-', 'per', '-', 'line', ',', 'with', 'empty', 'lines', 'between', 'documents', '.', 'this', 'sample', 'text', 'is', 'public', 'domain', 'and', 'was', 'randomly', 'selected', 'from', 'project', 'gut', '##tenberg', '.', '[SEP]']
segment_ids [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
------------ [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48]
============== [47, 3, 8, 11, 4, 46, 40, 7, 28, 30, 33, 26, 18, 12, 22, 39, 35, 21, 31, 42, 15, 1, 38, 34, 44, 29, 32, 19, 17, 43, 6, 37, 45, 27, 41, 36, 13, 20, 14, 23, 25, 9, 24, 48, 2, 10, 5]
被屏蔽的位置和标签
-原来的真实值
[MaskedLmInstance(index=47, label='##tenberg'), MaskedLmInstance(index=3, label='of'), MaskedLmInstance(index=8, label='seekers'), MaskedLmInstance(index=11, label='was'), MaskedLmInstance(index=4, label='his'), MaskedLmInstance(index=46, label='gut'), MaskedLmInstance(index=40, label='and'), MaskedLmInstance(index=7, label='-')]
排序后的值(但是为什么要排序,还没有弄懂,如果有大佬明白,望不吝赐教)
[MaskedLmInstance(index=47, label='##tenberg'), MaskedLmInstance(index=3, label='of'), MaskedLmInstance(index=8, label='seekers'), MaskedLmInstance(index=11, label='was'), MaskedLmInstance(index=4, label='his'), MaskedLmInstance(index=46, label='gut'), MaskedLmInstance(index=40, label='and'), MaskedLmInstance(index=7, label='-')]
原始值
['[CLS]', 'like', 'most', 'of', 'his', 'fellow', 'gold', '-', 'seekers', ',', 'cass', 'was', 'super', '##sti', '##tious', '.', '[SEP]', 'text', 'should', 'be', 'one', '-', 'sentence', '-', 'per', '-', 'line', ',', 'with', 'empty', 'lines', 'between', 'documents', '.', 'this', 'sample', 'text', 'is', 'public', 'domain', 'and', 'was', 'randomly', 'selected', 'from', 'project', 'gut', '##tenberg', '.', '[SEP]']
屏蔽后的值
['[CLS]', 'like', 'most', '[MASK]', '[MASK]', 'fellow', 'gold', '[MASK]', '[MASK]', ',', 'cass', 'was', 'super', '##sti', '##tious', '.', '[SEP]', 'text', 'should', 'be', 'one', '-', 'sentence', '-', 'per', '-', 'line', ',', 'with', 'empty', 'lines', 'between', 'documents', '.', 'this', 'sample', 'text', 'is', 'public', 'domain', '[MASK]', 'was', 'randomly', 'selected', 'from', 'project', '[MASK]', '[MASK]', '.', '[SEP]']
举例:
[tokens: [CLS] ceased [MASK] the gray streaks of morning at blazing star , and the [MASK] awoke to a [MASK] sense of clean ##liness , and the finding of forgotten knives , tin cups , and smaller [MASK] ut ##ens ##ils , where the heavy showers had washed away the [MASK] [MASK] dust heap ##s before the cabin doors . indeed , it [MASK] recorded in blazing star that a fortunate [MASK] rise ##r had once picked up on the highway a solid chunk [MASK] gold quartz which the rain had freed from its inc ##umber ##ing soil , and [SEP] this text is [MASK] to [MASK] sure unicode is handled [MASK] : [MASK] 加 勝 北 区 ᴵ bobbie ##ᵀ ##ᵃ ##ছ ##জ ##ট ##ড [MASK] ##ত [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: True
masked_lm_positions: 2 6 8 14 18 36 44 49 50 62 70 83 93 103 105 110 112 118 125
masked_lm_labels: with of at settlement moral camp showers debris and was early of inc included make properly 力 ##ᴺ ##ণ
, tokens: [CLS] possibly this may have been the reason why early rise [MASK] [MASK] that locality , during the [MASK] season , adopted [MASK] thoughtful habit of body , and seldom lifted their eyes to the rift ##ed [MASK] [MASK] - ink washed skies above them . [SEP] [MASK] , [MASK] not with a view [MASK] discovery . a leak in his cabin roof , - - quite consistent with his careless , imp ##rov ##ide ##nt habits , - - had rouse ##d him at 4 a . m . , with a flooded " bunk " and wet blankets . the [MASK] [MASK] his wood pile independently to kind ##le a fire to ##ᵘ [MASK] bed - [MASK] , and he had rec honesty ##e to [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: False
masked_lm_positions: 11 12 14 18 22 37 38 47 49 54 73 102 103 107 114 115 118 124 125
masked_lm_labels: ##rs in locality rainy a or india morning but to ##rov chips from refused dry his clothes ##ours ##e
, tokens: [CLS] this was nearly opposite . mr . cass ##ius crossed the highway , and stopped suddenly . something glitter ##ed in the [MASK] red pool [MASK] him [MASK] gold , surely ! but [MASK] wonderful [MASK] [MASK] , not an irregular , shape ##less fragment of [MASK] ore , fresh from [MASK] ' s cr ##ucible , but a bit of jewel ##er ' s ⁻ ##ic [MASK] ##t in [MASK] form [MASK] a plain gold ring . [MASK] at it [MASK] at ##ten ##tively , he saw that it [MASK] the inscription , " may to cass . [MASK] [SEP] this sample text is public domain and [MASK] randomly selected from project gut ##tenberg . [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: True
masked_lm_positions: 23 26 28 34 36 37 47 52 66 68 71 73 79 82 91 96 100 109
masked_lm_labels: nearest before . , to relate crude nature hand ##raf the of looking more bore may " was
, tokens: [CLS] like most [MASK] [MASK] fellow gold [MASK] [MASK] , cass was super ##sti ##tious . [SEP] text should be one - sentence - per - line , with empty lines between documents . this sample text is public domain [MASK] was randomly selected from project [MASK] [MASK] . [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: True
masked_lm_positions: 3 4 7 8 11 40 46 47
masked_lm_labels: of his - seekers was and gut ##tenberg
]
write_instance_to_example_files:写入文件 token转换成ID
I
写入tfrecord的一个完整的数据展示
NFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] at [MASK] [unused513] reached the quay at the opposite [MASK] of the street ; [SEP] [MASK] , wonderful to relate [MASK] [MASK] an irregular , shape ##less [MASK] of crude ore [MASK] fresh from nature ' s disagreements ##ucible , muse a bit of jewel ##er ' s hand [MASK] ##raf ##t in the form of a plain gold ring . looking at [MASK] [MASK] at ##ten ##tively , he saw that it bore the inscription , " may to cass . " like most [MASK] his fellow gold - seekers , cass [MASK] super ##sti ##tious . [SEP]
INFO:tensorflow:input_ids: 101 2012 103 518 2584 1996 21048 2012 1996 4500 103 1997 1996 2395 1025 102 103 1010 6919 2000 14396 103 103 2019 12052 1010 4338 3238 103 1997 13587 10848 103 4840 2013 3267 1005 1055 23145 21104 1010 18437 1037 2978 1997 13713 2121 1005 1055 2192 103 27528 2102 1999 1996 2433 1997 1037 5810 2751 3614 1012 2559 2012 103 103 2012 6528 25499 1010 2002 2387 2008 2009 8501 1996 9315 1010 1000 2089 2000 16220 1012 1000 2066 2087 103 2010 3507 2751 1011 24071 1010 16220 103 3565 16643 20771 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 2 3 10 16 21 22 28 32 38 41 50 64 65 86 94 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 2197 2027 2203 2021 1010 2025 15778 1010 13675 2021 2594 2009 2062 1997 2001 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 1