序列标注:序列标注是NLP中一个基本任务,在序列标注中,我们想对一个序列的每一个元素标注一个标签,一般情况下,序列标注可以分为中文分词,命名实体识别等
每个元素都需要被标注为一个标签,,其中一个标签指向实体的开始,另外一个标签指向实体的中间部分或者结束部分,例如在NER任务中,最常用的就是BIO标注体系。
记录下常见的标注体系:
1.BIO标注体系:
B-begin:代表实体的开头
I-inside:代表实体的中间或结尾
O-outside:代表非实体部分
2.BIOES标注体系:
B-begin:代表实体的开头
I-inside:代表实体的中间
O-outside:代表非实体部分
E-end:代表实体的结尾
S-single:代表单个字符,其本身就是一个实体
3.BMES标注体系
B-begin:代表实体的开头
M-inside:代表实体的中间
O-outside:代表非实体部分
E-end:代表实体的结尾
S-single:代表单个字符,其本身就是一个实体
综合来看,在很多任务上各种标注体系的表现差异不大。
下面附上BMES转为BIO标签体系的代码实现:
def load_lines(path, encoding='utf8'):
with open(path, 'r', encoding=encoding) as f:
lines = [line.strip() for line in f.readlines()]
return lines
def write_lines(lines, path, encoding='utf8'):
with open(path, 'w', encoding=encoding) as f:
for line in lines:
f.writelines('{}\n'.format(line))
def bmes_to_json(bmes_file, json_file):
"""
将bmes格式的文件,转换为json文件,json文件包含text和label,并且转换为BIO的标注格式
Args:
bmes_file:
json_file:
:return:
"""
texts = []
with open(bmes_file, 'r', encoding='utf8') as f:
lines = f.readlines()
words = []
labels = []
for idx in trange(len(lines)):
line = lines[idx].strip()
if not line:
assert len(words) == len(labels), (len(words), len(labels))
sample = {}
sample['text'] = words
sample['label'] = labels
texts.append(json.dumps(sample, ensure_ascii=False))
words = []
labels = []
else:
word = line.split()
label = line.split()
label = str(label).replace('M-', 'I-').replace('E-', 'I-')
words.append(word)
labels.append(label)
with open(json_file, 'w', encoding='utf-8') as f:
for text in texts:
f.write("{}\n".format(text))
if __name__ == '__main__':
# 生成json文件
data_names = ['msra']
path = '../datasets'
for data_name in data_names:
logger.info('processing dataset:{}'.format(data_name))
files = os.listdir(join(path, data_name))
for file in files:
file = join(path, data_name, file)
data_type = os.path.basename(file).split('.')[0]
out_path = join(path, data_name, data_type+'.json')