命名实体识别数据预处理--格式转换

最新推荐文章于 2022-10-21 15:34:38 发布

chengjinpei

最新推荐文章于 2022-10-21 15:34:38 发布

阅读量2.3k

点赞数 2

分类专栏：自然语言处理资源文章标签：自然语言处理机器学习

本文链接：https://blog.csdn.net/chengjinpei/article/details/107858437

版权

自然语言处理资源专栏收录该内容

5 篇文章 0 订阅

订阅专栏

命名实体识别数据预处理--格式转换成BIEO格式

常见的预料库
数据预处理代码

常见的预料库

链接: 微软数据.
链接: 人民日报.
链接: 微博语料数据.

数据预处理代码

下面展示一些将我不是<per>江莱</per>转换成‘BIOE标准格式’。

import codecs  
import sys  
  
def character_tagging(input_file, output_file):  
    input_data = codecs.open(input_file, 'r', 'utf-8')  
    output_data = codecs.open(output_file, 'w', 'utf-8')  
    for line in input_data.readlines():  
        word_list = line.strip().lower().replace('<per>',' <').replace('</per>',' ').split()  
        for word in word_list:  
            if len(word) == 1:  
                output_data.write(word + "\tO\n")
            else:  
                if word[0]!= '<':
                    for w in word[0:len(word)]:
                        output_data.write(w + "\tO\n")
                else:
                    output_data.write(word[1] + "\tB-PER\n")  
                    for w in word[2:len(word)-1]:  
                        output_data.write(w + "\tI-PER\n")  
                    output_data.write(word[len(word)-1] + "\tE-PER\n")  
        output_data.write("\n")  
    input_data.close()  
    output_data.close()  

if __name__ == "__main__":
    input_file = 'name_only.txt'
    output_file = 'name_only_bieo.txt'
    character_tagging(input_file,output_file)

最终转换结果：
在这里插入图片描述

chengjinpei

关注

2
点赞
踩
18

收藏

觉得还不错? 一键收藏
2
评论
命名实体识别数据预处理--格式转换

命名实体识别数据预处理常见的预料库数据预处理代码常见的预料库链接: 微软数据.链接: 人名日报.链接: 微博语聊数据.数据预处理代码下面展示一些将我不是<per>江莱</per>转换成‘BIOE标准格式’。import codecs import sys def character_tagging(input_file, output_file): input_data = codecs.open(input_file, 'r', 'utf-
复制链接

扫一扫

专栏目录