1、先跑通官方数据集的模型
参考模型:基于crf的CoNLL2002数据集命名实体识别模型实现-pycrfsuite
关于nltk和pysuite的安装问题:
如何在Jupyter Notebook中安装和使用NLTK
也可以在github上下载nltk_data:https://github.com/nltk/nltk_data
平民级NER︱pycrfsuite的介绍与应用
2、观察官方数据集格式
print出来看看:
**先了解清楚数据类型:**发现是列表中嵌套着数个列表(每个列表代表一句话),而里面的列表中又嵌套着数个元组(每个元组代表一个词及其词性),每个元组中有三个字符串元素
再看看待跑文件的格式:
为列表中数个字符串的格式
关于数据类型:
python三个基本数据类型:小括号( )、中括号[ ]和花括号{ }
Python tuple元组详解
3、分割数据集
待跑数据中共有40729句语料
将测试集与训练集三七分:
即训练集大约28000句,测试集约12000句
分割数据集参考代码:
with open("POS-data_all.txt","r+",encoding="utf-8") as f1:
with open("esptestb.txt","w+",encoding="utf-8") as f2:
with open("esptrain.txt","w+",encoding="utf-8") as f3:
lines_all = f1.readlines()
cnt=0
print(type(lines_all))
list1=[]
list2=[]
for line in lines_all:
line_t = line.split('\t')
if line_t[0] != '\n':
cnt = int(line_t[0])
# print(line[0])
if line_t[0] == '\n':
if cnt >= 28000: # test
list2.append(line)
print(type(line))
else:
list1.append(line)
print(type(line))
elif int(line_t[0]) >= 28000:#test
list2.append(line)
print(type(line))
else:
list1.append(line)
print(type(line))
for l2 in list2:
f2.write(l2)
for l1 in list1:
f3.write(l1)
f1.close()
f2.close()
f3.close()
关于文件读写操作:
一文搞懂Python文件读写
[python]—list列表写入txt文档的多种方法
关于split方法:
Python split() 方法,分隔符对字符串进行分割并返回一个列表
4、处理数据格式
参考代码:
with open("esptestb.txt", "r+", encoding="utf-8") as f1:
with open("esptrain.txt", "r+", encoding="utf-8") as f2:
# train_sents = f2.readlines()
# test_sents = f1.readlines()
# # train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
# # test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
# print(len(train_sents))
# print(len(test_sents))
# print(type(train_sents))
# test_sents
train_sents = f2.readlines()
test_sents = f1.readlines()
# tset
list1 = []
list_each = []
for line in test_sents:
if line == '\n':
list1.append(list_each)
list_each = []
else:
temp = line.replace('\n', '')
temp = temp.split('\t')
# print(temp)
# print(type(temp))
yuan = tuple(temp)
# print(yuan)
list_each.append(yuan)
# line = line.split('\t')
# list1.append(line)
# print(type(line))
# print(line)
# print(test_sents)
# print(list1)
test_sents = list1
# print(len(test_sents))
# for lie in list1:
# for str_each in lie:
# temp = str_each.replace('\n','')
# temp = temp.split('\t')
# print(temp)
# # print(type(temp))
# yuan = tuple(temp)
# print(yuan)
# train
list1 = []
list_each = []
for line in train_sents:
if line == '\n':
list1.append(list_each)
list_each = []
else:
temp = line.replace('\n', '')
temp = temp.split('\t')
# print(temp)
# print(type(temp))
yuan = tuple(temp)
# print(yuan)
list_each.append(yuan)
# line = line.split('\t')
# list1.append(line)
# print(type(line))
# print(line)
# print(test_sents)
# print(list1)
train_sents = list1
# print(len(test_sents))
# for lie in list1:
# for str_each in lie:
# temp = str_each.replace('\n','')
# temp = temp.split('\t')
# print(temp)
# # print(type(temp))
# yuan = tuple(temp)
# print(yuan)
print(len(train_sents))
print(len(test_sents))
处理完格式后就成了这样:
然后就可以直接代入到模型里面跑了
5、跑模型
将数据换为处理完格式的文件后运行代码即可
最终结果: