在上一篇文章中采用的是将数据集按顺序进行37分的方法分割数据,这样的话会导致跑出来的结果相比之下会不太准确,因此本文使用sklearn中的KFlod方法实现交叉验证从而使结果更加准确
上一篇文章------>Python处理数据格式后跑模型(pycrfsuite)—验证数据有效性
1、交叉验证方法介绍
sklearn官网中关于交叉验证的部分:3.1. Cross-validation: evaluating estimator performance¶
看不惯sklearn英文版的可以看看这个中文版的官方文档噢:scikit-learn (sklearn) 官方文档中文版
我的理解:通俗来讲交叉验证就是将数据集按比例的分成n个部分(n折交叉验证就是分成n份),其中每部分轮流担任测试集,测试集外的其余部分充当训练集,依次跑模型,最后对得分取平均从而使结果更准确
参考blog:sklearn中的交叉验证(Cross-Validation)
【Python学习】 - sklearn学习 - 数据集分割方法 - 随机划分与K折交叉划分与StratifiedKFold与StratifiedShuffleSplit
2、KFlod方法
关于KFlod方法的介绍参考blog:sklearn.KFold用法示例
3、处理过程
本次处理中实现交叉验证的主要代码:
#数据集分成五份,分配的时候随机打乱数据集(shuffle为True),
# 每次打乱的方式一致(random_state = 1)
kf = KFold(5,shuffle = True,random_state = 1)
for a in kf.split(data_all):
train_sents = []#每次的训练集
test_sents = []#每次的测试集
a_train_index = a[0]
a_test_index = a[1]
for index in a_train_index:
train_sents.append(data_all[index])
for index in a_test_index:
test_sents.append(data_all[index])
# print(a_train_index)
# print(a_test_index)
# print(train_sents[1:50])
# print(test_sents[1:50])
# print(len(train_sents))#32584
# print(len(test_sents))#8146
其中data_all的格式如下:
一个列表中嵌套数个列表(每个列表为一个句子),里面的每个列表中有数个元组(每个元组表示一个单词及其编号与标签)
kf = KFold(5,shuffle = True,random_state = 1) 关于这里的参数,可以参考blog:sklearn.KFold用法示例
或者官方文档:3.1. Cross-validation: evaluating estimator
performance¶
有一点要注意:打乱的是下标的索引!!
因此上述代码用了两个循环利用data_all[index]将文本内容进行打乱后的复原,再append进训练集和测试集的列表中进行跑模型
完整的代码:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
from array import array
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn import preprocessing
from sklearn.model_selection import KFold
from itertools import chain
import nltk,pycrfsuite
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = [
'bias',
'word.lower=' + word.lower(),
'word[-3:]=' + word[-3:],
'word[-2:]=' + word[-2:],
'word.isupper=%s' % word.isupper(),
'word.istitle=%s' % word.istitle(),
'word.isdigit=%s' % word.isdigit(),
'postag=' + postag,
'postag[:2]=' + postag[:2],
]
if i > 0:
word1 = sent[i - 1][0]
postag1 = sent[i - 1][1]
features.extend([
'-1:word.lower=%s' % word1.lower(),
'-1:word.istitle=%s' % word1.istitle(),
'-1:word.issupper=%s' % word1.isupper(),
'-1:postag=%s' % postag1,
'-1:postag[:2]=%s' % postag1[:2],
])
else:
features.append('BOS')
if i < len(sent) - 1:
word1 = sent[i + 1][0]
postag1 = sent[i + 1][1]
features.extend([
'+1:word.lower=%s' % word1.lower(),
'+1:word.istitle=%s' % word1.istitle(),
'+1:word.issupper=%s' % word1.isupper(),
'+1:postag=%s' % postag1,
'+1:postag[:2]=%s' % postag1[:2],
])
else:
features.append('EOS')
return features
# 测试效果
# sent=train_sents[0]
# print(len(sent))
# for i in range (len(sent)):
# print(word2features(sent,i))
# print("======================================")
# 完成特征转化
def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]
# 获取类别,即标签
def sent2labels(sent):
return [label for token, postag, label in sent]
# 获取词
def sent2tokens(sent):
return [token for token, postag, label in sent]
# 开始处理
with open("POS-data_all.txt", "r+", encoding="utf-8") as f1:
data_all = f1.readlines()
list_data = []
list_target = []
# all
list1 = []
list_each = []
for line in data_all:
if line == '\n':
list1.append(list_each)
list_each = []
else:
temp = line.replace('\n', '')
temp = temp.split('\t')
# print(temp)
# print(type(temp))
yuan = tuple(temp)
# print(yuan)
list_each.append(yuan)
# line = line.split('\t')
# list1.append(line)
# print(type(line))
# print(line)
# print(test_sents)
# print(list1)
data_all = list1
# print(len(test_sents))
# for lie in list1:
# for str_each in lie:
# temp = str_each.replace('\n','')
# temp = temp.split('\t')
# print(temp)
# # print(type(temp))
# yuan = tuple(temp)
# print(yuan)
print(type(data_all)) # list
# print(data_all[1:50])
# #提取印地语单词
# for list_sentences in data_all:
# for tuple_words in list_sentences:
# list_data.append(tuple_words[1])
# list_target.append(tuple_words[2])
# # print(tuple_words[1]+" "+tuple_words[2])
# # print("===============================")
# # print(list_data)
# # print(list_target)
# X = np.array(list_data)
# y = np.array(list_target)
# print(type(X))
# print(type(y))
# print(X.shape)
# print(y.shape)
# print(X)
# print(y)
# print(data_all[0])
# 交叉划分数据集
# 数据集分成五份,分配的时候随机打乱数据集(shuffle为True),
# 每次打乱的方式一致(random_state = 1)
kf = KFold(5, shuffle=True, random_state=1)
for a in kf.split(data_all):
train_sents = [] # 每次的训练集
test_sents = [] # 每次的测试集
a_train_index = a[0]
a_test_index = a[1]
for index in a_train_index:
train_sents.append(data_all[index])
for index in a_test_index:
test_sents.append(data_all[index])
# print(a_train_index)
# print(a_test_index)
# print(train_sents[1:50])
# print(test_sents[1:50])
# print(len(train_sents))#32584
# print(len(test_sents))#8146
# 开始每次的模型训练和输出结果
# 特征如上转化完成后,可以查看下一行特征内容
# print(sent2features(train_sents[0])[0])
# 构造特征训练集和测试集
X_train = [sent2features(s) for s in train_sents]
Y_train = [sent2labels(s) for s in train_sents]
# print(len(X_train))
# print(len(Y_train))
X_test = [sent2features(s) for s in test_sents]
Y_test = [sent2labels(s) for s in test_sents]
# print(len(X_test))
# print(X_train[0])
# print(Y_train[0])
print(len(Y_test))
print(type(Y_test))
# 模型训练
# 1) 创建pycrfsuite.Trainer
trainer = pycrfsuite.Trainer(verbose=False)
# 加载训练特征和分类的类别(label)
for xseq, yseq in zip(X_train, Y_train):
trainer.append(xseq, yseq)
# 2)设置训练参数,选择 L-BFGS 训练算法(默认)和 Elastic Net 回归模型
trainer.set_params({
'c1': 1.0, # coefficient for L1 penalty
'c2': 1e-3, # coefficient for L2 penalty
'max_iterations': 50, # stop earlier
# include transitions that are possible, but not observed
'feature.possible_transitions': True
})
print(trainer.params())
# 3)开始训练
# 含义是训练出的模型名为:conll2002-esp.crfsuite
trainer.train('conll2002-esp.crfsuite')
# 使用训练后的模型,创建用于测试的标注器。
tagger = pycrfsuite.Tagger()
tagger.open('conll2002-esp.crfsuite')
example_sent = test_sents[0]
# 查看这句话的内容
# print(type(sent2tokens(example_sent)))
# print(sent2tokens(example_sent))
# print(''.join(sent2tokens(example_sent)))
# print('\n\n')
# print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
# print("Predicted:", ' '.join(tagger.tag(X_test[0])))
# print("Correct: ", ' '.join(sent2labels(example_sent)))
# 查看模型在训练集上的效果
def bio_classification_report(y_true, y_pred):
lb = LabelBinarizer()
y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
tagset = set(lb.classes_) - {'O'}
tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
return classification_report(
y_true_combined,
y_pred_combined,
labels=[class_indices[cls] for cls in tagset],
target_names=tagset,
)
# 标注所有信息
Y_pred = [tagger.tag(xseq) for xseq in X_test]
print(type(Y_pred))
print(type(Y_test))
# 打印出评测报告
print(bio_classification_report(Y_test, Y_pred))
ps:POS_data_all文件的格式在上一篇blog中有:Python处理数据格式后跑模型(pycrfsuite)—验证数据有效性