【Python】运用sklearn中的KFlod实现在模型中使用交叉验证

在上一篇文章中采用的是将数据集按顺序进行37分的方法分割数据,这样的话会导致跑出来的结果相比之下会不太准确,因此本文使用sklearn中的KFlod方法实现交叉验证从而使结果更加准确

上一篇文章------>Python处理数据格式后跑模型(pycrfsuite)—验证数据有效性

1、交叉验证方法介绍

sklearn官网中关于交叉验证的部分:3.1. Cross-validation: evaluating estimator performance¶

看不惯sklearn英文版的可以看看这个中文版的官方文档噢:scikit-learn (sklearn) 官方文档中文版

在这里插入图片描述
我的理解:通俗来讲交叉验证就是将数据集按比例的分成n个部分(n折交叉验证就是分成n份),其中每部分轮流担任测试集,测试集外的其余部分充当训练集,依次跑模型,最后对得分取平均从而使结果更准确

参考blog:sklearn中的交叉验证(Cross-Validation)
【Python学习】 - sklearn学习 - 数据集分割方法 - 随机划分与K折交叉划分与StratifiedKFold与StratifiedShuffleSplit

2、KFlod方法

关于KFlod方法的介绍参考blog:sklearn.KFold用法示例

3、处理过程

本次处理中实现交叉验证的主要代码:

#数据集分成五份,分配的时候随机打乱数据集(shuffle为True),
# 每次打乱的方式一致(random_state = 1)
    kf = KFold(5,shuffle = True,random_state = 1)
    
    for a in kf.split(data_all):
        train_sents = []#每次的训练集
        test_sents = []#每次的测试集
        a_train_index = a[0]
        a_test_index = a[1]
        for index in a_train_index:
            train_sents.append(data_all[index])
        for index in a_test_index:
            test_sents.append(data_all[index])
#         print(a_train_index)
#         print(a_test_index)
#         print(train_sents[1:50])
#         print(test_sents[1:50])
#         print(len(train_sents))#32584
#         print(len(test_sents))#8146

其中data_all的格式如下:
在这里插入图片描述
一个列表中嵌套数个列表(每个列表为一个句子),里面的每个列表中有数个元组(每个元组表示一个单词及其编号与标签)

kf = KFold(5,shuffle = True,random_state = 1) 关于这里的参数,可以参考blog:sklearn.KFold用法示例
或者官方文档:3.1. Cross-validation: evaluating estimator
performance¶

有一点要注意:打乱的是下标的索引!!
因此上述代码用了两个循环利用data_all[index]将文本内容进行打乱后的复原,再append进训练集和测试集的列表中进行跑模型

完整的代码:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
from array import array
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn import preprocessing
from sklearn.model_selection import KFold
from itertools import chain
import nltk,pycrfsuite
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag,
        'postag[:2]=' + postag[:2],
    ]
    if i > 0:
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.extend([
            '-1:word.lower=%s' % word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.issupper=%s' % word1.isupper(),
            '-1:postag=%s' % postag1,
            '-1:postag[:2]=%s' % postag1[:2],
        ])
    else:
        features.append('BOS')

    if i < len(sent) - 1:
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.extend([
            '+1:word.lower=%s' % word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.issupper=%s' % word1.isupper(),
            '+1:postag=%s' % postag1,
            '+1:postag[:2]=%s' % postag1[:2],
        ])
    else:
        features.append('EOS')

    return features


# 测试效果
# sent=train_sents[0]
# print(len(sent))
# for i in range (len(sent)):
# 	print(word2features(sent,i))
# 	print("======================================")

# 完成特征转化
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


# 获取类别,即标签
def sent2labels(sent):
    return [label for token, postag, label in sent]


# 获取词
def sent2tokens(sent):
    return [token for token, postag, label in sent]


# 开始处理
with open("POS-data_all.txt", "r+", encoding="utf-8") as f1:
    data_all = f1.readlines()
    list_data = []
    list_target = []

    # all
    list1 = []
    list_each = []
    for line in data_all:
        if line == '\n':
            list1.append(list_each)
            list_each = []
        else:
            temp = line.replace('\n', '')
            temp = temp.split('\t')
            # print(temp)
            # print(type(temp))
            yuan = tuple(temp)
            # print(yuan)
            list_each.append(yuan)
        # line = line.split('\t')
        # list1.append(line)
        # print(type(line))
        # print(line)

    # print(test_sents)
    # print(list1)
    data_all = list1
    # print(len(test_sents))
    # for lie in list1:
    #     for str_each in lie:
    #         temp = str_each.replace('\n','')
    #         temp = temp.split('\t')
    #         print(temp)
    #         # print(type(temp))
    #         yuan = tuple(temp)
    #         print(yuan)
    print(type(data_all))  # list
    #     print(data_all[1:50])

    #     #提取印地语单词
    #     for list_sentences in data_all:
    #         for tuple_words in list_sentences:
    #             list_data.append(tuple_words[1])
    #             list_target.append(tuple_words[2])
    #             # print(tuple_words[1]+" "+tuple_words[2])
    #         # print("===============================")
    #     # print(list_data)
    #     # print(list_target)
    #     X = np.array(list_data)
    #     y = np.array(list_target)
    #     print(type(X))
    #     print(type(y))
    #     print(X.shape)
    #     print(y.shape)
    #     print(X)
    #     print(y)
    #     print(data_all[0])

    # 交叉划分数据集
    # 数据集分成五份,分配的时候随机打乱数据集(shuffle为True),
    # 每次打乱的方式一致(random_state = 1)
    kf = KFold(5, shuffle=True, random_state=1)

    for a in kf.split(data_all):
        train_sents = []  # 每次的训练集
        test_sents = []  # 每次的测试集
        a_train_index = a[0]
        a_test_index = a[1]
        for index in a_train_index:
            train_sents.append(data_all[index])
        for index in a_test_index:
            test_sents.append(data_all[index])
        #         print(a_train_index)
        #         print(a_test_index)
        #         print(train_sents[1:50])
        #         print(test_sents[1:50])
        #         print(len(train_sents))#32584
        #         print(len(test_sents))#8146

        # 开始每次的模型训练和输出结果

        # 特征如上转化完成后,可以查看下一行特征内容
        # print(sent2features(train_sents[0])[0])

        # 构造特征训练集和测试集
        X_train = [sent2features(s) for s in train_sents]
        Y_train = [sent2labels(s) for s in train_sents]
        # print(len(X_train))
        # print(len(Y_train))
        X_test = [sent2features(s) for s in test_sents]
        Y_test = [sent2labels(s) for s in test_sents]
        # print(len(X_test))
        # print(X_train[0])
        # print(Y_train[0])
        print(len(Y_test))
        print(type(Y_test))

        # 模型训练
        # 1) 创建pycrfsuite.Trainer
        trainer = pycrfsuite.Trainer(verbose=False)
        # 加载训练特征和分类的类别(label)
        for xseq, yseq in zip(X_train, Y_train):
            trainer.append(xseq, yseq)

        # 2)设置训练参数,选择 L-BFGS 训练算法(默认)和 Elastic Net 回归模型
        trainer.set_params({
            'c1': 1.0,  # coefficient for L1 penalty
            'c2': 1e-3,  # coefficient for L2 penalty
            'max_iterations': 50,  # stop earlier
            # include transitions that are possible, but not observed
            'feature.possible_transitions': True
        })
        print(trainer.params())

        # 3)开始训练
        # 含义是训练出的模型名为:conll2002-esp.crfsuite
        trainer.train('conll2002-esp.crfsuite')

        # 使用训练后的模型,创建用于测试的标注器。
        tagger = pycrfsuite.Tagger()
        tagger.open('conll2002-esp.crfsuite')
        example_sent = test_sents[0]


        # 查看这句话的内容
        # print(type(sent2tokens(example_sent)))
        # print(sent2tokens(example_sent))
        # print(''.join(sent2tokens(example_sent)))
        # print('\n\n')
        # print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
        # print("Predicted:", ' '.join(tagger.tag(X_test[0])))
        # print("Correct: ", ' '.join(sent2labels(example_sent)))

        # 查看模型在训练集上的效果
        def bio_classification_report(y_true, y_pred):
            lb = LabelBinarizer()
            y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
            y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))

            tagset = set(lb.classes_) - {'O'}
            tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
            class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}

            return classification_report(
                y_true_combined,
                y_pred_combined,
                labels=[class_indices[cls] for cls in tagset],
                target_names=tagset,
            )


        # 标注所有信息
        Y_pred = [tagger.tag(xseq) for xseq in X_test]
        print(type(Y_pred))
        print(type(Y_test))
        # 打印出评测报告
        print(bio_classification_report(Y_test, Y_pred))

ps:POS_data_all文件的格式在上一篇blog中有:Python处理数据格式后跑模型(pycrfsuite)—验证数据有效性

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

GCTTTTTT

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值