目录
版本:scikit-learn = 0.23.2
一、数据集划分
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, KFold
X_train, X_test, y_train, y_test = train_test_split (x, y, train_size, text_size, random_state, stratify, shuffle)
'''参数:x:数据集的自变量,一般为二维数组(list)或矩阵(numpy)
Y:数据集的因变量,一维数组
train_size:训练集大小,可为float(比例)、int(样本数)、none(test_size必须有输入)、省缺是默认为0.25
test_size:同上
random_state:控制随机数种子,int,默认为none
stratify:控制分类问题中的分层抽样,默认为None,即不进行分层抽样,当传入为数组时,则依据该数组进行分层抽样(一般传入因变量所在列)
shuffle:用来控制是否在分割数据前打乱原数据集的顺序,默认为True,stratify为None时该参数必须传入False
返回值:训练集自变量、测试集自变量、训练集因变量、测试集因变量(numpy格式)'''
二、特征生成
基于词频或者TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
def tfidf_embedding(self,content_cutted):
'''
Args:
content_cutted:list, words that cutted by jieba and linked with space
return:
<class 'scipy.sparse.csr.csr_matrix'>, shape (n_samples, n_features)
'''
# for content_cutted in contents_cutted:
# tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
# max_features=None,
# stop_words=self.stopwords_lis, #'english'
# max_df = 0.5,
# min_df = 2, #10
# )
# tf = tf_vectorizer.fit_transform(content_cutted)
# tfidftransf = TfidfTransformer()
# tfidf_matrix = tfidftransf.fit(tf).transform(tf)
#基于TF-IDF生成向量
tfidf_vectorizer = TfidfVectorizer(max_df=0.5,
max_features=None,
min_df=1,
stop_words=self.stopwords_lis,
use_idf=True,
ngram_range=(1,1))
tfidf_matrix = tfidf_vectorizer.fit_transform(content_cutted)
# 获取被TF-IDF向量化的所有的词,不同于jieba分词的结果
feature_names=tfidf_vectorizer.get_feature_names() # 1.0版本后为 .get_feature_names_out()