scikit-learn与机器学习实践

lucky_chaichai

已于 2023-12-26 13:44:17 修改

阅读量262

点赞数

文章标签： python

于 2021-05-20 11:02:04 首次发布

本文链接：https://blog.csdn.net/lucky_chaichai/article/details/117036087

版权

本文介绍了使用scikit-learn进行机器学习的过程，包括数据集划分、特征生成、模型训练与测试（涉及聚类、分类模型和主题抽取）、模型评估以及模型的保存与加载。特别提到了主题抽取模型如LDA和NMP，以及模型评估指标如准确率、精确率和召回率，并讨论了交叉验证方法。最后提及了与LLM的结合应用。

摘要由CSDN通过智能技术生成

一、数据集划分

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, KFold

X_train, X_test, y_train, y_test = train_test_split (x, y, train_size, text_size, random_state, stratify, shuffle)
'''参数：x：数据集的自变量，一般为二维数组（list）或矩阵（numpy）
		 Y：数据集的因变量，一维数组
		 train_size：训练集大小，可为float（比例）、int（样本数）、none（test_size必须有输入）、省缺是默认为0.25
		 test_size：同上
		 random_state：控制随机数种子，int，默认为none
		 stratify：控制分类问题中的分层抽样，默认为None，即不进行分层抽样，当传入为数组时，则依据该数组进行分层抽样（一般传入因变量所在列）
		 shuffle：用来控制是否在分割数据前打乱原数据集的顺序，默认为True，stratify为None时该参数必须传入False
   返回值：训练集自变量、测试集自变量、训练集因变量、测试集因变量（numpy格式）'''

二、特征生成

基于词频或者TF-IDF：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
def tfidf_embedding(self,content_cutted):
	'''
	 Args:
	     content_cutted:list, words that cutted by jieba and linked with space
	 return: 
	 	 <class 'scipy.sparse.csr.csr_matrix'>, shape (n_samples, n_features)
	 '''
	 # for content_cutted in contents_cutted:
	 # tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
	 #                         max_features=None,
	 #                         stop_words=self.stopwords_lis,  #'english'
	 #                         max_df = 0.5,
	 #                         min_df = 2,  #10
	 #                         )
	 # tf = tf_vectorizer.fit_transform(content_cutted)
	 # tfidftransf = TfidfTransformer()
	 # tfidf_matrix = tfidftransf.fit(tf).transform(tf)
	 
	 #基于TF-IDF生成向量
	 tfidf_vectorizer = TfidfVectorizer(max_df=0.5, 
	                         max_features=None,
	                         min_df=1, 
	                         stop_words=self.stopwords_lis,
	                         use_idf=True,  
	                         ngram_range=(1,1))
	 tfidf_matrix = tfidf_vectorizer.fit_transform(content_cutted)
	 
	 # 获取被TF-IDF向量化的所有的词，不同于jieba分词的结果
	 feature_names=tfidf_vectorizer.get_feature_names() # 1.0版本后为 .get_feature_names_out()