Python 第三方模块机器学习 Scikit-Learn模块简介,基类,数据集,错误

最新推荐文章于 2024-03-04 20:55:58 发布

EdVzAs

最新推荐文章于 2024-03-04 20:55:58 发布

阅读量820

点赞数

文章标签： python 机器学习 scikit-learn dataset

本文链接：https://blog.csdn.net/weixin_46131409/article/details/109555328

版权

Python 同时被 2 个专栏收录

135 篇文章 3 订阅

订阅专栏

机器学习

66 篇文章 5 订阅

订阅专栏

英文官方文档:https://scikit-learn.org/stable/ $\qquad$ 中文官方文档:https://scikit-learn.org.cn/ $\quad$ https://www.cntofu.com/book/170/index.html

一.基本情况
1.简介:

Scikit-Learn简称sklearn,是基于Numpy/SciPy/Matplotlib的Python开源机器学习库,
包含了从数据预处理到训练模型的各个方面,并涵盖了几乎所有主流的机器学习算法.该库的可重
用性较高,从而得以帮助程序员实现高效的开发

2.功能:

子模块列表:https://blog.csdn.net/newmarui/article/details/52094383

①分类(classification):识别某个对象属于哪个类别
  常用算法:SVM(支持向量机),nearest neighbors(最近邻),random forest(随机森林)
  常见应用:垃圾邮件识别,图像识别
②回归(regression):预测与对象相关联的连续值属性
  常见算法:SVR(支持向量回归),ridge regression(岭回归),Lasso
  常见应用:药物反应,预测股价
③聚类(clustering):将相似对象自动分组
  常用算法:k-Means,spectral clustering,mean-shift
  常见应用:客户细分,分组实验结果
④降维(dimensionality reduction):减少要考虑的随机变量的数量
  常见算法:PCA(主成分分析),feature selection(特征选择),non-negative matrix factorization(非负矩阵分解)
  常见应用:可视化,提高效率
⑤模型选择(model selection):比较,验证,选择参数和模型,目标是通过调整参数来提高精度
  常用模块:grid search(网格搜索),cross validation(交叉验证),metrics(度量)
⑥预处理(pre-processing):特征提取和归一化
  常用模块:preprocessing,feature extraction
  常见应用:把输入的数据(如文本)转换为机器学习算法可用的数据

3.安装:

pip install sklearn

二.算法选择

参见:https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

在这里插入图片描述
三.基类与实用函数
1.简介:

该模块提供了所有估计器的基类

2.基类(Base classes):

Base class for all estimators in scikit-learn:sklearn.base.BaseEstimator()
Mixin class for all bicluster estimators in scikit-learn:sklearn.base.BiclusterMixin
Mixin class for all classifiers in scikit-learn:sklearn.base.ClassifierMixin
Mixin class for all cluster estimators in scikit-learn:sklearn.base.ClusterMixin
Mixin class for all density estimators in scikit-learn:sklearn.base.DensityMixin
Mixin class for all regression estimators in scikit-learn:sklearn.base.RegressorMixin
Mixin class for all transformers in scikit-learn:sklearn.base.TransformerMixin
Transformer mixin that performs feature selection given a support mask:sklearn.feature_selection.SelectorMixin

3.实用函数(Utility functions):

Constructs a new unfitted estimator with the same parameters:sklearn.base.clone(<estimator>[,safe=True])
Return True if the given estimator is (probably) a classifier:[<out>=]sklearn.base.is_classifier(<estimator>)
Return True if the given estimator is (probably) a regressor:[<out>=]sklearn.base.is_regressor(<estimator>)

######################################################################################################################

Context manager for global scikit-learn configuration:sklearn.sklearn.config_context([**new_config])
Retrieve current values for configuration set by set_config:[<config>=]sklearn.get_config()
Set global scikit-learn configuration:sklearn.set_config([assume_finite=None,working_memory=None,print_changed_only=None,display=None])
Print useful debugging information:sklearn.show_versions()

四.数据集

官方文档:http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

1.概述:

该模块用于获得内置或自定义数据集

2.装载机(Loaders)
(1)简介:

"装载机"(Loaders)主要用于加载内置数据集

(2)工具函数:

清除缓存:sklearn.datasets.clear_data_home([data_home=None])
  #参数说明:
    data_home:指定缓存路径;为str/None(表示"~/sklearn_learn_data")
      #会将该目录及目录中的内容全部删除

######################################################################################################################

以svmlight/libsvm格式转储数据集:sklearn.datasets.dump_svmlight_file(<X>,<y>,<f>[,zero_based=True,comment=None,query_id=None,multilabel=False])
  #参数说明:
  	X:指定样本集;为n_samples × n_features array-like/n_samples × n_features sparse matrix
  	y:指定样本的真实类别;为1 × n_samples array-like/1 × n_samples sparse matrix/n_samples × n_labels array-like/n_samples × n_labels sparse matrix
  	f:指定转储数据集的位置;为str/binary mode file-like
  	zero_based:为True,则列索引从0开始(column indices should be written zero-based)
  	           为False,则列索引从1开始(column indices should be written one-based)
  	           为"auto",则自动确定
  	comment:指定在文件开始插入的注释;为Unicode str/ASCII byte
  	query_id:Array containing pairwise preference constraints (qid in svmlight format);为1 × n_samples array-like
  	multilabel:指定是否为多标签分类;为bool

######################################################################################################################

返回sklearn数据目录的路径:sklearn.datasets.get_data_home([data_home=None])

######################################################################################################################

加载以类别作为子文件夹名称的文本文件:[<data>=]sklearn.datasets.load_files(<container_path>[,description=None,categories=None,load_content=True,shuffle=True,encoding=None,decode_error='strict',random_state=0])

######################################################################################################################

以CSR稀疏矩阵格式加载svmlight/libsvm格式的数据集文件:[<X>,<y>,<query_id>=]sklearn.datasets.load_svmlight_file(<f>[,n_features=None,dtype=<class 'numpy.float64'>,multilabel=False,zero_based='auto',query_id=False,offset=0,length=-1])
  #参数说明:其他参数同sklearn.datasets.dump_svmlight_file()
	f:指定要加载的文件;为str/file-like/int
	n_features:指定要使用的特征数;为int
	dtype:指定数据类型;为numpy data type
	query_id:指定是否返回QueryID;为bool
	offset:Ignore the offset first bytes by seeking forward, then discarding the following bytes up until the next new line character;为int
	length:If strictly positive, stop reading any new line of data once the position in the file has reached the (offset + length) bytes threshold;为int
	X:返回样本集;为n_samples × n_features sparse matrix
	y:返回样本的标签;为1 × n_samples ndarray/tuple list(长度为n_samples)
	query_id:返回QueryID;为1 × n_samples array

######################################################################################################################

从SVMlight格式的多个文件中加载数据集:[<Xy>=]sklearn.datasets.load_svmlight_files(<files>[,n_features=None,dtype=<class 'numpy.float64'>,multilabel=False,zero_based='auto',query_id=False,offset=0,length=-1])
  #参数说明:其他参数同sklearn.datasets.load_svmlight_file()
	files:指定要加载的文件;为str array-like/file-like array-like/int array-like
	Xy:返回数据集;为[<X1>,<y1>[,<q1>],...<Xn>,<yn>[,<qn>]]
	  #仅当query_id=True时返回<qi>

(2)分类问题数据集:

Load the filenames and data from the 20 newsgroups dataset:[<bunch>=]sklearn.datasets.fetch_20newsgroups([data_home=None,subset='train',categories=None,shuffle=True,random_state=42,remove=(),download_if_missing=True,return_X_y=False])
Load and vectorize the 20 newsgroups dataset:[<bunch>=]sklearn.datasets.fetch_20newsgroups_vectorized([subset='train',remove=(),data_home=None,download_if_missing=True,return_X_y=False,normalize=True,as_frame=False])
Load the covertype dataset:[<dataset>=]sklearn.datasets.fetch_covtype([data_home=None,download_if_missing=True,random_state=None,shuffle=False,return_X_y=False,as_frame=False])
Load the kddcup99 datasetd:[<data>=]sklearn.datasets.fetch_kddcup99([subset=None,data_home=None,shuffle=False,random_state=None,percent10=True,download_if_missing=True,return_X_y=False,as_frame=False])
Load the Labeled Faces in the Wild (LFW) pairs dataset:[<data>=]sklearn.datasets.fetch_lfw_pairs([subset='train',data_home=None,funneled=True,resize=0.5,color=False,slice_=(slice(70,195),slice(78,172)),download_if_missing=True])
Load the Labeled Faces in the Wild (LFW) people dataset:[<dataset>=]sklearn.datasets.fetch_lfw_people([data_home=None,funneled=True,resize=0.5,min_faces_per_person=0,color=False,slice_=(slice(70,195),slice(78,172)),download_if_missing=True,return_X_y=False])
Load the Olivetti faces data-set from AT&T:[<data>=]sklearn.datasets.fetch_olivetti_faces([data_home=None,shuffle=False,random_state=0,download_if_missing=True,return_X_y=False])
Load the RCV1 multilabel datasets:[<dataset>=]klearn..datasets.fetch_rcv1([data_home=None,subset='all',download_if_missing=True,random_state=None,shuffle=False,return_X_y=False])
Load and return the breast cancer wisconsin dataset:[<data>=]sklearn.datasets.load_breast_cancer([return_X_y=False,as_frame=False])
Load and return the digits datasets:[<data>=]sklearn.datasets.load_digits([n_class=10,return_X_y=False,as_frame=False])
Load and return the iris dataset:[<data>=]sklearn.datasets.load_iris([return_X_y=False,as_frame=False])
Load and return the wine dataset:[<data>=]sklearn.datasets.load_wine([return_X_y=False,as_frame=False])

(3)回归问题数据集:

Load the California housing dataset:[<dataset>=]datasets.fetch_california_housing([data_home=None,download_if_missing=True,return_X_y=False,as_frame=False])
Load and return the boston house-prices dataset:[<data>=]sklearn.datasets.load_boston([return_X_y=False])
Load and return the diabetes dataset:[<data>=]sklearn.datasets.load_diabetes([return_X_y=False,as_frame=False])

(4)其他数据集:

Fetch dataset from openml by name or dataset id:[<data>=]sklearn.datasets.fetch_openml([name=None,version='active',data_id=None,data_home=None,target_column='default-target',cache=True,return_X_y=False,as_frame='auto'])
Loader for species distribution dataset from Phillips et:[<data>=]sklearn.datasets.fetch_species_distributions([data_home=None,download_if_missing=True])
Load and return the physical excercise linnerud dataset:[<data>=]sklearn.datasets.load_linnerud([return_X_y=False,as_frame=False])
Load the numpy array of a single sample image:[<img>=]sklearn.datasets.load_sample_image(<image_name>)
  #加载china.jpg或flower.jpg
Load sample images for image manipulation:[<data>=]sklearn.datasets.load_sample_images()
  #同时加载china.jpg与flower.jpg

3.样本生成器(Samples generator):

Generate an array with constant block diagonal structure for biclustering:[<X>,<rows>,<cols>=]sklearn.datasets.make_biclusters(<shape>,<n_clusters>[,noise=0.0,minval=10,maxval=100,shuffle=True,random_state=None])
Generate isotropic Gaussian blobs for clustering:[<X>,<y>,<centers>=]sklearn.datasets.make_blobs([n_samples=100,n_features=2,centers=None,cluster_std=1.0,center_box=(-10.0,10.0),shuffle=True,random_state=None,return_centers=False])
Generate an array with block checkerboard structure for biclustering:[<X>,<rows>,<cols>=]sklearn.datasets.make_checkerboard(<shape>,<n_clusters>[,noise=0.0,minval=10,maxval=100,shuffle=True,random_state=None])
Make a large circle containing a smaller circle in 2d:[<X>,<y>=]sklearn.datasets.make_circles([n_samples=100,shuffle=True,noise=None,random_state=None,factor=0.8])
Generate a random n-class classification problem:[<X>,<y>=]sklearn.datasets.make_classification([n_samples=100,n_features=20,n_informative=2,n_redundant=2,n_repeated=0,n_classes=2,n_clusters_per_class=2,weights=None,flip_y=0.01,class_sep=1.0,hypercube=True,shift=0.0,scale=1.0,shuffle=True,random_state=None])
Generate the “Friedman #1” regression problem:[<X>,<y>=]sklearn.datasets.make_friedman1([n_samples=100,n_features=10,noise=0.0,random_state=None])
Generate the “Friedman #2” regression problem:[<X>,<y>=]sklearn.datasets.make_friedman2([n_samples=100,noise=0.0,random_state=None])
Generate the “Friedman #3” regression problem:[<X>,<y>=]sklearn.datasets.make_friedman3([n_samples=100,noise=0.0,random_state=None])
Generate isotropic Gaussian and label samples by quantile:[<X>,<y>=]sklearn.datasets.make_gaussian_quantiles([mean=None,cov=1.0,n_samples=100,n_features=2,n_classes=3,shuffle=True,random_state=None])
Generates data for binary classification used in Hastie et al:[<X>,<y>=]sklearn.datasets.make_hastie_10_2([n_samples=12000,random_state=None])
Generate a mostly low rank matrix with bell-shaped singular values:[<X>=]sklearn.datasets.make_low_rank_matrix([n_samples=100,n_features=100,effective_rank=10,tail_strength=0.5,random_state=None])
Make two interleaving half circles:[<X>,<y>=]sklearn.datasets.make_moons([n_samples=100,shuffle=True,noise=None,random_state=None])
Generate a random multilabel classification problem:[<X>,<y>,<p_c><p_w_c>=]sklearn.datasets.make_multilabel_classification([n_samples=100,n_features=20,n_classes=5,n_labels=2,length=50,allow_unlabeled=True,sparse=False,return_indicator='dense',return_distributions=False,random_state=None])
Generate a random regression problem:[<X>,<y>,<coef>=]sklearn.datasets.make_regression([n_samples=100,n_features=100,n_informative=10,n_targets=1,bias=0.0,effective_rank=None,tail_strength=0.5,noise=0.0,shuffle=True,coef=False,random_state=None])
Generate an S curve dataset:[<X>,<t>=]sklearn.datasets.make_s_curve([n_samples=100,noise=0.0,random_state=None])
Generate a signal as a sparse combination of dictionary elements:[<data>,<dictionary>,<code>=]sklearn.datasets.make_sparse_coded_signal(<n_samples>,<n_components>,<n_features>,<n_nonzero_coefs>[,random_state=None])
Generate a sparse symmetric definite positive matrix:[<prec>=]sklearn.datasets.make_sparse_spd_matrix([dim=1,alpha=0.95,norm_diag=False,smallest_coef=0.1,largest_coef=0.9,random_state=None])
Generate a random regression problem with sparse uncorrelated design:[<X>,<y>=]sklearn.datasets.make_sparse_uncorrelated([n_samples=100,n_features=10,random_state=None])
Generate a random symmetric, positive-definite matrix:[<X>=]sklearn.datasets.make_spd_matrix(<n_dim>[,random_state=None])
Generate a swiss roll dataset:[<X>,<t>=]sklearn.datasets.make_swiss_roll([n_samples=100,noise=0.0,random_state=None])

五.exceptions
1.简介:

该模块中定义了所有sklearn自定义的错误和警告

2.使用:

Custom warning to capture convergence problems:class sklearn.exceptions.ConvergenceWarning
Warning used to notify implicit data conversions happening in the code:class sklearn.exceptions.DataConversionWarning
Custom warning to notify potential issues with data dimensionality:class sklearn.exceptions.DataDimensionalityWarning
Warning used to notify the user of inefficient computation:class sklearn.exceptions.EfficiencyWarning
Warning class used if there is an error while fitting the estimator:class sklearn.exceptions.FitFailedWarning
Exception class to raise if estimator is used before fitting:class sklearn.exceptions.NotFittedError
Warning used when the metric is invalid:class sklearn.exceptions.UndefinedMetricWarning