机器学习 - sklearn库及案例

最新推荐文章于 2024-06-24 21:05:35 发布

开码牛

最新推荐文章于 2024-06-24 21:05:35 发布

阅读量1.5k

点赞数

分类专栏： python 文章标签： python 机器学习

本文链接：https://blog.csdn.net/helunqu2017/article/details/113148635

版权

python 专栏收录该内容

119 篇文章 29 订阅

订阅专栏

一、sklearn库介绍

scikit-learn是数据挖掘与分析的简单而有效的工具。依赖于NumPy， SciPy和matplotlib。它主要包含以下几部分内容：

(1)从功能来分：

classification分类
Regression回归
Clustering聚类
Dimensionality reduction 降维
Model selection 模型选择
Preprocessing 数据预处理

(2)从API模块来分：

sklearn.base: Base classes and utility function 基类和效用函数
sklearn.cluster: Clustering 聚类
sklearn.cluster.bicluster: Biclustering 双聚类
sklearn.covariance: Covariance Estimators 协方差估计
sklearn.model_selection: Model Selection 模式选择
sklearn.datasets: Datasets 数据集
sklearn.decomposition: Matrix Decomposition 矩阵分解
sklearn.dummy: Dummy estimators 虚拟估计
sklearn.ensemble: Ensemble Methods 集成方法
sklearn.exceptions: Exceptions and warnings 例外和警告
sklearn.feature_extraction: Feature Extraction 特征提取
sklearn.feature_selection: Feature Selection 特征选择
sklearn.gaussian_process: Gaussian Processes 高斯过程
sklearn.isotonic: Isotonic regression 保序回归
sklearn.kernel_approximation: Kernel Approximation 核近似
sklearn.kernel_ridge: Kernel Ridge Regression Kernel Ridge回归
sklearn.discriminant_analysis: Discriminant Analysis 判别分析
sklearn.linear_model: Generalized Linear Models 广义线性模型
sklearn.manifold: Manifold Learning 流形学习
sklearn.metrics: Metrics 度量
sklearn.mixture: Gaussian Mixture Models 高斯混合模型
sklearn.multiclass: Multiclass and multilabel classification Multiclass和细粒度的分类
sklearn.multioutput: Multioutput regression and classification 多元回归和分类
sklearn.naive_bayes: Naive Bayes 朴素贝叶斯
sklearn.neighbors: Nearest Neighbors 近邻
sklearn.neural_network: Neural network models 神经网络模型
sklearn.calibration: Probability Calibration 概率校准
sklearn.cross_decomposition: Cross decomposition 交叉分解
sklearn.pipeline: Pipeline 管道
sklearn.preprocessing: Preprocessing and Normalization 预处理和归一化
sklearn.random_projection: Random projection 随机投影
sklearn.semi_supervised: Semi-Supervised Learning 半监督学习
sklearn.svm: Support Vector Machines 支持向量机
sklearn.tree: Decision Tree 决策树
sklearn.utils: Utilities 公用事业

就我目前的菜鸟级别，感觉经常用到的有clustering, classification(svm, tree, linear regression 等)

二、cluster聚类

阅读sklearn.cluster的API发现主要有两个内容：一个是各种聚类方法的class如cluster.KMeans，另一个是可以直接使用的聚类方法的函数如:

所以实际使用中，对应也有两种方法。

在sklearn.cluster共有9种聚类方法，分别是

AffinityPropagation: 吸引子传播
AgglomerativeClustering: 层次聚类
Birch
DBSCAN
FeatureAgglomeration: 特征聚集
KMeans: K均值聚类
MiniBatchKMeans
MeanShift
SpectralClustering: 谱聚类

Kmeans聚类介绍

采用类构造器，来构造Kmeans聚类器

sklearn.cluster.KMeans(n_clusters=8, #簇的个数，即你想聚成几类 init='k-means++', #初始簇中心的获取方法 n_init=10, #获取初始簇中心的更迭次数 max_iter=300, #最大迭代次数 tol=0.0001, #容忍度，即kmeans运行准则收敛的条件 precompute_distances='auto', #是否需要提前计算距离 verbose=0, #冗长模式 random_state=None, #随机生成簇中心的状态条件 copy_x=True, #True即复制了就不会修改原数据 n_jobs=1, #并行设置 algorithm='auto' #kmeans算法：'auto', 'full', 'elkan',其中 'full'表示用EM方式 )

虽然有很多参数，但是都已经给出了默认值。所以我们一般不需要去传入这些参数,参数的。可以根据实际需要来调用。下面给一个简单的例子：

import numpy as np from sklearn.cluster import KMeans data = np.random.rand(100, 3) #生成一个随机数据，样本大小为100, 特征数为3

#假如我要构造一个聚类数为3的聚类器 estimator = KMeans(n_clusters=3) #构造聚类器 estimator.fit(data) #对样本数据进行聚类计算

estimator.fit_predict(data) #输出样本数据的聚类结果

estimator.fit_transform(data) #Compute clustering and transform X to cluster-distance space

estimator.predict(data) #Predict the closest cluster each sample in X belongs to.

estimator.score(data) #Opposite of the value of X on the K-means objective.

estimator.transform(data) #Transform X to a cluster-distance space. label_pred = estimator.labels_ #获取聚类标签，即聚类结果 centroids = estimator.cluster_centers_ #获取聚类中心 inertia = estimator.inertia_ # 获取所有样本至聚类中心的距离总和

直接采用kmeans函数：

import numpy as np from sklearn import cluster data = np.random.rand(100, 3) #生成一个随机数据，样本大小为100, 特征数为3 k = 3 # 假如我要聚类为3个clusters [centroid, label, inertia] = cluster.k_means(data, k) #分别返回聚类中心点，样本聚类结果，样本至聚类中心的距离总和

三、classification分类

分类是数据挖掘或者机器学习中最重要的一个部分。不过由于经典的分类方法机制比较特性化，所以好像sklearn并没有特别定制一个分类器这样的class。

常用的分类方法有：

KNN最近邻:sklearn.neighbors
logistic regression逻辑回归: sklearn.linear_model.LogisticRegression
svm支持向量机: sklearn.svm
Naive Bayes朴素贝叶斯: sklearn.naive_bayes
Decision Tree决策树: sklearn.tree
Neural network神经网络: sklearn.neural_network

1.KNN最近邻案例(主要是Nearest Neighbors Classification)

from sklearn import neighbors, datasets # import some data to play with iris = datasets.load_iris() n_neighbors = 15 X = iris.data[:, :2] # we only take the first two features. We could # avoid this ugly slicing by using a two-dim dataset y = iris.target weights = 'distance' # also set as 'uniform' clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights) clf.fit(X, y) # if you have test data, just predict with the following functions # for example, xx, yy is constructed test data x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Z is the label_pred

2.svm案例

from sklearn import svm X = [[0, 0], [1, 1]] y = [0, 1] #建立支持向量分类模型 clf = svm.SVC() #拟合训练数据，得到训练模型参数 clf.fit(X, y) #对测试点[2., 2.], [3., 3.]预测 res = clf.predict([[2., 2.],[3., 3.]]) #输出预测结果值 print(res) #get support vectors print("support vectors:", clf.support_vectors_) #get indices of support vectors print("indices of support vectors:", clf.support_) #get number of support vectors for each class print("number of support vectors for each class:", clf.n_support_)

当然SVM还有对应的回归模型SVR

from sklearn import svm X = [[0, 0], [2, 2]] y = [0.5, 2.5] clf = svm.SVR() clf.fit(X, y) res = clf.predict([[1, 1]]) print(res)

3.逻辑回归

from sklearn import linear_model X = [[0, 0], [1, 1]] y = [0, 1] logreg = linear_model.LogisticRegression(C=1e5) #we create an instance of Neighbours Classifier and fit the data. logreg.fit(X, y) res = logreg.predict([[2, 2]]) print(res)

开码牛

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
机器学习 - sklearn库及案例

一、sklearn库介绍scikit-learn是数据挖掘与分析的简单而有效的工具。依赖于NumPy， SciPy和matplotlib。它主要包含以下几部分内容：(1)从功能来分：classification分类 Regression回归 Clustering聚类 Dimensionality reduction 降维 Model selection 模型选择 Preprocessing 数据预处理(2)从API模块来分：sklearn.base: Base cl...
复制链接

扫一扫

专栏目录