8种顶级Python机器学习算法-你必须学习

最新推荐文章于 2024-07-18 15:30:34 发布

weixin_33755554

最新推荐文章于 2024-07-18 15:30:34 发布

阅读量363

点赞数

文章标签：人工智能数据结构与算法大数据

原文链接：https://juejin.im/post/5b83eb926fb9a019f47d1a7d

版权

今天，我们将更深入地学习和实现8个顶级Python机器学习算法。

让我们开始Python编程中的机器学习算法之旅。

8种顶级Python机器学习算法-你必须学习 8 Python机器学习算法 - 你必须学习

以下是Python机器学习的算法：

1。线性回归线性回归是受监督的Python机器学习算法之一，它可以观察连续特征并预测结果。根据它是在单个变量上还是在许多特征上运行，我们可以将其称为简单线性回归或多元线性回归。

这是最受欢迎的Python ML算法之一，经常被低估。它为变量分配最佳权重以创建线ax + b来预测输出。我们经常使用线性回归来估计实际值，例如基于连续变量的房屋调用和房屋成本。回归线是拟合Y = a * X + b的最佳线，表示独立变量和因变量之间的关系。

您是否了解Python机器学习环境设置？

让我们为糖尿病数据集绘制这个图。

将matplotlib.pyplot导入为plt 将numpy导入为np 来自sklearn导入数据集，linear_model 来自sklearn.metrics import mean_squared_error，r2_score 糖尿病=数据集。load_diabetes （） diabetes_X = diabetes.data [ ：，np.newaxis，2 ] diabetes_X_train = diabetes_X [ ： - 30 ] #splitting数据到训练和测试集 diabetes_X_test = diabetes_X [ - 30 ：] diabetes_y_train = diabetes.target [ ： - 30 ] #splitting目标分为训练和测试集 diabetes_y_test = diabetes.target [ - 30 ：] regr = linear_model。LinearRegression （）＃线性回归对象 regr。fit （diabetes_X_train，diabetes_y_train ）#Use training set训练模型 LinearRegression（copy_X = True，fit_intercept = True，n_jobs = 1，normalize = False）

diabetes_y_pred = regr。预测（diabetes_X_test ）#Make预测 regr.coef_ 阵列（[941.43097333]）

mean_squared_error （diabetes_y_test，diabetes_y_pred ） 3035.0601152912695

r2_score （diabetes_y_test，diabetes_y_pred ）#Variance得分 0.410920728135835

plt。散射（diabetes_X_test，diabetes_y_test，color = 'lavender' ） <matplotlib.collections.PathCollection对象位于0x0584FF70>

plt。情节（diabetes_X_test，diabetes_y_pred，color = 'pink' ，linewidth = 3 ） [<matplotlib.lines.Line2D对象位于0x0584FF30>]

plt。xticks （（））（[]，<a 0 of text xticklabel objects>）

plt。yticks （（））（[]，<a 0 of text yticklabel objects>）

plt。show （） 8种顶级Python机器学习算法-你必须学习 Python机器学习算法 - 线性回归

2 Logistic回归 Logistic回归是一种受监督的分类Python机器学习算法，可用于估计离散值，如0/1，是/否和真/假。这是基于一组给定的自变量。我们使用逻辑函数来预测事件的概率，这给出了0到1之间的输出。

虽然它说'回归'，但这实际上是一种分类算法。Logistic回归将数据拟合到logit函数中，也称为logit回归。让我们描绘一下。

将numpy导入为np 将matplotlib.pyplot导入为plt 来自sklearn import linear_model XMIN，XMAX = - 7 ，7 #TEST集; 高斯噪声的直线 n_samples = 77 np.random。种子（0 ） x = np.random。正常（size = n_samples ） y = （x> 0 ）。astype （np.float ） x [ x> 0 ] * = 3 x + =。4 * np.random。正常（size = n_samples ） x = x [ ：，np.newaxis ] clf = linear_model。LogisticRegression （C = 1e4 ）#Classifier clf。适合（x，y ） plt。图（1 ，figsize = （3 ，4 ）） <图大小与300x400 0 轴>

plt。clf （） plt。散射（X。拆纱（）中，Y，颜色= '薰衣草' ，ZORDER = 17 ） <matplotlib.collections.PathCollection对象位于0x057B0E10>

x_test = np。linspace （- 7 ，7 ，277 ） def model （x ）：返回1 / （1个+ NP。EXP （-x ））

loss = model （x_test * clf.coef_ + clf.intercept_ ）。拉威尔（） plt。plot （x_test，loss，color = 'pink' ，linewidth = 2.5 ） [<matplotlib.lines.Line2D对象位于0x057BA090>]

ols = linear_model。LinearRegression （） ols。适合（x，y ） LinearRegression（copy_X = True，fit_intercept = True，n_jobs = 1，normalize = False）

plt。plot （x_test，ols.coef_ * x_test + ols.intercept_，linewidth = 1 ） [<matplotlib.lines.Line2D对象位于0x057BA0B0>]

plt。axhline （。4 ，颜色= ” 0.4' ） <matplotlib.lines.Line2D对象位于0x05860E70>

plt。ylabel （'y' ）文本（0,0.5， 'Y'）

plt。xlabel （'x' ）文本（0.5,0， 'X'）

plt。xticks （范围（- 7 ，7 ）） plt。yticks （[ 0 ，0.4 ，1 ] ） plt。ylim （- 。25 ，1.25 ）（-0.25,1.25）

plt。XLIM （- 4 ，10 ）（-4,10）

plt。图例（（'Logistic回归' ，'线性回归' ），loc = '右下' ，fontsize = 'small' ） <matplotlib.legend.Legend对象位于0x057C89F0>

plt。show （） 8种顶级Python机器学习算法-你必须学习机器学习算法 - Logistic Regreesion

3。决策树决策树属于受监督的Python机器学习学习，并且用于分类和回归 - 尽管主要用于分类。此模型接受一个实例，遍历树，并将重要特征与确定的条件语句进行比较。是下降到左子分支还是右分支取决于结果。通常，更重要的功能更接近根。

这种Python机器学习算法可以对分类和连续因变量起作用。在这里，我们将人口分成两个或更多个同类集。让我们看看这个算法 -

来自sklearn.cross_validation import train_test_split 来自sklearn.tree导入DecisionTreeClassifier 来自sklearn.metrics import accuracy_score 来自sklearn.metrics import classification_report def importdata （）：#Importing data balance_data = PD。read_csv （ 'archive.ics.uci.edu/ml/machine-…' + 'databases / balance-scale / balance-scale.data' ， sep = '，' ，header = None ） print （len （balance_data ）） print （balance_data.shape ）打印（balance_data。头（）） return balance_data

def splitdataset （balance_data ）：# Splitting 数据 x = balance_data.values [ ：，1 ：5 ] y = balance_data.values [ ：，0 ] x_train，x_test，y_train，y_test = train_test_split （ x，y，test_size = 0.3 ，random_state = 100 ）返回x，y，x_train，x_test，y_train，y_test

def train_using_gini （x_train，x_test，y_train ）：#gining with giniIndex clf_gini = DecisionTreeClassifier （criterion = “ gini ” ， random_state = 100 ，max_depth = 3 ，min_samples_leaf = 5 ） clf_gini。适合（x_train，y_train ）返回clf_gini

def train_using_entropy （x_train，x_test，y_train ）：#Training with entropy clf_entropy = DecisionTreeClassifier （ criterion = “entropy” ，random_state = 100 ， max_depth = 3 ，min_samples_leaf = 5 ） clf_entropy。适合（x_train，y_train ）返回clf_entropy

def 预测（x_test，clf_object ）：＃制作预测 y_pred = clf_object。预测（x_test ） print （f “预测值：{y_pred}” ）返回y_pred

def cal_accuracy （y_test，y_pred ）：＃计算准确性 print （confusion_matrix （y_test，y_pred ））打印（accuracy_score （y_test，y_pred ）* 100 ） print （classification_report （y_test，y_pred ））

data = importdata （） 625

（625,5）

0 1 2 3 4

0 B 1 1 1 1

1 R 1 1 1 2

2 R 1 1 1 3

3 R 1 1 1 4

4 R 1 1 1 5

x，y，x_train，x_test，y_train，y_test = splitdataset （data ） clf_gini = train_using_gini （x_train，x_test，y_train ） clf_entropy = train_using_entropy （x_train，x_test，y_train ） y_pred_gini = 预测（x_test，clf_gini ） 8种顶级Python机器学习算法-你必须学习 Python机器学习算法 - 决策树

cal_accuracy （y_test，y_pred_gini ） [[0 6 7]

[0 67 18]

[0 19 71]]

73.40425531914893

8种顶级Python机器学习算法-你必须学习 Python机器学习算法 - 决策树

y_pred_entropy = 预测（x_test，clf_entropy ） 8种顶级Python机器学习算法-你必须学习 Python机器学习算法 - 决策树

cal_accuracy （y_test，y_pred_entropy ） [[0 6 7]

[0 63 22]

[0 20 70]]

70.74468085106383

8种顶级Python机器学习算法-你必须学习 Python机器学习算法 - 决策树

4。支持向量机（SVM） SVM是一种受监督的分类Python机器学习算法，它绘制了一条划分不同类别数据的线。在这个ML算法中，我们计算向量以优化线。这是为了确保每组中最近的点彼此相距最远。虽然你几乎总会发现这是一个线性向量，但它可能不是那样的。

在这个Python机器学习教程中，我们将每个数据项绘制为n维空间中的一个点。我们有n个特征，每个特征都具有某个坐标的值。

首先，让我们绘制一个数据集。

来自sklearn.datasets.samples_generator import make_blobs x，y = make_blobs （n_samples = 500 ，centers = 2 ， random_state = 0 ，cluster_std = 0 .40 ）

将matplotlib.pyplot导入为plt plt。scatter （x [ ：，0 ] ，x [ ：，1 ] ，c = y，s = 50 ，cmap = 'plasma' ）位于0x04E1BBF0的<matplotlib.collections.PathCollection对象>

plt。show （） 8种顶级Python机器学习算法-你必须学习 Python机器学习算法 - SVM

将numpy导入为np xfit = np。linspace （- 1 ，3 0.5 ） plt。scatter （X [ ：，0 ] ，X [ ：，1 ] ，c = Y，s = 50 ，cmap = 'plasma' ） <matplotlib.collections.PathCollection对象位于0x07318C90>

为M，B，d在[ （1 ，0.65 ，0.33 ），（0.5 ，1.6 ，0.55 ），（- 0 0.2 ，2 0.9 ，0.2 ）] ： yfit = m * xfit + b PLT。情节（xfit，yfit，' - k' ） PLT。fill_between （xfit ，yfit - d，yfit + d，edgecolor = 'none' ， color = '＃AFFEDC' ，alpha = 0.4 ） [<matplotlib.lines.Line2D对象位于0x07318FF0>]

<matplotlib.collections.PolyCollection对象位于0x073242D0>

[<matplotlib.lines.Line2D对象位于0x07318B70>]

<matplotlib.collections.PolyCollection对象位于0x073246F0>

[<matplotlib.lines.Line2D对象位于0x07324370>]

<matplotlib.collections.PolyCollection对象位于0x07324B30>

plt。XLIM （- 1 ，3.5 ）（-1,3.5）

plt。show （） 8种顶级Python机器学习算法-你必须学习 Python机器学习算法 - SVM

5，朴素贝叶斯朴素贝叶斯是一种基于贝叶斯定理的分类方法。这假定预测变量之间的独立性。朴素贝叶斯分类器将假定类中的特征与任何其他特征无关。考虑一个水果。这是一个苹果，如果它是圆形，红色，直径2.5英寸。朴素贝叶斯分类器将说这些特征独立地促成果实成为苹果的概率。即使功能相互依赖，这也是如此。

对于非常大的数据集，很容易构建朴素贝叶斯模型。这种模型不仅非常简单，而且比许多高度复杂的分类方法表现更好。让我们建立这个。

来自sklearn.naive_bayes导入GaussianNB 来自sklearn.naive_bayes导入MultinomialNB 来自sklearn导入数据集来自sklearn.metrics import confusion_matrix 来自sklearn.model_selection import train_test_split iris =数据集。load_iris （） x = iris.data y = iris.target x_train，x_test，y_train，y_test = train_test_split （x，y，test_size = 0 .3 ，random_state = 0 ） gnb = GaussianNB （） MNB = MultinomialNB （） y_pred_gnb = gnb。适合（x_train，y_train ）。预测（x_test ） cnf_matrix_gnb = confusion_matrix （y_test，y_pred_gnb ） cnf_matrix_gnb 数组（[[16,0,0]，

[0,18,0]，

[0,0,11]]，dtype = int64）

y_pred_mnb = mnb。适合（x_train，y_train ）。预测（x_test ） cnf_matrix_mnb = confusion_matrix （y_test，y_pred_mnb ） cnf_matrix_mnb 数组（[[16,0,0]，

[0,0,18]，

[0,0,11]]，dtype = int64）

6。kNN（k-Nearest Neighbors）这是一种用于分类和回归的Python机器学习算法 - 主要用于分类。这是一种监督学习算法，它考虑不同的质心并使用通常的欧几里德函数来比较距离。然后，它分析结果并将每个点分类到组以优化它以放置所有最接近的点。它使用其邻居k的多数票对新案件进行分类。它分配给一个类的情况是其K个最近邻居中最常见的一个。为此，它使用距离函数。

I,对整个数据集进行培训和测试

来自sklearn.datasets import load_iris iris = load_iris （） x = iris.data y = iris.target 来自sklearn.linear_model import LogisticRegression logreg = LogisticRegression （） logreg。适合（x，y ） LogisticRegression（C = 1.0，class_weight = None，dual = False，fit_intercept = True，

intercept_scaling = 1，max_iter = 100，multi_class ='ovr'，n_jobs = 1，

penalty ='l2'，random_state = None，solver ='liblinear'，tol = 0.0001，

verbose = 0，warm_start = False）

logreg。预测（x ） array（[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0，

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0，

0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

2,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,1,1，

1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2，

2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2，

2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]]

y_pred = logreg。预测（x ） len （y_pred ） 150

来自sklearn导入指标指标。accuracy_score （y，y_pred ） 0.96

来自sklearn.neighbors导入KNeighborsClassifier knn = KNeighborsClassifier （n_neighbors = 5 ） knn。适合（x，y ） KNeighborsClassifier（algorithm ='auto'，leaf_size = 30，metric ='minkowski'，

metric_params =无，n_jobs = 1，n_neighbors = 5，p = 2，

权重=“均匀”）

y_pred = knn。预测（x ）指标。accuracy_score （y，y_pred ） 0.9666666666666667

knn = KNeighborsClassifier （n_neighbors = 1 ） knn。适合（x，y ） KNeighborsClassifier（algorithm ='auto'，leaf_size = 30，metric ='minkowski'，

metric_params =无，n_jobs = 1，n_neighbors = 1，p = 2，

权重=“均匀”）

y_pred = knn。预测（x ）指标。accuracy_score （y，y_pred ） 1.0

II。分裂成火车/测试

x.shape （150,4）

y.shape （150）

来自sklearn.cross_validation import train_test_split x.shape （150,4）

y.shape （150）

来自sklearn.cross_validation import train_test_split x_train，x_test，y_train，y_test = train_test_split （x，y，test_size = 0.4 ，random_state = 4 ） x_train.shape （90,4）

x_test.shape （60,4）

y_train.shape （90）

y_test.shape （60）

logreg = LogisticRegression （） logreg。适合（x_train，y_train ） y_pred = knn。预测（x_test ）指标。accuracy_score （y_test，y_pred ） 0.9666666666666667

knn = KNeighborsClassifier （n_neighbors = 5 ） knn。适合（x_train，y_train ） KNeighborsClassifier（algorithm ='auto'，leaf_size = 30，metric ='minkowski'，

metric_params =无，n_jobs = 1，n_neighbors = 5，p = 2，

权重=“均匀”）

y_pred = knn。预测（x_test ）指标。accuracy_score （y_test，y_pred ） 0.9666666666666667

k_range = 范围（1 ，26 ）得分= [ ] for k in k_range： knn = KNeighborsClassifier （n_neighbors = k ） KNN。适合（x_train，y_train ） y_pred = knn。预测（x_test ）分数。追加（指标。accuracy_score （y_test，y_pred ））

分数 [0.95，0.95，0.9666666666666667，0.9666666666666667，0.9666666666666667，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9833333333333333，0.9666666666666667，0.9833333333333333，0.9666666666666667，0.9666666666666667，0.9666666666666667，0.9666666666666667 0.95，0.95 ]

将matplotlib.pyplot导入为plt plt。情节（k_range，分数） [<matplotlib.lines.Line2D对象位于0x05FDECD0>]

plt。xlabel （'k代表kNN' ）文字（0.5,0，'k为kNN'）

plt。ylabel （'测试准确度' ）文字（0,0.5，'测试准确度'）

plt。show （） 8种顶级Python机器学习算法-你必须学习 Python机器学习算法 - kNN（k-Nearest Neighbors）

阅读Python统计数据 - p值，相关性，T检验，KS检验

7。K-Means k-Means是一种无监督算法，可以解决聚类问题。它使用许多集群对数据进行分类。类中的数据点与同类组是同构的和异构的。

将numpy导入为np 将matplotlib.pyplot导入为plt 来自matplotlib导入样式风格。使用（'ggplot' ）来自sklearn.cluster导入KMeans X = [ 1 ，5 ，1 0.5 ，8 ，1 ，9 ] Y = [ 2 ，8 ，1.7 ，6 ，0 0.2 ，12 ] plt。散射（x，y ） <matplotlib.collections.PathCollection对象位于0x0642AF30>

x = np。阵列（[ [ 1 ，2 ] ，[ 5 ，8 ] ，[ 1.5 ，1 0.8 ] ，[ 8 ，8 ] ，[ 1 ，0 0.6 ] ，[ 9 ，11 ] ] ） kmeans = KMeans （n_clusters = 2 ） kmeans。适合（x ） KMeans（algorithm ='auto'，copy_x = True，init ='k-means ++'，max_iter = 300，

n_clusters = 2，n_init = 10，n_jobs = 1，precompute_distances ='auto'，

random_state =无，tol = 0.0001，verbose = 0）

centroids = kmeans.cluster_centers_ labels = kmeans.labels_ 质心数组（[[1.16666667,1.46666667]，

[7.33333333,9。]]）

标签数组（[0,1,0,1,0,1]）

colors = [ 'g。' ，'r。' ，'c。' ，'呃。' ] for i in range （len （x ））： print （x [ i ] ，labels [ i ] ） PLT。plot （x [ i ] [ 0 ] ，x [ i ] [ 1 ] ，colors [ labels [ i ] ] ，markersize = 10 ） [1。2.] 0

[<matplotlib.lines.Line2D对象位于0x0642AE10>]

[5。8.] 1

[<matplotlib.lines.Line2D对象位于0x06438930>]

[1.5 1.8] 0

[<matplotlib.lines.Line2D对象位于0x06438BF0>]

[8。8.] 1

[<matplotlib.lines.Line2D对象位于0x06438EB0>]

[1。0.6] 0

[<matplotlib.lines.Line2D对象位于0x06438FB0>]

[9. 11.] 1

[<matplotlib.lines.Line2D对象位于0x043B1410>]

plt。scatter （centroids [ ：，0 ] ，centroids [ ：，1 ] ，marker = 'x' ，s = 150 ，linewidths = 5 ，zorder = 10 ） <matplotlib.collections.PathCollection对象位于0x043B14D0>

plt。show （） 8种顶级Python机器学习算法-你必须学习 8。Random Forest Random Forest是决策树的集合。为了根据其属性对每个新对象进行分类，树投票给类 - 每个树提供一个分类。投票最多的分类在Random

中获胜。

将numpy导入为np 将pylab导入为pl x = np.random。均匀的（1 ，100 ，1000 ） y = np。log （x ）+ np.random。正常（0 ，。3 ，1000 ） pl。scatter （x，y，s = 1 ，label = 'log（x）with noise' ） <matplotlib.collections.PathCollection对象，位于0x0434EC50>

pl。情节（NP。人气指数（1 ，100 ），NP。日志（NP。人气指数（1 ，100 ））中，c = 'B' ，标记= '日志（x）的函数真' ） [<matplotlib.lines.Line2D对象位于0x0434EB30>]

pl。xlabel （'x' ）文本（0.5,0， 'X'）

pl。ylabel （'f（x）= log（x）' ）文本（0,0.5， 'F（X）=日志（X）'）

pl。传奇（loc = 'best' ） <matplotlib.legend.Legend对象，位于0x04386450>

pl。标题（'基本日志功能' ）文字（0.5,1，'基本日志功能'）

pl。show （） 8种顶级Python机器学习算法-你必须学习 Python机器学习算法 -

来自sklearn.datasets import load_iris 来自sklearn.ensemble导入RandomForestClassifier 将pandas导入为pd 将numpy导入为np iris = load_iris （） df = pd。DataFrame （iris.data，columns = iris.feature_names ） df [ 'is_train' ] = np.random。均匀的（0 ，1 ，LEN （DF ））<=。75 df [ 'species' ] = pd.Categorical。from_codes （iris.target，iris.target_names ） df。头（）萼片长度（厘米）萼片宽度（厘米）... is_train物种

0 5.1 3.5 ...真正的setosa

1 4.9 3.0 ...真正的setosa

2 4.7 3.2 ...真正的setosa

3 4.6 3.1 ...真正的setosa

4 5.0 3.6 ...假setosa

[5行x 6列]

train，test = df [ df [ 'is_train' ] == True ] ，df [ df [ 'is_train' ] == False ] features = df.columns [ ：4 ] clf = RandomForestClassifier （n_jobs = 2 ） y，_ = pd。factorize （train [ 'species' ] ） clf。适合（火车[ 功能] ，y ） RandomForestClassifier（bootstrap = True，class_weight = None，criterion ='gini'，

max_depth =无，max_features ='auto'，max_leaf_nodes =无，

min_impurity_decrease = 0.0，min_impurity_split =无，

min_samples_leaf = 1，min_samples_split = 2，

min_weight_fraction_leaf = 0.0，n_estimators = 10，n_jobs = 2，

oob_score = False，random_state = None，verbose = 0，

warm_start = FALSE）

preds = iris.target_names [ clf。预测（测试[ 特征] ）] pd。交叉表（test [ 'species' ] ，preds，rownames = [ 'actual' ] ，colnames = [ 'preds' ] ） preds setosa versicolor virginica

实际

setosa 12 0 0

versicolor 0 17 2

virginica 0 1 15

所以，这就是Python机器学习算法教程。希望你喜欢。

因此，今天我们讨论了八个重要的Python机器学习算法。您认为哪一个最具潜力？希望大家多多关注，更多精彩的文章带给大家！大家对大数据感兴趣的可以关注我的微信公众号：大数据技术工程师

里面每天都会分享一些精彩文章，更有大数据基础与项目实战，java面试技巧，Python学习资料等等提供给大家免费学习，回复关键字就可以领取哦

转载于:https://juejin.im/post/5b83eb926fb9a019f47d1a7d