pyspark 机器学习_pyspark机器学习

最新推荐文章于 2024-04-21 12:00:00 发布

weixin_26756255

最新推荐文章于 2024-04-21 12:00:00 发布

阅读量571

点赞数 1

文章标签：机器学习 python 人工智能

原文链接：https://medium.com/towards-artificial-intelligence/machine-learning-with-pyspark-23d54d82dbc4

版权

pyspark 机器学习

机器学习，编程 (Machine Learning, Programming)

In this article, I am going to share a few machine learning work I have done in spark using PySpark.

在本文中，我将分享一些我使用PySpark在spark中完成的机器学习工作。

Machine Learning is one of the hot application of artificial intelligence (AI). AI is a much bigger ecosystem with many amazing applications. Machine learning in simple terms is the ability to automatically learn by the machine and improve from experience without explicitly programmed. The learning process starts with observation of data, then it finds the pattern in date and makes a better decision on learning from data.

机器学习是人工智能(AI)的热门应用之一。人工智能是一个更大的生态系统，具有许多惊人的应用程序。简单来说，机器学习是指无需明确编程即可自动学习机器并从经验中提高的能力。学习过程从观察数据开始，然后找到日期模式，并根据数据学习做出更好的决策。

机器学习算法的类别 (Categories of Machine Learning Algorithm)

There are usually supervised and unsupervised learning but some of the fell in between them

通常有监督和无监督的学习，但其中一些介于两者之间

Supervised machine learning algorithms
监督机器学习算法
unsupervised machine learning algorithms
无监督机器学习算法
Semi-supervised machine learning algorithms
半监督机器学习算法
Reinforcement machine learning algorithms
强化机器学习算法

Why feeding data to a machine?

为什么要向计算机馈送数据？

One of the main reasons is the quantity of data. Can you imagine how much data amazon server produces a day? A lot. The second reason is quality. A machine is generally faster, more accurate, and less effort.

主要原因之一是数据量。您能想象亚马逊服务器每天产生多少数据？很多。第二个原因是质量。机器通常更快，更准确且工作量更少。

Spark中的机器学习 (Machine Learning in Spark)

Apache spark has a Machine Learning Library called MLlib.

Apache spark有一个名为MLlib的机器学习库。

From https://spark.apache.org/docs/1.1.0/mllib-guide.html.

来自https://spark.apache.org/docs/1.1.0/mllib-guide.html 。

MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

MLlib是Spark的可扩展机器学习库，包含通用学习算法和实用程序，包括分类，回归，聚类，协作过滤，降维以及基础优化原语。

Many machine learning methodologies are built-in on this library which helps data scientists to focus on tasks instead of configuration and infrastructure.

该库中内置了许多机器学习方法，可帮助数据科学家将重点放在任务上，而不是配置和基础架构上。

We are going to discuss three popular machine learning algorithms. They

我们将讨论三种流行的机器学习算法。他们

Decision Tree regression
决策树回归
Random Forest Regression
森林随机回归
Gradient Boosted Tree Regression
梯度提升树回归

So, What is a regression in machine learning?

那么，什么是机器学习的回归呢？

It’s a task to predict the value of a target (mostly numerical variable) by building a model based on one or more predictors. Predictors can be numerical or categorical variables. Categorical variables also call nominal variables are those which have a category but no intrinsic ordering. An example can be gender, hair color, etc.

通过基于一个或多个预测变量构建模型来预测目标值(主要是数字变量)是一项任务。预测变量可以是数字变量或分类变量。分类变量也称为标称变量是那些具有类别但没有内在顺序的变量。例如性别，头发颜色等。

You can also describe regression in a statistical term. It’s a statistical process for estimating the relationships between a dependent variable and one or more independent variables. For more, you can read from this page. https://en.wikipedia.org/wiki/Regression_analysis

您也可以用统计术语描述回归。这是一个统计过程，用于估计因变量和一个或多个自变量之间的关系。有关更多信息，您可以从此页面阅读。 https://zh.wikipedia.org/wiki/回归分析

数据集和问题 (Datasets and problems)

We are predicting the housing price for next year based on this year price.

我们根据今年的价格预测明年的房价。

决策树回归 (Decision Tree Regression)

This is a really simple regression model. It creates a regression model in the form of a tree structure. Here is how it works. It breaks down the datasets into small subset groups until certain thresholds e.g. 50. These small data sets are sampling methods for regression models. This happens recursively until all the data are divided with a minimum threshold. Then it calculates the relative importance from the data sets and divides it into leaf. A decision tree can operate with both categorical and numerical data. Here is more detail about a decision tree.

这是一个非常简单的回归模型。它以树结构的形式创建回归模型。下面是它的工作原理。它将数据集细分为小子集组，直到达到某些阈值(例如50)为止。这些小数据集是回归模型的采样方法。递归地进行此操作，直到用最小阈值划分所有数据为止。然后，它根据数据集计算相对重要性，并将其划分为叶子。决策树可以同时处理分类数据和数字数据。这里是有关决策树的更多详细信息。

Image for post — Image Source: https://saedsayad.com/decision_tree_reg.htm

Code

码

# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %%
#Creating a spark session in order to have access to creating dataframes
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()# %%
#Importing the algorithms and evaluator needed for creating the model and evaluating its performancefrom pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator# %%
#Loading in the data, printing the schema, and showing the top 20 rowsecommerceData = spark.read.csv(r'Data/ServiceUsage.csv', header = True, inferSchema = True)
ecommerceData.printSchema()
ecommerceData.show()# %%
#Immporting the vector libraries in order to transform the datasetfrom pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler# %%
#Feeding the dataframe into the vector assembler (transformer) and combining 4 columns into one column called "features"assembler = VectorAssembler(inputCols = ['Avg Session Length', 'Time on App', 'Time on Website', 'Length of Membership'], outputCol = 'Features')
transformedEcommerceData = assembler.transform(ecommerceData)
transformedEcommerceData.show()# %%
#Preparing the data for the model by only having two columns: the features and the column of known data we're trying to predictfinalData = transformedEcommerceData.select('Features', 'Yearly Amount Spent')
finalData.show()# %%
#Splitting the data into training and testing sets by randomly choosing 70% of the rows for training and 30% of the rows for testingtrainingData, testingData = finalData.randomSplit([0.7, 0.3])# %%
#Decision Tree Regression
decisionTree = DecisionTreeRegressor(featuresCol = "Features", labelCol = "Yearly Amount Spent", maxDepth = 15, maxBins = 32)
decisionTreeModel = decisionTree.fit(trainingData)
dtresults = decisionTreeModel.transform(testingData)
dtresults.select("Prediction", "Yearly Amount Spent", "Features")
dtresults.show()
#Using RMSE to evaluate the model
gbtevaluator = RegressionEvaluator(labelCol="Yearly Amount Spent", predictionCol="prediction", metricName="rmse")
gbtrmse = gbtevaluator.evaluate(dtresults)
print("Gradient-Boosted Tree RMSE: ", gbtrmse)

Prediction

预测

2.随机森林回归： (2. Random Forest Regression:)

Random forest regression is one of the most effective and accurate machine learning models for prediction. It’s a supervised learning algorithm. It is a meta estimator that fits several classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. It is very good at handling tabular data numerical or categorical features. For more read on Random Forest Regression, follow this article. https://levelup.gitconnected.com/random-forest-regression-209c0f354c84

随机森林回归是最有效，最准确的机器学习预测模型之一。这是一种监督学习算法。它是一种元估计量，适用于数据集各个子样本上的多个分类决策树，并使用求平均值来提高预测准确性和控制过度拟合。它非常擅长处理表格数据的数字或分类特征。有关随机森林回归的更多信息，请关注本文。 https://levelup.gitconnected.com/random-forest-regression-209c0f354c84

Code

码

# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %%
#Creating a spark session in order to have access to creating dataframesfrom pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# %%
#Importing the algorithms and evaluator needed for creating the model and evaluating its performancefrom pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator
# %%
#Loading in the data, printing the schema, and showing the top 20 rowsecommerceData = spark.read.csv(r'Data/ServiceUsage.csv', header = True, inferSchema = True)
ecommerceData.printSchema()
ecommerceData.show()
# %%
#Immporting the vector libraries in order to transform the datasetfrom pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
# %%
#Feeding the dataframe into the vector assembler (transformer) and combining 4 columns into one column called "features"assembler = VectorAssembler(inputCols = ['Avg Session Length', 'Time on App', 'Time on Website', 'Length of Membership'], outputCol = 'Features')
transformedEcommerceData = assembler.transform(ecommerceData)
transformedEcommerceData.show()
# %%
#Preparing the data for the model by only having two columns: the features and the column of known data we're trying to predictfinalData = transformedEcommerceData.select('Features', 'Yearly Amount Spent')
finalData.show()
# %%
#Splitting the data into training and testing sets by randomly choosing 70% of the rows for training and 30% of the rows for testingtrainingData, testingData = finalData.randomSplit([0.7, 0.3])
# %%
#Random Forest Regression
randomForest = RandomForestRegressor(featuresCol = "Features", labelCol = "Yearly Amount Spent",  maxDepth = 15, maxBins = 32, numTrees = 200)
randomForestModel = randomForest.fit(trainingData)
rfresults = randomForestModel.transform(testingData)
rfresults.select("Prediction", "Yearly Amount Spent", "Features")
rfresults.show()
#Using RMSE to evaluate the model
gbtevaluator = RegressionEvaluator(labelCol="Yearly Amount Spent", predictionCol="prediction", metricName="rmse")
gbtrmse = gbtevaluator.evaluate(rfresults)
print("Gradient-Boosted Tree RMSE: ", gbtrmse)

Prediction

预测

3.梯度提升树回归 (3. Gradient Boosted Tree Regression)

Boosting in machine learning is a method of converting weak learners into strong learners. It combines many sample models. Then a final model going to be a strong predictor. It usages gradient descent to minimize the loss. It can perform regression, classification, and ranking. It’s one of the powerful machine learning algorithms.

促进机器学习是一种将弱学习者转变为强学习者的方法。它结合了许多示例模型。然后，最终模型将成为强大的预测指标。它使用梯度下降来最大程度地减少损耗。它可以执行回归，分类和排名。它是功能强大的机器学习算法之一。

It builds a series of trees where each tree train so that it attempts to correct the mistakes of the previous tree, in the series. Typically, gradient build tree ensembles use lots of shallow trees known as weak learners. They are built in a nonrandom way to create a model that makes fewer and fewer mistakes as more trees are added. It’s fast and doesn’t use a lot of memory. For more information on a gradient boost decision tree, follow this excellent video on Coursera. https://www.coursera.org/lecture/python-machine-learning/gradient-boosted-decision-trees-emwn3

它在每棵树训练的地方构建了一系列树，以便尝试纠正该树中前一棵树的错误。通常，梯度构建树集成会使用许多称为弱学习者的浅树。它们以非随机方式构建，以创建一个模型，该模型在添加更多树时会犯越来越少的错误。它速度很快，并且不占用大量内存。有关梯度提升决策树的更多信息，请在Coursera上观看此精彩视频。 https://www.coursera.org/lecture/python-machine-learning/gradient-boosted-decision-trees-emwn3

Code

码

# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %%
#Creating a spark session in order to have access to creating dataframesfrom pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()# %%
#Importing the algorithms and evaluator needed for creating the model and evaluating its performancefrom pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator# %%
#Loading in the data, printing the schema, and showing the top 20 rowsecommerceData = spark.read.csv(r'Data/ServiceUsage.csv', header = True, inferSchema = True)
ecommerceData.printSchema()
ecommerceData.show()# %%
#Immporting the vector libraries in order to transform the dataset
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler# %%
#Feeding the dataframe into the vector assembler (transformer) and combining 4 columns into one column called "features"assembler = VectorAssembler(inputCols = ['Avg Session Length', 'Time on App', 'Time on Website', 'Length of Membership'], outputCol = 'Features')
transformedEcommerceData = assembler.transform(ecommerceData)
transformedEcommerceData.show()# %%
#Preparing the data for the model by only having two columns: the features and the column of known data we're trying to predictfinalData = transformedEcommerceData.select('Features', 'Yearly Amount Spent')
finalData.show()# %%
#Splitting the data into training and testing sets by randomly choosing 70% of the rows for training and 30% of the rows for testingtrainingData, testingData = finalData.randomSplit([0.7, 0.3])# %%
#Gradient Boosted Tree Regression to do the prediction
boostedTree = GBTRegressor(featuresCol = "Features", labelCol = "Yearly Amount Spent", maxDepth = 5, maxBins = 32, maxIter = 200)
boostedTreeModel = boostedTree.fit(trainingData)
gbtresults = boostedTreeModel.transform(testingData)
gbtresults.select("Prediction", "Yearly Amount Spent", "Features")
gbtresults.show()
#Using RMSE to evaluate the model
gbtevaluator = RegressionEvaluator(labelCol="Yearly Amount Spent", predictionCol="prediction", metricName="rmse")
gbtrmse = gbtevaluator.evaluate(gbtresults)
print("Gradient-Boosted Tree RMSE: ", gbtrmse)

Prediction

预测

结论 (Conclusion)

The outcome of the prediction was amazing. It was really easy to implement but Random forest and GBT took longer to run. The best algorithm in-terms of score and prediction was GBT.

预测的结果是惊人的。这确实很容易实现，但是随机森林和GBT需要更长的时间才能运行。评分和预测方面最好的算法是GBT。

翻译自: https://medium.com/towards-artificial-intelligence/machine-learning-with-pyspark-23d54d82dbc4

pyspark 机器学习

weixin_26756255

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
pyspark 机器学习_pyspark机器学习

pyspark 机器学习机器学习，编程 (Machine Learning, Programming)In this article, I am going to share a few machine learning work I have done in spark using PySpark. 在本文中，我将分享一些我使用PySpark在spark中完成的机器学习工作。 Mach...
复制链接

扫一扫