文章目录
三、分类模型
3.1 继续任务5的步骤,假设Type 1为标签,将其进行labelencoder
# encoding=utf-8
from pyspark import SparkFiles
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# 任务6:SparkML基础:分类模型
spark = SparkSession.builder.appName('pyspark').getOrCreate()
spark.sparkContext.addFile("https://cdn.coggle.club/Pokemon.csv")
path = "file://"+SparkFiles.get("Pokemon.csv")
df = spark.read.csv(path=path, header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'SpAtk')
df = df.withColumnRenamed('Sp. Def', 'SpDef')
df = df.withColumnRenamed('Type 1', 'Type1')
df = df.withColumnRenamed('Type 2', 'Type2')
df = df.withColumn("Legendary", col("Legendary").cast('string'))
# df.show()
# 步骤1:继续任务5的步骤,假设Type 1为标签,将其进行labelencoder
indexer = StringIndexer(inputCol="Type1", outputCol="Type1_idx")
df = indexer.fit(df).transform(df)
# df.show()
我们通过StringIndexer
对Type 1
进行labelencoder,可以看到最右边多了一列Type1_idx
的编号。可以通过df.show()
看下df表:
+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+
| Name|Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|Type1_idx|
+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+
| Bulbasaur|Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| 2.0|
| Ivysaur|Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| 2.0|
| Venusaur|Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| 2.0|
|VenusaurMega Venu...|Grass|Poison| 625| 80| 100| 123| 122| 120| 80| 1| false| 2.0|
| Charmander| Fire| null| 309| 39| 52| 43| 60| 50| 65| 1| false| 5.0|
| Charmeleon| Fire| null| 405| 58| 64| 58| 80| 65| 80| 1| false| 5.0|
| Charizard| Fire|Flying| 534| 78| 84| 78| 109| 85| 100| 1| false| 5.0|
|CharizardMega Cha...| Fire|Dragon| 634| 78| 130| 111| 130| 85| 100| 1| false| 5.0|
|CharizardMega Cha...| Fire|Flying| 634| 78| 104| 78| 159| 115| 100| 1| false| 5.0|
| Squirtle|Water| null| 314| 44| 48| 65| 50| 64| 43| 1| false| 0.0|
| Wartortle|Water| null| 405| 59| 63| 80| 65| 80| 58| 1| false| 0.0|
| Blastoise|Water| null| 530| 79| 83| 100| 85| 105| 78| 1| false| 0.0|
|BlastoiseMega Bla...|Water| null| 630| 79| 103| 120| 135| 115| 78| 1| false| 0.0|
| Caterpie| Bug| null| 195| 45| 30| 35| 20| 20| 45| 1| false| 3.0|
| Metapod| Bug| null| 205| 50| 20| 55| 25| 25| 30| 1| false| 3.0|
| Butterfree| Bug|Flying| 395| 60| 45| 50| 90| 80| 70| 1| false| 3.0|
| Weedle| Bug|Poison| 195| 40| 35| 30| 20| 20| 50| 1| false| 3.0|
| Kakuna| Bug|Poison| 205| 45| 25| 50| 25| 25| 35| 1| false| 3.0|
| Beedrill| Bug|Poison| 395| 65| 90| 40| 45| 80| 75| 1| false| 3.0|
|BeedrillMega Beed...| Bug|Poison| 495| 65| 150| 40| 15| 80| 145| 1| false| 3.0|
+--------------------+-----+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+
only showing top 20 rows
3.2 导入合适的标签评价指标,说出选择的原因?
# 步骤2:导入合适的标签评价指标,说出选择的原因?
# Accuracy, Precision, Recall
3.3 选择至少3种分类方法,完成训练。
首先对类别型特征进行StringIndexer
编码:
# 步骤3:选择至少3种分类方法,完成训练。
# encode categorical features
# in_cols = ["Name", "Type2", "Generation", "Legendary"]
# out_cols = ["Name_idx", "Type2_idx", "Generation_idx", "Legendary_idx"]
in_cols = ["Type2", "Generation", "Legendary"]
out_cols = ["Type2_idx", "Generation_idx", "Legendary_idx"]
indexer = StringIndexer(inputCols=in_cols, outputCols=out_cols, handleInvalid="skip")
df = indexer.fit(df).transform(df)
数值型特征利用pipeline管道归一化处理:
# encode numerical features
columns_to_scale = ["Total", "HP", "Attack", "Defense", "SpAtk", "SpDef", "Speed"]
assemblers, scalers = list(), list()
for col in columns_to_scale:
vec = VectorAssembler(inputCols=[col], outputCol=col + "_vec")
assemblers.append(vec)
sc = MinMaxScaler(inputCol=col + "_vec", outputCol=col + "_scl")
scalers.append(sc)
pipeline = Pipeline(stages=assemblers + scalers)
df = pipeline.fit(df).transform(df)
+--------------------+------+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+-------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Name| Type1| Type2|Total| HP|Attack|Defense|SpAtk|SpDef|Speed|Generation|Legendary|Type1_idx|Type2_idx|Generation_idx|Legendary_idx|Total_vec|HP_vec|Attack_vec|Defense_vec|SpAtk_vec|SpDef_vec|Speed_vec| Total_scl| HP_scl| Attack_scl| Defense_scl| SpAtk_scl| SpDef_scl| Speed_scl|
+--------------------+------+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+-------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Bulbasaur| Grass|Poison| 318| 45| 49| 49| 65| 65| 45| 1| false| 2.0| 2.0| 0.0| 0.0| [318.0]|[45.0]| [49.0]| [49.0]| [65.0]| [65.0]| [45.0]|[0.21694915254237...|[0.2953020134228188]|[0.21666666666666...|[0.15813953488372...|[0.3235294117647059]|[0.2142857142857143]|[0.25806451612903...|
| Ivysaur| Grass|Poison| 405| 60| 62| 63| 80| 80| 60| 1| false| 2.0| 2.0| 0.0| 0.0| [405.0]|[60.0]| [62.0]| [63.0]| [80.0]| [80.0]| [60.0]|[0.3644067796610169]|[0.3959731543624161]|[0.2888888888888889]|[0.22325581395348...|[0.4117647058823529]|[0.28571428571428...|[0.3548387096774194]|
| Venusaur| Grass|Poison| 525| 80| 82| 83| 100| 100| 80| 1| false| 2.0| 2.0| 0.0| 0.0| [525.0]|[80.0]| [82.0]| [83.0]| [100.0]| [100.0]| [80.0]|[0.5677966101694915]|[0.5302013422818792]| [0.4]|[0.31627906976744...|[0.5294117647058824]| [0.380952380952381]|[0.4838709677419355]|
|VenusaurMega Venu...| Grass|Poison| 625| 80| 100| 123| 122| 120| 80| 1| false| 2.0| 2.0| 0.0| 0.0| [625.0]|[80.0]| [100.0]| [123.0]| [122.0]| [120.0]| [80.0]|[0.7372881355932204]|[0.5302013422818792]| [0.5]|[0.5023255813953489]|[0.6588235294117647]|[0.4761904761904762]|[0.4838709677419355]|
| Charizard| Fire|Flying| 534| 78| 84| 78| 109| 85| 100| 1| false| 5.0| 0.0| 0.0| 0.0| [534.0]|[78.0]| [84.0]| [78.0]| [109.0]| [85.0]| [100.0]|[0.5830508474576271]|[0.5167785234899329]|[0.41111111111111...|[0.2930232558139535]|[0.5823529411764706]|[0.30952380952380...|[0.6129032258064516]|
|CharizardMega Cha...| Fire|Dragon| 634| 78| 130| 111| 130| 85| 100| 1| false| 5.0| 9.0| 0.0| 0.0| [634.0]|[78.0]| [130.0]| [111.0]| [130.0]| [85.0]| [100.0]| [0.752542372881356]|[0.5167785234899329]|[0.6666666666666667]|[0.44651162790697...|[0.7058823529411764]|[0.30952380952380...|[0.6129032258064516]|
|CharizardMega Cha...| Fire|Flying| 634| 78| 104| 78| 159| 115| 100| 1| false| 5.0| 0.0| 0.0| 0.0| [634.0]|[78.0]| [104.0]| [78.0]| [159.0]| [115.0]| [100.0]| [0.752542372881356]|[0.5167785234899329]|[0.5222222222222223]|[0.2930232558139535]|[0.8764705882352941]|[0.45238095238095...|[0.6129032258064516]|
| Butterfree| Bug|Flying| 395| 60| 45| 50| 90| 80| 70| 1| false| 3.0| 0.0| 0.0| 0.0| [395.0]|[60.0]| [45.0]| [50.0]| [90.0]| [80.0]| [70.0]|[0.34745762711864...|[0.3959731543624161]|[0.19444444444444...|[0.16279069767441...|[0.47058823529411...|[0.28571428571428...|[0.41935483870967...|
| Weedle| Bug|Poison| 195| 40| 35| 30| 20| 20| 50| 1| false| 3.0| 2.0| 0.0| 0.0| [195.0]|[40.0]| [35.0]| [30.0]| [20.0]| [20.0]| [50.0]|[0.00847457627118...|[0.26174496644295...|[0.1388888888888889]|[0.06976744186046...|[0.05882352941176...| [0.0]|[0.2903225806451613]|
| Kakuna| Bug|Poison| 205| 45| 25| 50| 25| 25| 35| 1| false| 3.0| 2.0| 0.0| 0.0| [205.0]|[45.0]| [25.0]| [50.0]| [25.0]| [25.0]| [35.0]|[0.02542372881355...|[0.2953020134228188]|[0.08333333333333...|[0.16279069767441...|[0.08823529411764...|[0.02380952380952...|[0.1935483870967742]|
| Beedrill| Bug|Poison| 395| 65| 90| 40| 45| 80| 75| 1| false| 3.0| 2.0| 0.0| 0.0| [395.0]|[65.0]| [90.0]| [40.0]| [45.0]| [80.0]| [75.0]|[0.34745762711864...|[0.42953020134228...|[0.4444444444444445]|[0.11627906976744...|[0.20588235294117...|[0.28571428571428...|[0.45161290322580...|
|BeedrillMega Beed...| Bug|Poison| 495| 65| 150| 40| 15| 80| 145| 1| false| 3.0| 2.0| 0.0| 0.0| [495.0]|[65.0]| [150.0]| [40.0]| [15.0]| [80.0]| [145.0]|[0.5169491525423728]|[0.42953020134228...|[0.7777777777777778]|[0.11627906976744...|[0.02941176470588...|[0.28571428571428...|[0.9032258064516129]|
| Pidgey|Normal|Flying| 251| 40| 45| 40| 35| 35| 56| 1| false| 1.0| 0.0| 0.0| 0.0| [251.0]|[40.0]| [45.0]| [40.0]| [35.0]| [35.0]| [56.0]|[0.10338983050847...|[0.26174496644295...|[0.19444444444444...|[0.11627906976744...|[0.14705882352941...|[0.07142857142857...|[0.32903225806451...|
| Pidgeotto|Normal|Flying| 349| 63| 60| 55| 50| 50| 71| 1| false| 1.0| 0.0| 0.0| 0.0| [349.0]|[63.0]| [60.0]| [55.0]| [50.0]| [50.0]| [71.0]|[0.2694915254237288]|[0.4161073825503356]|[0.2777777777777778]|[0.18604651162790...|[0.23529411764705...|[0.14285714285714...|[0.4258064516129032]|
| Pidgeot|Normal|Flying| 479| 83| 80| 75| 70| 70| 101| 1| false| 1.0| 0.0| 0.0| 0.0| [479.0]|[83.0]| [80.0]| [75.0]| [70.0]| [70.0]| [101.0]|[0.48983050847457...|[0.5503355704697986]|[0.3888888888888889]|[0.27906976744186...|[0.3529411764705882]|[0.2380952380952381]|[0.6193548387096774]|
| PidgeotMega Pidgeot|Normal|Flying| 579| 83| 80| 80| 135| 80| 121| 1| false| 1.0| 0.0| 0.0| 0.0| [579.0]|[83.0]| [80.0]| [80.0]| [135.0]| [80.0]| [121.0]| [0.659322033898305]|[0.5503355704697986]|[0.3888888888888889]|[0.3023255813953488]|[0.7352941176470588]|[0.28571428571428...|[0.7483870967741936]|
| Spearow|Normal|Flying| 262| 40| 60| 30| 31| 31| 70| 1| false| 1.0| 0.0| 0.0| 0.0| [262.0]|[40.0]| [60.0]| [30.0]| [31.0]| [31.0]| [70.0]|[0.12203389830508...|[0.26174496644295...|[0.2777777777777778]|[0.06976744186046...|[0.12352941176470...|[0.05238095238095...|[0.41935483870967...|
| Fearow|Normal|Flying| 442| 65| 90| 65| 61| 61| 100| 1| false| 1.0| 0.0| 0.0| 0.0| [442.0]|[65.0]| [90.0]| [65.0]| [61.0]| [61.0]| [100.0]|[0.4271186440677966]|[0.42953020134228...|[0.4444444444444445]|[0.23255813953488...| [0.3]|[0.19523809523809...|[0.6129032258064516]|
| Nidoqueen|Poison|Ground| 505| 90| 92| 87| 75| 85| 76| 1| false| 12.0| 1.0| 0.0| 0.0| [505.0]|[90.0]| [92.0]| [87.0]| [75.0]| [85.0]| [76.0]|[0.5338983050847458]|[0.5973154362416108]|[0.45555555555555...|[0.33488372093023...|[0.38235294117647...|[0.30952380952380...|[0.45806451612903...|
| Nidoking|Poison|Ground| 505| 81| 102| 77| 85| 75| 85| 1| false| 12.0| 1.0| 0.0| 0.0| [505.0]|[81.0]| [102.0]| [77.0]| [85.0]| [75.0]| [85.0]|[0.5338983050847458]|[0.5369127516778524]|[0.5111111111111112]|[0.28837209302325...|[0.4411764705882353]|[0.2619047619047619]|[0.5161290322580645]|
+--------------------+------+------+-----+---+------+-------+-----+-----+-----+----------+---------+---------+---------+--------------+-------------+---------+------+----------+-----------+---------+---------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows
把剩下的特征也转为向量:
# encode all features into vectors
# cols = ["Name_idx", "Type2_idx", "Generation_idx", "Legendary_idx",
# "Total_scl", "HP_scl", "Attack_scl", "Defense_scl", "SpAtk_scl", "SpDef_scl", "Speed_scl"]
cols = ["Type2_idx", "Generation_idx", "Legendary_idx",
"Total_scl", "HP_scl", "Attack_scl", "Defense_scl", "SpAtk_scl", "SpDef_scl", "Speed_scl"]
assembler = VectorAssembler(inputCols=cols, outputCol="feature")
df = assembler.transform(df)
# df.show()
根据八二比例划分训练集和测试机,按题目要求,用了3种分类算法(决策树、随机森林、贝叶斯)
train, test = df.randomSplit(weights=[0.8, 0.2], seed=42)
evaluator = MulticlassClassificationEvaluator(
labelCol="Type1_idx",
predictionCol="prediction",
metricName="accuracy")
models = {
"Decision Tree": DecisionTreeClassifier(labelCol="Type1_idx", featuresCol="feature", predictionCol="prediction"),
"Random Forest": RandomForestClassifier(labelCol="Type1_idx", featuresCol="feature", predictionCol="prediction"),
"Naive Bayes": NaiveBayes(labelCol="Type1_idx", featuresCol="feature", predictionCol="prediction"),
}
for name, cls in models.items():
predictions = cls.fit(train).transform(test)
accuracy = evaluator.evaluate(predictions)
print("Accuracy of %s is %.4f" % (name, accuracy))
四、聚类模型
4.1 继续任务5的步骤,假设Type 1为标签,将其进行labelencoder
# encoding=utf-8
from pyspark import SparkFiles
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.clustering import KMeans
from pyspark.sql.types import DoubleType
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# 任务7:SparkML基础:聚类模型
spark = SparkSession.builder.appName('pyspark').getOrCreate()
spark.sparkContext.addFile("https://cdn.coggle.club/Pokemon.csv")
path = "file://"+SparkFiles.get("Pokemon.csv")
df = spark.read.csv(path=path, header=True, inferSchema= True)
df = df.withColumnRenamed('Sp. Atk', 'SpAtk')
df = df.withColumnRenamed('Sp. Def', 'SpDef')
df = df.withColumnRenamed('Type 1', 'Type1')
df = df.withColumnRenamed('Type 2', 'Type2')
df = df.withColumn("Legendary", col("Legendary").cast('string'))
# 步骤1:继续任务5的步骤,假设Type 1为标签,将其进行labelencoder
indexer = StringIndexer(inputCol="Type1", outputCol="Type1_idx")
df = indexer.fit(df).transform(df)
4.2 使用kmeans对宝可梦进行聚类,使用肘部法选择合适聚类个数。
# 步骤2:使用kmeans对宝可梦进行聚类,使用肘部法选择合适聚类个数。
# encode categorical features
in_cols = ["Type2", "Generation", "Legendary"]
out_cols = ["Type2_idx", "Generation_idx", "Legendary_idx"]
indexer = StringIndexer(inputCols=in_cols, outputCols=out_cols, handleInvalid="skip")
df = indexer.fit(df).transform(df)
# encode numerical features
columns_to_scale = ["Total", "HP", "Attack", "Defense", "SpAtk", "SpDef", "Speed"]
assemblers, scalers = list(), list()
for col in columns_to_scale:
vec = VectorAssembler(inputCols=[col], outputCol=col + "_vec")
assemblers.append(vec)
sc = MinMaxScaler(inputCol=col + "_vec", outputCol=col + "_scl")
scalers.append(sc)
pipeline = Pipeline(stages=assemblers + scalers)
df = pipeline.fit(df).transform(df)
# encode all features into vectors
cols = ["Type2_idx", "Generation_idx", "Legendary_idx",
"Total_scl", "HP_scl", "Attack_scl", "Defense_scl", "SpAtk_scl", "SpDef_scl", "Speed_scl"]
assembler = VectorAssembler(inputCols=cols, outputCol="feature")
df = assembler.transform(df)
# df.show()
train, test = df.randomSplit(weights=[0.8, 0.2], seed=42)
evaluator = MulticlassClassificationEvaluator(
labelCol="Type1_idx",
predictionCol="prediction",
metricName="accuracy")
num_of_type1 = df.select("Type1").distinct().count()
for k in range(2, num_of_type1+1):
cluster = KMeans(featuresCol="feature", predictionCol="prediction", k=k, seed=42)
model = cluster.fit(train)
prediction = model.transform(test)
prediction = prediction.withColumn("prediction", prediction.prediction.cast(DoubleType()))
cost = model.summary.trainingCost
accuracy = evaluator.evaluate(prediction)
print("Accuracy of k=%d is %.4f, with cost is %.4f" % (k, accuracy, cost))
kmeans聚类虽然简单了,但是要确定聚类的个数,常用的有肘部法则和轮廓系数法等。肘部法则通过寻找损失值下降平稳的拐点来确定k值,而轮廓系统则是通过寻找轮廓系数的最大值来进行计算:
(1)肘部法则:
S
S
E
=
∑
i
=
1
K
∑
c
∈
C
i
∣
p
−
m
i
∣
2
S S E=\sum_{i=1}^{K} \sum_{c \in C_{i}}\left|p-m_{i}\right|^{2}
SSE=i=1∑Kc∈Ci∑∣p−mi∣2
(
m
i
m_{i}
mi 为第簇的质心)
(2)轮廓系数法: S i = b i − a i max ( a i , b i ) ( a i S_{i}=\dfrac{b_{i}-a_{i}}{\max \left(a_{i}, b_{i}\right)} \quad\left(a_{i}\right. Si=max(ai,bi)bi−ai(ai 是样本i在同类别内到其它点的平均距离, b i b_{i} bi 是样本到最近不同类 别中样本的平均距离)
五、通过肘部法则和轮廓系数确定聚类k值
5.1 随机模拟3类数据
import os
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
num = 100
np.random.seed(0)
#聚类群体1
mu1 =np.array([1,1])
sigma1=np.array([[0.5,0],[0,0.5]])
R1 = np.linalg.cholesky(sigma1)
s1 = np.dot(np.random.randn(num, 2), R1) + mu1
plt.plot(s1[:,0],s1[:,1],'y.')
# 聚类群体2
mu2 =np.array([6,0])
sigma2=np.array([[0.1,0.1],[0,0.5]])
R2 = np.linalg.cholesky(sigma2)
s2 = np.dot(np.random.randn(num, 2), R2) + mu2
plt.plot(s2[:,0],s2[:,1],'*r')
# 聚类群体3
mu3 = np.array([-2,-2])
sigma3 = np.array([[0.6,0],[0,1]])
R3 = np.linalg.cholesky(sigma3)
s3 = np.dot(np.random.randn(num,2),R3)+mu3
plt.plot(s3[:,0],s3[:,1],'b+')
plt.show()
5.2 肘部法确定k值
#应用肘部法则确定 kmeans方法中的k
from scipy.spatial.distance import cdist
K=range(1,10)
sse_result=[]
for k in K:
kmeans=KMeans(n_clusters=k)
kmeans.fit(s)
sse_result.append(sum(np.min(cdist(s,kmeans.cluster_centers_,'euclidean'),axis=1))/s.shape[0])
plt.plot(K,sse_result,'gx-')
plt.xlabel('k')
plt.ylabel(u'平均畸变程度')
plt.title(u'肘部法则确定最佳的K值')
plt.show()
PS:这里的s
是上面3种数据的组合。