章节测验---8

第1关:第一题

任务描述

本关任务:根据编程要求,完成任务。

编程要求

打开右侧代码文件窗口,在 BeginEnd 区域补充代码,完成任务。

from pyspark.ml.feature import PCA
from pyspark.sql import SparkSession, Row
from pyspark.ml.linalg import Vectors
################ Begin ################
# 创建SparkSession
spark = SparkSession.builder.appName("PCA").getOrCreate()
# 读取训练集数据并进行过滤和转换
df = spark.sparkContext.textFile("/data/bigfiles/adult.data") \
    .map(lambda line: line.split(",")) \
    .filter(lambda p: len(p) == 15) \
    .filter(lambda p: all(field != "" for field in p[::2])) \
    .map(lambda p: Row(features=Vectors.dense([float(p[i]) for i in [0, 2, 4, 10, 11, 12]]), label=p[14])) \
    .toDF()
# 读取测试集数据并进行过滤和转换
test = spark.sparkContext.textFile("/data/bigfiles/adult.test") \
    .map(lambda line: line.split(",")) \
    .filter(lambda p: len(p) == 15) \
    .filter(lambda p: all(field != "" for field in p[::2])) \
    .map(lambda p: Row(features=Vectors.dense([float(p[i]) for i in [0, 2, 4, 10, 11, 12]]), label=p[14])) \
    .toDF()
# 构建PCA模型
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures").fit(df)
# 应用PCA模型到测试集
testdata = pca.transform(test)
# 显示结果
testdata.orderBy("features").show(20, truncate=False)
# 停止SparkSession
spark.stop()
################ End ################

第2关:第二题

任务描述

本关任务:根据编程要求,完成任务。

编程要求

打开右侧代码文件窗口,在 BeginEnd 区域补充代码,完成任务。

    from pyspark.ml import Pipeline
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.evaluation import MulticlassClassificationEvaluator
    from pyspark.ml.feature import PCA, StringIndexer, VectorIndexer, IndexToString
    from pyspark.ml.linalg import Vectors
    from pyspark.sql import SparkSession, Row
    ################ Begin ################
    # 创建SparkSession
    spark = SparkSession.builder.appName("LogisticRegression").getOrCreate()
    # 读取训练集数据并进行过滤和转换
    df = spark.sparkContext.textFile("/data/bigfiles/adult.data") \
        .map(lambda line: line.split(",")) \
        .filter(lambda p: len(p) == 15) \
        .filter(lambda p: all(field != "" for field in p[::2])) \
        .map(lambda p: Row(features=Vectors.dense([float(p[i]) for i in [0, 2, 4, 10, 11, 12]]), label=p[14])) \
        .toDF()
    # 读取测试集数据并进行过滤和转换
    test = spark.sparkContext.textFile("/data/bigfiles/adult.test") \
        .map(lambda line: line.split(",")) \
        .filter(lambda p: len(p) == 15) \
        .filter(lambda p: all(field != "" for field in p[::2])) \
        .map(lambda p: Row(features=Vectors.dense([float(p[i]) for i in [0, 2, 4, 10, 11, 12]]), label=p[14])) \
        .toDF()
    # 构建PCA模型
    pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures").fit(df)
    # 应用PCA模型到测试集
    result = pca.transform(test)
    # 标签与特征处理
    labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(result)
    featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures").fit(result)
    labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
    # 定义逻辑斯蒂模型
    lr = LogisticRegression(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=100)
    # 利用管道进行组合
    lrPipeline = Pipeline(stages=[labelIndexer, featureIndexer, lr, labelConverter])
    # 训练模型
    lrPipelineModel = lrPipeline.fit(result)
    # 放入测试集进行验证
    lrPredictions = lrPipelineModel.transform(test)
    # 获取正确率
    evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction")
    # 输出正确率
    lrAccuracy = evaluator.evaluate(lrPredictions)
    print("Accuracy:%.3f"%lrAccuracy)
    # 释放资源
    spark.stop()
    ################ End ################

第3关:第三题

任务描述

本关任务:根据编程要求,完成任务。

编程要求

打开右侧代码文件窗口,在 BeginEnd 区域补充代码,完成任务。

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import PCA, StringIndexer, VectorIndexer, IndexToString
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import SparkSession, Row
################ Begin ################
# 创建SparkSession
spark = SparkSession.builder.appName("LogisticRegression").getOrCreate()
# 读取训练集数据并进行过滤和转换
df = spark.sparkContext.textFile("/data/bigfiles/adult.data") \
    .map(lambda line: line.split(",")) \
    .filter(lambda p: len(p) == 15) \
    .filter(lambda p: all(field != "" for field in p[::2])) \
    .map(lambda p: Row(features=Vectors.dense([float(p[i]) for i in [0, 2, 4, 10, 11, 12]]), label=p[14])) \
    .toDF()
    
# 构建PCA模型
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures").fit(df)
# 标签和特征处理
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(df)
featureIndexer = VectorIndexer(inputCol="pcaFeatures", outputCol="indexedFeatures")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
# 设置逻辑斯蒂模型
lr = LogisticRegression(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=100)
# 管道组合
lrPipeline = Pipeline(stages=[pca, labelIndexer, featureIndexer, lr, labelConverter])
# 定义参数组合
paramGrid = (ParamGridBuilder()
             .addGrid(pca.k, [1, 2, 3, 4, 5, 6])
             .addGrid(lr.elasticNetParam, [0.2, 0.8])
             .addGrid(lr.regParam, [0.01, 0.1, 0.5])
             .build())
# 定义 CrossValidator
cv = CrossValidator(estimator=lrPipeline,
                    evaluator=MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction"),
                    estimatorParamMaps=paramGrid,
                    numFolds=3)
# 训练数据
cvModel = cv.fit(df)
# 找到 PCA 最优维度
bestModel = cvModel.bestModel
pcaModel = bestModel.stages[0]
# 输出 PCA 最优维度
print("Primary Component:" + str(pcaModel.pc))
# 释放资源
spark.stop()
################ End ################

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值