章节测验---8

11.11.1

已于 2024-07-17 20:05:31 修改

阅读量140

点赞数 2

分类专栏： Spark大数据处理技术文章标签： spark

于 2024-07-17 20:04:08 首次发布

本文链接：https://blog.csdn.net/m0_55885128/article/details/140504193

版权

Spark大数据处理技术专栏收录该内容

11 篇文章 0 订阅

订阅专栏

第1关：第一题

任务描述

本关任务：根据编程要求，完成任务。

编程要求

打开右侧代码文件窗口，在 Begin 至 End 区域补充代码，完成任务。

from pyspark.ml.feature import PCA
from pyspark.sql import SparkSession, Row
from pyspark.ml.linalg import Vectors
################ Begin ################
# 创建SparkSession
spark = SparkSession.builder.appName("PCA").getOrCreate()
# 读取训练集数据并进行过滤和转换
df = spark.sparkContext.textFile("/data/bigfiles/adult.data") \
    .map(lambda line: line.split(",")) \
    .filter(lambda p: len(p) == 15) \
    .filter(lambda p: all(field != "" for field in p[::2])) \
    .map(lambda p: Row(features=Vectors.dense([float(p[i]) for i in [0, 2, 4, 10, 11, 12]]), label=p[14])) \
    .toDF()
# 读取测试集数据并进行过滤和转换
test = spark.sparkContext.textFile("/data/bigfiles/adult.test") \
    .map(lambda line: line.split(",")) \
    .filter(lambda p: len(p) == 15) \
    .filter(lambda p: all(field != "" for field in p[::2])) \
    .map(lambda p: Row(features=Vectors.dense([float(p[i]) for i in [0, 2, 4, 10, 11, 12]]), label=p[14])) \
    .toDF()
# 构建PCA模型
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures").fit(df)
# 应用PCA模型到测试集
testdata = pca.transform(test)
# 显示结果
testdata.orderBy("features").show(20, truncate=False)
# 停止SparkSession
spark.stop()
################ End ################

第2关：第二题

任务描述

本关任务：根据编程要求，完成任务。

编程要求

打开右侧代码文件窗口，在 Begin 至 End 区域补充代码，完成任务。

    from pyspark.ml import Pipeline
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.evaluation import MulticlassClassificationEvaluator
    from pyspark.ml.feature import PCA, StringIndexer, VectorIndexer, IndexToString
    from pyspark.ml.linalg import Vectors
    from pyspark.sql import SparkSession, Row
    ################ Begin ################
    # 创建SparkSession
    spark = SparkSession.builder.appName("LogisticRegression").getOrCreate()
    # 读取训练集数据并进行过滤和转换
    df = spark.sparkContext.textFile("/data/bigfiles/adult.data") \
        .map(lambda line: line.split(",")) \
        .filter(lambda p: len(p) == 15) \
        .filter(lambda p: all(field != "" for field in p[::2])) \
        .map(lambda p: Row(features=Vectors.dense([float(p[i]) for i in [0, 2, 4, 10, 11, 12]]), label=p[14])) \
        .toDF()
    # 读取测试集数据并进行过滤和转换
    test = spark.sparkContext.textFile("/data/bigfiles/adult.test") \
        .map(lambda line: line.split(",")) \
        .filter(lambda p: len(p) == 15) \
        .filter(lambda p: all(field != "" for field in p[::2])) \
        .map(lambda p: Row(features=Vectors.dense([float(p[i]) for i in [0, 2, 4, 10, 11, 12]]), label=p[14])) \
        .toDF()
    # 构建PCA模型
    pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures").fit(df)
    # 应用PCA模型到测试集
    result = pca.transform(test)
    # 标签与特征处理
    labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(result)
    featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures").fit(result)
    labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
    # 定义逻辑斯蒂模型
    lr = LogisticRegression(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=100)
    # 利用管道进行组合
    lrPipeline = Pipeline(stages=[labelIndexer, featureIndexer, lr, labelConverter])
    # 训练模型
    lrPipelineModel = lrPipeline.fit(result)
    # 放入测试集进行验证
    lrPredictions = lrPipelineModel.transform(test)
    # 获取正确率
    evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction")
    # 输出正确率
    lrAccuracy = evaluator.evaluate(lrPredictions)
    print("Accuracy:%.3f"%lrAccuracy)
    # 释放资源
    spark.stop()
    ################ End ################

第3关：第三题

任务描述

本关任务：根据编程要求，完成任务。

编程要求

打开右侧代码文件窗口，在 Begin 至 End 区域补充代码，完成任务。

from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import PCA, StringIndexer, VectorIndexer, IndexToString
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import SparkSession, Row
################ Begin ################
# 创建SparkSession
spark = SparkSession.builder.appName("LogisticRegression").getOrCreate()
# 读取训练集数据并进行过滤和转换
df = spark.sparkContext.textFile("/data/bigfiles/adult.data") \
    .map(lambda line: line.split(",")) \
    .filter(lambda p: len(p) == 15) \
    .filter(lambda p: all(field != "" for field in p[::2])) \
    .map(lambda p: Row(features=Vectors.dense([float(p[i]) for i in [0, 2, 4, 10, 11, 12]]), label=p[14])) \
    .toDF()
    
# 构建PCA模型
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures").fit(df)
# 标签和特征处理
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(df)
featureIndexer = VectorIndexer(inputCol="pcaFeatures", outputCol="indexedFeatures")
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
# 设置逻辑斯蒂模型
lr = LogisticRegression(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=100)
# 管道组合
lrPipeline = Pipeline(stages=[pca, labelIndexer, featureIndexer, lr, labelConverter])
# 定义参数组合
paramGrid = (ParamGridBuilder()
             .addGrid(pca.k, [1, 2, 3, 4, 5, 6])
             .addGrid(lr.elasticNetParam, [0.2, 0.8])
             .addGrid(lr.regParam, [0.01, 0.1, 0.5])
             .build())
# 定义 CrossValidator
cv = CrossValidator(estimator=lrPipeline,
                    evaluator=MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction"),
                    estimatorParamMaps=paramGrid,
                    numFolds=3)
# 训练数据
cvModel = cv.fit(df)
# 找到 PCA 最优维度
bestModel = cvModel.bestModel
pcaModel = bestModel.stages[0]
# 输出 PCA 最优维度
print("Primary Component:" + str(pcaModel.pc))
# 释放资源
spark.stop()
################ End ################