Spark教育项目练习代码

Spark教育项目练习代码

# -*- coding: utf-8 -*-
# Program function:读取数据
#
# -*- coding: utf-8 -*-
# Program function:
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
# Import data types
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

# 这里可以选择本地PySpark环境执行Spark代码,也可以使用虚拟机中PySpark环境,通过os可以配置
os.environ['SPARK_HOME'] = '/export/server/spark-2.3.0-bin-hadoop2.7'
PYSPARK_PYTHON = "/root/anaconda3/envs/pyspark_env/bin/python"
# 当存在多个版本时,不指定很可能会导致出错
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON


if __name__ == '__main__':
    # 1-环境变量
    spark = SparkSession.builder \
        .appName('test') \
        .getOrCreate()
    sc = spark.sparkContext
    # 2-获取数据
    jdbcDF = spark.read.format("csv")\
        .option("header", True)\
        .option("sep", "\t")\
        .option("inferSchema", "true")\
        .load("file:///tmp/pycharm_project_553/EduAnalysis/data/eduxxx.csv")

    jdbcDF.printSchema()
    jdbcDF.show(2)

    """
    要求: 找到Top50热点题对应的科目,然后统计这些科目中,分别包含这几道热点题的条目数
    热点题
    题号 热度(数量) 学科
    1   100  数学
    2   99   数学
    3   98   语文

    最终结果:
    学科  热点题数量
    数学  2
    语文 1
    """
    allInfoDS = jdbcDF
    allInfoDS.createOrReplaceTempView("t_answer")
    def q1():
        spark.sql("""
            select subject_id, count(t_answer.question_id) as hot_question_count
            from
            (select question_id, count(*) as frequency
            from t_answer
            group by question_id
            order by frequency desc limit 50) t1
            join t_answer on t1.question_id = t_answer.question_id
            group by subject_id
            order by hot_question_count desc
        """).show()
        # DSL
        top50DS = allInfoDS.groupBy("question_id").count().orderBy("count", ascending=False).limit(50)
        top50DS.join(allInfoDS, "question_id").groupBy("subject_id").count().orderBy("count", ascending=False).show()
    # q1()

    """
    +----------+------------------+
    |subject_id|hot_question_count|
    +----------+------------------+
    | 科目ID_1_数学|               311|
    | 科目ID_2_语文|               276|
    | 科目ID_3_英语|               267|
    +----------+------------------+
    """

    # 各科目推荐题分析
    def q2():
        spark.sql("""
            select t4.subject_id, count(*) as frequency
            from
            (select distinct(t3.question_id), t_answer.subject_id
            from
            (select explode(split(t2.recommendations, ',')) as question_id
            from
            (select recommendations
            from
            (select question_id, count(*) as frequency
            from t_answer
            group by question_id
            order by frequency
            desc limit 20) t1
            join t_answer
            on t1.question_id=t_answer.question_id) t2) t3
            join t_answer
            on t3.question_id=t_answer.question_id) t4
            group by t4.subject_id
            order by frequency desc
        """).show()

        """
        +----------+---------+
        |subject_id|frequency|
        +----------+---------+
        | 科目ID_3_英语|      262|
        | 科目ID_2_语文|      240|
        | 科目ID_1_数学|      239|
        +----------+---------+
        """
        # DSL
        top20DS = allInfoDS.groupBy("question_id").count().orderBy("count", ascending=False).limit(20)
        recommendListDF = top20DS.join(allInfoDS, "question_id")
        questionIdDF = recommendListDF.select(F.explode(F.split("recommendations", ",")).alias("question_id"))
        questionIdAndSubjectIdDF = questionIdDF.distinct().\
            join(allInfoDS.dropDuplicates(["question_id"]), on="question_id").select("question_id", "subject_id")
        questionIdAndSubjectIdDF.groupBy("subject_id").count().orderBy("count", ascending=False).show()
    # q2()

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
这里是一个简单的Spark电影推荐项目代码示例,包括数据准备、特征工程和模型训练: ```python # 导入Spark库 from pyspark.sql import SparkSession from pyspark.ml.evaluation import RegressionEvaluator from pyspark.ml.recommendation import ALS from pyspark.sql.functions import col # 创建SparkSession spark = SparkSession.builder.appName("MovieRecommendation").getOrCreate() # 读取数据 ratings_df = spark.read.csv("ratings.csv", header=True, inferSchema=True) movies_df = spark.read.csv("movies.csv", header=True, inferSchema=True) # 特征工程 ratings_df = ratings_df.select("userId", "movieId", "rating") movies_df = movies_df.select("movieId", "title") # 拆分数据集 (training, test) = ratings_df.randomSplit([0.8, 0.2]) # 训练模型 als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating") model = als.fit(training) # 预测评分 predictions = model.transform(test) # 评估模型 evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction") rmse = evaluator.evaluate(predictions) print("Root-mean-square error = " + str(rmse)) # 为每个用户推荐电影 user_recs = model.recommendForAllUsers(10) # 输出推荐结果 user_recs = user_recs.join(movies_df, user_recs.movieId == movies_df.movieId).select("userId", col("recommendations.movieId"), col("recommendations.rating"), "title") user_recs.show() ``` 上述代码假设已经有两个CSV文件:`ratings.csv`和`movies.csv`,其中`ratings.csv`包含用户对电影的评分数据,`movies.csv`包含电影的元数据。代码首先读取这些数据,并进行特征工程。然后,它将数据拆分为训练集和测试集,并使用ALS算法训练推荐模型。最后,它使用训练好的模型为每个用户推荐10部电影,并输出推荐结果。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值