勒布朗生涯数据可视化(一)用Spark SQL从NBA数据集中提取相关数据

7 篇文章 0 订阅
6 篇文章 1 订阅

简介

配置

  • 语言:Scala 2.11

  • Spark版本:Spark 2.3.1

主要内容

  • DataFrame中添加新列
  • 列内容的精确度调整
  • 将列的内容转化为List形式
  • 用PrinterWriter模仿csv格式储存数据

编程实践

格式化获取表格信息

首先我们在数据集网站上获取每一列所代表的信息,我们看到如下区域:
表格信息
我们根据它来设计我们的seasonStats类,并用getStatsSchema函数获取通过这一类获得的StructType,用于格式化读取DataFrame:

  case class seasonStats (ID: Int, Year: Int, Player: String, Pos: String, Age: Int, Team: String,
                          Games: Double, GameStarted: Double, MinutesPlayed: Double, PlayerEfficiencyRating: Double,
                          TrueShootingPercentage: Double, TreePointAttemptRate: Double, FreeThrowRate: Double,
                          OffensiveReboundPercentage: Double, DefensiveReboundPercentage: Double, TotalReboundPercentage: Double,
                          AssistPercentage: Double, StealPercentage: Double, BlockPercentagee: Double, TurnoverPercentage: Double,
                          UsagePercentage: Double, blanl: String,
                          OffensiveWinShares: Double, DefensiveWinShares: Double, WinShares: Double,
                          WinSharesPer48Minutes: Double, blank2: String,
                          OffensiveBoxPlusOrMinus: String, DefensiveBoxPlusOrMinus: String, BoxPlusOrMinus: Double,
                          ValueOverReplacement: Double,
                          FieldGoals: Double, FieldGoalAttempts: Double, FieldGoalPercentage: Double,
                          ThreePointGoals: Double, ThreePointAttempts: Double, ThreePointPercentage: Double,
                          TwoPointGoals: Double, TwoPointAttempts: Double, TwoPointPercentage: Double,
                          EffectiveFieldGoalPercentage: Double,
                          FreeThrows: Double, FreeThrowAttempts: Double, FreeThrowPercentage: Double,
                          OffensiveRebounds: Double, DefensiveRebounds: Double, TotalRebounds: Double,
                          Assists: Double, Steals: Double, Blocks: Double, Turnovers: Double,
                          PersonalFouls: Double, Points: Double)
   def getStatsSchema: StructType = Encoders.product[seasonStats].schema

我们建立一个SparkSession并通过我们之前获取的StructType来格式化读取Season_Stats.csv文件的信息:

	  import org.apache.spark.sql.SparkSession
      val spark = SparkSession
        .builder()
        .master("local[*]")
        .appName("NBADataBase")`在这里插入代码片`
        .getOrCreate()

      import spark.implicits._

      val season_stats = spark.read.format("csv")
        .option("sep", ",")
        .option("header", "true")
        .schema(getStatsSchema)
        .load("./resources/Seasons_Stats.csv")
        .persist()

得到数据集如下:

+---+----+---------------+---+---+----+-----+-----------+-------------+----------------------+----------------------+--------------------+-------------+--------------------------+--------------------------+----------------------+----------------+---------------+----------------+------------------+---------------+-----+------------------+------------------+---------+---------------------+------+-----------------------+-----------------------+--------------+--------------------+----------+-----------------+-------------------+---------------+------------------+--------------------+-------------+----------------+------------------+----------------------------+----------+-----------------+-------------------+-----------------+-----------------+-------------+-------+------+------+---------+-------------+------+
| ID|Year|         Player|Pos|Age|Team|Games|GameStarted|MinutesPlayed|PlayerEfficiencyRating|TrueShootingPercentage|TreePointAttemptRate|FreeThrowRate|OffensiveReboundPercentage|DefensiveReboundPercentage|TotalReboundPercentage|AssistPercentage|StealPercentage|BlockPercentagee|TurnoverPercentage|UsagePercentage|blanl|OffensiveWinShares|DefensiveWinShares|WinShares|WinSharesPer48Minutes|blank2|OffensiveBoxPlusOrMinus|DefensiveBoxPlusOrMinus|BoxPlusOrMinus|ValueOverReplacement|FieldGoals|FieldGoalAttempts|FieldGoalPercentage|ThreePointGoals|ThreePointAttempts|ThreePointPercentage|TwoPointGoals|TwoPointAttempts|TwoPointPercentage|EffectiveFieldGoalPercentage|FreeThrows|FreeThrowAttempts|FreeThrowPercentage|OffensiveRebounds|DefensiveRebounds|TotalRebounds|Assists|Steals|Blocks|Turnovers|PersonalFouls|Points|
+---+----+---------------+---+---+----+-----+-----------+-------------+----------------------+----------------------+--------------------+-------------+--------------------------+--------------------------+----------------------+----------------+---------------+----------------+------------------+---------------+-----+------------------+------------------+---------+---------------------+------+-----------------------+-----------------------+--------------+--------------------+----------+-----------------+-------------------+---------------+------------------+--------------------+-------------+----------------+------------------+----------------------------+----------+-----------------+-------------------+-----------------+-----------------+-------------+-------+------+------+---------+-------------+------+
|  0|1950|Curly Armstrong|G-F| 31| FTW| 63.0|       null|         null|                  null|                 0.368|                null|        0.467|                      null|                      null|                  null|            null|           null|            null|              null|           null| null|              -0.1|               3.6|      3.5|                 null|  null|                   null|                   null|          null|                null|     144.0|            516.0|              0.279|           null|              null|                null|        144.0|           516.0|             0.279|                       0.279|     170.0|            241.0|              0.705|             null|             null|         null|  176.0|  null|  null|     null|        217.0| 458.0|
|  1|1950|   Cliff Barker| SG| 29| INO| 49.0|       null|         null|                  null|                 0.435|                null|        0.387|                      null|                      null|                  null|            null|           null|            null|              null|           null| null|               1.6|               0.6|      2.2|                 null|  null|                   null|                   null|          null|                null|     102.0|            274.0|              0.372|           null|              null|                null|        102.0|           274.0|             0.372|                       0.372|      75.0|            106.0|              0.708|             null|             null|         null|  109.0|  null|  null|     null|         99.0| 279.0|
|  2|1950|  Leo Barnhorst| SF| 25| CHS| 67.0|       null|         null|                  null|                 0.394|                null|        0.259|                      null|                      null|                  null|            null|           null|            null|              null|           null| null|               0.9|               2.8|      3.6|                 null|  null|                   null|                   null|          null|                null|     174.0|            499.0|              0.349|           null|              null|                null|        174.0|           499.0|             0.349|                       0.349|      90.0|            129.0|              0.698|             null|             null|         null|  140.0|  null|  null|     null|        192.0| 438.0|
|  3|1950|     Ed Bartels|  F| 24| TOT| 15.0|       null|         null|                  null|                 0.312|                null|        0.395|                      null|                      null|                  null|            null|           null|            null|              null|           null| null|              -0.5|              -0.1|     -0.6|                 null|  null|                   null|                   null|          null|                null|      22.0|             86.0|              0.256|           null|              null|                null|         22.0|            86.0|             0.256|                       0.256|      19.0|             34.0|              0.559|             null|             null|         null|   20.0|  null|  null|     null|         29.0|  63.0|
|  4|1950|     Ed Bartels|  F| 24| DNN| 13.0|       null|         null|                  null|                 0.308|                null|        0.378|                      null|                      null|                  null|            null|           null|            null|              null|           null| null|              -0.5|              -0.1|     -0.6|                 null|  null|                   null|                   null|          null|                null|      21.0|             82.0|              0.256|           null|              null|                null|         21.0|            82.0|             0.256|                       0.256|      17.0|             31.0|              0.548|             null|             null|         null|   20.0|  null|  null|     null|         27.0|  59.0|
|  5|1950|     Ed Bartels|  F| 24| NYK|  2.0|       null|         null|                  null|                 0.376|                null|         0.75|                      null|                      null|                  null|            null|           null|            null|              null|           null| null|               0.0|               0.0|      0.0|                 null|  null|                   null|                   null|          null|                null|       1.0|              4.0|               0.25|           null|              null|                null|          1.0|             4.0|              0.25|                        0.25|       2.0|              3.0|              0.667|             null|             null|         null|    0.0|  null|  null|     null|          2.0|   4.0|
|  6|1950|    Ralph Beard|  G| 22| INO| 60.0|       null|         null|                  null|                 0.422|                null|        0.301|                      null|                      null|                  null|            null|           null|            null|              null|           null| null|               3.6|               1.2|      4.8|                 null|  null|                   null|                   null|          null|                null|     340.0|            936.0|              0.363|           null|              null|                null|        340.0|           936.0|             0.363|                       0.363|     215.0|            282.0|              0.762|             null|             null|         null|  233.0|  null|  null|     null|        132.0| 895.0|
|  7|1950|     Gene Berce|G-F| 23| TRI|  3.0|       null|         null|                  null|                 0.275|                null|        0.313|                      null|                      null|                  null|            null|           null|            null|              null|           null| null|              -0.1|               0.0|     -0.1|                 null|  null|                   null|                   null|          null|                null|       5.0|             16.0|              0.313|           null|              null|                null|          5.0|            16.0|             0.313|                       0.313|       0.0|              5.0|                0.0|             null|             null|         null|    2.0|  null|  null|     null|          6.0|  10.0|
|  8|1950|  Charlie Black|F-C| 28| TOT| 65.0|       null|         null|                  null|                 0.346|                null|        0.395|                      null|                      null|                  null|            null|           null|            null|              null|           null| null|              -2.2|               5.0|      2.8|                 null|  null|                   null|                   null|          null|                null|     226.0|            813.0|              0.278|           null|              null|                null|        226.0|           813.0|             0.278|                       0.278|     209.0|            321.0|              0.651|             null|             null|         null|  163.0|  null|  null|     null|        273.0| 661.0|
|  9|1950|  Charlie Black|F-C| 28| FTW| 36.0|       null|         null|                  null|                 0.362|                null|         0.48|                      null|                      null|                  null|            null|           null|            null|              null|           null| null|              -0.7|               2.2|      1.5|                 null|  null|                   null|                   null|          null|                null|     125.0|            435.0|              0.287|           null|              null|                null|        125.0|           435.0|             0.287|                       0.287|     132.0|            209.0|              0.632|             null|             null|         null|   75.0|  null|  null|     null|        140.0| 382.0|
+---+----+---------------+---+---+----+-----+-----------+-------------+----------------------+----------------------+--------------------+-------------+--------------------------+--------------------------+----------------------+----------------+---------------+----------------+------------------+---------------+-----+------------------+------------------+---------+---------------------+------+-----------------------+-----------------------+--------------+--------------------+----------+-----------------+-------------------+---------------+------------------+--------------------+-------------+----------------+------------------+----------------------------+----------+-----------------+-------------------+-----------------+-----------------+-------------+-------+------+------+---------+-------------+------+

通过withColumn函数添加新列

通过分析每一列的信息可知,这一数据集只提供了赛季总得分、总助攻、总篮板等数据,而球迷最关心的一半是平均数据,因此我们需要根据总数除以上场数的值来确定平均数据,而withColumn实现了这一功能,我们先在官方文档中找到它:
Spark 2.3.2 ScalaDoc - Dataset
其中定义withColumn如下:
withColumn
可知这一函数可以添加新列或替换一个同名列,第一个参数是添加列的名字,后一个参数可以规定列的数据,在此我们实现如下代码:

import org.apache.spark.sql.functions.bround
val lebron_stats = season_stats
        .withColumn("AveragePoints", col = bround($"Points" / $"Games", 3))
        .withColumn("AverageRebounds", col = bround($"TotalRebounds" / $"Games", 3))
        .withColumn("AverageAssists", col = bround($"Assists" / $"Games", 3))
        .withColumn("AverageSteals", col = bround($"Steals" / $"Games", 3))
        .withColumn("AverageBlocks", col = bround($"Blocks" / $"Games", 3))
        .filter($"Player" === "LeBron James").orderBy("Year")
        .select("Year", "Player", "Age", "Games", "Team",
          "AveragePoints", "AverageRebounds", "AverageAssists", "AverageSteals", "AverageBlocks",
          "FieldGoalPercentage", "ThreePointPercentage")

这样我们获得了如下的列:

+----+------------+---+-----+----+-------------+---------------+--------------+-------------+-------------+-------------------+--------------------+
|Year|      Player|Age|Games|Team|AveragePoints|AverageRebounds|AverageAssists|AverageSteals|AverageBlocks|FieldGoalPercentage|ThreePointPercentage|
+----+------------+---+-----+----+-------------+---------------+--------------+-------------+-------------+-------------------+--------------------+
|2004|LeBron James| 19| 79.0| CLE|       20.937|          5.468|         5.886|        1.646|        0.734|              0.417|                0.29|
|2005|LeBron James| 20| 80.0| CLE|       27.188|           7.35|         7.212|        2.212|         0.65|              0.472|               0.351|
|2006|LeBron James| 21| 79.0| CLE|       31.367|          7.038|         6.595|        1.557|        0.835|               0.48|               0.335|
|2007|LeBron James| 22| 78.0| CLE|       27.333|          6.744|         6.026|        1.603|        0.705|              0.476|               0.319|
|2008|LeBron James| 23| 75.0| CLE|         30.0|          7.893|         7.187|         1.84|         1.08|              0.484|               0.315|
|2009|LeBron James| 24| 81.0| CLE|       28.444|          7.568|         7.247|        1.691|        1.148|              0.489|               0.344|
|2010|LeBron James| 25| 76.0| CLE|       29.711|          7.289|         8.566|        1.645|        1.013|              0.503|               0.333|
|2011|LeBron James| 26| 79.0| MIA|       26.722|          7.468|         7.013|         1.57|        0.633|               0.51|                0.33|
|2012|LeBron James| 27| 62.0| MIA|       27.145|          7.935|         6.242|        1.855|        0.806|              0.531|               0.362|
|2013|LeBron James| 28| 76.0| MIA|       26.789|          8.026|          7.25|        1.697|        0.882|              0.565|               0.406|
+----+------------+---+-----+----+-------------+---------------+--------------+-------------+-------------+-------------------+--------------------+

可见我们已经成功提取出我们需要的数据集,下面我们将列信息其转化为List类,其中collect函数返回一个Array:

      val avePointsList = lebron_stats.select("AveragePoints").collect()
        .map(_(0)).toList
      val aveReboundsList = lebron_stats.select("AverageRebounds").collect()
        .map(_(0)).toList
      val aveAssistsList = lebron_stats.select("AverageAssists").collect()
        .map(_(0)).toList
      val aveStealsList = lebron_stats.select("AverageSteals").collect()
        .map(_(0)).toList
      val aveBlocksList = lebron_stats.select("AverageBlocks").collect()
        .map(_(0)).toList
      val gamesList = lebron_stats.select("Games").collect()
        .map(_(0)).toList
      val fieldPercent = lebron_stats.select("FieldGoalPercentage").collect()
        .map(_(0)).toList
      val threePercent = lebron_stats.select("ThreePointPercentage").collect()
        .map(_(0)).toList
      val yearList = lebron_stats.select("Year").collect()
        .map(_(0)).toList

得到如下格式的数据:

List(20.937, 27.188, 31.367, 27.333, 30.0, 28.444, 29.711, 26.722, 27.145, 26.789, 27.13, 25.261, 25.263, 26.405)
List(5.468, 7.35, 7.038, 6.744, 7.893, 7.568, 7.289, 7.468, 7.935, 8.026, 6.922, 6.029, 7.434, 8.649)
List(5.886, 7.212, 6.595, 6.026, 7.187, 7.247, 8.566, 7.013, 6.242, 7.25, 6.338, 7.406, 6.763, 8.73)

然后我们用PrintWriter将其写入文件中:

	  import java.io._
      val writer = new PrintWriter(new File("./data/lebron_data.csv" ))
      writer.write(yearList.mkString(",") + "\r\n")
      writer.write(avePointsList.mkString(",") + "\r\n")
      writer.write(aveReboundsList.mkString(",") + "\r\n")
      writer.write(aveAssistsList.mkString(",") + "\r\n")
      writer.write(aveStealsList.mkString(",") + "\r\n")
      writer.write(aveBlocksList.mkString(",") + "\r\n")
      writer.write(gamesList.mkString(",") + "\r\n")
      writer.write(fieldPercent.mkString(",") + "\r\n")
      writer.write(threePercent.mkString(",") + "\r\n")
      writer.close()

最终我们得到了内容如下的一个lebron_data.csv文件:

2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
20.937,27.188,31.367,27.333,30.0,28.444,29.711,26.722,27.145,26.789,27.13,25.261,25.263,26.405
5.468,7.35,7.038,6.744,7.893,7.568,7.289,7.468,7.935,8.026,6.922,6.029,7.434,8.649
5.886,7.212,6.595,6.026,7.187,7.247,8.566,7.013,6.242,7.25,6.338,7.406,6.763,8.73
1.646,2.212,1.557,1.603,1.84,1.691,1.645,1.57,1.855,1.697,1.571,1.58,1.368,1.243
0.734,0.65,0.835,0.705,1.08,1.148,1.013,0.633,0.806,0.882,0.338,0.71,0.645,0.595
79.0,80.0,79.0,78.0,75.0,81.0,76.0,79.0,62.0,76.0,77.0,69.0,76.0,74.0
0.417,0.472,0.48,0.476,0.484,0.489,0.503,0.51,0.531,0.565,0.567,0.488,0.52,0.548
0.29,0.351,0.335,0.319,0.315,0.344,0.333,0.33,0.362,0.406,0.379,0.354,0.309,0.363

下面我们用转而使用python,用Matplotlib库进行可视化

参考资料

  • 1
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值