文章目录
一、scala与spark
Scala是一门多范式的编程语言,一种类似java的编程语言,设计初衷是实现可伸缩的语言 、并集成面向对象编程和函数式编程的各种特性
Spark 是一种基于内存的快速、通用、可扩展的大数据分析计算引擎
二、环境配置
1.scala环境配置
先在系统变量里配置SCALA_HOME,再到PATH里引用SCALA_HOME下的bin目录
2.idea里的scala工具
利用idea开发,idea里有scala工具可以帮我们很快的配置scala环境
3.添加pom.xml依赖
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.0.0</version>
</dependency>
</dependencies>
三、编写代码处理游戏数据
1.统计出Counter-Strike游戏每个月玩家最多人数峰值
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val read: DataFrame = spark.read
.format("csv")
.option("encoding","GBK")
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.load("datas/AllSteamData.csv")
read.createOrReplaceTempView("steam")
val frame: DataFrame = spark.sql("select " +
"*,substring_index(Month,'月',1) as mon " +
"from steam " +
"where Month not like 'Last 30 Days' " +
"and " +
"Name='Counter-Strike'"
)
frame.createOrReplaceTempView("steam1")
spark.sql("select " +
"mon,max(PeakPlayers) as max_Players " +
"from steam1 " +
"group by mon " +
"order by max_Players desc").show()
spark.close()
}
最总结果
2.求游戏峰值人数大于10000最多的那个月份峰值最高前十个游戏
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val read: DataFrame = spark.read
.format("csv")
.option("encoding","GBK")
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.load("datas/AllSteamData.csv")
read.createOrReplaceTempView("steam")
val frame: DataFrame = spark.sql("select " +
"Name,PeakPlayers,substring_index(Month,'月',1) as mon " +
"from steam " +
"where Month not like 'Last 30 Days' " +
"and " +
"PeakPlayers > 10000")
frame.createOrReplaceTempView("steam1")
val frame1: DataFrame = spark.sql("select " +
"mon,count(mon) as ct " +
"from steam1 " +
"group by mon")
frame1.createOrReplaceTempView("steam2")
spark.sql("select " +
"Name,mon,max(cast(PeakPlayers as bigint)) as max_players " +
"from steam1 " +
"where mon=(select mon from steam2 where ct=(select max(ct) from steam2)) " +
"group by Name,mon " +
"order by max_players desc limit 10").show
spark.close()
}
最终结果
总结
我找的数据里有一些脏数据,处理的时候很多的坑也是弄了很久,要转数据类型等等,总体做下来没有太多困难