1.hadoop安装
1.修改hadoop配置文件
hadoop-env.sh
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91
core-site.xml
<!--hdfs namenode的地址+端口-->
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop000:8020</value>
</property>
hdfs-site.xml
<!--hdfs namenode 文件罗盘的本地目录-->
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/tmp/dfs/name</value>
</property>
<!--hdfs datanode 文件罗盘的本地目录-->
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/tmp/dfs/data</value>
</property>
<!--hdfs 文件存储的副本数-->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!--访问权限关掉-->
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
yarn-site.xml
<!--底层使用mapreduce_shuffle-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
mapred-site.xml
<!--底层使用yarn-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
配置系统环境变量
vi /etc/profile
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.15.1
export PATH=$HADOOP_HOME/bin:$PATH
source /etc/profile
2.启动
hdfs启动
在启动HDFS之前,一定要先对HDFS对格式化
切记:格式化只会一次,因为一旦格式化了,那么HDFS上的数据就没了
格式化命令:hdfs namenode -format
启动HDFS
启动方式1. 逐个进程启动/停止
hadoop-daemon.sh start/stop namenode
hadoop-daemon.sh start/stop datanode
jps验证
namenode
datanode
如果发现有缺失的进程,那么就找缺失进程的名称对应的日志(log而不是out)
启动方式2.一键式启动HDFS
start-dfs.sh
stop-dfs.sh
启动yarn
sbin/start-yarn.sh
2. Hive部署
系统环境变量
vi /etc/profile
export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.15.1
export PATH=$HIVE_HOME/bin:$PATH
source /etc/profile
需要安装MySQL 可以直接使用yum
需要拷贝MySQL的驱动mysql-connector-java-5.1.25.jar 到 hive/lib 目录下
注意mysql的驱动要是5.x 版本
新建$HIVE_HOME/conf/hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://10.139.12.149:3306/pk?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Test2016@</value>
</property>
<!-- Hive 元数据存储版本的验证 -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<!--元数据存储授权-->
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<!-- Hive 默认在 HDFS 的工作目录 -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
</configuration>
启动hive
使用 Hive
hive> show databases;
hive> show tables;
hive> create table test1(id String); #字段类型使用java的基本数据类型
hive> create table test(id int);
hive> insert into test values(1);
hive> select * from test;
3.spark 运行模式
local和YARN模式 重点掌握 ,Standalone 了解
不管什么运行模式,代码不用改变,只需要在spark-submit脚本提交时
通过–master xxx 来设置你的运行模式即可
local模式
使用local模式运行spark-shell
./spark-shell --master local
打包我们的应用程序,让其运行在local模式下
./spark-submit \
--class com.imooc.bigdata.chapter02.SparkWordCountAppV2 \
--master local \
/home/hadoop/lib/sparksql-train-1.0.jar \
file:///home/hadoop/data/wc.data file:///home/hadoop/data/out
使用local模式的话,你只需要把spark的安装包解压开,什么都不用动,就能使用
yarn模式
代码
import org.apache.spark.{SparkConf, SparkContext}
object SparkWordCountAppV2 {
def main(args: Array[String]): Unit = {
//不要在代码中写死运行模式,在提交命令时指定 如在yarn模式运行./spark-submit --master yarn
// val sparkConf = new SparkConf().setMaster("local").setAppName("SparkWordCountApp")
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
// Spark特性:提供了80+高阶API
val rdd = sc.textFile(args(0))
rdd.flatMap(_.split(",")).map(word => (word, 1))
.reduceByKey(_+_).map(x => (x._2, x._1)).sortByKey(false)
.map(x=> (x._2, x._1))
.saveAsTextFile(args(1))
sc.stop()
}
}
交Spark应用程序到YARN上执行
./spark-submit \
--class com.imooc.bigdata.chapter02.SparkWordCountAppV2 \
--master yarn \
--name SparkWordCountAppV2 \
/home/hadoop/lib/sparksql-train-1.0.jar \
hdfs://hadoop000:8020/pk/wc.data hdfs://hadoop000:8020/pk/out
要将Spark应用程序运行在YARN上,一定要配置HADOOP_CONF_DIR或者YARN_CONF_DIR
指向$HADOOP_HOME/etc/conf
配置系统环境变量
vi /etc/prof
#添加
export HADOOP_CONF_DIR=/root/ysw/hadoop/hadoop-2.7.2/etc/hadoop
source /etc/profile
Standalone模式
Standalone 了解
多个机器,那么你每个机器都需要部署spark
相关配置:
$SPARK_HOME/conf/slaves
hadoop000
$SPARK_HOME/conf/spark-env.sh
SPARK_MASTER_HOST=hadoop000
启动Spark集群
$SPARK_HOME/sbin/start-all.sh
jps: Master Worker
./spark-submit \
--class com.imooc.bigdata.chapter02.SparkWordCountAppV2 \
--master spark://hadoop000:7077 \
--name SparkWordCountAppV2 \
/home/hadoop/lib/sparksql-train-1.0.jar \
hdfs://hadoop000:8020/pk/wc.data hdfs://hadoop000:8020/pk/out2
Spark sql
支持sql的大数据组件
使用SQL语句对大数据进行统计分析,数据是在Hadoop
**Apache Hive**
SQL转换成一系列可以在Hadoop上运行的MapReduce/Tez/Spark作业
SQL到底底层是运行在哪种分布式引擎之上的,是可以通过一个参数来设置
功能:
SQL:命令行、代码
多语言Apache Thrift驱动
自定义的UDF函数:按照标准接口实现,打包,加载到Hive中
元数据
**Cloudera Impala**
使用了自己的执行守护进程集合,一般情况下这些进程是需要与Hadoop DN安装在一个节点上
功能:
92 SQL支持
Hive支持
命令行、代码
与Hive能够共享元数据
性能方面是Hive要快速一些,基于内存
**Spark SQL**
Spark中的一个子模块,是不是仅仅只用SQL来处理呢?
Hive:SQL ==> MapReduce
Spark:能不能直接把SQL运行在Spark引擎之上呢?
Shark: SQL==>Spark X
优点:快 与Hive能够兼容
缺点:执行计划优化完全依赖于Hive 进程 vs 线程
使用:需要独立维护一个打了补丁的Hive源码分支
==>
1) Spark SQL
SQL,这是Spark里面的
2) Hive on Spark
Hive里面的,通过切换Hive的执行引擎即可,底层添加了Spark执行引擎的支持
**Presto**
交互式查询引擎 SQL
功能
共享元数据信息
92语法
提供了一系列的连接器,Hive Cassandra...
**Drill**
HDFS、Hive、Spark SQL
支持多种后端存储,然后直接进行各种后端数据的处理
**Phoenix**
HBase的数据,是要基于API进行查询
Phoenix使用SQL来查询HBase中的数据
主要点:如果想查询的快的话,还是取决于ROWKEY的设计
Spark SQL是什么
Spark SQL is Apache Spark’s module for working with structured data(结构化的数据(数据库中的表)).
误区一:Spark SQL就是一个SQL处理框架
1)集成性:在Spark编程中无缝对接多种复杂的SQL
2)统一的数据访问方式:以类似的方式访问多种不同的数据源,而且可以进行相关操作
spark.read.format("json").load(path)
spark.read.format("text").load(path)
spark.read.format("parquet").load(path)
spark.read.format("json").option("...","...").load(path)
3) 兼容Hive
allowing you to access existing Hive warehouses
如果你想把Hive的作业迁移到Spark SQL,这样的话,迁移成本就会低很多
4)标准的数据连接:提供标准的JDBC/ODBC连接方式 Server
Spark SQL应用并不局限于SQL
还支持Hive、JSON、Parquet文件的直接读取以及操作
SQL仅仅是Spark SQL中的一个功能而已
Spark SQL优点
SQL带来的便利性
Spark Core: RDD Scala/Java
熟悉Java、Scala语言,不然你也开发不了代码, 入门门槛比较大,学习成本比较大
Spark SQL
Catalyst 为我们自动做了很多的优化工作
SQL(只要了解业务逻辑,然后使用SQL来实现)
DF/DS:面向API编程的,使用一些Java/Scala
Spark SQL架构
spark-shell
每个Spark应用程序(spark-shell)在不同目录下启动,其实在该目录下是有metastore_db
单独的
如果你想spark-shell共享我们的元数据的话,肯定要指定元数据信息==> 后续讲Spark SQL整合Hive的时候讲解
spark.sql(sql语句)
spark-sql的使用
spark-shell你会发现如果要操作SQL相关的东西,要使用spark.sql(sql语句)
spark-shell底层调用的是spark-submit
spark-submit底层调用的是spark-class
spark-sql的使用
spark-shell你会发现如果要操作SQL相关的东西,要使用spark.sql(sql语句)
select a.key*(3+5), b.value from t a join t b on a.key = b.key and a.key > 3;
sparksql 会帮助我们优化此sql,把一些(a.key > 3)条件过滤前置
spark-sql执行流程分析
spark-sql底层调用的也是spark-submit
因为spark-sql它就是一个Spark应用程序,和spark-shell一样
对于你想启动一个Spark应用程序,肯定要借助于spark-submit这脚本进行提交
spark-sql调用的类是org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
spark-shell调用的类是org.apache.spark.repl.Main
4.saprk sql api 编程
sparkSession
使用sql 需要先创建sparkSession, SparkSession就是DF/DS编程的入口点
1.x里面Spark SQL的编程的入口点:SQLContext HiveContext,spark 2.0之后SQLContext HiveContext 统一为sparkSession
saprk 2.x sparkSession
import org.apache.spark.sql.{DataFrame, SparkSession}
/**
* 认识SparkSession
*/
object SparkSessionApp {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\hadoop-2.7.1")
// 创建sparkSession,就是DF/DS编程的入口点
val spark: SparkSession = SparkSession
.builder()
.master("local")
.appName("Spark SQL basic example")
.getOrCreate()
// 读取文件的API
val df: DataFrame = spark.read.text("E:\\apps\\data\\12.txt")
// TODO... 业务逻辑处理,肯定是通过DF/DS提供的API来完成我们的业务
df.printSchema()
df.show() // 展示出来 只有一个字段,string类型的value
spark.stop()
}
}
root
|-- value: string (nullable = true)
+-----------------+
| value|
+-----------------+
|spark hello world|
|spark hello world|
|spark hello world|
+-----------------+
spark 1.x SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}
object SQLContextApp {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setAppName("SQLContextApp").setMaster("local")
val sc: SparkContext = new SparkContext(sparkConf) // 此处一定要把SparkConf传进来
val sqlContext: SQLContext = new SQLContext(sc)
val df: DataFrame = sqlContext.read.text("E:\\apps\\data\\12.txt")
df.show()
sc.stop()
}
}
spark DataFrame
分布式的数据集,就是一个数据集合拆开分不到不同的机器上去,RDD 、DataFrame和DataSet都是分布式的数据集。
DataFrame是以列(列名、列类型、列值)的形式构成的分布式数据集,其实就可以把DataFrame看成数据库中的一张表。有了表就可以用sql查询
基于Dataframe API 查询
// TODO... DF里面有两列,只要name列 ==> select name from people
people.select("name").show()
people.select($"name").show()
// TODO... select * from people where age > 21
people.filter($"age" > 21).show()
people.filter("age > 21").show()
// TODO... select age, count(1) from people group by age
people.groupBy("age").count().show()
// TODO... select name,age+10 from people
people.select($"name", ($"age"+10).as("new_age")).show()
基于SQL 方式 查询
// TODO... 使用SQL的方式操作
people.createOrReplaceTempView("people") //把dataFrame 对象 注册成一张临时表
spark.sql("select name from people where age > 21").show()
DataFrame中前N条的取值方式
数据
{ "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" }
{ "_id" : "01002", "city" : "CUSHMAN", "loc" : [ -72.51564999999999, 42.377017 ], "pop" : 36963, "state" : "MA" }
{ "_id" : "01005", "city" : "BARRE", "loc" : [ -72.10835400000001, 42.409698 ], "pop" : 4546, "state" : "MA" }
{ "_id" : "01007", "city" : "BELCHERTOWN", "loc" : [ -72.41095300000001, 42.275103 ], "pop" : 10579, "state" : "MA" }
{ "_id" : "01008", "city" : "BLANDFORD", "loc" : [ -72.936114, 42.182949 ], "pop" : 1240, "state" : "MA" }
{ "_id" : "01010", "city" : "BRIMFIELD", "loc" : [ -72.188455, 42.116543 ], "pop" : 3706, "state" : "MA" }
{ "_id" : "01011", "city" : "CHESTER", "loc" : [ -72.988761, 42.279421 ], "pop" : 1688, "state" : "MA" }
{ "_id" : "01012", "city" : "CHESTERFIELD", "loc" : [ -72.833309, 42.38167 ], "pop" : 177, "state" : "MA" }
{ "_id" : "01013", "city" : "CHICOPEE", "loc" : [ -72.607962, 42.162046 ], "pop" : 23396, "state" : "MA" }
字段不截取显示
设置显示前几条数据
zips.show(10, false)
zips.head(3).foreach(println)
zips.first() //调用的head(1)
zips.take(5) //调用的head(5)
计算有多少条数据
val count: Long = zips.count()
println(s"Total Counts: $count")
其他查询语句
// 过滤出大于40000,字段重新命名
zips.filter(zips.col("pop") > 40000).withColumnRenamed("_id","new_id").show(10,false)
//import org.apache.spark.sql.functions._
// 统计加州pop最多的10个城市名称和ID desc是一个内置函数
zips.select("_id","city","pop","state").filter(zips.col("state") === "CA").orderBy(desc("pop")).show(10,false)
使用sql查询
// 统计加州pop最多的10个城市名称和ID desc是一个内置函数
//使用sql 需要先把dataFrame注册为一张临时表
zips.createOrReplaceTempView("zips")
spark.sql("select _id,city,pop,state from zips where state='CA' order by pop desc limit 10").show()
Dataset
示例
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
object DatasetApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local").appName("DatasetApp").getOrCreate()
import spark.implicits._
//val ds: Dataset[Person] = Seq(Person("PK","30")).toDS()
//ds.show()
// val primitiveDS: Dataset[Int] = Seq(1,2,3).toDS()
//primitiveDS.map(x => x+1).collect().foreach(println)
val peopleDF: DataFrame = spark.read.json("./data/people.json")
//DataFrame和DataSet转换
val peopleDS: Dataset[Person] = peopleDF.as[Person]
peopleDS.show(false)
// 是在运行期报错
//peopleDF.select("anme").show()
peopleDS.map(x => x.name).show() //编译期报错
spark.stop()
}
case class Person(name: String, age: String)
}
DataFrame vs Dataset
//DataFrame 没有类型,有字段写错(anme) 是在运行期才报错
peopleDF.select("anme").show()
//DataSet 是强类型,有字段写错,代码抛红,编译期就报错
peopleDS.map(x => x.name).show()
RDD和DataFrame/DataSet的转换
在日常开发过程中,我们使用Spark SQL来进行日志处理(90%)
你要处理一个目录下或者指定文件的日志数据,数据格式是文本类型的
直接使用spark.read.text(path)读进来之后,就是只有一个string类型的名字为value的值
RDD和DataFrame/DataSet的转换主要有两种方式
1)uses reflection to infer the schema of an RDD that contains specific types of objects
2)creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD
对于字段比较少的场景,个人倾向于使用第一种
对于字段比较多的场景,个人倾向于使用第二种,自己灵活定制
一般使用RDD和DataFrame/DataSet的转换编程的三步为
1.使用RDD读取元数据
2.转换为DataFrame/DataSet
3.创建临时表使用SQL编程
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
object InteroperatingRDDApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local").appName("DatasetApp").getOrCreate()
// 第一种方式: 反射: 用RDD读取元数据,使用RDD.map 把没一行数据转成对应的对象
runInferSchema(spark)
//第二种方式:自定义编程
runProgrammaticSchema(spark)
spark.stop()
}
/**
* 第二种方式:自定义编程
*/
def runProgrammaticSchema(spark:SparkSession): Unit = {
import spark.implicits._
// step1
val peopleRDD: RDD[String] = spark.sparkContext.textFile("./data/people.txt")
val peopleRowRDD: RDD[Row] = peopleRDD.map(_.split(",")) // RDD
.map(x => Row(x(0), x(1).trim.toInt))
// step2
val struct =
StructType(
StructField("name", StringType, true) ::
StructField("age", IntegerType, false) ::Nil)
// step3
val peopleDF: DataFrame = spark.createDataFrame(peopleRowRDD, struct)
peopleDF.show()
peopleRowRDD
}
/**
* 第一种方式:反射: 用RDD读取元数据,使用RDD.map 把没一行数据转成对应的对象
* 1)定义case class
* 2)RDD map,map中每一行数据转成case class
*/
def runInferSchema(spark: SparkSession): Unit = {
import spark.implicits._
//用RDD读取元数据,
val peopleRDD: RDD[String] = spark.sparkContext.textFile("./data/people.txt")
//TODO... RDD => DF
//使用RDD.map 把没一行数据转成对应的对象
val peopleDF: DataFrame = peopleRDD.map(_.split(",")) //RDD
.map(x => People(x(0), x(1).trim.toInt)) //RDD
.toDF() //RDD转成DataFrame
//peopleDF.show(false)
//使用sql查询
peopleDF.createOrReplaceTempView("people")
val queryDF: DataFrame = spark.sql("select name,age from people where age between 19 and 29")
//queryDF.show()
//queryDF.map(x => "Name:" + x(0)).show() // from index
queryDF.map(x => "Name:" + x.getAs[String]("name")).show // from field
}
case class People(name:String, age:Int)
}
5.SparkSql Data Source
Spark能处理多种数据源的数据,而且这些数据源可以是在不同的地方
file/HDFS/S3/OSS/COS/RDBMS
json/ORC/Parquet/JDBC
通过Spark将各种不同数据源的数据加载成DF/DS,后续的所有操作都是基于DF/DS
文本text数据源读写案例
数据
./data/people.txt
Michael, 29
Andy, 30
Justin, 19
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\hadoop-2.7.1")
val spark: SparkSession = SparkSession.builder()
.master("local").getOrCreate()
import spark.implicits._
val textDF: DataFrame = spark.read.text("./data/people.txt")
// textDF.show()
val result: Dataset[(String)] = textDF.map(x => {
//每行根据","拆分
val splits: Array[String] = x.getString(0).split(",")
//返回值
(splits(0).trim) //, splits(1).trim
})
// SaveMode.Append
result.write.mode("overwrite").text("./data/out") // 如果才能支持使用text方式输出多列的值呢?
// 回忆一下:Hadoop中MapReduce的输出,第一次OK,第二次就会报输出目录已存在
spark.stop()
}
文本类型的数据不支持多列也不支持int类型
SaveMode数据保存模式
// SaveMode.Append
result.write.mode("overwrite").text("./data/out")
工作一般不会用append,会使数据翻倍
@InterfaceStability.Stable
public enum SaveMode {
/**
* Append mode means that when saving a DataFrame to a data source, if data/table already exists,
* contents of the DataFrame are expected to be appended to existing data.
*
* @since 1.3.0
*/
Append,
/**
* Overwrite mode means that when saving a DataFrame to a data source,
* if data/table already exists, existing data is expected to be overwritten by the contents of
* the DataFrame.
*
* @since 1.3.0
*/
Overwrite,
/**
* ErrorIfExists mode means that when saving a DataFrame to a data source, if data already exists,
* an exception is expected to be thrown.
*
* @since 1.3.0
*/
ErrorIfExists,
/**
* Ignore mode means that when saving a DataFrame to a data source, if data already exists,
* the save operation is expected to not save the contents of the DataFrame and to not
* change the existing data.
*
* @since 1.3.0
*/
Ignore
}
json数据源案例
数据
./data/people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
./data/people2.json
{"name":"PK","age":30,"info":{"work":"beijing","home":"shenzhen"}}
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\hadoop-2.7.1")
val spark: SparkSession = SparkSession.builder()
.master("local").getOrCreate()
import spark.implicits._
//简单的json
val jsonDF: DataFrame = spark.read.json("./data/people.json")
//jsonDF.show()
// TODO... 只要age>20的数据
//jsonDF.filter("age > 20").select("name").write.mode(SaveMode.Overwrite).json("out")
//嵌套的JSON
val jsonDF2: DataFrame = spark.read.json("./data/people2.json")
jsonDF2.select($"name",$"age",$"info.work".as("work"), $"info.home".as("home"))
.write.mode("overwrite").json("out")
}
保存的数据
{"name":"PK","age":30,"work":"beijing","home":"shenzhen"}
Parquet数据源
Spark 模式认使用的就是Parquet格式的数据
//读取数据的时候不指定数据源格式模式使用的就是Parquet
val parquetDF: DataFrame = spark.read.parquet("./data/users.parquet")
// 标准API写法
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\hadoop-2.7.1")
val spark: SparkSession = SparkSession.builder()
.master("local").getOrCreate()
import spark.implicits._
val parquetDF: DataFrame = spark.read.parquet("./data/users.parquet")
// parquetDF.printSchema()
// parquetDF.show()
parquetDF.select("name","favorite_numbers")
.write.mode("overwrite")
.option("compression","none")
.parquet("out")
// spark.read.parquet("./out").show()
}
Data Source API标准写法
// 标准API写法
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\hadoop-2.7.1")
val spark: SparkSession = SparkSession.builder()
.master("local").getOrCreate()
import spark.implicits._
// 源码面前 了无秘密
// val textDF: DataFrame = spark.read.format("text").load("./data/people.txt")
val jsonDF: DataFrame = spark.read.format("json").load("./data/people.json")
//
// textDF.show()
// println("~~~~~~~~")
// jsonDF.show()
jsonDF.write.format("json").mode("overwrite").save("out")
}
Data Source格式转换
// 存储类型转换:JSON==>Parquet
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\hadoop-2.7.1")
val spark: SparkSession = SparkSession.builder()
.master("local").getOrCreate()
import spark.implicits._
val jsonDF: DataFrame = spark.read.format("json").load("./data/people.json")
// jsonDF.show()
jsonDF.filter("age>20")
.write.format("parquet").mode(SaveMode.Overwrite).save("out")
spark.read.parquet("./out").show()
}
jdbc数据源案例
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
/**
* 有些数据是在MySQL,如果使用Spark处理,肯定需要通过Spark读取出来MySQL的数据
* 数据源是text/json,通过Spark处理完之后,我们要将统计结果写入到MySQL
*/
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\hadoop-2.7.1")
val spark: SparkSession = SparkSession.builder()
.master("local").getOrCreate()
import spark.implicits._
val jdbcDF = spark.read
.format("jdbc")
.option("url","jdbc:mysql://kafka2:3306")
.option("dbtable","sqltest.user1")
.option("user","root")
.option("password","root")
.load()
jdbcDF.show();
jdbcDF.filter($"cnt" > 100).show(100)
println("==============")
jdbcDF.filter($"age" > 30).write
.format("jdbc")
.option("url","jdbc:mysql://kafka3:3306")
.option("dbtable","sqltest.user3")
.option("user","root")
.option("password","root")
.save()
// Saving data to a JDBC source
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "sqltest.user3")
.option("user", "root")
.option("password", "root")
.save()
// 死去活来法
val url = "jdbc:mysql://kafka2:3306"
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "root")
val jdbcDF: DataFrame = spark.read
.jdbc(url, "sqltest.user1", connectionProperties)
jdbcDF.filter($"cnt" > 100)
.write.jdbc(url, "sqltest.user2", connectionProperties)
}
通过统一配置参数管理工程中使用到的参数
<dependency>
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.3</version>
</dependency>
application.cof
db.default.driver="com.mysql.jdbc.Driver"
db.default.url="jdbc:mysql://hadoop000:3306"
db.default.user=root
db.default.password=root
db.default.database=spark
db.default.table=browser_stat
db.default.sink.table=browser_stat_3
kudu.masters=hadoop000
kudu.table.ods=ods
json.hdfs.path="hdfs://hadoop000:8020/project"
生产环境主要是以这种配置文件的形式传递参数
import com.typesafe.config.{Config, ConfigFactory}
object ParamsApp {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\hadoop-2.7.1")
val spark: SparkSession = SparkSession.builder()
.master("local").getOrCreate()
import spark.implicits._
val config: Config = ConfigFactory.load()
val url = config.getString("db.default.url")
val user = config.getString("db.default.user")
val password = config.getString("db.default.password")
val driver = config.getString("db.default.driver")
val database = config.getString("db.default.database")
val table = config.getString("db.default.table")
val sinkTable = config.getString("db.default.sink.table")
println(url)
val connectionProperties = new Properties()
connectionProperties.put("user", user)
connectionProperties.put("password", password)
val jdbcDF: DataFrame = spark.read.jdbc(url, s"$database.$table", connectionProperties)
jdbcDF.filter($"cnt" > 100).show() //.write.jdbc(url, s"$database.$sinkTable", connectionProperties)
}
}
spark整合Hive
Spark SQL如何对接Hive
场景:历史原因积累下来的,很多数据原先是采用Hive来进行处理的,
现在想改用Spark来进行数据,我们必须要求Spark能够无缝对接已有的Hive的数据
以spark+hive的形式 平滑的过渡
操作是非常简单的,MetaStore
Hive底层的元数据信息是存储在MySQL中 $HIVE_HOME/conf/hive-site.xml
Spark如果能够直接访问到MySQL中已有的元数据信息 $SPARK_HOME/conf/hive-site.xml
Spark对接Hive的原理及实操
Hive的元数据是保存在mysql ,spark只要获取元数据,就可以连接hive了
package com.imooc.bigdata.chapter06
import java.util.Properties
import com.typesafe.config.ConfigFactory
import org.apache.spark.sql.{DataFrame, SparkSession}
object HiveSourceApp {
def main(args: Array[String]): Unit = {
// 如果你想使用Spark来访问Hive的时候,一定需要开启Hive的支持
val spark: SparkSession = SparkSession.builder().master("local").appName("HiveSourceApp")
.enableHiveSupport() //切记:一定要开启
.getOrCreate()
// 走的就是连接 default数据库中的pk表,如果你是其他数据库的,那么也采用类似的写法即可
//spark.table("default.pk").show()
// input(Hive/MySQL/JSON...) ==> 处理 ==> output (Hive)
import spark.implicits._
// 连接保存hive元数据的mysql
val config = ConfigFactory.load()
val url = config.getString("db.default.url")
val user = config.getString("db.default.user")
val password = config.getString("db.default.password")
val driver = config.getString("db.default.driver")
val database = config.getString("db.default.database")
val table = config.getString("db.default.table")
val sinkTable = config.getString("db.default.sink.table")
val connectionProperties = new Properties()
connectionProperties.put("user", user)
connectionProperties.put("password", password)
val jdbcDF: DataFrame = spark.read
.jdbc(url, s"$database.$table", connectionProperties).filter($"cnt" > 100)
//jdbcDF.show()
jdbcDF.write.saveAsTable("browser_stat_hive")
// TODO... saveAsTable和insertInto的区别
jdbcDF.write.insertInto("browser_stat_hive_1")
spark.stop()
}
}
Spark SQL内置函数实战
spark sql 所有的内置函数都在functions这个类下面
工作中,源码托管在gitlab,svn,cvs,你clone下来以后,千万不,做代码样式的格式化
package com.imooc.bigdata.chapter06
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object BuildinFunctionApp {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.master("local").appName("HiveSourceApp")
.getOrCreate()
// 工作中,源码托管在gitlab,svn,cvs,你clone下来以后,千万不要手贱,做代码样式的格式化
val userAccessLog = Array (
"2016-10-01,1122", // day userid
"2016-10-01,1122",
"2016-10-01,1123",
"2016-10-01,1124",
"2016-10-01,1124",
"2016-10-02,1122",
"2016-10-02,1121",
"2016-10-02,1123",
"2016-10-02,1123"
)
import spark.implicits._
// Array ==> RDD
val userAccessRDD: RDD[String] = spark.sparkContext.parallelize(userAccessLog)
val userAccessDF: DataFrame = userAccessRDD.map(x => {
val splits: Array[String] = x.split(",")
Log(splits(0), splits(1).toInt)
}).toDF
//userAccessDF.show()
import org.apache.spark.sql.functions._
//使用sql实现
// select day, count(user_id) from xxx group by day;
//pv 不需要去重 分组后求每个组中的个数
userAccessDF.groupBy("day").agg(count("userId").as("pv")).show()
//uv 需要去重 分组后,去除每个组中重复的再求每个组中的个数
userAccessDF.groupBy("day").agg(countDistinct("userId").as("uv")).show()
// TODO... 使用JDBC数据源把统计结果输出到MySQL表中
spark.stop()
}
case class Log(day:String,userId:Int)
}
Spark SQL自定义UDF实战
Spark SQL函数–自定义函数
很多是和贵司业务逻辑相关的函数,这些函数如果在内置函数中没有的话,那么就需要我们自定义函数来实现
三步曲:
1)定义函数
2)注册函数
3)使用函数
数据
pk###jogging,Coding,cooking
jepson###travel,dance
package com.imooc.bigdata.chapter06
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
/**
* 需求:统计每个人爱好的个数
* pk:3
* jepson: 2
*
*
* 1)定义函数
* 2)注册函数
* 3)使用函数
*/
object UDFFunctionApp {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.master("local").appName("HiveSourceApp")
.getOrCreate()
import spark.implicits._
val infoRDD: RDD[String] = spark.sparkContext.textFile("./data/hobbies.txt")
val infoDF: DataFrame = infoRDD.map(_.split("###")).map(x => {
Hobbies(x(0), x(1))
}).toDF
//infoDF.show(false)
// TODO... 定义函数 和 注册函数
spark.udf.register("hobby_num", (s:String) => s.split(",").size)
//生成临时表
infoDF.createOrReplaceTempView("hobbies")
//TODO... 函数的使用
spark.sql("select name, hobbies, hobby_num(hobbies) as hobby_count from hobbies").show(false)
// select name, hobby_num(hobbies) from xxx
spark.stop()
}
case class Hobbies(name:String, hobbies:String)
}
Spark整合Kudu
kudu介绍
SparkSQL 整合 Kudu 代码
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_2.11</artifactId>
<version>1.7.0</version>
</dependency>
package com.imooc.bigdata.chapter07
import java.util.Properties
import com.typesafe.config.ConfigFactory
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
object SparkKuduApp {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.master("local").getOrCreate()
import spark.implicits._
val config = ConfigFactory.load()
val url = config.getString("db.default.url")
val user = config.getString("db.default.user")
val password = config.getString("db.default.password")
val driver = config.getString("db.default.driver")
val database = config.getString("db.default.database")
val table = "wc"
val connectionProperties = new Properties()
connectionProperties.put("user", user)
connectionProperties.put("password", password)
// TODO... 以上代码是加载
// TODO... 就是你们需要开发的业务逻辑功能
//1. 从mysql读取数据
val jdbcDF: DataFrame = spark.read
.jdbc(url, s"$database.$table", connectionProperties).filter($"cnt" > 11)
// TODO... 以下代码是Sink
val kuduMasters = "hadoop000"
// 自定义数据如何实现 load save
//2. 写入kudu
jdbcDF.write.mode(SaveMode.Append).format("org.apache.kudu.spark.kudu")
.option("kudu.master",kuduMasters)
.option("kudu.table", "pk")
.save()
//再从kudu读取出写入的数据
spark.read.format("org.apache.kudu.spark.kudu")
.option("kudu.master",kuduMasters)
.option("kudu.table", "app_stat_20181007")
.load().show()
spark.stop()
}
}
基于Spark和Kudu的广告业务项目实战
数据
json格式数据
{"sessionid":"qld2dU4cfhEa3yhADzgphOf3ySv9vMml","advertisersid":66,"adorderid":142848,"adcreativeid":212312,"adplatformproviderid":174663,"sdkversion":"Android 5.0","adplatformkey":"PLMyYnDKQgOPL55frHhxkUIQtBThHfHq","putinmodeltype":1,"requestmode":1,"adprice":8410.0,"adppprice":5951.0,"requestdate":"2018-10-07","ip":"182.91.190.221","appid":"XRX1000014","appname":"支付宝 - 让生活更简单","uuid":"QtxDH9HUueM2IffUe8z2UqLKuZueZLqq","device":"HUAWEI GX1手机","client":1,"osversion":"","density":"","pw":1334,"ph":750,"lang":"","lat":"","provincename":"","cityname":"","ispid":46007,"ispname":"移动","networkmannerid":1,"networkmannername":"4G","iseffective":1,"isbilling":1,"adspacetype":3,"adspacetypename":"全屏","devicetype":1,"processnode":3,"apptype":0,"district":"district","paymode":1,"isbid":1,"bidprice":6812.0,"winprice":89934.0,"iswin":0,"cur":"rmb","rate":0.0,"cnywinprice":0.0,"imei":"","mac":"52:54:00:41:ba:02","idfa":"","openudid":"FIZHDPIKQYVNHOHOOAWMTQDFTPNWAABZTAFVHTEL","androidid":"","rtbprovince":"","rtbcity":"","rtbdistrict":"","rtbstreet":"","storeurl":"","realip":"182.92.196.236","isqualityapp":0,"bidfloor":0.0,"aw":0,"ah":0,"imeimd5":"","macmd5":"","idfamd5":"","openudidmd5":"","androididmd5":"","imeisha1":"","macsha1":"","idfasha1":"","openudidsha1":"","androididsha1":"","uuidunknow":"","userid":"vtUO8pPXfwdsPnvo6ttNGhAAnHi8NVbA","reqdate":null,"reqhour":null,"iptype":1,"initbidprice":0.0,"adpayment":175547.0,"agentrate":0.0,"lomarkrate":0.0,"adxrate":0.0,"title":"中信建投首次公开发行股票发行结果 本次发行价格为5.42元/股","keywords":"IPO,中信建投证券,股票,投资,财经","tagid":"rBRbAEQhkcAaeZ6XlTrGXOxyw6w9JQ7x","callbackdate":"2018-10-07","channelid":"123528","mediatype":2,"email":"e4aqd67bo@263.net","tel":"13105823726","age":"29","sex":"0"}
{
"sessionid":"qld2dU4cfhEa3yhADzgphOf3ySv9vMml",
"advertisersid":66,
"adorderid":142848,
"adcreativeid":212312,
"adplatformproviderid":174663,
"sdkversion":"Android 5.0",
"adplatformkey":"PLMyYnDKQgOPL55frHhxkUIQtBThHfHq",
"putinmodeltype":1,
"requestmode":1,
"adprice":8410,
"adppprice":5951,
"requestdate":"2018-10-07",
"ip":"182.91.190.221",
"appid":"XRX1000014",
"appname":"支付宝 - 让生活更简单",
"uuid":"QtxDH9HUueM2IffUe8z2UqLKuZueZLqq",
"device":"HUAWEI GX1手机",
"client":1,
"osversion":"",
"density":"",
"pw":1334,
"ph":750,
"lang":"",
"lat":"",
"provincename":"",
"cityname":"",
"ispid":46007,
"ispname":"移动",
"networkmannerid":1,
"networkmannername":"4G",
"iseffective":1,
"isbilling":1,
"adspacetype":3,
"adspacetypename":"全屏",
"devicetype":1,
"processnode":3,
"apptype":0,
"district":"district",
"paymode":1,
"isbid":1,
"bidprice":6812,
"winprice":89934,
"iswin":0,
"cur":"rmb",
"rate":0,
"cnywinprice":0,
"imei":"",
"mac":"52:54:00:41:ba:02",
"idfa":"",
"openudid":"FIZHDPIKQYVNHOHOOAWMTQDFTPNWAABZTAFVHTEL",
"androidid":"",
"rtbprovince":"",
"rtbcity":"",
"rtbdistrict":"",
"rtbstreet":"",
"storeurl":"",
"realip":"182.92.196.236",
"isqualityapp":0,
"bidfloor":0,
"aw":0,
"ah":0,
"imeimd5":"",
"macmd5":"",
"idfamd5":"",
"openudidmd5":"",
"androididmd5":"",
"imeisha1":"",
"macsha1":"",
"idfasha1":"",
"openudidsha1":"",
"androididsha1":"",
"uuidunknow":"",
"userid":"vtUO8pPXfwdsPnvo6ttNGhAAnHi8NVbA",
"reqdate":null,
"reqhour":null,
"iptype":1,
"initbidprice":0,
"adpayment":175547,
"agentrate":0,
"lomarkrate":0,
"adxrate":0,
"title":"中信建投首次公开发行股票发行结果 本次发行价格为5.42元/股",
"keywords":"IPO,中信建投证券,股票,投资,财经",
"tagid":"rBRbAEQhkcAaeZ6XlTrGXOxyw6w9JQ7x",
"callbackdate":"2018-10-07",
"channelid":"123528",
"mediatype":2,
"email":"e4aqd67bo@263.net",
"tel":"13105823726",
"age":"29",
"sex":"0"
}
功能一:
针对json数据做一个ETL操作 ==> ODS
原始数据 json,里面包含很多字段,但是还是需要做一些处理的,比如说我们需要根据IP解析出来省份、城市、运营商信息
针对你们的业务逻辑进行字段的拆分或者是补充
ETL后的数据是规整的了(规整后的数据一般都是列式),后续其他的操作就是基于这个ETL后的数据进行处理
**1)IP的字段解析**
a) 提供了ip规则库,然后自己解析出来 *****
日志文件 join ipRule
b)纯真
c)实际生产中需要专门的公司提供的ip服务,及付费购买ip规则库
**2)解析完的数据落地到Kudu的ods表,供后续的统计分析作业使用**
实现sql
lazy val SQL = "select " +
"logs.ip ," +
"logs.sessionid," +
"logs.advertisersid," +
"logs.adorderid," +
"logs.adcreativeid," +
"logs.adplatformproviderid" +
",logs.sdkversion" +
",logs.adplatformkey" +
",logs.putinmodeltype" +
",logs.requestmode" +
",logs.adprice" +
",logs.adppprice" +
",logs.requestdate" +
",logs.appid" +
",logs.appname" +
",logs.uuid, logs.device, logs.client, logs.osversion, logs.density, logs.pw, logs.ph" +
",ips.province as provincename" +
",ips.city as cityname" +
",ips.isp as isp" +
",logs.ispid, logs.ispname" +
",logs.networkmannerid, logs.networkmannername, logs.iseffective, logs.isbilling" +
",logs.adspacetype, logs.adspacetypename, logs.devicetype, logs.processnode, logs.apptype" +
",logs.district, logs.paymode, logs.isbid, logs.bidprice, logs.winprice, logs.iswin, logs.cur" +
",logs.rate, logs.cnywinprice, logs.imei, logs.mac, logs.idfa, logs.openudid,logs.androidid" +
",logs.rtbprovince,logs.rtbcity,logs.rtbdistrict,logs.rtbstreet,logs.storeurl,logs.realip" +
",logs.isqualityapp,logs.bidfloor,logs.aw,logs.ah,logs.imeimd5,logs.macmd5,logs.idfamd5" +
",logs.openudidmd5,logs.androididmd5,logs.imeisha1,logs.macsha1,logs.idfasha1,logs.openudidsha1" +
",logs.androididsha1,logs.uuidunknow,logs.userid,logs.iptype,logs.initbidprice,logs.adpayment" +
",logs.agentrate,logs.lomarkrate,logs.adxrate,logs.title,logs.keywords,logs.tagid,logs.callbackdate" +
",logs.channelid,logs.mediatype,logs.email,logs.tel,logs.sex,logs.age " +
"from logs left join " +
"ips on logs.ip_long between ips.start_ip and ips.end_ip "
主要代码
package com.imooc.bigdata.chapter08.business
import com.imooc.bigdata.chapter08.utils.{IPUtils, KuduUtils, SQLUtils, SchemaUtils}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
/**
* 日志ETL清洗操作
*/
object LogETLApp {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder() //
.master("local[2]").appName("LogETLApp") //
.getOrCreate()
// 日志数据: 使用Data Source API直接加载要处理的json数据
var jsonDF: DataFrame = spark.read.json("./data/data-test.json")
// jsonDF.printSchema()
// jsonDF.show(false)
import spark.implicits._
val ipRowRDD: RDD[String] = spark.sparkContext.textFile("./data/ip.txt")
// 建议使用DF 需要将RDD转成DF ==> DF的相关操作 或者DF注册成表 然后进行相关操作
//取出ip规则库中每条数据需要的字段
val ipRuleDF: DataFrame = ipRowRDD.map(x => {
val splits: Array[String] = x.split("\\|")
val startIP: Long = splits(2).toLong
val endIP: Long = splits(3).toLong
val province: String = splits(6)
val city: String = splits(7)
val isp: String = splits(9)
(startIP, endIP, province, city, isp)
}).toDF("start_ip", "end_ip", "province", "city", "isp")
//ipRuleDF.show(false)
// TODO 需要将每一行日志中的ip获得到对应的省份、城市、运营商
import org.apache.spark.sql.functions._
// TODO... json中的ip转换一下 通过前面我们学习的Spark SQL UDF函数
//自定函数实现,将jsonDF中字符串形式的ip转换成十进制数值形式
def getLongIp() = udf((ip:String) => {
IPUtils.ip2Long(ip)
})
//给jsonDF加一个字段,将转换后的ip添加进去
jsonDF = jsonDF.withColumn("ip_long", getLongIp()($"ip"))
//DataFream API的方式实现
// 两个DF进行join,条件是json中的ip 是在规则ip中的范围内就行 ip between ... and ...
jsonDF.join(ipRuleDF,jsonDF("ip_long")
.between(ipRuleDF("start_ip"), ipRuleDF("end_ip")))
// .show(false)
// TODO... 你知道join有哪几种类型,区别是什么
//sql的方式实现
//先生成临时表
jsonDF.createOrReplaceTempView("logs")
ipRuleDF.createOrReplaceTempView("ips")
val sql = SQLUtils.SQL
//sql 转成DataFrame方便打印显示
val result: DataFrame = spark.sql(sql)
// result.printSchema()
//result.show(false)
// ===> printSchema
// 重构: Client result tableName master schema partitionId
// ETL处理完之后,肯定要落地到某个地方 KUDU
// 只需要定义表相关的信息,剩下的创建表 删除表操作全部封装到KuduUtils的Sink方法中
val tableName = "ods" //表名 需要现在kud创建ods表
val masterAddresses = "hadoop000"
val partitionId = "ip"
KuduUtils.sink(result,tableName,masterAddresses,SchemaUtils.ODSSchema, partitionId)
spark.stop()
}
}
功能二:统计省份、城市数量分布情况
input: kudu ods
统计:按照provincename,cityname的分组统计
扩展:求每个省份下每个城市数量最多的TopN 这是一个典型的TopN
实现sql
lazy val PROVINCE_CITY_SQL = "select provincename, cityname, count(1) as cnt from ods group by provincename,cityname"
package com.imooc.bigdata.chapter08.business
import com.imooc.bigdata.chapter08.utils.{KuduUtils, SQLUtils, SchemaUtils}
import org.apache.spark.sql.{DataFrame, SparkSession}
object ProvinceCityStatApp {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.master("local[2]").appName("ProvinceCityStatApp").getOrCreate()
// 从KUDU的ods表中读取数据,然后进行按照省份和城市分组统计即可
val sourceTableName = "ods"
val masterAddresses = "hadoop000"
val odsDF: DataFrame = spark.read.format("org.apache.kudu.spark.kudu") //
.option("kudu.table", sourceTableName)
.option("kudu.master", masterAddresses)
.load()
//odsDF.show(false)
odsDF.createOrReplaceTempView("ods")
val result: DataFrame = spark.sql(SQLUtils.PROVINCE_CITY_SQL)
// result.show(false)
val sinkTableName = "province_city_stat"
val partitionId = "provincename"
KuduUtils.sink(result,sinkTableName,masterAddresses,SchemaUtils.ProvinceCitySchema, partitionId)
spark.stop()
}
}
重构功能一二
两个需求我们是通过两个单独的类来实现,这种方式不友好,不符合生产上的要求,所以我们需要进行重构
package com.imooc.bigdata.chapter08.`trait`
import org.apache.spark.sql.SparkSession
/**
* 顶层数据处理接口
*/
trait DataProcess {
def process(spark:SparkSession)
}
功能一实现接口
package com.imooc.bigdata.chapter08.business
import com.imooc.bigdata.chapter08.`trait`.DataProcess
import com.imooc.bigdata.chapter08.utils._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object LogETLProcessor extends DataProcess{
override def process(spark: SparkSession): Unit = {
// 日志数据: 使用Data Source API直接加载要处理的json数据
val rawPath: String = spark.sparkContext.getConf.get("spark.raw.path")
var jsonDF: DataFrame = spark.read.json(rawPath)
// var jsonDF: DataFrame = spark.read.json("file:///Users/rocky/IdeaProjects/imooc-workspace/sparksql-train/data/data-test.json")
//jsonDF.printSchema()
//jsonDF.show(false)
import spark.implicits._
val ipRulePath: String = spark.sparkContext.getConf.get("spark.ip.path")
val ipRowRDD: RDD[String] = spark.sparkContext.textFile(ipRulePath)
// val ipRowRDD: RDD[String] = spark.sparkContext.textFile("file:///Users/rocky/IdeaProjects/imooc-workspace/sparksql-train/data/ip.txt")
// 建议使用DF 需要将RDD转成DF ==> DF的相关操作 或者DF注册成表 然后进行相关操作
val ipRuleDF: DataFrame = ipRowRDD.map(x => {
val splits: Array[String] = x.split("\\|")
val startIP: Long = splits(2).toLong
val endIP: Long = splits(3).toLong
val province: String = splits(6)
val city: String = splits(7)
val isp: String = splits(9)
(startIP, endIP, province, city, isp)
}).toDF("start_ip", "end_ip", "province", "city", "isp")
//ipRuleDF.show(false)
// TODO 需要将每一行日志中的ip获得到对应的省份、城市、运营商
// 两个DF进行join,条件是json中的ip 是在规则ip中的范围内就行 ip between ... and ...
// TODO... json中的ip转换一下 通过前面我们学习的Spark SQL UDF函数
import org.apache.spark.sql.functions._
def getLongIp() = udf((ip:String) => {
IPUtils.ip2Long(ip)
})
jsonDF = jsonDF.withColumn("ip_long", getLongIp()($"ip"))
// jsonDF.join(ipRuleDF,jsonDF("ip_long")
// .between(ipRuleDF("start_ip"), ipRuleDF("end_ip")))
// .show(false)
// TODO... 你知道join有哪几种类型,区别是什么
jsonDF.createOrReplaceTempView("logs")
ipRuleDF.createOrReplaceTempView("ips")
// ETL处理完之后,肯定要落地到某个地方 KUDU
val sql = SQLUtils.SQL
val result: DataFrame = spark.sql(sql)
// ===> printSchema
// 重构: Client result tableName master schema partitionId
// 只需要定义表相关的信息,剩下的创建表 删除表操作全部封装到KuduUtils的Sink方法中
val tableName = DateUtils.getTableName("ods", spark)
val masterAddresses = "hadoop000"
val partitionId = "ip"
KuduUtils.sink(result,tableName,masterAddresses,SchemaUtils.ODSSchema, partitionId)
}
}
功能二实现节后
package com.imooc.bigdata.chapter08.business
import com.imooc.bigdata.chapter08.`trait`.DataProcess
import com.imooc.bigdata.chapter08.utils.{DateUtils, KuduUtils, SQLUtils, SchemaUtils}
import org.apache.spark.sql.{DataFrame, SparkSession}
object ProvinceCityStatProcessor extends DataProcess{
override def process(spark: SparkSession): Unit = {
// 从KUDU的ods表中读取数据,然后进行按照省份和城市分组统计即可
val sourceTableName = DateUtils.getTableName("ods", spark)
val masterAddresses = "hadoop000"
val odsDF: DataFrame = spark.read.format("org.apache.kudu.spark.kudu") //
.option("kudu.table", sourceTableName)
.option("kudu.master", masterAddresses)
.load()
//odsDF.show(false)
odsDF.createOrReplaceTempView("ods")
val result: DataFrame = spark.sql(SQLUtils.PROVINCE_CITY_SQL)
// result.show(false)
val sinkTableName = DateUtils.getTableName("province_city_stat", spark)
val partitionId = "provincename"
KuduUtils.sink(result,sinkTableName,masterAddresses,SchemaUtils.ProvinceCitySchema, partitionId)
}
}
package com.imooc.bigdata.chapter08
import com.imooc.bigdata.chapter08.business.{AppStatProcessor, AreaStatProcessor, LogETLProcessor, ProvinceCityStatProcessor}
import org.apache.commons.lang3.StringUtils
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession
/**
* 整个项目Spark作业的入口点
* 离线的处理 一天一个批次
*/
object SparkApp extends Logging{
def main(args: Array[String]): Unit = {
// .master("local[2]").appName("SparkApp") Spark官网强调不要硬编码,appName master统一使用spark-submit提交的时候指定即可
val spark: SparkSession = SparkSession.builder().getOrCreate()
/**
* 入参统计:
* 1) spark.time
* 2) spark.raw.path
* 3) spark.ip.path
*/
// spark-submit ...... --conf spark.time=20181007
val time = spark.sparkContext.getConf.get("spark.time") // spark框架只认以spark.开头的参数,否则系统不识别
if(StringUtils.isBlank(time)) { // 如果是空,后续的代码就不应该执行了
logError("处理批次不能为空....")
System.exit(0)
}
// STEP1: ETL
LogETLProcessor.process(spark)
// STEP2:省份地市统计
ProvinceCityStatProcessor.process(spark)
// STEP3: 地域分布情况统计
AreaStatProcessor.process(spark)
// STEP4: APP分布情况统计
AppStatProcessor.process(spark)
spark.stop()
}
}
需求三:统计地域分布情况
根据省市分组后再对最内(各个市)过滤计算
lazy val AREA_SQL_STEP1 = "select provincename,cityname, " +
"sum(case when requestmode=1 and processnode >=1 then 1 else 0 end) origin_request," +
"sum(case when requestmode=1 and processnode >=2 then 1 else 0 end) valid_request," +
"sum(case when requestmode=1 and processnode =3 then 1 else 0 end) ad_request," +
"sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and isbid=1 and adorderid!=0 then 1 else 0 end) bid_cnt," +
"sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1 then 1 else 0 end) bid_success_cnt," +
"sum(case when requestmode=2 and iseffective=1 then 1 else 0 end) ad_display_cnt," +
"sum(case when requestmode=3 and processnode=1 then 1 else 0 end) ad_click_cnt," +
"sum(case when requestmode=2 and iseffective=1 and isbilling=1 then 1 else 0 end) medium_display_cnt," +
"sum(case when requestmode=3 and iseffective=1 and isbilling=1 then 1 else 0 end) medium_click_cnt," +
"sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1 and adorderid>20000 then 1*winprice/1000 else 0 end) ad_consumption," +
"sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1 and adorderid>20000 then 1*adpayment/1000 else 0 end) ad_cost " +
"from ods group by provincename,cityname"
//过滤掉除数为0的数据
lazy val AREA_SQL_STEP2 = "select provincename,cityname, " +
"origin_request," +
"valid_request," +
"ad_request," +
"bid_cnt," +
"bid_success_cnt," +
"bid_success_cnt/bid_cnt bid_success_rate," +
"ad_display_cnt," +
"ad_click_cnt," +
"ad_click_cnt/ad_display_cnt ad_click_rate," +
"ad_consumption," +
"ad_cost from area_tmp " +
"where bid_cnt!=0 and ad_display_cnt!=0"
package com.imooc.bigdata.chapter08.business
import com.imooc.bigdata.chapter08.`trait`.DataProcess
import com.imooc.bigdata.chapter08.utils.{DateUtils, KuduUtils, SQLUtils, SchemaUtils}
import org.apache.spark.sql.{DataFrame, SparkSession}
object AreaStatProcessor extends DataProcess{
override def process(spark: SparkSession): Unit = {
val sourceTableName = DateUtils.getTableName("ods", spark)
val masterAddresses = "hadoop000"
val odsDF: DataFrame = spark.read.format("org.apache.kudu.spark.kudu") //
.option("kudu.table", sourceTableName)
.option("kudu.master", masterAddresses)
.load()
odsDF.createOrReplaceTempView("ods")
//
val resultTmp: DataFrame = spark.sql(SQLUtils.AREA_SQL_STEP1)
resultTmp.createOrReplaceTempView("area_tmp")
val result: DataFrame = spark.sql(SQLUtils.AREA_SQL_STEP2)
// result.show(false)
//过滤掉除数为0的数据
val sinkTableName = DateUtils.getTableName("area_stat", spark)
val partitionId = "provincename"
KuduUtils.sink(result,sinkTableName,masterAddresses,SchemaUtils.AREASchema, partitionId)
}
}
需求四:统计APP分布情况
根据appid和appname分组再对组内数据(每个app中的数据)过滤计算
lazy val APP_SQL_STEP1 = "select appid,appname, " +
"sum(case when requestmode=1 and processnode >=1 then 1 else 0 end) origin_request," +
"sum(case when requestmode=1 and processnode >=2 then 1 else 0 end) valid_request," +
"sum(case when requestmode=1 and processnode =3 then 1 else 0 end) ad_request," +
"sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and isbid=1 and adorderid!=0 then 1 else 0 end) bid_cnt," +
"sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1 then 1 else 0 end) bid_success_cnt," +
"sum(case when requestmode=2 and iseffective=1 then 1 else 0 end) ad_display_cnt," +
"sum(case when requestmode=3 and processnode=1 then 1 else 0 end) ad_click_cnt," +
"sum(case when requestmode=2 and iseffective=1 and isbilling=1 then 1 else 0 end) medium_display_cnt," +
"sum(case when requestmode=3 and iseffective=1 and isbilling=1 then 1 else 0 end) medium_click_cnt," +
"sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1 and adorderid>20000 then 1*winprice/1000 else 0 end) ad_consumption," +
"sum(case when adplatformproviderid>=100000 and iseffective=1 and isbilling=1 and iswin=1 and adorderid>20000 then 1*adpayment/1000 else 0 end) ad_cost " +
"from ods group by appid,appname"
lazy val APP_SQL_STEP2 = "select appid,appname, " +
"origin_request," +
"valid_request," +
"ad_request," +
"bid_cnt," +
"bid_success_cnt," +
"bid_success_cnt/bid_cnt bid_success_rate," +
"ad_display_cnt," +
"ad_click_cnt," +
"ad_click_cnt/ad_display_cnt ad_click_rate," +
"ad_consumption," +
"ad_cost from app_tmp " +
"where bid_cnt!=0 and ad_display_cnt!=0"
}
package com.imooc.bigdata.chapter08.business
import com.imooc.bigdata.chapter08.`trait`.DataProcess
import com.imooc.bigdata.chapter08.utils.{DateUtils, KuduUtils, SQLUtils, SchemaUtils}
import org.apache.spark.sql.{DataFrame, SparkSession}
object AppStatProcessor extends DataProcess{
override def process(spark: SparkSession): Unit = {
val sourceTableName = DateUtils.getTableName("ods", spark)
val masterAddresses = "hadoop000"
val odsDF: DataFrame = spark.read.format("org.apache.kudu.spark.kudu") //
.option("kudu.table", sourceTableName)
.option("kudu.master", masterAddresses)
.load()
odsDF.createOrReplaceTempView("ods")
val resultTmp: DataFrame = spark.sql(SQLUtils.APP_SQL_STEP1)
//resultTmp.show(false)
resultTmp.createOrReplaceTempView("app_tmp")
val result: DataFrame = spark.sql(SQLUtils.APP_SQL_STEP2)
// result.show(false)
val sinkTableName = DateUtils.getTableName("app_stat", spark)
val partitionId = "appid"
KuduUtils.sink(result,sinkTableName,masterAddresses,SchemaUtils.APPSchema, partitionId)
}
}
通过参数传递到Spark作业重构代码并打包
package com.imooc.bigdata.chapter08
import com.imooc.bigdata.chapter08.business.{AppStatProcessor, AreaStatProcessor, LogETLProcessor, ProvinceCityStatProcessor}
import org.apache.commons.lang3.StringUtils
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession
/**
* 整个项目Spark作业的入口点
* 离线的处理 一天一个批次
*/
object SparkApp extends Logging{
def main(args: Array[String]): Unit = {
// .master("local[2]").appName("SparkApp") Spark官网强调不要硬编码,appName master统一使用spark-submit提交的时候指定即可
val spark: SparkSession = SparkSession.builder().getOrCreate()
/**
* 入参统计:
* 1) spark.time
* 2) spark.raw.path
* 3) spark.ip.path
*/
// spark-submit ...... --conf spark.time=20181007
val time = spark.sparkContext.getConf.get("spark.time") // spark框架只认以spark.开头的参数,否则系统不识别
if(StringUtils.isBlank(time)) { // 如果是空,后续的代码就不应该执行了
logError("处理批次不能为空....")
System.exit(0)
}
// STEP1: ETL
LogETLProcessor.process(spark)
// STEP2:省份地市统计
ProvinceCityStatProcessor.process(spark)
// STEP3: 地域分布情况统计
AreaStatProcessor.process(spark)
// STEP4: APP分布情况统计
AppStatProcessor.process(spark)
spark.stop()
}
}
package com.imooc.bigdata.chapter08.business
import com.imooc.bigdata.chapter08.`trait`.DataProcess
import com.imooc.bigdata.chapter08.utils._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object LogETLProcessor extends DataProcess{
override def process(spark: SparkSession): Unit = {
// 日志数据: 使用Data Source API直接加载要处理的json数据
val rawPath: String = spark.sparkContext.getConf.get("spark.raw.path")
var jsonDF: DataFrame = spark.read.json(rawPath)
// var jsonDF: DataFrame = spark.read.json("file:///Users/rocky/IdeaProjects/imooc-workspace/sparksql-train/data/data-test.json")
//jsonDF.printSchema()
//jsonDF.show(false)
import spark.implicits._
val ipRulePath: String = spark.sparkContext.getConf.get("spark.ip.path")
val ipRowRDD: RDD[String] = spark.sparkContext.textFile(ipRulePath)
// val ipRowRDD: RDD[String] = spark.sparkContext.textFile("file:///Users/rocky/IdeaProjects/imooc-workspace/sparksql-train/data/ip.txt")
// 建议使用DF 需要将RDD转成DF ==> DF的相关操作 或者DF注册成表 然后进行相关操作
val ipRuleDF: DataFrame = ipRowRDD.map(x => {
val splits: Array[String] = x.split("\\|")
val startIP: Long = splits(2).toLong
val endIP: Long = splits(3).toLong
val province: String = splits(6)
val city: String = splits(7)
val isp: String = splits(9)
(startIP, endIP, province, city, isp)
}).toDF("start_ip", "end_ip", "province", "city", "isp")
//ipRuleDF.show(false)
// TODO 需要将每一行日志中的ip获得到对应的省份、城市、运营商
// 两个DF进行join,条件是json中的ip 是在规则ip中的范围内就行 ip between ... and ...
// TODO... json中的ip转换一下 通过前面我们学习的Spark SQL UDF函数
import org.apache.spark.sql.functions._
def getLongIp() = udf((ip:String) => {
IPUtils.ip2Long(ip)
})
jsonDF = jsonDF.withColumn("ip_long", getLongIp()($"ip"))
// jsonDF.join(ipRuleDF,jsonDF("ip_long")
// .between(ipRuleDF("start_ip"), ipRuleDF("end_ip")))
// .show(false)
// TODO... 你知道join有哪几种类型,区别是什么
jsonDF.createOrReplaceTempView("logs")
ipRuleDF.createOrReplaceTempView("ips")
// ETL处理完之后,肯定要落地到某个地方 KUDU
val sql = SQLUtils.SQL
val result: DataFrame = spark.sql(sql)
// ===> printSchema
// 重构: Client result tableName master schema partitionId
// 只需要定义表相关的信息,剩下的创建表 删除表操作全部封装到KuduUtils的Sink方法中
val tableName = DateUtils.getTableName("ods", spark)
val masterAddresses = "hadoop000"
val partitionId = "ip"
KuduUtils.sink(result,tableName,masterAddresses,SchemaUtils.ODSSchema, partitionId)
}
}
将项目运行在服务器上
将我们开发的代码不在本地运行了,因为本地开发测试已经完成
==> 代码调整、打包、运行在服务器上
数据在HDFS上的规划,每天一个目录,YYYYMMDD
我们离线处理的粒度:每天跑一次,每天凌晨3点跑一次
==> 需要传递一个要处理的时间进去
我们的要求:打瘦包,不要胖包(插件,把pom中依赖的所有jar打进去)
kudu依赖的jar包拷贝过来,可以直接从本地maven仓库直接考过来
job.sh
time=20181007
${SPARK_HOME}/bin/spark-submit \
--class com.imooc.bigdata.chapter08.SparkApp \
--master local \
--jars /home/hadoop/lib/kudu-client-1.7.0.jar,/home/hadoop/lib/kudu-spark2_2.11-1.7.0.jar \
--conf spark.time=$time \
--conf spark.raw.path="hdfs://hadoop000:8020/pk/access/$time" \
--conf spark.ip.path="hdfs://hadoop000:8020/pk/access/ip.txt" \
/home/hadoop/lib/sparksql-train-1.0.jar
定时调度提交Spark作业到服务器运行
time是写死的,这肯定不行,如何 死去活来呢?
此时:就需要借助于调度
调度:crontab Azkaban Ooize … 这些任务调度器其实底层都是shell脚本
crontab -e 编辑
需求:每一分钟输出date 到 一个文件里面
凌晨3点开始执行我们的作业,注意:此时执行的应该是当前天-1
crontab -l 查看
crontab -r 删除
作业:使用crontab定时调度,每天凌晨三点运行前一天的数据