1.hive表
案列分析:
//spark连接hive
val spark = SparkSession.builder()
.master("local[*]")
.config("hive.metastore.uris","thrift://192.168.195.20:9083")
.enableHiveSupport()
.appName("sparkonhive").getOrCreate()
spark.sql("show databases").collect().foreach(println)
//spark默认连接hive default数据库
//连接其他的库:库名.表名
val df = spark.sql("select * from toronto")
df.show()
// df.printSchema()
df.where(df("ssn").startsWith("111")).show()
df.filter(df("ssn").like("111%")).show()
df.filter(df("ssn").startsWith("111")).show()
在linux系统上连接hive和spark:
将hive里配置好的hive-site.xml复制到/opt/spark/conf目录下
同时增加如下内容(仅仅spark写):
<property>
<name>hive.metastore.uris</name>
<value>thrift://192.168.195.20:9083</value>
</property>
拷贝mysql的驱动包【因为hive的元数据是存贮在MySQL里的,所以要访问元数据的话,必须要驱动包】
配置好,需要启动hive前输入:nohup hive --service metastore &
(注意,每一次启动spark都要先输入,否则spark进入会报错,因为spark里配置了hive的配置文件)
2.MySQL表
//spark连接MySQL
val spark = SparkSession.builder().master("local[*]")
.appName("sparkqlonmysql").getOrCreate()
val sc = spark.sparkContext
val url="jdbc:mysql://192.168.195.20:3306/hive"
val user="root"
val pwd="ok"
val driver="com.mysql.jdbc.Driver"
val prop=new Properties()
prop.setProperty("user",user)
prop.setProperty("password",pwd)
prop.setProperty("driver",driver)
val df = spark.read.jdbc(url,"TBLS",prop) //TBLS是mysql里的table名
df.show()
df.where(df("CREATE_TIME").startsWith("16008")).show()
df.groupBy(df("DB_ID")).count().show()
3.spark的内置函数
导入内置函数的包的方法:import org.apache.spark.sql.funtions._
第一种案列:
val spark = SparkSession.builder()
.master("local[*]")
.appName("innerFunction").getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
val accessLog = Array(
"2016-12-27,001",
"2016-12-27,001",
"2016-12-27,002",
"2016-12-28,003",
"2016-12-28,004",
"2016-12-28,002",
"2016-12-28,002",
"2016-12-28,001"
)
val schema = StructType(Array(
StructField("day", StringType, true),
StructField("userId", IntegerType, true)
))
val rdd = sc.parallelize(accessLog).map(_.split(","))
.map(x=>Row(x(0),x(1).toInt))
val df = spark.createDataFrame(rdd,schema)
df.printSchema()
df.show()
//导入内置函数
import org.apache.spark.sql.functions._
df.groupBy(df("day")).agg(count(df("userId")).as("pv"))
.collect().foreach(println)
println("-----------------")
df.groupBy(df("day")).agg(countDistinct(df("userId")).as("pv1"))
.collect().foreach(println)
第二种案例:
val stuDF: DataFrame = Seq(
Student(1001, "zhangsan", "F", 20),
Student(1002, "lisi", "M", 16),
Student(1003, "wangwu", "M", 21),
Student(1004, "zhaoliu", "F", 21),
Student(1005, "zhouqi", "M", 22),
Student(1006, "qianba", "M", 19),
Student(1007, "liuliu", "F", 23),
Student(1008, "liulij", "F", 23)
).toDF()
stuDF.show()
stuDF.groupBy(stuDF("gender")).agg(count(stuDF("age"))).show()
stuDF.groupBy(stuDF("gender")).agg(max(stuDF("age"))).show()
//把他们所需要的取值放到一行,生成一张表
stuDF.groupBy(stuDF("gender")).agg("age"->"max","age"->"min"
,"age"->"avg","id"->"count").show()
//select * from table group by gender,age
println("----------------")
stuDF.groupBy("gender","age").count().show()
附上效果图:(感觉挺好用,方便工作使用)
4.spark的UDF函数
1.定义函数
2.注册函数
- SparkSession.udf.register():只在sql()中有效
- functions.udf():对DataFrame
API均有效
3.函数调用
val spark = SparkSession.builder()
.master("local[*]")
.appName("UDF").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
val rdd = sc.textFile("in/hobbies.txt")
val df = rdd.map(x=>x.split(","))
.map(x=>Hobbies(x(0),x(1)))
.toDF()
df.printSchema()
df.show()
df.registerTempTable("hobbies")
spark.udf.register("hoby_num",(s:String)=>s.split(",").size)
val frame = spark.sql(""
+ "select name,hobbies,hoby_num(hobbies) as hobnum from hobbies")
frame.show()
5.Spark SQL CLI(sparksql客户端)
1.Spark SQL CLI是在本地模式下使用Hive元存储服务和执行从命令行所输入查询语句的简便工具
2.注意,Spark SQL CLI无法与thrift JDBC服务器通信
3.Spark SQL CLI等同于Hive CLI(old CLI)、Beeline CLI(new CLI)
4.启动Spark SQL CLI,请在Spark目录中运行以下内容:./bin/spark-sql
等同于本文中提到的连接hive表的方法,也等同于直接在linux系统里操刀的hive