spark sql——2. 实例:spark sql操作hive表

 

hive表已经创建好了,详见: hive实例:搜狗用户搜索日志

 

配置:

1. 把core-site.xml和hive-site.xml复制到spark的conf目录下

core-site.xml在hadoop的配置目录下,hive-site.xml在hive的配置目录下

2. 把mysql-connector-java-5.1.47-bin.jar复制到spark的jars目录

3. 修改spark的hive-site.xml

/spark/conf/hive-site.xml,加上

<property>

<name>hive.metastore.uris</name>

<value>thrift://hadoop01:9083</value>

</property>

<property>

<name>hive.metastore.schema.verification</name>

<value>false</value>

</property>

解释:

hive.metastore.uris:sparksql 连接到这里,这里是hive的metastore,用于获取hive表

另一个是禁用metastore的版本检测

4.开启hive的metastore元数据库

spark sql想要使用hive的表,还需要hive开启metastore

hive --service metastore &

启动后放后台就可以,供spark sql使用

或者用hive --service metastore,没研究他们的区别,好像一样

 

 

如果启动metastore时遇到版本不一致的错误

MetaException(message:Hive Schema version 2.3.0 does not match metastore's schema version 1.2.0 Metastore is not upgraded or corrupt)

就修改hive的配置文件/hive/conf/hive-site.xml,

也加上禁用metastore的版本检测即可

<property>

<name>hive.metastore.schema.verification</name>

<value>false</value>

</property>

 

 

 

下面的常用操作还可以参考一下官方文档:

http://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

 

使用:

启动spark shell:

[root@hadoop01 ~]# spark-shell

scala>

 

测试:

(1)查看hive的元数据仓库metastore有啥数据库

scala> spark.sql("show databases").show

+------------+

|databaseName|

+------------+

| default|

| hive|

+------------+

scala>

 

(2)查看hive的元数据仓库metastore有啥表

scala> val priors = spark.sql("show tables")

priors: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field]

scala> priors.show

+--------+---------+-----------+

|database|tableName|isTemporary|

+--------+---------+-----------+

| default| sogou| false|

+--------+---------+-----------+

scala>

 

(3)查看sogou表有多少条数据

scala> val x = spark.sql("select count(*) from sogou")

x: org.apache.spark.sql.DataFrame = [count(1): bigint]

scala> x.show

+--------+

|count(1)|

+--------+

| 10000|

+--------+

scala>

 

(4)新建一张表

scala> spark.sql("create table if not exists people(id int,name string)")

res4: org.apache.spark.sql.DataFrame = []

scala> spark.sql("show tables").show()

+--------+---------+-----------+

|database|tableName|isTemporary|

+--------+---------+-----------+

| default| people| false|

| default| sogou| false|

+--------+---------+-----------+

scala>

 

(5)插入数据

scala> spark.sql("insert into people values (1,'a'),(2,'b'),(3,'c')")

19/05/28 17:29:48 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!

res7: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from people").show()

+---+----+

| id|name|

+---+----+

| 1| a|

| 2| b|

| 3| c|

+---+----+

scala>

 

 

(6)导入本地数据

本地文件/test/age.txt为:

a,18

b,24

c,22

新建一个表,以逗号分割(sql语句参考hive):

spark.sql("create table if not exists people_age(name string,age int) row format delimited fields terminated by ','")

导入(sql语句参考hive):

scala> spark.sql("load data local inpath '/test/age.txt' into table people_age")

19/05/28 23:20:51 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!

res10: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from people_age").show

+----+---+

|name|age|

+----+---+

| a| 18|

| b| 24|

| c| 22|

+----+---+

scala>

 

(7)连表查询,筛选并倒序输出

scala> spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>18 order by age desc").show

+---+----+---+

| id|name|age|

+---+----+---+

| 2| b| 24|

| 3| c| 22|

+---+----+---+

scala>

如果想要设置只输出n条,再加上limit n

 

(8)保存为表,存在本地

val resultDF = spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>18 order by age")

保存为永久表:

resultDF.write.saveAsTable("people_result")

scala> spark.sql("show tables").show

+--------+-------------+-----------+

|database| tableName|isTemporary|

+--------+-------------+-----------+

| default| people| false|

| default| people_age| false|

| default|people_result| false|

| default| sogou| false|

+--------+-------------+-----------+

 

可以直接用spark的table方法加载已经保存的永久表:

scala> spark.table("people_result").show

+---+----+---+

| id|name|age|

+---+----+---+

| 3| c| 22|

| 2| b| 24|

+---+----+---+

scala>

 

 

 

以上都是使用sql语句,也可以使用dsl语句,还有rdd的相关操作

首先创建一个dataframe:

scala> val df = spark.sql("select * from people_result")

df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

或者:

scala> val df = spark.table("people_result")

df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> df.show

+---+----+---+

| id|name|age|

+---+----+---+

| 3| c| 22|

| 2| b| 24|

+---+----+---+

dsl语句

(1)查询、筛选、过滤

scala> df.select("name").show //df.select("*").show

+----+

|name|

+----+

| c|

| b|

+----+

scala> df.where("id=2").show //df.select("name").where("id=2").show

+---+----+---+

| id|name|age|

+---+----+---+

| 2| b| 24|

+---+----+---+

scala> df.filter("age<23").show

+---+----+---+

| id|name|age|

+---+----+---+

| 3| c| 22|

+---+----+---+

(3)排序

scala> df.sort("id").show //df.sort($"id".desc).show

+---+----+---+

| id|name|age|

+---+----+---+

| 2| b| 24|

| 3| c| 22|

+---+----+---+

(4)toDF新建dataframe,toDF方法可以指定列名

scala> val df_new = df.toDF

df_new: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> df_new.show

+---+----+---+

| id|name|age|

+---+----+---+

| 3| c| 22|

| 2| b| 24|

+---+----+---+

toDF方法可以指定列名

scala> val df_new = df.toDF("newTD","newName","newAge")

df_new: org.apache.spark.sql.DataFrame = [newTD: int, newName: string ... 1 more field]

scala> df_new.show

+-----+-------+------+

|newTD|newName|newAge|

+-----+-------+------+

| 3| c| 22|

| 2| b| 24|

+-----+-------+------+

scala>

 

 

rdd操作:

dataframe也是rdd,是对rdd的封装,相当于rdd+表头schema

collect、count、cache、filter、sort、join、sample等 rdd操作都是可用的

scala> df.collect

res35: Array[org.apache.spark.sql.Row] = Array([3,c,22], [2,b,24])

scala> df.take(2)

res36: Array[org.apache.spark.sql.Row] = Array([3,c,22], [2,b,24])

scala> df.first

res38: org.apache.spark.sql.Row = [3,c,22]

scala> df.count

res39: Long = 2

scala> df.cache

res40: df.type = [id: int, name: string ... 1 more field]

scala> df.sample(0.5).show //df.sample(false,0.5).show false代表不放回抽样

+---+----+---+

| id|name|age|

+---+----+---+

| 2| b| 24|

+---+----+---+

 

Dataframe转换为RDD:

val rdd1 = df.rdd

 

 

 

spark自带的操作hive例子在:

examples/src/main/scala/org/apache/spark/examples/sql/

SparkHiveExample.scala

  • 5
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值