spark sql操作外部数据源(hive和mysql)

引用致谢:

https://blog.csdn.net/weixin_42490528/article/details/90708106

一、spark sql操作hive数据源

(1)将hive的hive-site.xml拷贝到spark的conf中

cd /opt/install/hive/conf/
cp hive-site.xml ${SPARK_HOME}/conf

(2)mysqlConnect的jar包到jars中

cd /opt/install/hive/lib
cp mysql-connector-java-5.1.44-bin.jar ${SPARK_HOME}/jars

(3)然后在spark中编辑hive-site.xml

[root@nodefour conf]# cd ${SPARK_HOME}/conf
[root@nodefour conf]# ll
总用量 48
-rw-r--r-- 1 1000 1000  996 828 2019 docker.properties.template
-rw-r--r-- 1 1000 1000 1105 828 2019 fairscheduler.xml.template
-rw-r--r-- 1 root root 1114 17 22:11 hive-site.xml
-rw-r--r-- 1 1000 1000 2025 828 2019 log4j.properties.template
-rw-r--r-- 1 1000 1000 7801 828 2019 metrics.properties.template
-rw-r--r-- 1 1000 1000  865 828 2019 slaves.template
-rw-r--r-- 1 1000 1000 1292 828 2019 spark-defaults.conf.template
-rwxr-xr-x 1 root root 4471 14 00:36 spark-env.sh
-rwxr-xr-x 1 1000 1000 4221 828 2019 spark-env.sh.template
[root@nodefour conf]# vi hive-site.xml 

添加如下内容(记得换成自己的主机映射地址或ip):

<property>

<name>hive.metastore.uris</name>

<value>thrift://192.168.202.204:9083</value>

</property>

<property>

<name>hive.metastore.schema.verification</name>

<value>false</value>

</property>

(4)在hive中建立表

 #创建一个Hive表
hive>create table toronto(full_name string, ssn string, office_address string);
hive>insert into toronto(full_name, ssn, office_address) values('John S. ', '111-222-333 ', '123 Yonge Street ');

(5)开启hive的metastore元数据库
hive --service metastore &
然后进入spark

spark-shell

(6)spark具体的操作的方法
展示表格

scala> spark.sql("show databases").show
+------------+
|databaseName|
+------------+
|     default|
|        demo|
|        exam|
|        test|
+------------+


scala> val priors = spark.sql("show tables")
priors: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field]

scala> val priors = spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default|       t1|      false|
| default|  toronto|      false|
+--------+---------+-----------+

priors: Unit = ()

scala> val x=spark.sql("select count(*) from toronto")
x: org.apache.spark.sql.DataFrame = [count(1): bigint]

scala> x.show
+--------+
|count(1)|
+--------+
|       1|
+--------+

创建表people

scala> spark.sql("create table if not exists people(id int,name string)")
res18: org.apache.spark.sql.DataFrame = []

往people表中插入数据

scala> spark.sql("insert into people values (1,'zhangsan'),(2,'lisi'),(3,'wangwu')")
res19: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from people").show()
+---+--------+
| id|    name|
+---+--------+
|  1|zhangsan|
|  2|    lisi|
|  3|  wangwu|
+---+--------+


scala> spark.sql("create table if not exists people_age(name string,age int) row format delimited fields terminated by ','")
res21: org.apache.spark.sql.DataFrame = []

scala> spark.sql("load data local inpath '/opt/datas/age.txt' into table people_age")
res22: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from people_age").show
+--------+----+
|    name| age|
+--------+----+
|zhangsan|  18|
|        |null|
|    lisi|  24|
|        |null|
|  wangwu|  22|
+--------+----+
scala> spark.sql("select * from people_age").show
+--------+---+
|    name|age|
+--------+---+
|zhangsan| 18|
|    lisi| 24|
|  wangwu| 22|
+--------+---+


scala> spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>18 order by age desc").show
+---+------+---+
| id|  name|age|
+---+------+---+
|  2|  lisi| 24|
|  3|wangwu| 22|
+---+------+---+


scala> spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>10 order by age desc").show
+---+--------+---+
| id|    name|age|
+---+--------+---+
|  2|    lisi| 24|
|  3|  wangwu| 22|
|  1|zhangsan| 18|
+---+--------+---+


scala> spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>10 order by age desc limit 2").show
+---+------+---+
| id|  name|age|
+---+------+---+
|  2|  lisi| 24|
|  3|wangwu| 22|
+---+------+---+


scala> spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>10 order by age desc limit 1").show
+---+----+---+
| id|name|age|
+---+----+---+
|  2|lisi| 24|
+---+----+---+


scala> val resultDF = spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>18 order by age")
resultDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> resultDF.write.saveAsTable("people_result")
[Stage 21:>                                                         (0 [Stage 21:=============================>                            (1                                                                                 
scala> spark.sql("show tables").show
+--------+-------------+-----------+
|database|    tableName|isTemporary|
+--------+-------------+-----------+
| default|       people|      false|
| default|   people_age|      false|
| default|people_result|      false|
| default|           t1|      false|
| default|      toronto|      false|
+--------+-------------+-----------+


scala> spark.table("people_result").show
+---+------+---+
| id|  name|age|
+---+------+---+
|  3|wangwu| 22|
|  2|  lisi| 24|
+---+------+---+


scala> val df = spark.sql("select * from people_result")
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> df.show
+---+------+---+
| id|  name|age|
+---+------+---+
|  3|wangwu| 22|
|  2|  lisi| 24|
+---+------+---+


scala> df.select("name").show //df.select("*").show
+------+
|  name|
+------+
|wangwu|
|  lisi|
+------+


scala> df.select("*").show
+---+------+---+
| id|  name|age|
+---+------+---+
|  3|wangwu| 22|
|  2|  lisi| 24|
+---+------+---+


scala> df.where("id=2").show //df.select("name").where("id=2").show
+---+----+---+
| id|name|age|
+---+----+---+
|  2|lisi| 24|
+---+----+---+


scala> df.select("name").where("id=2").show
+----+
|name|
+----+
|lisi|
+----+


scala> df.filter("age<23").show
+---+------+---+
| id|  name|age|
+---+------+---+
|  3|wangwu| 22|
+---+------+---+


scala> df.sort("id").show //df.sort($"id".desc).show
+---+------+---+
| id|  name|age|
+---+------+---+
|  2|  lisi| 24|
|  3|wangwu| 22|
+---+------+---+


scala> val df_new = df.toDF
df_new: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> df_new.show
+---+------+---+
| id|  name|age|
+---+------+---+
|  3|wangwu| 22|
|  2|  lisi| 24|
+---+------+---+


scala> val df_new = df.toDF("newTD","newName","newAge")
df_new: org.apache.spark.sql.DataFrame = [newTD: int, newName: string ... 1 more field]

scala> df_new.show
+-----+-------+------+
|newTD|newName|newAge|
+-----+-------+------+
|    3| wangwu|    22|
|    2|   lisi|    24|
+-----+-------+------+


scala> df.collect
res47: Array[org.apache.spark.sql.Row] = Array([3,wangwu,22], [2,lisi,24])

scala> df.take(2)
res48: Array[org.apache.spark.sql.Row] = Array([3,wangwu,22], [2,lisi,24])

scala> df.first
res49: org.apache.spark.sql.Row = [3,wangwu,22]

scala> df.take(1)
res50: Array[org.apache.spark.sql.Row] = Array([3,wangwu,22])

scala> df.count
res51: Long = 2

scala> df.cache
res52: df.type = [id: int, name: string ... 1 more field]

scala> df.sample(0.5).show //df.sample(false,0.5).show false
+---+------+---+
| id|  name|age|
+---+------+---+
|  3|wangwu| 22|
+---+------+---+

二、spark sql操作mysql数据源

进入spark的命令

spark-shell --master local[2] --jars /opt/install/spark/jars/mysql-connector-java-5.1.44-bin.jar

连接并获取表信息

scala> val jdbcDF=spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/hive").option("dbtable", "hive.TBLS").option("user", "root").option("password", "123QWEasd!").option("driver","com.mysql.jdbc.Driver").load()
jdbcDF: org.apache.spark.sql.DataFrame = [TBL_ID: bigint, CREATE_TIME: int ... 9 more fields]

输出表结构信息

scala> jdbcDF.printSchema
root
 |-- TBL_ID: long (nullable = true)
 |-- CREATE_TIME: integer (nullable = true)
 |-- DB_ID: long (nullable = true)
 |-- LAST_ACCESS_TIME: integer (nullable = true)
 |-- OWNER: string (nullable = true)
 |-- RETENTION: integer (nullable = true)
 |-- SD_ID: long (nullable = true)
 |-- TBL_NAME: string (nullable = true)
 |-- TBL_TYPE: string (nullable = true)
 |-- VIEW_EXPANDED_TEXT: string (nullable = true)
 |-- VIEW_ORIGINAL_TEXT: string (nullable = true)

显示表

scala> jdbcDF.show
[Stage 0:>                                                          (0                                                                        +------+-----------+-----+----------------+--------------+---------+-----+--------------------+--------------+------------------+------------------+
|TBL_ID|CREATE_TIME|DB_ID|LAST_ACCESS_TIME|         OWNER|RETENTION|SD_ID|            TBL_NAME|      TBL_TYPE|VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|
+------+-----------+-----+----------------+--------------+---------+-----+--------------------+--------------+------------------+------------------+
|     3| 1607589229|    6|               0|     anonymous|        0|    3|   employee_external|EXTERNAL_TABLE|              null|              null|
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值