引用致谢:
https://blog.csdn.net/weixin_42490528/article/details/90708106
一、spark sql操作hive数据源
(1)将hive的hive-site.xml拷贝到spark的conf中
cd /opt/install/hive/conf/
cp hive-site.xml ${SPARK_HOME}/conf
(2)mysqlConnect的jar包到jars中
cd /opt/install/hive/lib
cp mysql-connector-java-5.1.44-bin.jar ${SPARK_HOME}/jars
(3)然后在spark中编辑hive-site.xml
[root@nodefour conf]# cd ${SPARK_HOME}/conf
[root@nodefour conf]# ll
总用量 48
-rw-r--r-- 1 1000 1000 996 8月 28 2019 docker.properties.template
-rw-r--r-- 1 1000 1000 1105 8月 28 2019 fairscheduler.xml.template
-rw-r--r-- 1 root root 1114 1月 7 22:11 hive-site.xml
-rw-r--r-- 1 1000 1000 2025 8月 28 2019 log4j.properties.template
-rw-r--r-- 1 1000 1000 7801 8月 28 2019 metrics.properties.template
-rw-r--r-- 1 1000 1000 865 8月 28 2019 slaves.template
-rw-r--r-- 1 1000 1000 1292 8月 28 2019 spark-defaults.conf.template
-rwxr-xr-x 1 root root 4471 1月 4 00:36 spark-env.sh
-rwxr-xr-x 1 1000 1000 4221 8月 28 2019 spark-env.sh.template
[root@nodefour conf]# vi hive-site.xml
添加如下内容(记得换成自己的主机映射地址或ip):
<property>
<name>hive.metastore.uris</name>
<value>thrift://192.168.202.204:9083</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
(4)在hive中建立表
#创建一个Hive表
hive>create table toronto(full_name string, ssn string, office_address string);
hive>insert into toronto(full_name, ssn, office_address) values('John S. ', '111-222-333 ', '123 Yonge Street ');
(5)开启hive的metastore元数据库
hive --service metastore &
然后进入spark
spark-shell
(6)spark具体的操作的方法
展示表格
scala> spark.sql("show databases").show
+------------+
|databaseName|
+------------+
| default|
| demo|
| exam|
| test|
+------------+
scala> val priors = spark.sql("show tables")
priors: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field]
scala> val priors = spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default| t1| false|
| default| toronto| false|
+--------+---------+-----------+
priors: Unit = ()
scala> val x=spark.sql("select count(*) from toronto")
x: org.apache.spark.sql.DataFrame = [count(1): bigint]
scala> x.show
+--------+
|count(1)|
+--------+
| 1|
+--------+
创建表people
scala> spark.sql("create table if not exists people(id int,name string)")
res18: org.apache.spark.sql.DataFrame = []
往people表中插入数据
scala> spark.sql("insert into people values (1,'zhangsan'),(2,'lisi'),(3,'wangwu')")
res19: org.apache.spark.sql.DataFrame = []
scala> spark.sql("select * from people").show()
+---+--------+
| id| name|
+---+--------+
| 1|zhangsan|
| 2| lisi|
| 3| wangwu|
+---+--------+
scala> spark.sql("create table if not exists people_age(name string,age int) row format delimited fields terminated by ','")
res21: org.apache.spark.sql.DataFrame = []
scala> spark.sql("load data local inpath '/opt/datas/age.txt' into table people_age")
res22: org.apache.spark.sql.DataFrame = []
scala> spark.sql("select * from people_age").show
+--------+----+
| name| age|
+--------+----+
|zhangsan| 18|
| |null|
| lisi| 24|
| |null|
| wangwu| 22|
+--------+----+
scala> spark.sql("select * from people_age").show
+--------+---+
| name|age|
+--------+---+
|zhangsan| 18|
| lisi| 24|
| wangwu| 22|
+--------+---+
scala> spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>18 order by age desc").show
+---+------+---+
| id| name|age|
+---+------+---+
| 2| lisi| 24|
| 3|wangwu| 22|
+---+------+---+
scala> spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>10 order by age desc").show
+---+--------+---+
| id| name|age|
+---+--------+---+
| 2| lisi| 24|
| 3| wangwu| 22|
| 1|zhangsan| 18|
+---+--------+---+
scala> spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>10 order by age desc limit 2").show
+---+------+---+
| id| name|age|
+---+------+---+
| 2| lisi| 24|
| 3|wangwu| 22|
+---+------+---+
scala> spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>10 order by age desc limit 1").show
+---+----+---+
| id|name|age|
+---+----+---+
| 2|lisi| 24|
+---+----+---+
scala> val resultDF = spark.sql("select people.id,people.name,people_age.age from people join people_age on people.name=people_age.name where people_age.age>18 order by age")
resultDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> resultDF.write.saveAsTable("people_result")
[Stage 21:> (0 [Stage 21:=============================> (1
scala> spark.sql("show tables").show
+--------+-------------+-----------+
|database| tableName|isTemporary|
+--------+-------------+-----------+
| default| people| false|
| default| people_age| false|
| default|people_result| false|
| default| t1| false|
| default| toronto| false|
+--------+-------------+-----------+
scala> spark.table("people_result").show
+---+------+---+
| id| name|age|
+---+------+---+
| 3|wangwu| 22|
| 2| lisi| 24|
+---+------+---+
scala> val df = spark.sql("select * from people_result")
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> df.show
+---+------+---+
| id| name|age|
+---+------+---+
| 3|wangwu| 22|
| 2| lisi| 24|
+---+------+---+
scala> df.select("name").show //df.select("*").show
+------+
| name|
+------+
|wangwu|
| lisi|
+------+
scala> df.select("*").show
+---+------+---+
| id| name|age|
+---+------+---+
| 3|wangwu| 22|
| 2| lisi| 24|
+---+------+---+
scala> df.where("id=2").show //df.select("name").where("id=2").show
+---+----+---+
| id|name|age|
+---+----+---+
| 2|lisi| 24|
+---+----+---+
scala> df.select("name").where("id=2").show
+----+
|name|
+----+
|lisi|
+----+
scala> df.filter("age<23").show
+---+------+---+
| id| name|age|
+---+------+---+
| 3|wangwu| 22|
+---+------+---+
scala> df.sort("id").show //df.sort($"id".desc).show
+---+------+---+
| id| name|age|
+---+------+---+
| 2| lisi| 24|
| 3|wangwu| 22|
+---+------+---+
scala> val df_new = df.toDF
df_new: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> df_new.show
+---+------+---+
| id| name|age|
+---+------+---+
| 3|wangwu| 22|
| 2| lisi| 24|
+---+------+---+
scala> val df_new = df.toDF("newTD","newName","newAge")
df_new: org.apache.spark.sql.DataFrame = [newTD: int, newName: string ... 1 more field]
scala> df_new.show
+-----+-------+------+
|newTD|newName|newAge|
+-----+-------+------+
| 3| wangwu| 22|
| 2| lisi| 24|
+-----+-------+------+
scala> df.collect
res47: Array[org.apache.spark.sql.Row] = Array([3,wangwu,22], [2,lisi,24])
scala> df.take(2)
res48: Array[org.apache.spark.sql.Row] = Array([3,wangwu,22], [2,lisi,24])
scala> df.first
res49: org.apache.spark.sql.Row = [3,wangwu,22]
scala> df.take(1)
res50: Array[org.apache.spark.sql.Row] = Array([3,wangwu,22])
scala> df.count
res51: Long = 2
scala> df.cache
res52: df.type = [id: int, name: string ... 1 more field]
scala> df.sample(0.5).show //df.sample(false,0.5).show false
+---+------+---+
| id| name|age|
+---+------+---+
| 3|wangwu| 22|
+---+------+---+
二、spark sql操作mysql数据源
进入spark的命令
spark-shell --master local[2] --jars /opt/install/spark/jars/mysql-connector-java-5.1.44-bin.jar
连接并获取表信息
scala> val jdbcDF=spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/hive").option("dbtable", "hive.TBLS").option("user", "root").option("password", "123QWEasd!").option("driver","com.mysql.jdbc.Driver").load()
jdbcDF: org.apache.spark.sql.DataFrame = [TBL_ID: bigint, CREATE_TIME: int ... 9 more fields]
输出表结构信息
scala> jdbcDF.printSchema
root
|-- TBL_ID: long (nullable = true)
|-- CREATE_TIME: integer (nullable = true)
|-- DB_ID: long (nullable = true)
|-- LAST_ACCESS_TIME: integer (nullable = true)
|-- OWNER: string (nullable = true)
|-- RETENTION: integer (nullable = true)
|-- SD_ID: long (nullable = true)
|-- TBL_NAME: string (nullable = true)
|-- TBL_TYPE: string (nullable = true)
|-- VIEW_EXPANDED_TEXT: string (nullable = true)
|-- VIEW_ORIGINAL_TEXT: string (nullable = true)
显示表
scala> jdbcDF.show
[Stage 0:> (0 +------+-----------+-----+----------------+--------------+---------+-----+--------------------+--------------+------------------+------------------+
|TBL_ID|CREATE_TIME|DB_ID|LAST_ACCESS_TIME| OWNER|RETENTION|SD_ID| TBL_NAME| TBL_TYPE|VIEW_EXPANDED_TEXT|VIEW_ORIGINAL_TEXT|
+------+-----------+-----+----------------+--------------+---------+-----+--------------------+--------------+------------------+------------------+
| 3| 1607589229| 6| 0| anonymous| 0| 3| employee_external|EXTERNAL_TABLE| null| null|

3200

被折叠的 条评论
为什么被折叠?



