01.创建对象,设定日志级别
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.10")\
.config("spark.ui.showConsoleProgress","false")\
.appName("hdfs_hive").master("local[*]").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
02.选定hive数据库,查看所有表
spark.sql("""
use hive_test_one
""")
spark.sql("""
show tables
""").show()
输出结果:
+-------------+-------------+-----------+
| database| tableName|isTemporary|
+-------------+-------------+-----------+
|hive_test_one|seeds_dataset| false|
+-------------+-------------+-----------+
03.从hdfs加载数据
查看结构是为了hive中创建表时使用
hdfs_df = spark.read.csv("hdfs://192.168.1.10:9000/HadoopFileS/DataSet/MLdataset/Customers.csv",inferSchema=True,header=True)
print(hdfs_df.count())
hdfs_df.printSchema()
输出结果:
1000
root
|-- Dummy_Id: string (nullable = true)
|-- Email: string (nullable = true)
|-- Address: string (nullable = true)
|-- Avatar: double (nullable = true)
|-- Avg Session Length: double (nullable = true)
|-- Time on App: double (nullable = true)
|-- Time on Website: double (nullable = true)
|-- Length of Membership: double (nullable = true)
|-- Yearly Amount Spent: string (nullable = true)
04.创建hive表,并查看
spark.sql("""
create table IF NOT EXISTS Customers(
Dummy_Id string ,
Email string ,
Address string,
Avatar double ,
Avg_Session_Length double ,
Time_on_App double,
Time_on_Website double ,
Length_of_Membership double ,
Yearly_Amount_Spent string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
""")
spark.sql("show tables").show()
spark.sql("select * from Customers").show()
输出结果:表已经创建好,但是没有数据
+-------------+-------------+-----------+
| database| tableName|isTemporary|
+-------------+-------------+-----------+
|hive_test_one| customers| false|
|hive_test_one|seeds_dataset| false|
+-------------+-------------+-----------+
05.待导入数据注册为临时表
hdfs_df.createOrReplaceTempView("hdfs_df")
spark.sql("select * from hdfs_df").show()
输出:
06.将数据插入hive表中
spark.sql("""
insert into Customers select * from hdfs_df
""")
spark.sql("select * from Customers").show()
输出结果:
07.删除hive中的表
# 删除hive中的表
spark.sql("drop table Customers")
spark.sql("show tables").show()
输出结果:isTemporary列表示是否是临时表,比如之前注册的hdfs_df就是临时表