之前都是hadoop上操作hive,现在在spark上连接hive。
进入到spark下:cd opt/spark-2.2.1-bin-hadoop2.7
进入spark下的conf下:cd conf
复制hive这个文件:cp /home/hadoop/apache-hive-2.3.2-bin/conf/hive-site.xml hive-site.xml
返回上一级进入sbin下,启动下面的命令:
cd ..
cd sbin
start-thriftserver.sh
jps后没有开启SparkSubmit,查看日志:
cd ..
cd logs
cat spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-python2.localdomain.out
发现("com.mysql.jdbc.Driver") was not found in the CLASSPATH的错误。
spark由scala写的,所以放jar包的地方都叫jars。所以进入到jars下:
cd ..
cd jars
cp /home/hadoop/apache-hive-2.3.2-bin/lib/mysql-connector-java-5.1.38.jar mysql-connector-java-5.1.38.jar
在重新启动start-thriftserver.sh。此时jps即多了一个节点:SparkSubmit
但是查看日志又有异常:
因为我们当前所有hive版本是2.3,spark中使用hive的版本是1.2
进入spark下,更改之前复制的,vi hive-site.xml
查找:hive.metastore.schema.verification
将true改为false
再重启start-thriftserver.sh
进入到spark下:cd opt/spark-2.2.1-bin-hadoop2.7
进入spark下的conf下:cd conf
复制hive这个文件:cp /home/hadoop/apache-hive-2.3.2-bin/conf/hive-site.xml hive-site.xml
返回上一级进入sbin下,启动下面的命令:
cd ..
cd sbin
start-thriftserver.sh
jps后没有开启SparkSubmit,查看日志:
cd ..
cd logs
cat spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-python2.localdomain.out
发现("com.mysql.jdbc.Driver") was not found in the CLASSPATH的错误。
spark由scala写的,所以放jar包的地方都叫jars。所以进入到jars下:
cd ..
cd jars
cp /home/hadoop/apache-hive-2.3.2-bin/lib/mysql-connector-java-5.1.38.jar mysql-connector-java-5.1.38.jar
在重新启动start-thriftserver.sh。此时jps即多了一个节点:SparkSubmit
但是查看日志又有异常:
因为我们当前所有hive版本是2.3,spark中使用hive的版本是1.2
进入spark下,更改之前复制的,vi hive-site.xml
查找:hive.metastore.schema.verification
将true改为false
再重启start-thriftserver.sh
最后开启spark,输入命令查看表,已成功连上hive了。
spark成功连上hive,查看表:
print(spark.sql("show databases"))
spark.sql("show databases").show()
spark.sql("use school")
spark.sql("show tables").show()
查看表内容及表结构:
spark.sql("select * from tb_student2").show()
spark.sql("desc formatted tb_student2").show()
datas=[
'9,test1,男,12',
'10,test2,男,15',
'11,test3,男,16',
]
from pyspark.sql.types import *
rdd=sc.parallelize(datas)
rdd=rdd.map(lambda x:x.split(','))
rdd=rdd.map(lambda x:[int(x[0]),x[1],x[2],int(x[3])])
print(rdd.collect())
#将rdd转成dataframe
sch=StructType([StructField('id',IntegerType(),True),StructField('name',StringType(),True),StructField('sex',StringType(),True),StructField('age',IntegerType(),True)])
df=spark.createDataFrame(rdd,sch)
#创建临时表
df.createOrReplaceTempView('tmpV')
spark.sql("select * from tmpV").show()
#将临时表导入
#spark.sql("insert into table tb_student2 select * from tmpV")
#spark.sql("select * from tb_student2").show()
#将姓名年龄放入新表中
spark.sql("create table student2 as select name,age from tmpV")
spark.sql("select * from student2").show()
表中年龄总和,创建parquet格式文件:
spark.sql("select sum(age) from student2").show()
strsql='''
create table name_age2(name string,age integer)
stored as parquet
'''
#spark.sql(strsql)
求行数,平均,最大,最小。按性别分:
spark.sql("select count(age),avg(age),sum(age),max(age),min(age) from tb_student2 group by sex").show()
from pyspark.sql import functions as F
df=spark.sql("select * from tb_student2 where sex='男'")
df.show()
df.select(df.age).distinct().show()#去除年龄重复的值
df.agg(F.max(df.age)).show()
from pyspark.sql import functions as F
#在每个年龄后面加一个‘岁’字
df=spark.sql("select *,concat(age,'岁') as sui from tb_student2 where sex='男'")
df.show()
def busui(age):
return str(age)+'岁'
scx=sqlContext
scx.registerFunction("bu",busui)
#此种注册函数一
df=spark.sql("select *,bu(age) as bu from tb_student2 where sex='男'")
df.show()
#此种注册函数二
bu=F.udf(busui,StringType())
df=df.withColumn('aa',bu(df.age))
df.show()
![](https://i-blog.csdnimg.cn/blog_migrate/cd470bfe1d7ac3fd686f1750ba9c16bf.png)