首先,我用的是下列方式来连接hive:
def main(args: Array[String]): Unit = {
Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance()
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.getOrCreate()
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:hive2://34.223.237.55:10000/default")
.option("dbtable", "students1")
.option("user", "hduser")
.option("password", "hadoop")
//.option("driver", "org.apache.hadoop.hive.jdbc.HiveDriver")
.load()
println("able to connect------------------")
jdbcDF.show
jdbcDF.printSchema()
jdbcDF.createOrReplaceTempView("std")
val sqlDF = spark.sql("select * from std")
println("Start println-----")
spark.sqlContext.sql("select * from std").collect().foreach(println)
println("end println-----")
sqlDF.show(false)
}
这样确实可以连上hive但是,只能获取到列名,但是获取不到数据。
然后我去网上搜,给出的答案是加fetchsize属性,
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:hive2://34.223.237.55:10000/default")
.option("dbtable", "students1")
.option("user", "hduser")
.option("password", "hadoop")
.option("fetchsize", "20")
//.option("driver", "org.apache.hadoop.hive.jdbc.HiveDriver")
.load()
结果报数组越界错误,然后我就放弃了这种方法。如果有大佬晓得为啥,欢迎留言,谢谢。
我又用了另一种方法,如下:
def main(args: Array[String]): Unit = {
Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance()
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.getOrCreate()
val sqlContext: SQLContext = spark.sqlContext
val conn = DriverManager.getConnection("jdbc:hive2:/xxxxxxx:10000/dbname", "username", "password")
/**
* 第三步:创建Statement句柄,基于该句柄进行SQL的各种操作;
*/
val state = conn.createStatement()
val sql = "select * from data_xy_b_content_asr_nlp limit 100"
val resultSet = state.executeQuery(sql)
val rowList = new scala.collection.mutable.MutableList[Row]
var cRow: Row = null
//Looping resultset
while (resultSet.next()) {
//adding two columns into a "Row" object
cRow = RowFactory.create(resultSet.getObject(1), resultSet.getObject(2))
//adding each rows into "List" object.
rowList += cRow
}
//creates a dataframe
val DF = sqlContext.createDataFrame(sc.parallelize(rowList ,2), getSchema())
DF.show() //show the dataframe.
DF.repartition(1).write.parquet("output_path")
}
def getSchema(): StructType = {
val schema = StructType(
StructField("COUNT", StringType, false) ::
StructField("TABLE_NAME", StringType, false) :: Nil)
//Returning the schema to define dataframe columns.
schema
}
这种方法就是像连mysql一样,取出来数据后转换成DataFrame,然后再进行操作。
希望对大家有帮助。