Spark SQL中的结构化数据
需要将以下两个配置文件复制到spark安装目录下的conf目录下
core-site.xml
hive-site.xml
启动spark-shell操作
命令行模式
scala> sc
res4: org.apache.spark.SparkContext = org.apache.spark.SparkContext@27ea552d
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala> val hc = new HiveContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@790431bf
scala> val rows = hc.sql("select*from zpark.users")
rows: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> rows.first
res5: org.apache.spark.sql.Row = [1,zhangsan]
scala>
编码模式
object SparkSqlReadJson {
val inputFilePath = "F:/input1/sparkdata/people.json"
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("readJson"))
//new出HiveContext的对象hc
val hc = new HiveContext(sc)
//通过hc调用read.json方法读json文件,读到DataFrame这种数据类型中
val frame: DataFrame = hc.read.json(inputFilePath)
//创建临时表
frame.createOrReplaceTempView("jsontable")
//hc调用sql语句
val res: DataFrame = hc.sql("select name from jsontable")
println(res.first())
}
}
注意:
如果启动spark-shell报错
关键报错信息:
Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver (“com.mysql.jdbc.Driver”) was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
原因时没有mysql驱动,需要在spark-defaults.conf添加如下代码信息,找到驱动jar包所在位置
spark.executor.extraClassPath /opt/module/hive/lib/mysql-connector-java-5.1.27-bin.jar
spark.driver.extraClassPath /opt/module/hive/lib/mysql-connector-java-5.1.27-bin.jar