1.SparkSQL查询-SparkShell中使用
1.employees.json原始数据如下:
{"name":"Michael", "salary":3000}
{"name":"Andy", "salary":4500}
{"name":"Justin", "salary":3500}
{"name":"Berta", "salary":4000}
2.在HDFS文件系统中创建文件夹(将本地的employees.json文件上传到该目录)
hadoop fs -mkdir -p /user/luomk/source/
3.将本地文件上传到HDFS
hadoop fs -copyFromLocal /opt/module/spark-2.1.1-bin-hadoop2.7/source/employees.json /user/luomk/source/
4.通过集群模式读取HDFS文件
val df=spark.read.json("source/employees.json")
或者
val peopleDF = spark.read.format("json").load("source/employees.json")
5.展示
peopleDF.show()
scala> peopleDF.filter($"salary">3300).show
peopleDF.createOrReplaceTempView("employees")
spark.sql("select * FROM employees").show();
spark.sql("select * FROM employees where salary>=4000").show();
2.SparkSQL查询-IDEA中使用
2.1 IDEA中程序的打包和运行方式都和SparkCore类似,Maven依赖中需要添加新的依赖项:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
2.2 程序如下:
package com.luomk.sql
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
/**
* @author luomingkui
* @date 2018/6/16 上午12:59
* @desc
*/
object HelloWord {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("sql").setMaster("local[*]")
//SparkSession使用在SarkSql中,它只是对SparkContext对了封装,支持SparkSql的操作。
// 也就是说:我们在SparkCore中使用SparkContext,在SparkSql中使用SparkSession
val spark = SparkSession.builder().config(conf).getOrCreate()
val df = spark.read.json("/Users/g2/workspace/myprive/spark/sparkDemo/sparkCore/sparkcore_sql/src/main/resources/employees.json")
//注意,导入隐式转换,
import spark.implicits._
println("===展示整个表===")
df.show()
println("===展示整个表的Scheam===")
df.printSchema()
println("===DSL风格查询===")
df.filter($"salary" > 3300).show
println("===SQL风格查询===")
//注册一个表名
df.createOrReplaceTempView("employee")
println("===条件查询===")
spark.sql("select * from employee where salary > 3300").show()
spark.close()
}
}