一、SQLContext
1.适用spark版本:spark1.x
2.添加依赖
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
<scope>compile</scope>
</dependency>
3.代码
(1)创建Context
(2)进行相关处理(加载数据)
(3)关闭连接
package MoocSparkSQL
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
/**
* spark context的使用
*/
object SQLContextApp {
def main(args: Array[String]): Unit = {
val path=args(0)
//val path="file:///E:\\Tools\\WorkspaceforMyeclipse\\sparksqlworking\\data\\people.json"
//1)创建相应的Context
val sparkConf=new SparkConf()
.setAppName("SQLContextApp").setMaster("local[2]")
val sc =new SparkContext(sparkConf)
val sqlContext=new SQLContext(sc)
//2)进行相关处理
//people文件从 /opt/modules/spark-2.1.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json
val people=sqlContext.read.format("json").load(path)
people.printSchema()
people.show()
//3)关闭资源
//每个sparkContext关闭
sc.stop()
}
}
(4)测试
在Program arguments:中输入
file:///E:\\Tools\\WorkspaceforMyeclipse\\sparksqlworking\\data\\people.json
结果:
结果:
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
(5)线上运行
(a)注销掉代码里面的appname和master
(b)Maven打包
使用cmd命令行,在项目根目录下
mvn clean package -DskipTests
(c)查看编译结果
[INFO]
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ sparksql ---
[INFO] Tests are skipped.
[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ sparksql ---
[INFO] Building jar: E:\Tools\WorkspaceforMyeclipse\sparksqlworking\target\sparksql-1.0.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 24.444 s
[INFO] Finished at: 2019-01-16T22:15:33+08:00
[INFO] Final Memory: 24M/261M
[INFO] ------------------------------------------------------------------------
(d)查找到jar包位置
Building jar: E:\Tools\WorkspaceforMyeclipse\sparksqlworking\target\sparksql-1.0.jar
(f)上传服务器到
/opt/datas/sparksql-1.0.jar
(g)提交spark任务
【参考官网:https://spark.apache.org/docs/2.1.0/submitting-applications.html】
./bin/spark-submit \
--class sparkworking.sqlcontext \
--master local[2] \
/opt/datas/sparksql-1.0.jar \
file:opt/modules/spark-2.1.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json
备注:最后路径如果不添加file:///则是默认hadoop的路径。
二、HiveContext
1.适用spark版本:spark1.x
2.前提:
(1)不需要hive环境
(2)需要hive-site.xml
将hive-site.xml拷贝到项目的资源目录下面,放入scala文件夹同级目录下:...\src\main\sources\hive-site.xml
(3)开启metastore,如果hive配置的话
bin/hive --service metastore &
3.引入依赖包
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.0</version>
</dependency>
4.代码
package SparkSQL
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
/**
* Hive Comtext 的使用
*/
object HiveContextApp {
def main(args: Array[String]): Unit = {
// val path=args(0)
//1)创建相应的Context
val sparkConf=new SparkConf()
//生产环境把下面的注释掉
.setAppName("HiveContextApp").setMaster("local[2]")
val sc =new SparkContext(sparkConf)
val hiveContext=new HiveContext(sc)
//2)进行相关处理
hiveContext.table("emp").show() //这个是可以的
//3)关闭context
sc.stop()
}
}
服务器
package sparkworking
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
object _02hivecontext {
def main(args: Array[String]): Unit = {
//1)创建相关的context
val sparkconf=new SparkConf().setAppName("Hivesql").setMaster("local[2]")
.set("dfs.client.use.datanode.hostname","true") //添加这个,datanode返回hostname
val sc=SparkContext.getOrCreate(sparkconf)
val hiveContext=new HiveContext(sc)
//2)相关处理
// val people=hiveContext.table("default.emp").show()
hiveContext.table("emp").show()
//3)关闭资源
sc.stop()
}
}
4.结果
服务器:
+-----+------+---------+----+----------+------+------+------+
|empno| ename| job| mgr| hiredate| sal| comm|deptno|
+-----+------+---------+----+----------+------+------+------+
| 7369| SMITH| CLERK|7902|1980-12-17| 800.0| null| 20|
| 7499| ALLEN| SALESMAN|7698| 1981-2-20|1600.0| 300.0| 30|
| 7521| WARD| SALESMAN|7698| 1981-2-22|1250.0| 500.0| 30|
| 7566| JONES| MANAGER|7839| 1981-4-2|2975.0| null| 20|
| 7654|MARTIN| SALESMAN|7698| 1981-9-28|1250.0|1400.0| 30|
| 7698| BLAKE| MANAGER|7839| 1981-5-1|2850.0| null| 30|
| 7782| CLARK| MANAGER|7839| 1981-6-9|2450.0| null| 10|
| 7788| SCOTT| ANALYST|7566| 1987-4-19|3000.0| null| 20|
| 7839| KING|PRESIDENT|null|1981-11-17|5000.0| null| 10|
| 7844|TURNER| SALESMAN|7698| 1981-9-8|1500.0| 0.0| 30|
| 7876| ADAMS| CLERK|7788| 1987-5-23|1100.0| null| 20|
| 7900| JAMES| CLERK|7698| 1981-12-3| 950.0| null| 30|
| 7902| FORD| ANALYST|7566| 1981-12-3|3000.0| null| 20|
| 7934|MILLER| CLERK|7782| 1982-1-23|1300.0| null| 10|
+-----+------+---------+----+----------+------+------+------+
三、SparkSession
1.适用spark版本:spark2.x
2.代码
(1)虚拟机
package SparkSQL
import org.apache.spark.sql.SparkSession
/**
* sparksession
*/
object SparkSessionApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SparkSessionApp")
.master("local[2]")
.enableHiveSupport() //如果需要访问hive,则需要添加这一个
.getOrCreate()
val people=spark.read.json("datas/people.json")
people.show()
spark.stop()
}
}
服务器结果:
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
(2)服务器
package sparkworking
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
object _04overall {
def main(args: Array[String]): Unit = {
//1.构建上下文
val spark=SparkSession.builder()
.config("dfs.client.use.datanode.hostname","true") //添加这个配置
.appName("sparkDemo")
.master("local[2]")
.enableHiveSupport()
.getOrCreate()
spark.sql("select * from emp").show()
}
}
3.服务器测试结果
+-----+------+---------+----+----------+------+------+------+
|empno| ename| job| mgr| hiredate| sal| comm|deptno|
+-----+------+---------+----+----------+------+------+------+
| 7369| SMITH| CLERK|7902|1980-12-17| 800.0| null| 20|
| 7499| ALLEN| SALESMAN|7698| 1981-2-20|1600.0| 300.0| 30|
| 7521| WARD| SALESMAN|7698| 1981-2-22|1250.0| 500.0| 30|
| 7566| JONES| MANAGER|7839| 1981-4-2|2975.0| null| 20|
| 7654|MARTIN| SALESMAN|7698| 1981-9-28|1250.0|1400.0| 30|
| 7698| BLAKE| MANAGER|7839| 1981-5-1|2850.0| null| 30|
| 7782| CLARK| MANAGER|7839| 1981-6-9|2450.0| null| 10|
| 7788| SCOTT| ANALYST|7566| 1987-4-19|3000.0| null| 20|
| 7839| KING|PRESIDENT|null|1981-11-17|5000.0| null| 10|
| 7844|TURNER| SALESMAN|7698| 1981-9-8|1500.0| 0.0| 30|
| 7876| ADAMS| CLERK|7788| 1987-5-23|1100.0| null| 20|
| 7900| JAMES| CLERK|7698| 1981-12-3| 950.0| null| 30|
| 7902| FORD| ANALYST|7566| 1981-12-3|3000.0| null| 20|
| 7934|MILLER| CLERK|7782| 1982-1-23|1300.0| null| 10|
+-----+------+---------+----+----------+------+------+------+
四、三者综合适用实例
package _0728sql
import SparkUtil.SparkUtil
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.sql.hive.HiveContext
/**
*=
*/
object SparkSQLDemo {
def main(args: Array[String]) {
//System.setProperty("user.name","hadoop")
//1.构建上下文
val conf = new SparkConf()
.setAppName("SparkSQLDemo")
.setMaster("local[*]")
val sc = SparkContext.getOrCreate(conf)
// val sc2=new SparkContext(conf)
/**
* 2.如果不使用hive的功能的话,可以直接使用SQLContext对象来执行
* 如果要使用hive,要读取hive的配置的话,就需要创建HiveContext
*
* java.lang.OutOfMemoryError: PermGen space
* -XX:PermSize=128M -XX:MaxPermSize=256M
*/
val sqlContext = new HiveContext(sc)
sqlContext.sql("select * from default.emp").show(5,false)
//如果要使用hive中的某些函数,就需要创建hivecontext才能运行
sqlContext.sql(
"""select *,
|row_number() over (partition by deptno order by sal desc)
|from
|default.emp""".stripMargin)
.show(5,false)
//3.如果是使用spark2以上的版本,建议适用sparksession
//sparksession == hivecontext+sqlcontext
val spark = SparkSession.builder()
.appName("SparkSQLDemo")
.master("local[*]")
// .config("xxx","xx")
.enableHiveSupport()
.getOrCreate()
spark.sql("select * from default.emp").show()
spark.table("default.emp").show()
}
}
五、三者区别
1.对于spark1.x,SQLContext和HiveContext的使用是需要区分的
如果不使用hive的功能的话,可以直接使用SQLContext对象来执行;如果要使用hive,要读取hive的配置的话,就需要创建HiveContext
2.对于spark2.x,统一使用SparkSession
3.服务器需要特殊配置datanode的使用hostname,而不是内网ip,如果不配置,会namenode提供datanode访问内网ip,所以但是ip是内网的,所以局域网之外调试环境可能会访问不到datanode
(1)hivecontext需要配置
val sparkconf=new SparkConf().setAppName("Hivesql").setMaster("local[2]")
.set("dfs.client.use.datanode.hostname","true")
(2)session需要配置
val spark=SparkSession.builder()
.config("dfs.client.use.datanode.hostname","true")
.appName("sparkDemo")
.master("local[2]")
.enableHiveSupport()
.getOrCreate()