一、Spark闭包处理
------------------------------------------------------------
RDD,resilient distributed dataset,弹性(容错)分布式数据集。
分区列表,function,dep Option(分区类, Pair[Key,Value]),首选位置。
运行job时,spark将rdd打碎变换成tasks,每个task由一个executor执行。执行
之前,spark会进行task的闭包(closure)计算。闭包是指针对executor可见的
变量和方法,将他们封装成一个包,以备在rdd的foreach中进行计算。闭包就是将包串行化,然后发送给每个
executor.
local模式下,所有spark程序运行在同一JVM中,共享对象,counter是可以累加的。
原因是所有executor指向的是同一个引用。
cluster模式下,不可以,counter是闭包处理的。每个节点对driver上的counter是
不可见的。只能看到自己内部串行化的counter副本。
二、Spark的应用的部署模式[客户端模式和集群模式]
--------------------------------------------------------------------
a.spark-submit --class xxx xx.jar --deploy-mode (client | cluster)
--deploy-mode指定是否部署的driver程序,是在worker节点上还是在client主机上。
b.[client]
driver运行在client主机上。client可以不在cluster中。
c.[cluster]
driver程序提交给spark cluster的某个worker节点来执行。
worker是cluster中的一员。
导出的jar需要放置到所有worker节点都可见的位置(如hdfs)才可以。
d.不论哪种方式,rdd的运算都在worker执行
f.验证Spark的部署模式
1)启动spark集群
2)编程
package com.test.spark.scala;
import java.net.{InetAddress, Socket}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object DeployModeTest {
def printInfo(str:String): Unit ={
val ip = InetAddress.getLocalHost.getHostAddress;
val sock = new Socket("192.168.231.205",8888);
val out = sock.getOutputStream;
out.write((ip + " : " + str + "\r\n").getBytes())
out.flush()
sock.close();
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("DeployModeTest")
conf.setMaster("spark://s201:7077")
val sc = new SparkContext(conf)
printInfo("hello world") ;
val rdd1 = sc.parallelize(1 to 10,3);
val rdd2 = rdd1.map(e=>{
printInfo(" map : " + e)
e * 2 ;
})
val rdd3 = rdd2.repartition(2)
val rdd4 = rdd3.map(e=>{
printInfo(" map2 : " + e)
e
})
val res = rdd4.reduce((a,b)=>{
printInfo("reduce : " + a + "," + b)
a + b ;
})
printInfo("driver : " + res + "")
}
}
3)打包
jar
对于cluster部署模式,必须要将jar放置到所有worker都能够看到的地方才可以,例如hdfs。
4)复制到s100,并分发到所有节点的相同目录下
5)提交job到spark集群
$s100> spark-submit --class com.test.spark.scala.DeployModeTest --master spark://s100:7077 --deploy-mode client TestSpark-2-1.0-SNAPSHOT.jar
//上传jar到hdfs
$> spark-submit --class com.test.spark.scala.DeployModeTest --master spark://s100:7077 --deploy-mode cluster hdfs://s500:8020/data/spark/TestSpark-2-1.0-SNAPSHOT.jar
三、Spark集群的模式
-----------------------------------------------------
1.Spark集群模式的区别主要是ClusterMaster的什么
如果是SparkMaster,就是local或者standalone或者local模式
如果是MesosMaster,就是Mesos模式
如果是ResourceManagerMaster,就是Yarn模式
2.开启模式
yarn模式: --master yarn(yarn-site.xml)
standalone: --master spark://s100:7077
mesos: --master mesos//xxx:xxx
2.[local]/[standalone]
使用SparkMaster进程作为管理节点.
3.[mesos]
使用mesos的master作为管理节点。
4.[yarn]
a.使用hadoop的ResourceManager作为sparkMaster节点。不用spark的master.
b.不需要启动spark-master节点。也不需要。
c.确保HADOOP_CONF_DIR和YARN_CONF_DIR环境变量指向了包含了hadoop配置文件的目录,这些配置文件可以确保,是向hdfs写入数据,并且确定是连接到yarn的resourcemanager.
--> cp core-site.xml hdfs-site.xml yarn-site.xml 到 spark/conf下
--> 分发到所有节点
--> 配置HADOOP_CONF_DIR和YARN_CONF_DIR环境变量
修改/soft/sparl/conf/spark-env.sh
-----------------------------------
export HADOOP_CONF_DIR=/soft/hadoop/etc/hadoop
export SPARK_EXECUTOR_INSTANCES=3
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=500M
export SPARK_DRIVER_MEMORY=500M
---> 分发到所有节点
d.这些配置分发到yarn集群的所有节点,并且确保所有节点的配置是一致的。配置中设置的所有属性确保所有节点都能找到。
f.在yarn上运行spark应用,可以采用两种部署模式。
a.cluster部署模式:driver运行在Yarn-Appmaster进程中。
b.client部署模式:driver运行在client进程中,Yarn-AppMaster只用于请求资源。
四、启动Spark On Yarn模式
-------------------------------------------------------------
1.修改代码并打包
package com.test.spark.scala;
import java.net.{InetAddress, Socket}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object DeployModeTest {
def printInfo(str:String): Unit ={
val ip = InetAddress.getLocalHost.getHostAddress;
val sock = new Socket("192.168.231.205",8888);
val out = sock.getOutputStream;
out.write((ip + " : " + str + "\r\n").getBytes())
out.flush()
sock.close();
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("DeployModeTest")
conf.setMaster("yarn")
val sc = new SparkContext(conf)
printInfo("hello world") ;
val rdd1 = sc.parallelize(1 to 10,3);
val rdd2 = rdd1.map(e=>{
printInfo(" map : " + e)
e * 2 ;
})
val rdd3 = rdd2.repartition(2)
val rdd4 = rdd3.map(e=>{
printInfo(" map2 : " + e)
e
})
val res = rdd4.reduce((a,b)=>{
printInfo("reduce : " + a + "," + b)
a + b ;
})
printInfo("driver : " + res + "")
}
}
2.拷贝配置文件,配置HADOOP_CONF_DIR和YARN_CONF_DIR环境变量
a.拷贝 core-site.xml hdfs-site.xml yarn-site.xml 到 spark/conf下,并分发到所有节点
b.修改/soft/spark/conf/spark-env.sh,并分发
export HADOOP_CONF_DIR=/soft/hadoop/etc/hadoop
export SPARK_EXECUTOR_INSTANCES=3
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=500M
export SPARK_DRIVER_MEMORY=500M
3.将Spark的jars文件放到hdfs上[因为Hadoop集群中默认是没有spark的jar包的,所以,需要手动put上去,不然每次系统都会将jars打包上传到临时目录]
a.将/soft/spark/jars文件夹上传到hdfs上
$> hdfs dfs -put /soft/spark/jars /data/spark
4.配置spark属性文件,并分发到所有节点
[/spark/conf/spark-defaults.conf]
spark.yarn.jars hdfs://mycluster/data/spark/jars/*.jar
spark.yarn.am.memory=512M
spark.driver.memory=512M
spark.executor.memory=512M
5.在s500上开启nc
$> nc -lk 8888
6.提交作业
//yarn + cluster
spark-submit --class com.test.spark.scala.DeployModeTest --master yarn --deploy-mode cluster hdfs://mycluster/data/spark/TestSpark-2-1.0-SNAPSHOT.jar
//yarn + client
spark-submit --class com.test.spark.scala.DeployModeTest --master yarn --deploy-mode client TestSpark-2-1.0-SNAPSHOT.jar
7.如果是yarn模式启动,是不需要启动spark集群的
五、SparkMaster_HA
---------------------------------------------------------------
1.[描述]
只针对standalone和mesos集群部署情况,因为yarn模式已经有HA了
使用zk连接多个master并存储state。
master主要负责调度。
2.[配置配置文件并分发到所有节点]
[spark/conf/spark-env.sh]
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=s100:2181,s200:2181,s300:2181 -Dspark.deploy.zookeeper.dir=/spark"
spark.deploy.recoveryMode=ZOOKEEPER
spark.deploy.zookeeper.url=s100:2181,s200:2181,s300:2181
spark.deploy.zookeeper.dir=/spark/ha
3.[启动方式]
a.直接在多个节点上启动master进程,HA会自动从zk中添加或者删除Master节点.
$s100> ./start-all.sh
$s500> ./start-master.sh
b.也可通过指定多个master连接地址实现:spark://host1:port1,host2:port2.
代码中使用spark ha
conf.setMaster("spark://s100:7077,s500:7077")
六、通话日志项目改造,使用Spark聚合查询Hive的通话日志
-----------------------------------------------------------------
1.原代码 --- 使用hive + hadoop mr
/**
* 查询指定人员指定年份中各个月份的通话次数
*/
public List<CalllogStat> statCalllogsCount_1(String caller, String year){
List<CalllogStat> list = new ArrayList<CalllogStat>() ;
try {
Connection conn = DriverManager.getConnection(url);
Statement st = conn.createStatement();
//拼串: select count(*) , substr(calltime,1,6) from ext_calllogs_in_hbase where caller = '15338597777'
// and substr(calltime,1,4) == '2018' group by substr(calltime,1,6) ;
String sql = "select count(*) ,substr(calltime,1,6) from ext_calllogs_in_hbase " +
"where caller = '" + caller+"' and substr(calltime,1,4) == '" + year
+ "' group by substr(calltime,1,6)";
ResultSet rs = st.executeQuery(sql);
Calllog log = null;
while (rs.next()) {
CalllogStat logSt = new CalllogStat();
logSt.setCount(rs.getInt(1));
logSt.setYearMonth(rs.getString(2));
list.add(logSt);
}
rs.close();
return list;
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
2.改造后代码 --- 使用 hive + spark
/**
* 查询指定人员指定年份中各个月份的通话次数
*/
public List<CalllogStat> statCalllogsCount(String caller, String year){
List<CalllogStat> list = new ArrayList<CalllogStat>() ;
SparkConf conf = new SparkConf();
conf.setAppName("SparkHive");
conf.setMaster("spark://s100:7077,s500:7077");
SparkSession sess = SparkSession.builder().config(conf).getOrCreate();
SparkContext sc = sess.sparkContext();
//拼串: select count(*) , substr(calltime,1,6) from ext_calllogs_in_hbase where caller = '15338597777'
// and substr(calltime,1,4) == '2018' group by substr(calltime,1,6) ;
String sql = "select count(*) ,substr(calltime,1,6) from ext_calllogs_in_hbase " +
"where caller = '" + caller+"' and substr(calltime,1,4) == '" + year
+ "' group by substr(calltime,1,6)";
Dataset<Row> df = sess.sql(sql);
List<Row> lst = df.collectAsList();
for(Row r : lst)
{
CalllogStat logSt = new CalllogStat();
logSt.setCount(r.getInt(1));
logSt.setYearMonth(r.getString(2));
list.add(logSt);
}
return list;
}
七、Spark集成Hive访问hbase库出现类找不到问题解决
----------------------------------------------------------
1.Spark local[*]模式
a.复制hive的hive-hbase-handler-2.1.0.jar文件到spark/jars目录下[所有节点]
$> xcall.sh "cp /soft/hive/lib/hive-hbase-handler-2.1.1.jar /soft/spark/jars/"
b.复制hive/下的metrics的jar文件到spark/jars下[所有节点]。
$>cd /soft/hive/lib
$>ls /soft/hive/lib | grep metrics | cp `xargs` /soft/spark/jars
c.启动spark-shell 本地模式测试
$>spark-shell --master local[4]
$scala>spark.sql("select * from mydb.ext_calllogs_in_hbase").show();
$scala>spark.sql("select count(*) ,substr(calltime,1,6) from ext_calllogs_in_hbase where caller = '15778423030' and substr(calltime,1,4) == '2017' group by substr(calltime,1,6)").show();
2.Spark standlone模式
a.复制hive的hive-hbase-handler-2.1.0.jar文件到spark/jars目录下[所有节点]
$> xcall.sh "cp /soft/hive/lib/hive-hbase-handler-2.1.1.jar /soft/spark/jars/"
b.复制hive/下的metrics的jar文件到spark/jars下[所有节点]。
$>cd /soft/hive/lib
$>ls /soft/hive/lib | grep metrics | cp `xargs` /soft/spark/jars
c.将spark/jars 中的所有jar包复制到hdfs集群上
$> hdfs dfs -put /soft/spark/jars /data/spark/
d.启动spark集群
$> ./start-all.sh
e.开启spark-standlone shell 进行测试
$>spark-shell --master spark://s100:7077,s500:7077 //因为有HA
$scala>spark.sql("select * from mydb.ext_calllogs_in_hbase").show();
$scala>spark.sql("select count(*) ,substr(calltime,1,6) from ext_calllogs_in_hbase where caller = '15778423030' and substr(calltime,1,4) == '2017' group by substr(calltime,1,6)").show();
3.Spark IDEA中编程手段实现访问Hbase+hive
a.引入新增依赖[注意版本要一致]
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-hbase-handler -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-hbase-handler</artifactId>
<version>2.1.0</version>
</dependency>
b.编程测试
@Test
public void test1(){
String caller = "13341109505" ;
String year = "2017" ;
SparkSession sess = SparkSession.builder().enableHiveSupport().appName("SparkHive").master("spark://s201:7077").getOrCreate();
String sql = "select count(*) ,substr(calltime,1,6) from ext_calllogs_in_hbase " +
"where caller = '" + caller + "' and substr(calltime,1,4) == '" + year
+ "' group by substr(calltime,1,6) order by substr(calltime,1,6)";
Dataset<Row> df = sess.sql(sql);
List<Row> rows = df.collectAsList();
List<CallLogStat> list = new ArrayList<CallLogStat>();
for (Row row : rows) {
System.out.println(row.getString(1));
list.add(new CallLogStat(row.getString(1), (int)row.getLong(0)));
}
}
八、Spark SQL统计查询:使用spark下的thriftserver2服务器访问hbase
--------------------------------------------------------------------------------
1.启动spark下的thriftserver2服务器
$>./start-thriftserver.sh --master spark://s100:7077
2.web程序通过hive-jdbc驱动程序进行集成
引入pom.xml
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>2.1.0</version>
</dependency>
3.编程
Class.forName("org.apache.hive.jdbc.HiveDriver");
Connection conn = DriverManager.getConnection("jdbc:hive2://s201:10000");
String sql = "select count(*) ,substr(calltime,1,6) from mydb.ext_calllogs_in_hbase " +
"where caller = '" + caller + "' and substr(calltime,1,4) == '" + year
+ "' group by substr(calltime,1,6) order by substr(calltime,1,6) desc";
Statement st = conn.createStatement();
ResultSet rs = st.executeQuery(sql);
List<CallLogStat> list = new ArrayList<CallLogStat>();
while (rs.next()) {
long count = rs.getLong(1);
String ym = rs.getString(2);
list.add(new CallLogStat(ym, (int)count));
}
rs.close();
st.close();
conn.close();
return list ;