Spark案例:分析网站每日pv,uv,并将处理结果存放到MySQL数据库中
pv分析:主要分析url数据是否为空
uv分析:根据数据中的用户id或者ip地址分析
结果存放到MySQL表中时,需要先建立数据表
使用UNIO与JOIN可以产生不同的结果表
根据需求使用
代码实现
import java.sql.{Connection, DriverManager, PreparedStatement}
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}
object TrackLogAnalyseSpark {
def main(args: Array[String]): Unit = {
var sparkConf: SparkConf = new SparkConf().
setAppName("TrackLogAnalyseSpark").
setMaster("local[2]")
//创建sparkContext对象:主要用于读取需要处理的数据,封装在RDD集合中;调度jobs执行
val sc = new SparkContext(sparkConf)
sc.setLogLevel("WARN")
//第一步:数据的读取(输入)
val trackRDD = sc.textFile("file:///E:\\JavaWork\\20190802",4)
println(s"${trackRDD.count()}")
//第二步:数据的处理(分析)
val filterRDD: RDD[(String, String, String)] = trackRDD
.filter(_.split("\t").length > 35)
.map((line=>{
val arr = line.split("\t")
(arr(17).substring(0,10),arr(1),arr(5))
}))
filterRDD.persist(StorageLevel.MEMORY_AND_DISK_SER_2)
val pv = filterRDD
.map{case(date,url,guid)=>(date,url)}
.filter(_._2.trim.length>0)
.map(x=>(x._1,1))
.reduceByKey(_ +_)
pv.foreachPartition(iter=>iter.foreach(println))
val uv: RDD[(String, Int)] = filterRDD
.map{case(date,url,guid)=>(date,guid)}
.filter(_._2.trim.length>0)
.distinct()
.map(x=>(x._1,1))
.reduceByKey(_+_)
uv.foreachPartition(iter=>iter.foreach(println))
//第三步:数据的输出(输出)
println("=========union===========")
val unionRDD: RDD[(String, Int)] = pv.union(uv)
unionRDD.coalesce(1).foreachPartition(iter =>iter.foreach(println))
println("=========join===========")
//def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
val joinRDD: RDD[(String, (Int, Int))] = pv.join(uv)
joinRDD.foreach{
case (date,(pv1,uv1))=>{
println(s"date=$date,pv=$pv1,uv=$uv1")
}
}
joinRDD
//由于RDD的数据较少,可以降低分区数
.coalesce(1)
.foreachPartition(f = iter => {
//创建连接
Class.forName("com.mysql.jdbc.Driver")
val url = "jdbc:mysql://bigdata-hpsk01.huadian.com/test"
val userName = "root"
val password = "123456"
var conn: Connection = null
try {
//获取数据库连接的实例
conn = DriverManager.getConnection(url, userName, password)
val pst: PreparedStatement = conn.prepareStatement("INSERT INTO tb_pvuv_result(date,pv,uv) VALUES(?,?,?)")
//插入数据
iter.foreach {
case (date, (pv, uv)) => {
println(s"date=$date,pv=$pv,uv=$uv")
pst.setString(1,date)
pst.setInt(2,pv)
pst.setInt(3,uv)
//每天数据执行一次
pst.executeUpdate()
}
}
} catch {
case e:Exception =>e.printStackTrace()
} finally {
if(conn !=null) conn.close()
}
})
//清除缓存数据
filterRDD.unpersist()
//开发测试的时候,为了对每个spark app页面监控查看job的执行情况,
//spark app运行结束4040端口就没了
//关闭资源
sc.stop()
}
}
Spark standalone
介绍
spark框架自身带的 分布式集群资源管理和任务调度框架,类似于Hadoop Yarn框架
Standalone Yarn
Master ResourceManager
worker s NodeManager
唯独有一点不同的是:
一台机器只能运行一个Nodemanager,但是spark standalone在一台机器上可以同时运行多个Workers
安装部署配置环境(伪分布式)
conf/spark-env.sh
JAVA_HOME=/opt/modules/jdk1.8.0_201
SCALA_HOME=/opt/modules/scala-2.11.8
HADOOP_CONF_DIR=/opt/cdh5.7.6/hadoop-2.6.0-cdh5.7.6/etc/hadoop
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
SPARK_MASTER_HOST=bigdata-hpsk01.huadian.com
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_PORT=7078
SPARK_WORKER_WEBUI_PORT=8081
mv slaves.template slaves
启动服务
配置worker节点运行的主机,一行一个
bigdata-hpsk01.huadian.com启动主节点
sbin/start-master.sh
启动从节点
sbin/start-slaves.sh
注意:配置免密钥登录
ssh-keygen -t rsa # 产生秘钥
ssh-copy-id 主机名 # 分发
测试
可以通过webui页面查看
8080/8081