flume kafka spark streaming
安装flume 目前1.6 可能不支持Taildir(猜测) ,下载1.7/1.8版本下载地址
http://www.apache.org/dyn/closer.lua/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz
去官网自己找吧。
1.先搭建flume
将下载的包解压 tar -zxvf **
1)配置conf
cp flume-env.sh.template flume-env.sh
配置java home(也可以不配置 环境变量配置的详细就可以省略)
这里我就没有配置,贴出/etc/profile 内容
export JAVA_HOME=/usr/local/soft/jdk/jdk1.8.0_45
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export FLUME_HOME=/usr/local/soft/apache-flume-1.8.0-bin
export FLUME_CONF_DIR=$FLUME_HOME/conf
export FLUME_PATH=$FLUME_HOME/bin
export PATH=$SCALA_HOME/bin:$SPARK_HOME/bin:$HADOOP_HOME/bin:${JAVA_HOME}/bin:$FLUME_HOME/bin:$PATH
2)关键配置
kafkanode.conf 配置如下
#Agent
flumeAgent.channels = c1
flumeAgent.sources = s1
flumeAgent.sinks = k1
#flumeAgent Spooling Directory Source
#注(1)
flumeAgent.sources.s1.type = TAILDIR
flumeAgent.sources.s1.positionFile = /opt/apps/log4j/taildir_position.json
flumeAgent.sources.s1.fileHeader = true
#flumeAgent.sources.s1.deletePolicy =immediate
#flumeAgent.sources.s1.batchSize =1000
flumeAgent.sources.s1.channels =c1
flumeAgent.sources.s1.filegroups = f1 f2
flumeAgent.sources.s1.filegroups.f1=/usr/logs/.*log.*
flumeAgent.sources.s1.filegroups.f2=/logs/.*log.*
#flumeAgent.sources.s1.deserializer.maxLineLength =1048576
#flumeAgent FileChannel
#注(2)
flumeAgent.channels.c1.type = file
flumeAgent.channels.c1.checkpointDir = /var/flume/spool/checkpoint
flumeAgent.channels.c1.dataDirs = /var/flume/spool/data
flumeAgent.channels.c1.capacity = 200000000
flumeAgent.channels.c1.keep-alive = 30
flumeAgent.channels.c1.write-timeout = 30
flumeAgent.channels.c1.checkpoint-timeout=600
# flumeAgent Sinks
#注(3)
flumeAgent.sinks.k1.channel = c1
flumeAgent.sinks.k1.type = avro
# connect to CollectorMainAgent
flumeAgent.sinks.k1.hostname = data17.Hadoop
flumeAgent.sinks.k1.port = 44444
kafka.conf 配置如下
#flumeConsolidationAgent
flumeConsolidationAgent.channels = c1
flumeConsolidationAgent.sources = s1
flumeConsolidationAgent.sinks = k1
#flumeConsolidationAgent Avro Source
#注(4)
flumeConsolidationAgent.sources.s1.type = avro
flumeConsolidationAgent.sources.s1.channels = c1
flumeConsolidationAgent.sources.s1.bind = data17.Hadoop
flumeConsolidationAgent.sources.s1.port = 44444
flumeConsolidationAgent.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
flumeConsolidationAgent.sinks.k1.topic = myflume
flumeConsolidationAgent.sinks.k1.brokerList = data14.Hadoop:9092,data15.Hadoop:9092,data16.Hadoop:9092
flumeConsolidationAgent.sinks.k1.requiredAcks = 1
flumeConsolidationAgent.sinks.k1.batchSize = 20
flumeConsolidationAgent.sinks.k1.channel = c1
flumeConsolidationAgent.channels.c1.type = file
flumeConsolidationAgent.channels.c1.checkpointDir = /var/flume/spool/checkpoint
flumeConsolidationAgent.channels.c1.dataDirs = /var/flume/spool/data
flumeConsolidationAgent.channels.c1.capacity = 200000000
flumeConsolidationAgent.channels.c1.keep-alive = 30
flumeConsolidationAgent.channels.c1.write-timeout = 30
flumeConsolidationAgent.channels.c1.checkpoint-timeout=600
kafka.conf 是主节点 启动命令
bin/flume-ng agent --conf conf --conf-file conf/kafka.conf --name flumeConsolidationAgent -Dflume.root.logger=DEBUG,console
kafkanode.conf 是副节点 可以有多个 启动命令
bin/flume-ng agent --conf conf --conf-file conf/kafkanode.conf --name flumeAgent -Dflume.root.logger=DEBUG,console
【副节点配置都一样】
2.kafka 配置
这里就不说了因为没有什么特殊的配置
用原有的kafka就行,新建一个topic 就行了 命令
/usr/local/soft/kafka_2.11-0.9.0.1/bin/kafka-topics.sh --create --topic myflume --replication-factor 2 --partitions 5 --zookeeper data4.Hadoop:2181
3.spark streaming
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import kafka.serializer.StringDecoder
import scala.collection.immutable.HashMap
import org.apache.log4j.Level
import org.apache.log4j.Logger
object RealTimeMonitorStart extends Serializable {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN);
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.ERROR);
val conf = new SparkConf().setAppName("stocker").setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(1))
// Kafka configurations
val topicMap = "myflume".split(",").map((_, 3)).toMap
print(topicMap)
val kafkaStrings = KafkaUtils.createStream(ssc,"data4.Hadoop:2181,data5.Hadoop:2181,data6.Hadoop:2181","myflumegroup",topicMap)
val urlClickLogPairsDStream = kafkaStrings.flatMap(_._2.split(" ")).map((_, 1))
val urlClickCountDaysDStream = urlClickLogPairsDStream.reduceByKeyAndWindow(
(v1: Int, v2: Int) => {
v1 + v2
},
Seconds(60),
Seconds(5));
urlClickCountDaysDStream.print();
ssc.start()
ssc.awaitTermination()
}
}
目前就这些,有疑问评论留言。