这是一个综合的笔记,我们已经分别学习了spark streaming如何从Flume上面如何读取数据,以及如何从Kafka上面读取数据。
现在我们来尝试将日志数据发送到Flume,Flume收集数据后实时的将数据发送到Kafka中,然后,Spark Streaming在Kafka中读取数据,并进行处理。
Flume人性的给我们提供了通过log4j.properties配置,将日志产生的数据发送到Flume上。如下:
GenerateLog.java
public class GenerateLog {
private static Logger logger = Logger.getLogger(GenerateLog.class.getName());
public static void main(String[] args) throws InterruptedException {
ArrayList<String> list = new ArrayList<String>();
list.add("a");
list.add("b");
list.add("c");
int len = list.size();
while(true){
Thread.sleep(2000);
int index_1 = new Random().nextInt(len);
int index_2 = new Random().nextInt(len);
String s_1 = list.get(index_1);
String s_2 = list.get(index_2);
logger.info(s_1+","+s_2);
}
}
}
log4j.properties
log4j.rootLogger=INFO,stdout,flume
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = master
log4j.appender.flume.Port = 9449
Java程序(GenerateLog.java)会读取log4j.properties,发现需要将数据发送到flume上面(主机名是master,端口号是9449)。
master主机上面的flume接收到日志数据时,会通过配置文件将数据收集发送给kafka。配置文件如下所示:
agent1.sources=avro-source
agent1.channels=memory-channel
agent1.sinks=kafka-sink
#define source
agent1.sources.avro-source.type=avro
agent1.sources.avro-source.bind=master
agent1.sources.avro-source.port=9449
#define channel
agent1.channels.memory-channel.type=memory
#define sink
agent1.sinks.kafka-sink.type=org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.kafka-sink.kafka.topic=test
agent1.sinks.kafka-sink.kafka.bootstrap.servers=master:9092,slave1:9092,slave2:9092
agent1.sources.avro-source.channels=memory-channel
agent1.sinks.kafka-sink.channel=memory-channel
将数据发送到kafka上面后,Spark Streaming程序就可以读取数据了。通过如下的代码进行处理:
object KafkaDirectWordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("KafkaDirectWordCount")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.sparkContext.setLogLevel("ERROR")
ssc.checkpoint(".")
val kafkaParams = Map("bootstrap.servers" -> "master:9092"
, ("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
, "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer"
, "group.id" -> "kafkatest"
, "enable.auto.commit" -> "false"
)
val topics = Set("test")
val consumerStrategies = ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
val kafkaDStream = KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, consumerStrategies)
val res = kafkaDStream
.map(x => {
x.value()
})
.flatMap(_.split(","))
.map(x => (x, 1))
.reduceByKey(_ + _)
res.print()
ssc.start()
ssc.awaitTermination()
}
}
这样,一套日志处理方案就搭建完成了。