一、安装Zookeeper
1.下载zookeeper
安装包拷贝到指定目录解压修改名称为zookeeper,我的目录是/home/metaqtar -zxvf zookeeper-3.4.6.tar.gz
mv zookeeper-3.4.6 zookeeper
2.修改zookeeper配置文件
在其中一台机器(bigdatasvr01)上,解压缩zookeeper-3.3.4.tar.gz,修改配置文件conf/zoo.cfg,内容如下所示:tickTime=2000 dataDir=/home/metaq/zookeeper/data dataLogDir=/home/metaq/zookeeper/log clientPort=2181 initLimit=5 syncLimit=2 server.1=bigdatasvr01:2888:3888 server.2=bigdatasvr02:2888:3888 server.3=bigdatasvr03:2888:3888将bigdatasvr01机器上的修改好配置后,将安装文件远程拷贝到另外两台zookeeper服务器上对应的目录
cd /home/metaq scp -r zookeeper metaq@ bigdatasvr02:/home/metaq scp -r zookeeper metaq@ bigdatasvr03:/home/metaq设置myid,在dataDir指定的路径下面创建myid文件,里面为数字用来标识当前主机,在conf/zoo.cfg中配置的server.Num中的Num就为当前服务器myid中的数字,如server.1=bigdatasvr01:2888:3888 则服务器bigdatasvr01中myid的文件中的内容就为
[metaq@bigdatasvr01 data]$ echo "1" > /home/metaq/zookeeper/data/myid echo "1" > /home/metaq/zookeeper/data/myid [metaq@bigdatasvr01 data]$ echo "2" > /home/metaq/zookeeper/data/myid echo "1" > /home/metaq/zookeeper/data/myid [metaq@bigdatasvr01 data]$ echo "3" > /home/metaq/zookeeper/data/myid echo "1" > /home/metaq/zookeeper/data/myid
3.启动Zookeeper
在ZooKeeper集群的每个结点上,执行启动ZooKeeper服务的脚本,如下所示:[metaq@bigdatasvr01 zookeeper]$ bin/zkServer.sh start [metaq@bigdatasvr02 zookeeper]$ bin/zkServer.sh start [metaq@bigdatasvr03 zookeeper]$ bin/zkServer.sh start
4.验证安装
可以通过zookeeper的脚本来查看集群状态,集群中各节点角色包括(leader,follower),如下所以每个节点的查询结果[metaq@bigdatasvr01 zookeeper]$ bin/zkServer.sh status JMX enabled by default Using config: /home/metaq/zookeeper/bin/../conf/zoo.cfg Mode: follower [metaq@bigdatasvr02 zookeeper]$ bin/zkServer.sh status JMX enabled by default Using config: /home/metaq/zookeeper/bin/../conf/zoo.cfg Mode: follower [metaq@bigdatasvr03 zookeeper]$ bin/zkServer.sh status JMX enabled by default Using config: /home/metaq/zookeeper/bin/../conf/zoo.cfg Mode: leaderzookeeper遇到的问题,zookeeper集群启动时出现路由失败问题需要将主机的/etc/hosts文件中配置集群中其他节点服务器名和IP,防火墙未关闭也造成的造成相应问题。
二、安装kafka
kafka下载地址:https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.1/kafka_2.10-0.8.1.tgz分别在三台服务器上安装kafka:
tar zxvf kafka_2.10-0.8.1.tgz
1.修改配置文件
修改每台服务器的config/server.propertiesbroker.id: 唯一,填数字
host.name:唯一,填服务器IP zookeeper.connect=192.168.1.107:2181,192.168.1.108:2181,192.168.1.109:2181
2.启动Kakfa
再在每台机器上执行: bin/kafka-server-start.sh -daemon config/server.properties3.创建Topic
bin/kafka-topics.sh --create --zookeeper 192.168.1.107:2181,192.168.1.108:2181,192.168.1.109:2181 --replication-factor 1 --partitions 3 --topic mykafka_test
4.查看Topic
bin/kafka-topics.sh --list --zookeeper 192.168.1.107:2181,192.168.1.108:2181,192.168.1.109:2181
5.查看详细信息
bin/kafka-topics.sh --describe --zookeeper 192.168.1.107:2181,192.168.1.108:2181,192.168.1.109:2181
6.发送消息
bin/kafka-console-producer.sh --broker-list 192.168.1.107:9092 --topic mykafka_test
7.接收消息
bin/kafka-console-consumer.sh --zookeeper 192.168.1.106:2181,192.168.1.108:2181,192.168.1.109:2181 --topic mykafka --from-beginning
三、安装logshash
下载安装包logstash-5.2.1.tar.gz
在指定服务器上安装logstash:
tar -zxvf logstash-5.2.1.tar.gz
监听Tomcat输出到kafka
添加配置文件tomcat_log_to_kafka.conf
input {
file {
type=> "apache"
path=> "/home/zeus/apache-tomcat-7.0.72/logs/*"
exclude => ["*.gz","*.log","*.out"]
sincedb_path => "/dev/null"
}
}
filter {
if [type]== "apache" {
grok {
match => {"message" =>"%{COMBINEDAPACHELOG}"}
}
}
}
output {
kafka {
topic_id => "logstash_topic"
bootstrap_servers => "192.168.1.107:9092, 192.168.1.108:9092, 192.168.1.109:9092"
codec => plain {
format => "%{message}"
}
}
}
启动:bin/logstash -f tomcat_log_to_kafka.conf --config.reload.automatic
--config.reload.automatic 每次修改配置文件不需要停止并重新启动logstash
其中grok是则正格式化log日志
四、Spark stream读取Kafka消息
代码如下:
object ApacheLogAnalysis {
val LOG_ENTRY_PATTERN = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(\\w+) (\\S+) (\\S+)\" (\\d{3}) (\\S+)".r
def main(args: Array[String]): Unit = {
var masterUrl = "local[2]"
if (args.length > 0) {
masterUrl = args.apply(0)
}
val sparkConf = new SparkConf().setMaster(masterUrl).setAppName("ApacheLogAnalysis")
val ssc = new StreamingContext(sparkConf,Seconds(5))
//ssc.checkpoint(".") // 因为使用到了updateStateByKey,所以必须要设置checkpoint
//主题
val topics = Set{ResourcesUtil.getValue(Constants.KAFKA_TOPIC_NAME)}
//kafka地址
val brokerList = ResourcesUtil.getValue(Constants.KAFKA_HOST_PORT)
val kafkaParams = Map[String, String](
"metadata.broker.list" -> brokerList,
"serializer.class" -> "kafka.serializer.StringEncoder"
)
//连接kafka 创建stream
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,kafkaParams,topics)
val events = kafkaStream.flatMap(line => {
//正则解析apache log日志
//192.168.1.249 - - [23/Jun/2017:12:48:43 +0800] "POST /zeus/zeus_platform/user.rpc HTTP/1.1" 200 99
val LOG_ENTRY_PATTERN(clientip,ident,auth,timestamp,verb,request,httpversion,response,bytes) = line._2
val logEntryMap = mutable.Map.empty[String,String]
logEntryMap("clientip") = clientip
logEntryMap("ident") = ident
logEntryMap("auth") = auth
logEntryMap("timestamp") = timestamp
logEntryMap("verb") = verb
logEntryMap("request") = request
logEntryMap("httpversion") = httpversion
logEntryMap("response") = response
logEntryMap("bytes") = bytes
Some(logEntryMap)
})
events.print()
val requestUrls = events.map(x => (x("request"),1L)).reduceByKey(_+_)
requestUrls.foreachRDD(rdd => {
rdd.foreachPartition(partitionOfRecords => {
partitionOfRecords.foreach(pair => {
val requestUrl = pair._1
val clickCount = pair._2
println(s"=================requestUrl count==================== clientip:${requestUrl} clickCount:${clickCount}.")
})
})
})
ssc.start()
ssc.awaitTermination()
}
}