之前我们已经介绍过怎么把nginx日志同步到kafka,现在我们尝试消费里面的消息并固化到hdfs里面;
在实施方案前,假设读者已经熟悉以下技术 (不细说)
Java及其Spring框架的基本使用
Spark和Spark streaming原理
kudu的基本使用
方案实施
sparkstreaming 消费 kafka
遍历rdd过程把日志数据新增到kudu中
最后在kudu的数据可以用impala查询
建好表
建好表 (假设impala其中一个实例IPnode1)
su hdfs;
#这里的库已经建好kudu_vip,接着建表NGINX_LOG
#考虑到nginx日志中每个用户一次操作就会形成一行记录,而kudu表需要主键,用uuid补充
impala-shell -i node1:21000 -q "
CREATE TABLE kudu_vip.NGINX_LOG
(
uuid STRING,
remote_addr STRING,
time_local STRING,
remote_user STRING,
status STRING,
body_bytes_sent STRING,
http_referer STRING,
http_user_agent STRING,
http_x_forwarded_for STRING,
PRIMARY KEY(uuid)
)
PARTITION BY HASH PARTITIONS 3
STORED AS KUDU;
"
编写kafka消费端
这里就不细说了,直接看代码吧
pom 配置
UTF-8
5.0.2.RELEASE
org.springframework
spring-core
${spring.version}
org.springframework
spring-beans
${spring.version}
org.springframework
spring-context
${spring.version}
junit
junit
3.8.1
test
org.apache.spark
spark-streaming_2.11
2.3.0
org.slf4j
slf4j-api
org.apache.spark
spark-streaming-kafka-0-10_2.11
2.3.0
org.slf4j
slf4j-api
org.slf4j
slf4j-api
1.7.23
org.apache.kudu
kudu-client
1.4.0
org.apache.kudu
kudu-spark2_2.11
1.4.0
org.apache.kudu
kudu-client-tools
1.4.0
com.alibaba
fastjson
1.2.46
spring 配置
xmlns:aop="http://www.springframework.org/schema/aop" xmlns:context="http://www.springframework.org/schema/context"
xmlns:jee="http://www.springframework.org/schema/jee" xmlns:tx="http://www.springframework.org/schema/tx"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.springframework.org/schema/aop http://www.springframework.org/schema/aop/spring-aop-3.1.xsd
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.1.xsd
http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.1.xsd
http://www.springframework.org/schema/jee http://www.springframework.org/schema/jee/spring-jee-3.1.xsd
http://www.springframework.org/schema/tx http://www.springframework.org/schema/tx/spring-tx-3.1.xsd">
location="file:D:/yougouconf/bi/bi-sparkstreaming/*.properties"
file-encoding="UTF-8"
ignore-unresolvable="true" ignore-resource-not-found="true" order="2" system-properties-mode="NEVER" />
location="file:/etc/wonhighconf/bi/bi-sparkstreaming/*.properties"
file-encoding="UTF-8"
ignore-unresolvable="true" ignore-resource-not-found="true" order="2" system-properties-mode="NEVER" />
配置文件,放在D:/yougouconf/bi/bi-sparkstreaming/或者/etc/wonhighconf/bi/bi-sparkstreaming/下
#spark的一些设置
#假如在yarn上面跑就改成yarn-cluster
conf.master=local[*]
conf.appName=testSparkStreaming
conf.bootStrapServers=node1:9092,node2:9092,node3:9092
#监听主题
conf.topics=testnginx
conf.loglevel=ERROR
#kudu连接实例
kudu.instances=node1:7051
#kudu数据库
kudu.schema=impala::KUDU_VIP
sparkStreaming代码
package sparkStreaming.spark;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
imp