Hadoop之——Flume采集Nginx日志到Hive的事务表

转载请注明出处:https://blog.csdn.net/l1028386804/article/details/97975539

注意:笔者这里使用的各软件版本为:Hadoop 3.2.0、Flume 1.9.0、Hive 2.3.5、Nginx 1.17.2。

简单流程示意图如下:

1.Nginx数据格式

有关Nginx的安装和配置可以参见博文《Nginx+Tomcat+Memcached负载均衡集群服务搭建

Nginx定义日志格式:

  • $remote_addr 客户端IP
  • $time_local 通用日志格式下的本地时间
  • $status 状态码
  • $body_bytes_sent 发送给客户端的字节数,不包括响应头的大小
  • $http_user_agent 客户端浏览器信息
  • $http_referer 请求的referer地址。
  • $request 完整的原始请求
  • $request_method #HTTP请求方法,通常为"GET"或"POST"
  • $request_time 请求处理时长
  • $request_uri 完整的请求地址
  • $server_protocol #服务器的HTTP版本,通常为 "HTTP/1.0" 或 "HTTP/1.1"
  • $request_body POST请求参数,参数需放form中
  • token $http_token (自定义header字段前加http_,即可将指定的自定义header字段打印到log中。)
  • version $arg_version (自定义body字段前加arg_,即可将指定的自定义header字段打印到log中。)

Nginx配置文件中配置输出日志格式:

log_format main "$remote_addr,$time_local,$status,$body_bytes_sent,$http_user_agent,$http_referer,$request_method,$request_time,$request_uri,$server_protocol,$request_body,$http_token";
access_log  logs/access.log  main;

Nginx的完整配置如下:

user  hadoop hadoop;
worker_processes  auto;

error_log  logs/error.log;
#error_log  logs/error.log  notice;
#error_log  logs/error.log  info;

#pid        logs/nginx.pid;


events {
	use epoll;
    worker_connections  1024;
}


http {
     include       mime.types;
     default_type application/octet-stream;
 	 client_max_body_size     16m;
     client_body_buffer_size  256k;
     proxy_connect_timeout    1200;
     proxy_read_timeout       1200;
     proxy_send_timeout       6000;
     proxy_buffer_size        32k;
     proxy_buffers            4 64k;
     proxy_busy_buffers_size 128k;
     proxy_temp_file_write_size 128k;
	 
	 #自定义Nginx的log格式
	 log_format main "$remote_addr,$time_local,$status,$body_bytes_sent,$http_user_agent,$http_referer,$request_method,$request_time,$request_uri,$server_protocol,$request_body,$http_token";
     access_log  logs/access.log  main;

    sendfile        on;
    #tcp_nopush     on;
    #http连接的持续时间
    keepalive_timeout  65;
 
    #gzip压缩设置
    gzip  on;           #开启gzip
    gzip_min_length 1k;  #最小压缩文件大小
    gzip_buffers 4 16k;  #压缩缓冲区
 
    #http的协议版本(1.0/1.1),默认1.1,前端如果是squid2.5请使用1.0
    gzip_http_version 1.1;
 
    #gzip压缩比,1压缩比最小处理速度最快,9压缩比最大但处理速度最慢(传输快但比较消耗cpu)
    gzip_comp_level 2;   
 
    #和http头有关系,加个vary头,给代理服务器用的,有的浏览器支持压缩,有的不支持,所以避免浪费不支持的也压缩,所以根据客户端的HTTP头来判断,是否需要压缩
    gzip_vary on;
 
    #gzip压缩类型,不用添加text/html,否则会有警告信息
    gzip_types text/plain text/javascript text/css application/xmlapplication/x-javascript application/json;
	
    server {
        listen       80;
        server_name  192.168.175.100;
        location / {
            root   html;
            index  index.html index.htm;
        }
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }
    }
}

符合配置输出日志格式的示例如下:

192.168.175.100,31/Jul/2019:23:12:50 +0000,200,556,okhttp/3.8.1,-,GET,0.028,/resource/test.jpg,HTTP/1.1,-,-

2.Flume采集清洗

flume的lib目录下导入如下依赖的包:

可以到链接https://download.csdn.net/download/l1028386804/11459670 下载Flume依赖的完整Jar包文件。

拷贝FLUME_HOME/conf目录下的flume-env.sh.template文件为flume-env.sh文件,并在flume-env.sh文件中添加如下代码:

export JAVA_HOME=/usr/local/jdk1.8.0_212

flume-hive-acc.conf.properties配置文件的内容如下:

#定义agent名, source、channel、sink的名称
myagent.sources = s1
myagent.channels = c1
myagent.sinks = k1
# 配置Source
myagent.sources.s1.type = exec
myagent.sources.s1.batchSize=50
myagent.sources.s1.channels = c1
myagent.sources.s1.deserializer.outputCharset = UTF-8
# 配置需要监控的日志输出目录
myagent.sources.s1.command = tail -F /usr/local/nginx-1.17.2/logs/access.log
#设置缓存提交行数
myagent.sources.s1.deserializer.maxLineLength =1048576
myagent.sources.s1.fileSuffix = .DONE
myagent.sources.s1.ignorePattern = access(_\d{4}\-\d{2}\-\d{2}_\d{2})?\.log(\.DONE)?
myagent.sources.s1.consumeOrder = oldest
myagent.sources.s1.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
#具体定义channel
myagent.channels.c1.type = memory
myagent.channels.c1.capacity = 10000
myagent.channels.c1.transactionCapacity = 100
#具体定义sink
myagent.sinks.k1.type=hive
myagent.sinks.k1.channel = c1
myagent.sinks.k1.batchSize=50
# hive地址
myagent.sinks.k1.hive.batchSize =50
myagent.sinks.k1.hive.metastore=thrift://binghe100:9083
myagent.sinks.k1.hive.database=hive_test
myagent.sinks.k1.hive.table=nginx_log
myagent.sinks.k1.serializer=delimited
# 输入分隔符
myagent.sinks.k1.serializer.delimiter=","
# 输出分隔符
myagent.sinks.k1.serializer.serdeSeparator=','
myagent.sinks.k1.serializer.fieldnames=remote_addr,time_local,status,body_bytes_sent,http_user_agent,http_referer,request_method,request_time,request_uri,server_protocol,request_body,http_token,id,appkey,sing,version
# 组成Source、Channel和Sink
myagent.sources.r1.channels = c1
myagent.sinks.k1.channel = c1

启动flume:

nohup flume-ng agent -c /usr/local/flume-1.9.0/conf -f /usr/local/flume-1.9.0/conf/flume-hive-acc.conf.properties -n myagent -Dflume.root.logger=INFO,console >> /dev/null &

3.Hive操作

以Hadoop用身份登录服务器,执行如下命令启动Hive。

nohup hive --service metastore >> ~/metastore.log 2>&1 &        

nohup  hive --service hiveserver2 >> ~/hiveserver2.log 2>&1 &   

修改权限:

hadoop fs -ls /user/hive/warehouse

hadoop fs -chmod 777 /user/hive/warehouse/hive_test.db

登录Hive的远程模式:

-bash-4.1$ beeline
beeline>  !connect jdbc:hive2://localhost:10000 hadoop hadoop
0: jdbc:hive2://localhost:10000> 

建表语句:(与flume输出一致)

DROP TABLE IF EXISTS nginx_log;
create table nginx_log(remote_addr string,time_local string,status string,body_bytes_sent string,http_user_agent string,http_referer string,request_method string,request_time string,request_uri string,server_protocol string,request_body string,http_token string,id string,appkey string,sing string,version string) clustered by (id) into 5 buckets stored as orc TBLPROPERTIES ('transactional'='true');

接下来,在hive-site.xml中配置事务,使Hive支持事务,新增的配置内容如下所示:

<!--支持事务-->
<property>
    <name>hive.support.concurrency</name>
    <value>true</value>
</property>
<property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
</property>
<property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
</property>
<property>
    <name>hive.compactor.worker.threads</name>
    <value>5</value>
</property>
<property>
    <name>hive.enforce.bucketing</name>
    <value>true</value>
</property>

Hive的hive-site.xml文件的完整配置如下:

<configuration>
	<property>
		<name>javax.jdo.option.ConnectionURL</name>
		<value>jdbc:mysql://192.168.175.100:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false&amp;characterEncoding=UTF-8</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionDriverName</name>
		<value>com.mysql.jdbc.Driver</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionUserName</name>
		<value>hive</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionPassword</name>
		<value>hive</value>
	</property>
	<property>
		<name>hive.metastore.local</name>
		<value>true</value>
	</property>
	<property>
		<name>hive.server2.logging.operation.log.location</name>
		<value>/usr/local/hive-2.3.5/operation_logs</value>
	</property>
	<property> 
		<name>hive.exec.scratchdir</name> 
		<value>/usr/local/hive-2.3.5/exec</value> 
	</property> 
	<property>
		<name>hive.exec.local.scratchdir</name>
		<value>/usr/local/hive-2.3.5/scratchdir</value>
	</property>
	<property>
		<name>hive.downloaded.resources.dir</name>
		<value>/usr/local/hive-2.3.5/resources</value>
	</property>
	<property>
		<name>hive.querylog.location</name>
		<value>/usr/local/hive-2.3.5/querylog</value>
	</property>
	<property>
		<name>hive.metastore.uris</name>
		<value>thrift://binghe100:9083</value>
	</property>
	
	<property>
		<name>hive.support.concurrency</name>
		<value>true</value>
	</property>
	<property>
		<name>hive.exec.dynamic.partition.mode</name>
		<value>nonstrict</value>
	</property>
	<property>
		<name>hive.txn.manager</name>
		<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
	</property>
	<property>
		<name>hive.compactor.initiator.on</name>
		<value>true</value>
	</property>
	<property>
		<name>hive.compactor.worker.threads</name>
		<value>5</value>
	</property>
	<property>
		<name>hive.enforce.bucketing</name>
		<value>true</value>
	</property>
</configuration>

在浏览器地址栏中,输入http://192.168.175.100访问Nginx,然后在Hive命令行执行如下命令查询数据

hive> show databases;

hive> use hive_test;

hive> show tables;

hive> select * from nginx_log;

 

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

冰 河

可以吃鸡腿么?

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值