Hadoop之——Flume采集Nginx日志到Hive的事务表

最新推荐文章于 2022-05-28 15:48:51 发布

冰河

最新推荐文章于 2022-05-28 15:48:51 发布

阅读量1.3k

点赞数 2

分类专栏：精通Nginx系列精通大数据系列文章标签： hadoop Flume Nginx Hive

本文链接：https://blog.csdn.net/l1028386804/article/details/97975539

版权

精通大数据系列同时被 2 个专栏收录

269 篇文章 86 订阅

订阅专栏

精通Nginx系列

76 篇文章 66 订阅

订阅专栏

转载请注明出处：https://blog.csdn.net/l1028386804/article/details/97975539

注意：笔者这里使用的各软件版本为：Hadoop 3.2.0、Flume 1.9.0、Hive 2.3.5、Nginx 1.17.2。

简单流程示意图如下：

1.Nginx数据格式

有关Nginx的安装和配置可以参见博文《Nginx+Tomcat+Memcached负载均衡集群服务搭建》

Nginx定义日志格式:

$remote_addr 客户端IP
$time_local 通用日志格式下的本地时间
$status 状态码
$body_bytes_sent 发送给客户端的字节数，不包括响应头的大小
$http_user_agent 客户端浏览器信息
$http_referer 请求的referer地址。
$request 完整的原始请求
$request_method #HTTP请求方法，通常为"GET"或"POST"
$request_time 请求处理时长
$request_uri 完整的请求地址
$server_protocol #服务器的HTTP版本，通常为 "HTTP/1.0" 或 "HTTP/1.1"
$request_body POST请求参数,参数需放form中
token $http_token （自定义header字段前加http_，即可将指定的自定义header字段打印到log中。)
version $arg_version （自定义body字段前加arg_，即可将指定的自定义header字段打印到log中。）

Nginx配置文件中配置输出日志格式：

log_format main "$remote_addr,$time_local,$status,$body_bytes_sent,$http_user_agent,$http_referer,$request_method,$request_time,$request_uri,$server_protocol,$request_body,$http_token";
access_log  logs/access.log  main;

Nginx的完整配置如下：

user  hadoop hadoop;
worker_processes  auto;

error_log  logs/error.log;
#error_log  logs/error.log  notice;
#error_log  logs/error.log  info;

#pid        logs/nginx.pid;


events {
	use epoll;
    worker_connections  1024;
}


http {
     include       mime.types;
     default_type application/octet-stream;
 	 client_max_body_size     16m;
     client_body_buffer_size  256k;
     proxy_connect_timeout    1200;
     proxy_read_timeout       1200;
     proxy_send_timeout       6000;
     proxy_buffer_size        32k;
     proxy_buffers            4 64k;
     proxy_busy_buffers_size 128k;
     proxy_temp_file_write_size 128k;
	 
	 #自定义Nginx的log格式
	 log_format main "$remote_addr,$time_local,$status,$body_bytes_sent,$http_user_agent,$http_referer,$request_method,$request_time,$request_uri,$server_protocol,$request_body,$http_token";
     access_log  logs/access.log  main;

    sendfile        on;
    #tcp_nopush     on;
    #http连接的持续时间
    keepalive_timeout  65;
 
    #gzip压缩设置
    gzip  on;           #开启gzip
    gzip_min_length 1k;  #最小压缩文件大小
    gzip_buffers 4 16k;  #压缩缓冲区
 
    #http的协议版本(1.0/1.1),默认1.1，前端如果是squid2.5请使用1.0
    gzip_http_version 1.1;
 
    #gzip压缩比，1压缩比最小处理速度最快，9压缩比最大但处理速度最慢(传输快但比较消耗cpu)
    gzip_comp_level 2;   
 
    #和http头有关系，加个vary头，给代理服务器用的，有的浏览器支持压缩，有的不支持，所以避免浪费不支持的也压缩，所以根据客户端的HTTP头来判断，是否需要压缩
    gzip_vary on;
 
    #gzip压缩类型，不用添加text/html，否则会有警告信息
    gzip_types text/plain text/javascript text/css application/xmlapplication/x-javascript application/json;
	
    server {
        listen       80;
        server_name  192.168.175.100;
        location / {
            root   html;
            index  index.html index.htm;
        }
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }
    }
}

符合配置输出日志格式的示例如下：

192.168.175.100,31/Jul/2019:23:12:50 +0000,200,556,okhttp/3.8.1,-,GET,0.028,/resource/test.jpg,HTTP/1.1,-,-

2.Flume采集清洗

flume的lib目录下导入如下依赖的包：

可以到链接https://download.csdn.net/download/l1028386804/11459670 下载Flume依赖的完整Jar包文件。

拷贝FLUME_HOME/conf目录下的flume-env.sh.template文件为flume-env.sh文件，并在flume-env.sh文件中添加如下代码：

export JAVA_HOME=/usr/local/jdk1.8.0_212

flume-hive-acc.conf.properties配置文件的内容如下：

#定义agent名， source、channel、sink的名称
myagent.sources = s1
myagent.channels = c1
myagent.sinks = k1
# 配置Source
myagent.sources.s1.type = exec
myagent.sources.s1.batchSize=50
myagent.sources.s1.channels = c1
myagent.sources.s1.deserializer.outputCharset = UTF-8
# 配置需要监控的日志输出目录
myagent.sources.s1.command = tail -F /usr/local/nginx-1.17.2/logs/access.log
#设置缓存提交行数
myagent.sources.s1.deserializer.maxLineLength =1048576
myagent.sources.s1.fileSuffix = .DONE
myagent.sources.s1.ignorePattern = access(_\d{4}\-\d{2}\-\d{2}_\d{2})?\.log(\.DONE)?
myagent.sources.s1.consumeOrder = oldest
myagent.sources.s1.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
#具体定义channel
myagent.channels.c1.type = memory
myagent.channels.c1.capacity = 10000
myagent.channels.c1.transactionCapacity = 100
#具体定义sink
myagent.sinks.k1.type=hive
myagent.sinks.k1.channel = c1
myagent.sinks.k1.batchSize=50
# hive地址
myagent.sinks.k1.hive.batchSize =50
myagent.sinks.k1.hive.metastore=thrift://binghe100:9083
myagent.sinks.k1.hive.database=hive_test
myagent.sinks.k1.hive.table=nginx_log
myagent.sinks.k1.serializer=delimited
# 输入分隔符
myagent.sinks.k1.serializer.delimiter=","
# 输出分隔符
myagent.sinks.k1.serializer.serdeSeparator=','
myagent.sinks.k1.serializer.fieldnames=remote_addr,time_local,status,body_bytes_sent,http_user_agent,http_referer,request_method,request_time,request_uri,server_protocol,request_body,http_token,id,appkey,sing,version
# 组成Source、Channel和Sink
myagent.sources.r1.channels = c1
myagent.sinks.k1.channel = c1

启动flume：

nohup flume-ng agent -c /usr/local/flume-1.9.0/conf -f /usr/local/flume-1.9.0/conf/flume-hive-acc.conf.properties -n myagent -Dflume.root.logger=INFO,console >> /dev/null &

3.Hive操作

以Hadoop用身份登录服务器，执行如下命令启动Hive。

nohup hive --service metastore >> ~/metastore.log 2>&1 &        

nohup  hive --service hiveserver2 >> ~/hiveserver2.log 2>&1 &

修改权限：

hadoop fs -ls /user/hive/warehouse

hadoop fs -chmod 777 /user/hive/warehouse/hive_test.db

登录Hive的远程模式：

-bash-4.1$ beeline
beeline>  !connect jdbc:hive2://localhost:10000 hadoop hadoop
0: jdbc:hive2://localhost:10000>

建表语句：（与flume输出一致）

DROP TABLE IF EXISTS nginx_log;
create table nginx_log(remote_addr string,time_local string,status string,body_bytes_sent string,http_user_agent string,http_referer string,request_method string,request_time string,request_uri string,server_protocol string,request_body string,http_token string,id string,appkey string,sing string,version string) clustered by (id) into 5 buckets stored as orc TBLPROPERTIES ('transactional'='true');

接下来，在hive-site.xml中配置事务，使Hive支持事务，新增的配置内容如下所示：

<!--支持事务-->
<property>
    <name>hive.support.concurrency</name>
    <value>true</value>
</property>
<property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
</property>
<property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
</property>
<property>
    <name>hive.compactor.worker.threads</name>
    <value>5</value>
</property>
<property>
    <name>hive.enforce.bucketing</name>
    <value>true</value>
</property>

Hive的hive-site.xml文件的完整配置如下：

<configuration>
	<property>
		<name>javax.jdo.option.ConnectionURL</name>
		<value>jdbc:mysql://192.168.175.100:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false&amp;characterEncoding=UTF-8</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionDriverName</name>
		<value>com.mysql.jdbc.Driver</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionUserName</name>
		<value>hive</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionPassword</name>
		<value>hive</value>
	</property>
	<property>
		<name>hive.metastore.local</name>
		<value>true</value>
	</property>
	<property>
		<name>hive.server2.logging.operation.log.location</name>
		<value>/usr/local/hive-2.3.5/operation_logs</value>
	</property>
	<property> 
		<name>hive.exec.scratchdir</name> 
		<value>/usr/local/hive-2.3.5/exec</value> 
	</property> 
	<property>
		<name>hive.exec.local.scratchdir</name>
		<value>/usr/local/hive-2.3.5/scratchdir</value>
	</property>
	<property>
		<name>hive.downloaded.resources.dir</name>
		<value>/usr/local/hive-2.3.5/resources</value>
	</property>
	<property>
		<name>hive.querylog.location</name>
		<value>/usr/local/hive-2.3.5/querylog</value>
	</property>
	<property>
		<name>hive.metastore.uris</name>
		<value>thrift://binghe100:9083</value>
	</property>
	
	<property>
		<name>hive.support.concurrency</name>
		<value>true</value>
	</property>
	<property>
		<name>hive.exec.dynamic.partition.mode</name>
		<value>nonstrict</value>
	</property>
	<property>
		<name>hive.txn.manager</name>
		<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
	</property>
	<property>
		<name>hive.compactor.initiator.on</name>
		<value>true</value>
	</property>
	<property>
		<name>hive.compactor.worker.threads</name>
		<value>5</value>
	</property>
	<property>
		<name>hive.enforce.bucketing</name>
		<value>true</value>
	</property>
</configuration>

在浏览器地址栏中，输入http://192.168.175.100访问Nginx，然后在Hive命令行执行如下命令查询数据

hive> show databases;

hive> use hive_test;

hive> show tables;

hive> select * from nginx_log;

冰河

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
Hadoop之——Flume采集Nginx日志到Hive的事务表

转载请注明出处：https://blog.csdn.net/l1028386804/article/details/97975539注意：笔者这里使用的各软件版本为：Hadoop 3.2.0、Flume 1.9.0、Hive 2.3.5、Nginx 1.17.2。简单流程示意图如下：1.Nginx数据格式有关Nginx的安装和配置可以参见博文《Nginx+Tomcat+Memc...
复制链接

扫一扫