1、 确定你的flume在哪台主机上
2、 确认该台主机上的flume是否可以正常使用?
在指定的目录下,创建一个bigdata_page_to_hive.conf
内容可以是官网的实例:http://flume.apache.org/FlumeUserGuide.html
启动:
flume-ng agent --conf conf --conf-file bigdata_page_to_hive.conf--name a1 -Dflume.root.logger=INFO,console
3、 flume将数据写入到hive中
3.1:验证你的hive是否可以成功使用
3.2:创建表
create table t_pages( date string, user_id string, session_id string, page_id string, action_time string, search_keyword string, click_category_id string, click_product_id string, order_category_ids string, order_product_ids string, pay_category_ids string, pay_product_ids string, city_id string )ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; |
3.3 flume的sink为hive
发现我们需要用到hive的metastore服务,先看一下服务是否启动
a1.sinks.k1.hive.metastore = thrift://master:9083
可以采用telnet的方式判断端口是否通【但是最好是通过CDH界面】
# example.conf: A single-node Flume configuration
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1
# Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444
# Describe the sink a1.sinks.k1.type = hive a1.sinks.k1.hive.metastore = thrift://master:9083 a1.sinks.k1.hive.database = default a1.sinks.k1.hive.table = t_pages a1.sinks.k1.useLocalTimeStamp = false a1.sinks.k1.round = true a1.sinks.k1.roundValue = 10 a1.sinks.k1.roundUnit = minute a1.sinks.k1.serializer = DELIMITED a1.sinks.k1.serializer.delimiter = "\t" a1.sinks.k1.serializer.serdeSeparator = '\t' a1.sinks.k1.serializer.fieldnames =date,user_id,session_id,page_id,action_time,search_keyword,click_category_id,click_product_id,order_category_ids,order_product_ids,pay_category_ids,pay_product_ids,city_id
# Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 |
启动:
nohup flume-ng agent --conf conf--conf-file bigdata_page_to_hive.conf --name a1 &
发现异常
java.lang.NoClassDefFoundError:org/apache/hive/hcatalog/streaming/RecordWriter
1、 没有导入依赖
2、 有可能maven没有下载完整
3、 包冲突的问题
没有依赖包----flume中缺少某个包
1、 根据异常信息,确定缺少什么包
根据网上的搜索信息,确定缺少某一个包:
https://zhidao.baidu.com/question/923836961800918739.html
find / -name 'hive-hcatalog-core*'
根据link文件过滤、版本对比、猜测等,优先选择了一个jar包
/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/jars/hive-hcatalog-core-1.1.0-cdh5.11.1.jar
2、 如果找到的包正好是自己要的包的话,将包放在什么地方?
通过flume-ng启动时产生的日志信息
/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/flume-ng/lib/*
3、 问题解决
cp/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/jars/hive-hcatalog-streaming-1.1.0-cdh5.11.1.jar/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/flume-ng/lib/
可以采用链接的方式来解决:
ln -s/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/jars/hive-hcatalog-streaming-1.1.0-cdh5.11.1.jarhive-hcatalog-streaming-1.1.0-cdh5.11.1.jar
异常:java.lang.NoClassDefFoundError:org/apache/hadoop/hive/metastore/api/MetaException
解决办法:
ln -s /opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/jars/hive-metastore-1.1.0-cdh5.11.1.jar hive-metastore-1.1.0-cdh5.11.1.jar |
异常:java.lang.ClassNotFoundException:org.apache.hadoop.hive.ql.session.SessionState
解决办法:
ln -s /opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/jars/hive-exec-1.1.0-cdh5.11.1.jar hive-exec-1.1.0-cdh5.11.1.jar
|
异常:java.lang.ClassNotFoundException:org.apache.hadoop.hive.cli.CliSessionState
ln -s /opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/jars/hive-cli-1.1.0-cdh5.11.1.jar /opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/flume-ng/lib/hive-cli-1.1.0-cdh5.11.1.jar |
异常:org.apache.commons.cli.MissingOptionException: Missing requiredoption: n
在执行的时候忘记输入-name |
异常:java.lang.ClassNotFoundException:com.facebook.fb303.FacebookService$Iface
ln -s /opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/jars/libfb303-0.9.3.jar /opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/flume-ng/lib/libfb303-0.9.3.jar |
异常:Cannot stream to table that has not been bucketed :{metaStoreUri='thrift://master:9083', database='default', table='t_pages',partitionVals=[] }
Hive对接的时候需要将表设置成桶表 create table t_pages( date string, user_id string, session_id string, page_id string, action_time string, search_keyword string, click_category_id string, click_product_id string, order_category_ids string, order_product_ids string, pay_category_ids string, pay_product_ids string, city_id string ) CLUSTERED BY (city_id) INTO 20 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; |
异常:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat cannot becast to org.apache.hadoop.hive.ql.io.AcidOutputFormat
AcidOutputFormat的类只有OrcOutputFormat, Hive表需要stored as orc create table t_pages( date string, user_id string, session_id string, page_id string, action_time string, search_keyword string, click_category_id string, click_product_id string, order_category_ids string, order_product_ids string, pay_category_ids string, pay_product_ids string, city_id string ) CLUSTERED BY (city_id) INTO 20 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS ORC; |
测试,在hive当中去看是否有当前数据
4、 修改sources
capacity 100 full, consider committing more frequently,increasing capacity, or increasing thread count
5、 最好将channel的存储转为文件