flume mysql 导入数据到 hdfs

#声明source, channel, sink
a1.sources=sqlSource
a1.channels=c1
a1.sinks=s1

#声明source类型
a1.sources.sqlSource.type=org.keedio.flume.source.SQLSource
a1.sources.sqlSource.hibernate.connection.url=jdbc:mysql://10.1.20.115:3306/t_datax
a1.sources.sqlSource.hibernate.connection.user=root
a1.sources.sqlSource.hibernate.connection.password=123456

#这个参数很重要,任务自动提交,默认为false,如果不设置为true,查询不会自动执行
a1.sources.sqlSource.hibernate.connection.autocommit=true
#声明mysql的hibernate方言
a1.sources.sqlSource.hibernate.dialect= org.hibernate.dialect.MySQL5Dialect
a1.sources.sqlSource.hibernate.connection.driver_calss=com.mysql.jdbc.Driver

#查询时间间隔
a1.sources.sqlSource.run.query..delay=10000

#声明保存flume状态的文件夹位置
a1.sources.sqlSource.status.file.path=/opt/flume/log
a1.sources.sqlSource.status.file.name=sqlSource.status

#声明查询开始位置
a1.sources.sqlSource.start.from=0

#sql语句自定义,但是要注意:增量只能针对id字段即主键列,经测试系统默认如此.
#而且必须要将主键查询出来,因为如果不查询主键,flume无法记录上一次查询的位置.
#$@$表示增量列上一次查询的值,记录在status文件中
a1.sources.sqlSource.custom.query=select id, name, img, utime from t_datax1 where id > $@$

#设置分配参数
a1.sources.sqlSource.batch.size=1000
a1.sources.sqlSource.max.rows=1000

#查询结果分隔符
a1.sources.sqlSource.delimiter.entty=|

#a1.sources.sqlSource.hibernate.connection.provider_class = org.hibernate.connection.C3P0ConnectionProvider
#a1.sources.sqlSource.hibernate.c3p0.min_size=3
#a1.sources.sqlSource.hibernate.c3p0.max_size=10


a1.sources.sqlSource.interceptors=i1
a1.sources.sqlSource.interceptors.i1.type=search_replace
a1.sources.sqlSource.interceptors.i1.searchPattern="
a1.sources.sqlSource.interceptors.i1.replaceString=

#设置通道为内存模式
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=10000
a1.channels.c1.byteCapacityBufferPercentage=20
a1.channels.c1.byteCapacity=800000


a1.sinks.s1.type=HDFS
#ns为namenode的命名空间,两个作用,一个是防止集群坍塌,另一个是改参数只能作用在active的namenode节点上
a1.sinks.s1.hdfs.path=hdfs://10.1.80.5:8020/user/root/test
a1.sinks.s1.hdfs.fileType=DataStream
a1.sinks.s1.hdfs.writeFormat=Text
#设置滚动时间,每隔多少时间生成一个文件.如果设置成0,则禁止滚动,可以使所有数据被写到一个文件中.
a1.sinks.s1.hdfs.rollInterval=0
#设置文件存储数据多大的时候生成下一个文件,建议设置成128M和块大小相同
a1.sinks.s1.hdfs.rollSize=134217728
#设置文件多少行时,滚动生成下一个文件,设置成0时禁止滚动
a1.sinks.s1.hdfs.rollCount=0

a1.sources.sqlSource.channels=c1
a1.sinks.s1.channel=c1

flume 增量导入数据到hdfs:

环境需求:

​ jdk1.8

​ maven

插件:flume-ng-sql-source

https://github.com/keedio/flume-ng-sql-source

flume-ng-sql-source

This project is used for flume-ng to communicate with sql databases

Current sql database engines supported

  • After the last update the code has been integrated with hibernate, so all databases supported by this technology should work.

Compilation and packaging

  $ mvn package

Deployment

Copy flume-ng-sql-source-.jar in target folder into flume plugins dir folder

  $ mkdir -p $FLUME_HOME/plugins.d/sql-source/lib $FLUME_HOME/plugins.d/sql-source/libext
  $ cp flume-ng-sql-source-0.8.jar $FLUME_HOME/plugins.d/sql-source/lib

Specific installation by database engine

MySQL

Download the official mysql jdbc driver and copy in libext flume plugins directory:

$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.35.tar.gz
$ tar xzf mysql-connector-java-5.1.35.tar.gz
$ cp mysql-connector-java-5.1.35-bin.jar $FLUME_HOME/plugins.d/sql-source/libext
Microsoft SQLServer

Download the official Microsoft 4.1 Sql Server jdbc driver and copy in libext flume plugins directory:
Download URL: https://www.microsoft.com/es-es/download/details.aspx?id=11774

$ tar xzf sqljdbc_4.1.5605.100_enu.tar.gz
$ cp sqljdbc_4.1/enu/sqljdbc41.jar $FLUME_HOME/plugins.d/sql-source/libext
IBM DB2

Download the official IBM DB2 jdbc driver and copy in libext flume plugins directory: Download URL: http://www-01.ibm.com/support/docview.wss?uid=swg21363866

Configuration of SQL Source:

Mandatory properties in bold

Property NameDefaultDescription
channels-Connected channel names
type-The component type name, needs to be org.keedio.flume.source.SQLSource
hibernate.connection.url-Url to connect with the remote Database
hibernate.connection.user-Username to connect with the database
hibernate.connection.password-Password to connect with the database
table-Table to export data
status.file.name-Local file name to save last row number read
status.file.path/var/lib/flumePath to save the status file
start.from0Start value to import data
delimiter.entry,delimiter of incoming entry
enclose.by.quotestrueIf Quotes are applied to all values in the output.
columns.to.select*Which colums of the table will be selected
run.query.delay10000ms to wait between run queries
batch.size100Batch size to send events to flume channel
max.rows10000Max rows to import per query
read.onlyfalseSets read only session with DDBB
custom.query-Custom query to force a special request to the DB, be carefull. Check below explanation of this property.
hibernate.connection.driver_class-Driver class to use by hibernate, if not specified the framework will auto asign one
hibernate.dialect-Dialect to use by hibernate, if not specified the framework will auto asign one. Check https://docs.jboss.org/hibernate/orm/4.3/manual/en-US/html/ch03.html#configuration-optional-dialects for a complete list of available dialects
hibernate.connection.provider_class-Set to org.hibernate.connection.C3P0ConnectionProvider to use C3P0 connection pool (recommended for production)
hibernate.c3p0.min_size-Min connection pool size
hibernate.c3p0.max_size-Max connection pool size
default.charset.resultsetUTF-8Result set from DB converted to charset character encoding

Standard Query

If no custom query is set, SELECT <columns.to.select> FROM <table> will be executed each run.query.delay milliseconds configured

Custom Query

A custom query is supported to bring the possibility of using the entire SQL language. This is powerful, but risky, be careful with the custom queries used.

To avoid row export repetitions use the @ @ @ special character in WHERE clause, to incrementaly export not processed rows and the new ones inserted.

IMPORTANT: For proper operation of Custom Query ensure that incremental field will be returned in the first position of the Query result.

Example:

agent.sources.sql-source.custom.query = SELECT incrementalField,field2 FROM table1 WHERE incrementalField > $@$ 

Configuration example

# For each one of the sources, the type is defined
agent.sources.sqlSource.type = org.keedio.flume.source.SQLSource

agent.sources.sqlSource.hibernate.connection.url = jdbc:db2://192.168.56.70:50000/sample

# Hibernate Database connection properties
agent.sources.sqlSource.hibernate.connection.user = db2inst1
agent.sources.sqlSource.hibernate.connection.password = db2inst1
agent.sources.sqlSource.hibernate.connection.autocommit = true
agent.sources.sqlSource.hibernate.dialect = org.hibernate.dialect.DB2Dialect
agent.sources.sqlSource.hibernate.connection.driver_class = com.ibm.db2.jcc.DB2Driver

#agent.sources.sqlSource.table = employee1

# Columns to import to kafka (default * import entire row)
#agent.sources.sqlSource.columns.to.select = *

# Query delay, each configured milisecond the query will be sent
agent.sources.sqlSource.run.query.delay=10000

# Status file is used to save last readed row
agent.sources.sqlSource.status.file.path = /var/log/flume
agent.sources.sqlSource.status.file.name = sqlSource.status

# Custom query
agent.sources.sqlSource.start.from = 19700101000000000000
agent.sources.sqlSource.custom.query = SELECT * FROM (select DECIMAL(test) * 1000000 AS INCREMENTAL, EMPLOYEE1.* from employee1 UNION select DECIMAL(test) * 1000000 AS INCREMENTAL, EMPLOYEE2.* from employee2) WHERE INCREMENTAL > $@$ ORDER BY INCREMENTAL ASC

agent.sources.sqlSource.batch.size = 1000
agent.sources.sqlSource.max.rows = 1000
agent.sources.sqlSource.delimiter.entry = |

agent.sources.sqlSource.hibernate.connection.provider_class = org.hibernate.connection.C3P0ConnectionProvider
agent.sources.sqlSource.hibernate.c3p0.min_size=1
agent.sources.sqlSource.hibernate.c3p0.max_size=10

# The channel can be defined as follows.
agent.sources.sqlSource.channels = memoryChannel

Known Issues

An issue with Java SQL Types and Hibernate Types could appear Using SQL Server databases and SQL Server Dialect coming with Hibernate.

Something like:

org.hibernate.MappingException: No Dialect mapping for JDBC type: -15

Use org.keedio.flume.source.SQLServerCustomDialect in flume configuration file to solve this problem.

Special thanks

I used flume-ng-kafka to guide me (https://github.com/baniuyao/flume-ng-kafka-source.git). Thanks to Frank Yao.

Version History

  • Version 1.5.1 added charset encoding for result set is now configurable.
  • Stable version is 1.5.0 (compatible with Apache Flume 1.8.0)
  • Previous stable version is 1.4.3 (compatible with Apache Flume prior to 1.7.0)

问题, 如果在导入数据到hdfs, 需要将hadoop相关的jar包放到flume lib目录下

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Web应用程序通常需要处理大量的数据,而Hadoop是一种用于存储和处理大数据的分布式系统。因此,将Web应用程序中的数据导出到Hadoop中进行处理是一个非常常见的需求。在这种情况下,我们可以使用Sqoop和Flume来实现数据的导出。 Sqoop是一个用于在Hadoop和关系数据库之间进行数据传输的工具。通过Sqoop,我们可以将关系数据库中的数据导出到Hadoop中,或者将Hadoop中的数据导入到关系数据库中。Sqoop支持多种关系数据库,如MySQL、Oracle、PostgreSQL等。 Flume是一个用于在不同的数据源之间移动数据的分布式系统。通过Flume,我们可以将Web应用程序中产生的数据实时地导出到Hadoop中进行处理。Flume支持多种数据源,如日志文件、网络流、Twitter数据流等。 下面是一个通过Sqoop和Flume将Web应用程序中的数据导出到Hadoop中的项目介绍: 1. 数据库中的数据导出到Hadoop中 首先,我们需要使用Sqoop将关系数据库中的数据导出到Hadoop中。假设我们要将MySQL数据库中的数据导出到HDFS中,我们可以使用以下命令: ``` sqoop import \ --connect jdbc:mysql://localhost/test \ --username root \ --password root \ --table mytable \ --target-dir /user/hadoop/mydata ``` 该命令会将MySQL数据库中的mytable表中的数据导出到HDFS的/user/hadoop/mydata目录中。 2. Web应用程序中的数据实时导出到Hadoop中 接下来,我们需要使用Flume将Web应用程序中产生的数据实时地导出到Hadoop中。假设我们要将Tomcat服务器中的日志文件导出到HDFS中,我们可以使用以下命令: ``` flume-ng agent -n agent -c conf -f /path/to/flume.conf ``` 其中,flume.conf是Flume的配置文件,我们需要在其中指定数据源和目的地。例如,以下是一个简单的flume.conf文件: ``` agent.sources = web agent.sinks = hdfs agent.channels = mem agent.sources.web.type = exec agent.sources.web.command = tail -F /path/to/tomcat/logs/*.log agent.sinks.hdfs.type = hdfs agent.sinks.hdfs.hdfs.path = /user/hadoop/mydata agent.sinks.hdfs.hdfs.fileType = DataStream agent.channels.mem.type = memory agent.channels.mem.capacity = 1000 agent.channels.mem.transactionCapacity = 100 ``` 该配置文件中,我们将Tomcat服务器中的日志文件作为数据源,使用tail命令实时读取最新的日志数据。然后,我们将数据导出到HDFS中,使用的是hdfs类型的sink。最后,我们使用memory类型的channel将数据缓存起来,以便在数据传输过程中进行流量控制。 通过上述配置,我们可以实现将Web应用程序中产生的数据实时地导出到Hadoop中进行处理。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值