如何使用Maxwell和flume,kafka 把MySQL数据实时同步到HDFs？

最新推荐文章于 2024-07-30 09:52:05 发布

Fei-joe

最新推荐文章于 2024-07-30 09:52:05 发布

阅读量2.1k

点赞数

分类专栏： mapreduce hadoop

本文链接：https://blog.csdn.net/qq_33290422/article/details/80225432

版权

mapreduce hadoop 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Hadoop实战：使用Maxwell把MySQL数据实时同步到HDFs

Maxwell介绍

Maxwell是一个守护程序，一个应用程序，能够读取MySQL Binlogs然后解析输出为json。支持数据输出到Kafka中，支持表和库过滤。

→ Reference：http://maxwells-daemon.io

→ Download: https://github.com/zendesk/maxwell/releases/download/v1.10.3/maxwell-1.10.3.tar.gz

→ Source: https://github.com/zendesk/maxwell

配置MySQL->Maxwell->Kafka->Flume->HDFS

1）MySQL配置要求

配置要求

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
        [ 
        mysqld 
        ] 
       
        server 
        - 
        id 
        = 
        1 
       
        log 
        - 
        bin 
        = 
        master 
       
        binlog_format 
        = 
        row #######################将mixed 改成row !!!! 
       
        binlog_row_image 
        = 
        FULL

权限要求

 
  
         1 
       

         2 
       

         3 
       

         4 
       
 
        GRANT  
        ALL  
        on  
        maxwell 
        . 
        * 
          
        to 
          
        'maxwell' 
        @ 
        '%' 
          
        identified  
        by 
          
        'maxwell' 
        ; 
       
 
        GRANT  
        ALL  
        on  
        maxwell 
        . 
        * 
          
        to 
          
        'maxwell' 
        @ 
        'localhost' 
          
        identified  
        by 
          
        'maxwell' 
        ; 
       
 
        GRANT  
        SELECT 
        , 
          
        REPLICATION  
        CLIENT 
        , 
          
        REPLICATION  
        SLAVE  
        on 
          
        * 
        . 
        * 
          
        to 
          
        'maxwell' 
        @ 
        '%' 
        ; 
       
 
        GRANT  
        SELECT 
        , 
          
        REPLICATION  
        CLIENT 
        , 
          
        REPLICATION  
        SLAVE  
        on 
          
        * 
        . 
        * 
          
        to 
          
        'maxwell' 
        @ 
        'localhost' 
        ; 
       
 
 

2）安装配置Kafka

确认已安装java运行环境，直接解压Kafka即可使用。

 
         1 
       
        $ 
          
        tar  
        xvf  
        kafka_2 
        . 
        10 
        - 
        0.10.2.1.tgz 
          
        - 
        C 
          
        / 
        mnt

解压后，编辑配置文件：

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
        $ 
          
        cat 
          
        / 
        mnt 
        / 
        kafka_2 
        . 
        10 
        - 
        0.10.2.1 
        / 
        config 
        / 
        server 
        .properties 
              
        ############################# Server Basics ############################# 
       
        broker 
        .id 
        = 
        0 
       
        delete 
        .topic 
        .enable 
        = 
        true 
       
        ############################# Socket Server Settings ############################# 
       
        listeners 
        = 
        PLAINTEXT 
        : 
        / 
        / 
        0.0.0.0 
        : 
        9092 
       
        num 
        .network 
        .threads 
        = 
        3 
       
        num 
        .io 
        .threads 
        = 
        8 
       
        socket 
        .send 
        .buffer 
        .bytes 
        = 
        102400 
       
        socket 
        .receive 
        .buffer 
        .bytes 
        = 
        102400 
       
        socket 
        .request 
        .max 
        .bytes 
        = 
        104857600 
       
        ############################# Log Basics ############################# 
       
        log 
        .dirs 
        = 
        / 
        tmp 
        / 
        kafka 
        - 
        logs 
       
        num 
        .partitions 
        = 
        1 
       
        num 
        .recovery 
        .threads 
        .per 
        .data 
        .dir 
        = 
        1 
       
        ############################# Log Flush Policy ############################# 
       
        log 
        .flush 
        .interval 
        .messages 
        = 
        10000 
       
        log 
        .flush 
        .interval 
        .ms 
        = 
        1000 
       
        ############################# Log Retention Policy ############################# 
       
        log 
        .retention 
        .hours 
        = 
        168 
       
        log 
        .segment 
        .bytes 
        = 
        1073741824 
       
        log 
        .retention 
        .check 
        .interval 
        .ms 
        = 
        300000 
       
        ############################# Zookeeper ############################# 
       
        zookeeper 
        .connect 
        = 
        localhost 
        : 
        2181 
       
        zookeeper 
        .connection 
        .timeout 
        .ms 
        = 
        6000

kafka需要依赖zookeeper，所以需要先启动zookeeper。

 
  
         1 
       
 
        $ 
          
        nohup 
          
        /mnt/ 
        kafka_2 
        . 
        10 
        - 
        0.10.2.1 
        / 
        bin 
        / 
        zookeeper 
        - 
        server 
        - 
        start 
        .sh 
          
        / 
        mnt 
        / 
        kafka_2 
        . 
        10 
        - 
        0.10.2.1 
        / 
        config 
        / 
        zookeeper 
        .properties 
          
        & 
       
 
 

启动Kafka Server：(指定JMX_PORT端口，可以通过Kafka-manager获取统计信息)

 
  
         1 
       
 
        $ 
          
        nohup 
          
        / 
        mnt 
        / 
        kafka_2 
        . 
        10 
        - 
        0.10.2.1 
        / 
        bin 
        / 
        kafka 
        - 
        server 
        - 
        start 
        .sh 
          
        / 
        mnt 
        / 
        kafka_2 
        . 
        10 
        - 
        0.10.2.1 
        / 
        config 
        / 
        server 
        .properties 
          
        & 
       
 
 

3）安装配置Flume

去Apache官网下载Flume二进制安装包，然后解压即可。

设置环境变量

 
  
         1 
       

         2 
       

         3 
       

         4 
       
 
        $ 
          
        cat 
          
        / 
        etc 
        / 
        profile 
        .d 
        / 
        flume 
        .sh 
       
 
        export  
        FLUME_HOME 
        = 
        / 
        usr 
        / 
        local 
        / 
        flume 
       
 
        export  
        FLUME_CONF_DIR 
        = 
        $FLUME_HOME 
        / 
        conf 
       
 
        export  
        PATH 
        = 
        $PATH 
        : 
        $FLUME_HOME 
        / 
        bin 
       
 
 

查看Flume版本

 
  
         1 
       

         2 
       

         3 
       

         4 
       

         5 
       
 
        $ 
          
        flume 
        - 
        ng  
        version 
       
 
        Flume 
          
        1.7.0 
       
 
        Source  
        code  
        repository 
        : 
          
        https 
        : 
        / 
        / 
        git 
        - 
        wip 
        - 
        us 
        .apache 
        .org 
        / 
        repos 
        / 
        asf 
        / 
        flume 
        .git 
       
 
        Revision 
        : 
          
        511d868555dd4d16e6ce4fedc72c2d1454546707 
       
 
        Compiled  
        by  
        bessbd  
        on  
        Wed  
        Oct 
          
        12 
          
        20 
        : 
        51 
        : 
        10 
          
        CEST 
          
        2016 
       
 
 

创建配置文件和环境变量

 
  
         1 
       

         2 
       
 
        $ 
          
        cp 
          
        - 
        fr 
          
        / 
        usr 
        / 
        local 
        / 
        flume 
        / 
        conf 
        / 
        flume 
        - 
        conf 
        .properties 
        .template 
          
        / 
        usr 
        / 
        local 
        / 
        flume 
        / 
        conf 
        / 
        flume 
        .conf 
       
 
        $ 
          
        cp 
          
        - 
        fr 
          
        / 
        usr 
        / 
        local 
        / 
        flume 
        / 
        conf 
        / 
        flume 
        - 
        env 
        .sh 
        .template 
          
        / 
        usr 
        / 
        local 
        / 
        flume 
        / 
        conf 
        / 
        flume 
        - 
        env 
        .sh 
       
 
 

如果上面的JAVA_HOME设置好了，这里其实不需要设置flume-env.sh，也可以选择配置。

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
        $ 
          
        cat 
          
        / 
        usr 
        / 
        local 
        / 
        flume 
        / 
        conf 
        / 
        flume 
        - 
        env 
        .sh 
       
        # Enviroment variables can be set here. 
       
        export  
        JAVA_HOME 
        = 
        / 
        ................ 
       
        # Give Flume more memory and pre-allocate, enable remote monitoring via JMX 
       
        export  
        JAVA_OPTS 
        = 
        "-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote" 
       
        # as it may result in logging sensitive user information or encryption secrets. 
       
        export  
        JAVA_OPTS 
        = 
        "$JAVA_OPTS -Dorg.apache.flume.log.rawdata=true -Dorg.apache.flume.log.printconfig=true " 
       
        # Note that the Flume conf directory is always included in the classpath. 
       
        #FLUME_CLASSPATH=""

4）安装配置Maxwell

Maxwell存储在MySQL服务器本身所需要的所有状态，在schema_database选项指定的数据库中。默认情况下, 数据库被命名为maxwell。

 
  
         1 
       
 
        $ 
          
        cd 
          
        / 
        usr 
        / 
        local 
        / 
        maxwell 
        / 
        ; 
        . 
        / 
        bin 
        / 
        maxwell 
          
        -- 
        user 
        = 
        'maxwell' 
          
        -- 
        password 
        = 
        'maxwell' 
          
        -- 
        host 
        = 
        '127.0.0.1' 
          
        -- 
        port 
        = 
        '3306' 
          
        -- 
        producer 
        = 
        stdout 
       
 
 

MySQL创造点数据

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
        mysql 
        > 
          
        create  
        database  
        demo  
        charset  
        utf8 
        ; 
       
        Query  
        OK 
        , 
          
        1 
          
        row  
        affected 
          
        ( 
        0.02 
          
        sec 
        ) 
       
        mysql 
        > 
          
        use 
          
        demo 
        ; 
       
        Database  
        changed 
       
        mysql 
        > 
          
        create  
        table  
        test 
        ( 
        id 
          
        int 
        , 
        name  
        varchar 
        ( 
        10 
        ) 
        , 
        address  
        varchar 
        ( 
        20 
        ) 
        ) 
        ; 
       
        Query  
        OK 
        , 
          
        0 
          
        rows  
        affected 
          
        ( 
        0.00 
          
        sec 
        ) 
       
        mysql 
        > 
          
        insert into wlslog(id,time_stamp,category,type,servername,code,msg) values(8,'apr-8-2014-7:06:22-pm-pdt','notice','weblogicserver','adminserver','bea-000360','server started in running mode');   
        
insert into wlslog(id,time_stamp,category,type,servername,code,msg) values(9,'apr-8-2014-7:06:22-pm-pdt','notice','weblogicserver','adminserver','bea-000360','server started in running mode');   
        
insert into wlslog(id,time_stamp,category,type,servername,code,msg) values(10,'apr-8-2014-7:06:22-pm-pdt','notice','weblogicserver','adminserver','bea-000360','server started in running mode');   
        
insert into wlslog(id,time_stamp,category,type,servername,code,msg) values(8,'apr-8-2014-7:06:22-pm-pdt','notice','weblogicserver','adminserver','bea-000360','server started in running mode');   
         
        Query  
        OK 
        , 
          
        4 
          
        row  
        affected 
          
        ( 
        0.01 
          
        sec 
        )

然后可以看到Maxwell的输出信息：

 
  
         1 
       

         2 
       

         3 
       

         4 
       

         5 
       

         6 
       
 
        04 
        : 
        16 
        : 
        48 
        , 
        341 
          
        INFO  
        OpenReplicator 
          
        - 
          
        starting  
        replication  
        at 
          
        mysql 
        - 
        bin 
        . 
        000004 
        : 
        6777 
       
 
        04 
        : 
        18 
        : 
        18 
        , 
        654 
          
        INFO  
        AbstractSchemaStore 
          
        - 
          
        storing  
        schema 
          
        @ 
        Position 
        [ 
        BinlogPosition 
        [ 
        mysql 
        - 
        bin 
        . 
        000004 
        : 
        136974 
        ] 
        , 
          
        lastHeartbeat 
        = 
        1497601097500 
        ] 
          
       
 
        
 
       
 
 

5）数据输出到HDFS

Kafka创建topic

 
  
         1 
       
 
        $ 
          
        / 
        usr 
        / 
        local 
        / 
        kafka 
        / 
        bin 
        / 
        kafka 
        - 
        topics 
        .sh 
          
        -- 
        zookeeper 
          
        127.0.0.1 
        : 
        2181 
          
        -- 
        create 
          
        -- 
        topic  
        maxwell 
          
        -- 
        partitions 
          
        20 
          
        -- 
        replication 
        - 
        factor 
          
        1 
       
 
 

查看主题

查看主题详情

 
  
         1 
       

         2 
       

         3 
       
 
        $ 
          
        / 
        usr 
        / 
        local 
        / 
        kafka 
        / 
        bin 
        / 
        kafka 
        - 
        topics 
        .sh 
          
        -- 
        zookeeper 
        = 
        127.0.0.1 
        : 
        2181 
          
        -- 
        describe 
          
        -- 
        topic  
        maxwell 
       
 
        Topic 
        : 
        maxwell  
        PartitionCount 
        : 
        1 
          
        ReplicationFactor 
        : 
        1 
          
        Configs 
        : 
       
 
        Topic 
        : 
          
        maxwell  
        Partition 
        : 
          
        0 
          
        Leader 
        : 
          
        0 
          
        Replicas 
        : 
          
        0 
          
        Isr 
        : 
          
        0 
       
 
 

提供一份Flume配置文件（从Kafka收集日志到HDFS）

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
         8 
       
         9 
       
         10 
       
         11 
       
         12 
       
         13 
       
         14 
       
         15 
       
         16 
       
         17 
       
         18 
       
         19 
       
         20 
       
         21 
       
         22 
       
         23 
       
         24 
       
         25 
       
         26 
       
         27 
       
         28 
       
         29 
       
         30 
       
         31 
       
         32 
       
         33 
       
        $ 
          
        cat 
          
        / 
        usr 
        / 
        local 
        / 
        flume 
        / 
        conf 
        / 
        mysql 
        - 
        flume 
        - 
        hdfs 
        .conf 
       
        a1 
        .sources 
          
        = 
          
        r1 
       
        a1 
        .sinks 
          
        = 
          
        k1 
       
        a1 
        .channels 
          
        = 
          
        c1 
       
        # Describe/configure the source 
       
        a1 
        .sources 
        .r1 
        .type 
          
        = 
          
        org 
        .apache 
        .flume 
        .source 
        .kafka 
        .KafkaSource 
       
        a1 
        .sources 
        .r1 
        .zookeeperConnect 
          
        = 
          
        127.0.0.1 
        : 
        2181 
       
        a1 
        .sources 
        .r1 
        .topic 
          
        = 
          
        maxwell 
       
        a1 
        .sources 
        .r1 
        .groupId 
          
        = 
          
        flume 
       
        a1 
        .sources 
        .r1 
        .channels 
          
        = 
          
        c1 
       
        a1 
        .sources 
        .r1 
        .interceptors 
          
        = 
          
        i1 
       
        a1 
        .sources 
        .r1 
        .interceptors 
        .i1 
        .type 
          
        = 
          
        timestamp 
       
        a1 
        .sources 
        .r1 
        .kafka 
        .consumer 
        .timeout 
        .ms 
          
        = 
          
        100 
       
        # Describe the sink 
       
        a1 
        .sinks 
        .k1 
        .type 
          
        = 
          
        hdfs 
       
        #a1.sinks.k1.hdfs.path = /mysql/%{topic}/%y-%m-%d 
       
        a1 
        .sinks 
        .k1 
        .hdfs 
        .path 
          
        = 
          
        hdfs 
        : 
        / 
        / 
        172.30.10.2 
        : 
        8020 
        / 
        mysql 
        / 
        % 
        { 
        topic 
        } 
        / 
        % 
        y 
        - 
        % 
        m 
        - 
        % 
        d 
       
        a1 
        .sinks 
        .k1 
        .hdfs 
        .rollInterval 
          
        = 
          
        5 
       
        a1 
        .sinks 
        .k1 
        .hdfs 
        .rollSize 
          
        = 
          
        0 
       
        a1 
        .sinks 
        .k1 
        .hdfs 
        .rollCount 
          
        = 
          
        0 
       
        a1 
        .sinks 
        .k1 
        .hdfs 
        .fileType 
          
        = 
          
        DataStream 
       
        a1 
        .sinks 
        .k1 
        .channel 
          
        = 
          
        c1 
       
        # Use a channel which buffers events in memory 
       
        a1 
        .channels 
        .c1 
        .type 
          
        = 
          
        memory 
       
        a1 
        .channels 
        .c1 
        .capacity 
          
        = 
          
        10000 
       
        a1 
        .channels 
        .c1 
        .transactionCapacity 
          
        = 
          
        1000 
       
        # Bind the source and sink to the channel 
       
        a1 
        .sources 
        .r1 
        .channels 
          
        = 
          
        c1 
       
        a1 
        .sinks 
        .k1 
        .channel 
          
        = 
          
        c1

启动Flume

如果启动Flume时报错：ERROR – org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:146)] Failed to start agent because dependencies were not found in classpath. Error follows.java.lang.NoClassDefFoundError: org/apache/hadoop/io/SequenceFile$CompressionType.

可能是因为你的Flume是独立部署，需要依赖Hadoop HDFS的jar包，解决方法也很简单，就是在Flume主机上解压好Hadoop的二进制安装包，然后输出Hadoop环境变量即可，Flume会根据环境变量自动找到相关的依赖jar包。

另外，当Flume-ng正常运行后，写入HDFS时报错：java.lang.NoClassDefFoundError: org/apache/hadoop/io/SequenceFile$CompressionType
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode=”/”:hadoop:supergroup:drwxr-xr-x.

这个提示很明显，就是没有写入权限（因为你当前运行flume-ng的用户不是Hadoop用户），解决方案也很简单，就是切换到Hadoop用户执行flume-ng命令即可。或者开启HDFS允许所有用户进行文件写入，默认可能你没有开启。

启动Maxwell

 
         1 
       
         2 
       
         3 
       
         4 
       
         5 
       
         6 
       
         7 
       
        $ 
          
        cd 
          
        / 
        usr 
        / 
        local 
        / 
        maxwell 
        / 
        ; 
        . 
        / 
        bin 
        / 
        maxwell 
          
        \ 
       
        -- 
        user 
        = 
        'maxwell' 
          
        \ 
       
        -- 
        password 
        = 
        'maxwell' 
          
        \ 
       
        -- 
        host 
        = 
        '127.0.0.1' 
          
        \ 
       
        -- 
        port 
        = 
        '3306' 
          
        \ 
       
        -- 
        producer 
        = 
        kafka 
          
        \ 
       
        -- 
        kafka 
        .bootstrap 
        .servers 
        = 
        127.0.0.1 
        : 
        9092

测试MySQL->Maxwell->Kafka->Flume->HDFS

相关组件现在都已经跑通了，接下来就是测试了，我们在MySQL插入一条数据：

 
  
         1 
       

         2 
       
 
        mysql 
        > 
          
        insert  
        into  
        hadoop 
        .test 
          
        values 
        (insert into wlslog(id,time_stamp,category,type,servername,code,msg) values(8,'apr-8-2014-7:06:22-pm-pdt','notice','weblogicserver','adminserver','bea-000360','server started in running mode');  
insert into wlslog(id,time_stamp,category,type,servername,code,msg) values(9,'apr-8-2014-7:06:22-pm-pdt','notice','weblogicserver','adminserver','bea-000360','server started in running mode');  
insert into wlslog(id,time_stamp,category,type,servername,code,msg) values(10,'apr-8-2014-7:06:22-pm-pdt','notice','weblogicserver','adminserver','bea-000360','server started in running mode');  
insert into wlslog(id,time_stamp,category,type,servername,code,msg) values(8,'apr-8-2014-7:06:22-pm-pdt','notice','weblogicserver','adminserver','bea-000360','server started in running mode');   
        ) 
        ; 
       
 
        Query  
        OK 
        , 
          
        1 
          
        row  
        affected 
          
        ( 
        0.00 
          
        sec 
        ) 
       
 
 

查看Kafka队列

 
  
         1 
       

         2 
       

         3 
       
 
        $ 
          
        / 
        usr 
        / 
        local 
        / 
        kafka 
        / 
        bin 
        / 
        kafka 
        - 
        console 
        - 
        consumer 
        .sh 
          
        - 
        zookeeper 
        = 
        127.0.0.1 
        : 
        2181 
          
        -- 
        from 
        - 
        beginning 
          
        -- 
        topic  
        maxwell 
       
 
        Using  
        the  
        ConsoleConsumer  
        with  
        old  
        consumer  
        is 
          
        deprecated  
        and 
          
        will  
        be  
        removed  
        in 
          
        a 
          
        future  
        major  
        release 
        . 
          
        Consider  
        using  
        the  
        new 
          
        consumer  
        by  
        passing 
          
        [ 
        bootstrap 
        - 
        server 
        ] 
          
        instead  
        of 
          
        [ 
        zookeeper 
        ] 
        . 
       
 
        { 
        "database" 
        : 
        "hadoop" 
        , 
        "table" 
        : 
        "test" 
        , 
        "type" 
        : 
        "insert" 
        , 
        "ts" 
        : 
        1497607783 
        , 
        "xid" 
        : 
        2414 
        , 
        "commit" 
        : 
        true 
        , 
        "data" 
        : 
        { 
        "id" 
        : 
        5 
        , 
        "name" 
        : 
        "dkey5" 
        , 
        "address" 
        : 
        "Shanghai" 
        } 
        } 
       
 
 

然后去HDFS查看：

 
  
         1 
       

         2 
       

         3 
       
 
        [ 
        hadoop 
        @ 
        hadoop 
        - 
        nn 
          
        ~ 
        ] 
        $ 
          
        hdfs  
        dfs 
          
        - 
        ls 
          
        / 
        mysql 
        / 
        maxwell 
        / 
        18 
        - 
        05 
        -0 
        7 
        / 
       
 
        Found 
          
        1 
          
        items 
       
 
        - 
        rw 
        - 
        r 
        -- 
        r 
        -- 
            
        3 
          
        cyhp  
        supergroup 
                 
        148 
          
        2018 
        - 
        05 
        - 
        07 
          
        13 
        : 
        57 
          
        / 
        mysql 
        / 
        maxwell 
        / 
        18 
        - 
        05 
        - 
        19 
        / 
        FlumeData 
        . 
        1525673813779