Mapreduce读写DB数据

最新推荐文章于 2024-08-16 23:08:28 发布

chuanzhuanxian8669

最新推荐文章于 2024-08-16 23:08:28 发布

阅读量109

点赞数

文章标签： java 数据库大数据

原文链接：https://my.oschina.net/stevie/blog/691858

版权

有时候我们在项目中会遇到输入结果集很大，但是输出结果很小，比如一些 pv、uv 数据，然后为了实时查询的需求，或者一些 OLAP 的需求，我们需要 mapreduce 与 mysql 进行数据的交互，而这些特性正是 hbase 或者 hive 目前亟待改进的地方。

好了言归正传，简单的说说背景、原理以及需要注意的地方：

1、为了方便 MapReduce 直接访问关系型数据库（Mysql,Oracle），Hadoop提供了DBInputFormat和DBOutputFormat两个类。通过DBInputFormat类把数据库表数据读入到HDFS，根据DBOutputFormat类把MapReduce产生的结果集导入到数据库表中。

2、由于0.20版本对DBInputFormat和DBOutputFormat支持不是很好，该例用了0.19版本来说明这两个类的用法。

至少在我的 0.20.203 中的 org.apache.hadoop.mapreduce.lib 下是没见到 db 包，所以本文也是以老版的 API 来为例说明的。

3、运行MapReduce时候报错：java.io.IOException: com.mysql.jdbc.Driver，一般是由于程序找不到mysql驱动包。解决方法是让每个tasktracker运行MapReduce程序时都可以找到该驱动包。

添加包有两种方式：

（1）在每个节点下的${HADOOP_HOME}/lib下添加该包。重启集群，一般是比较原始的方法。

（2）a)把包传到集群上： hadoop fs -put mysql-connector-java-5.1.0- bin.jar /hdfsPath/

b)在mr程序提交job前，添加语句：DistributedCache.addFileToClassPath(new Path(“/hdfsPath/mysql- connector-java- 5.1.0-bin.jar”), conf);

（3）虽然API用的是0.19的，但是使用0.20的API一样可用，只是会提示方法已过时而已。

4、测试数据：

 
         CREATE 
         TABLE 
          `t` ( 
        
 
         `id` 
         int 
         DEFAULT 
          NULL 
         , 
        
 
         ` 
         name 
         ` 
         varchar 
         (10) 
         DEFAULT 
         NULL 
        
 
         ) ENGINE=InnoDB  
         DEFAULT 
         CHARSET=utf8; 
        

            
        
 
         CREATE 
         TABLE 
          `t2` ( 
        
 
         `id` 
         int 
         DEFAULT 
          NULL 
         , 
        
 
         ` 
         name 
         ` 
         varchar 
         (10) 
         DEFAULT 
         NULL 
        
 
         ) ENGINE=InnoDB  
         DEFAULT 
         CHARSET=utf8; 
        

            
        
 
         insert 
         into 
          t  
         values 
         (1, 
         "june" 
         ),(2, 
         "decli" 
         ),(3, 
         "hello" 
         ), 
        
 
              
         (4, 
         "june" 
         ),(5, 
         "decli" 
         ),(6, 
         "hello" 
         ),(7, 
         "june" 
         ), 
        
 
              
         (8, 
         "decli" 
         ),(9, 
         "hello" 
         ),(10, 
         "june" 
         ), 
        
 
              
         (11, 
         "june" 
         ),(12, 
         "decli" 
         ),(13, 
         "hello" 
         ); 
        

5、代码：

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

 
         import 
         java.io.DataInput; 
        
         import 
         java.io.DataOutput; 
        
         import 
         java.io.IOException; 
        
         import 
         java.sql.PreparedStatement; 
        
         import 
         java.sql.ResultSet; 
        
         import 
         java.sql.SQLException; 
        
         import 
         java.util.Iterator; 
        
         import 
         org.apache.hadoop.filecache.DistributedCache; 
        
         import 
         org.apache.hadoop.fs.Path; 
        
         import 
         org.apache.hadoop.io.LongWritable; 
        
         import 
         org.apache.hadoop.io.Text; 
        
         import 
         org.apache.hadoop.io.Writable; 
        
         import 
         org.apache.hadoop.mapred.JobClient; 
        
         import 
         org.apache.hadoop.mapred.JobConf; 
        
         import 
         org.apache.hadoop.mapred.MapReduceBase; 
        
         import 
         org.apache.hadoop.mapred.Mapper; 
        
         import 
         org.apache.hadoop.mapred.OutputCollector; 
        
         import 
         org.apache.hadoop.mapred.Reducer; 
        
         import 
         org.apache.hadoop.mapred.Reporter; 
        
         import 
         org.apache.hadoop.mapred.lib.IdentityReducer; 
        
         import 
         org.apache.hadoop.mapred.lib.db.DBConfiguration; 
        
         import 
         org.apache.hadoop.mapred.lib.db.DBInputFormat; 
        
         import 
         org.apache.hadoop.mapred.lib.db.DBOutputFormat; 
        
         import 
         org.apache.hadoop.mapred.lib.db.DBWritable; 
        
         /** 
        
         * Function: 测试 mr 与 mysql 的数据交互，此测试用例将一个表中的数据复制到另一张表中 
        
         *           实际当中，可能只需要从 mysql 读，或者写到 mysql 中。 
        
         * date: 2013-7-29 上午2:34:04 <br/> 
        
         * @author june 
        
         */ 
        
         public 
         class 
          Mysql2Mr { 
        
         // DROP TABLE IF EXISTS `hadoop`.`studentinfo`; 
        
         // CREATE TABLE studentinfo ( 
        
         // id INTEGER NOT NULL PRIMARY KEY, 
        
         // name VARCHAR(32) NOT NULL); 
        
         public 
         static 
          class 
          StudentinfoRecord  
         implements 
         Writable, DBWritable { 
        
         int 
         id; 
        
         String name; 
        
         public 
         StudentinfoRecord() { 
        
         } 
        
         public 
         void 
          readFields(DataInput in)  
         throws 
         IOException { 
        
         this 
         .id = in.readInt(); 
        
         this 
         .name = Text.readString(in); 
        
         } 
        
         public 
         String toString() { 
        
         return 
         new 
          String( 
         this 
         .id +  
         " " 
          +  
         this 
         .name); 
        
         } 
        
         @Override 
        
         public 
         void 
          write(PreparedStatement stmt)  
         throws 
         SQLException { 
        
         stmt.setInt( 
         1 
         , 
         this 
         .id); 
        
         stmt.setString( 
         2 
         , 
         this 
         .name); 
        
         } 
        
         @Override 
        
         public 
         void 
          readFields(ResultSet result)  
         throws 
         SQLException { 
        
         this 
         .id = result.getInt( 
         1 
         ); 
        
         this 
         .name = result.getString( 
         2 
         ); 
        
         } 
        
         @Override 
        
         public 
         void 
          write(DataOutput out)  
         throws 
         IOException { 
        
         out.writeInt( 
         this 
         .id); 
        
         Text.writeString(out, 
         this 
         .name); 
        
         } 
        
         } 
        
         // 记住此处是静态内部类，要不然你自己实现无参构造器，或者等着抛异常： 
        
         // Caused by: java.lang.NoSuchMethodException: DBInputMapper.<init>() 
        
         // http://stackoverflow.com/questions/7154125/custom-mapreduce-input-format-cant-find-constructor 
        
         // 网上脑残式的转帖，没见到一个写对的。。。 
        
         public 
         static 
          class 
          DBInputMapper  
         extends 
         MapReduceBase  
         implements 
        
         Mapper<LongWritable, StudentinfoRecord, LongWritable, Text> { 
        
         public 
         void 
          map(LongWritable key, StudentinfoRecord value, 
        
         OutputCollector<LongWritable, Text> collector, Reporter reporter)  
         throws 
         IOException { 
        
         collector.collect( 
         new 
         LongWritable(value.id),  
         new 
         Text(value.toString())); 
        
         } 
        
         } 
        
         public 
         static 
          class 
          MyReducer  
         extends 
         MapReduceBase  
         implements 
        
         Reducer<LongWritable, Text, StudentinfoRecord, Text> { 
        
         @Override 
        
         public 
         void 
          reduce(LongWritable key, Iterator<Text> values, 
        
         OutputCollector<StudentinfoRecord, Text> output, Reporter reporter)  
         throws 
         IOException { 
        
         String[] splits = values.next().toString().split( 
         " " 
         ); 
        
         StudentinfoRecord r =  
         new 
         StudentinfoRecord(); 
        
         r.id = Integer.parseInt(splits[ 
         0 
         ]); 
        
         r.name = splits[ 
         1 
         ]; 
        
         output.collect(r, 
         new 
         Text(r.name)); 
        
         } 
        
         } 
        
         public 
         static 
          void 
          main(String[] args)  
         throws 
         IOException { 
        
         JobConf conf =  
         new 
         JobConf(Mysql2Mr. 
         class 
         ); 
        
         DistributedCache.addFileToClassPath( 
         new 
         Path( 
         "/tmp/mysql-connector-java-5.0.8-bin.jar" 
         ), conf); 
        
         conf.setMapOutputKeyClass(LongWritable. 
         class 
         ); 
        
         conf.setMapOutputValueClass(Text. 
         class 
         ); 
        
         conf.setOutputKeyClass(LongWritable. 
         class 
         ); 
        
         conf.setOutputValueClass(Text. 
         class 
         ); 
        
         conf.setOutputFormat(DBOutputFormat. 
         class 
         ); 
        
         conf.setInputFormat(DBInputFormat. 
         class 
         ); 
        
         // // mysql to hdfs 
        
         // conf.setReducerClass(IdentityReducer.class); 
        
         // Path outPath = new Path("/tmp/1"); 
        
         // FileSystem.get(conf).delete(outPath, true); 
        
         // FileOutputFormat.setOutputPath(conf, outPath); 
        
         DBConfiguration.configureDB(conf, 
         "com.mysql.jdbc.Driver" 
         , 
         "jdbc:mysql://192.168.1.101:3306/test" 
         , 
        
         "root" 
         , 
         "root" 
         ); 
        
         String[] fields = {  
         "id" 
         , 
         "name" 
         }; 
        
         // 从 t 表读数据 
        
         DBInputFormat.setInput(conf, StudentinfoRecord. 
         class 
         , 
         "t" 
         , 
         null 
         , 
         "id" 
         , fields); 
        
         // mapreduce 将数据输出到 t2 表 
        
         DBOutputFormat.setOutput(conf, 
         "t2" 
         , 
         "id" 
         , 
         "name" 
         ); 
        
         // conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class); 
        
         conf.setMapperClass(DBInputMapper. 
         class 
         ); 
        
         conf.setReducerClass(MyReducer. 
         class 
         ); 
        
         JobClient.runJob(conf); 
        
         } 
        
         }

6、结果：

执行两次后，你可以看到mysql结果：

 
         mysql> 
         select 
         *  
         from 
         t2; 
        
         + 
         ------+-------+ 
        
         | id   |  
         name  
         | 
        
         + 
         ------+-------+ 
        
         |    1 | june  | 
        
         |    2 | decli | 
        
         |    3 | hello | 
        
         |    4 | june  | 
        
         |    5 | decli | 
        
         |    6 | hello | 
        
         |    7 | june  | 
        
         |    8 | decli | 
        
         |    9 | hello | 
        
         |   10 | june  | 
        
         |   11 | june  | 
        
         |   12 | decli | 
        
         |   13 | hello | 
        
         |    1 | june  | 
        
         |    2 | decli | 
        
         |    3 | hello | 
        
         |    4 | june  | 
        
         |    5 | decli | 
        
         |    6 | hello | 
        
         |    7 | june  | 
        
         |    8 | decli | 
        
         |    9 | hello | 
        
         |   10 | june  | 
        
         |   11 | june  | 
        
         |   12 | decli | 
        
         |   13 | hello | 
        
         + 
         ------+-------+ 
        
         26 
         rows 
         in 
          set 
          (0.00 sec) 
        
         mysql>

7、日志：

 
         13 
         /07/29 
         02:33:03 WARN mapred.JobClient: Use GenericOptionsParser  
         for 
         parsing the arguments. Applications should implement Tool  
         for 
         the same. 
        
 
         13 
         /07/29 
         02:33:03 INFO filecache.TrackerDistributedCacheManager: Creating mysql-connector-java-5.0.8-bin.jar 
         in 
         /tmp/hadoop-june/mapred/local/archive/-8943686319031389138_-1232673160_640840668/192 
         .168.1.101 
         /tmp-work--8372797484204470322 
         with rwxr-xr-x 
        
 
         13 
         /07/29 
         02:33:03 INFO filecache.TrackerDistributedCacheManager: Cached hdfs: 
         //192 
         .168.1.101:9000 
         /tmp/mysql-connector-java-5 
         .0.8-bin.jar as  
         /tmp/hadoop-june/mapred/local/archive/-8943686319031389138_-1232673160_640840668/192 
         .168.1.101 
         /tmp/mysql-connector-java-5 
         .0.8-bin.jar 
        
 
         13 
         /07/29 
         02:33:03 INFO filecache.TrackerDistributedCacheManager: Cached hdfs: 
         //192 
         .168.1.101:9000 
         /tmp/mysql-connector-java-5 
         .0.8-bin.jar as  
         /tmp/hadoop-june/mapred/local/archive/-8943686319031389138_-1232673160_640840668/192 
         .168.1.101 
         /tmp/mysql-connector-java-5 
         .0.8-bin.jar 
        
 
         13 
         /07/29 
         02:33:03 INFO mapred.JobClient: Running job: job_local_0001 
        
 
         13 
         /07/29 
         02:33:03 INFO mapred.MapTask: numReduceTasks: 1 
        
 
         13 
         /07/29 
         02:33:03 INFO mapred.MapTask: io. 
         sort 
         .mb = 100 
        
 
         13 
         /07/29 
         02:33:03 INFO mapred.MapTask: data buffer = 79691776 
         /99614720 
        
 
         13 
         /07/29 
         02:33:03 INFO mapred.MapTask: record buffer = 262144 
         /327680 
        
 
         13 
         /07/29 
         02:33:03 INFO mapred.MapTask: Starting flush of map output 
        
 
         13 
         /07/29 
         02:33:03 INFO mapred.MapTask: Finished spill 0 
        
 
         13 
         /07/29 
         02:33:03 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is  
         done 
         . And is  
         in 
         the process of commiting 
        
 
         13 
         /07/29 
         02:33:04 INFO mapred.JobClient:  map 0% reduce 0% 
        
 
         13 
         /07/29 
         02:33:06 INFO mapred.LocalJobRunner:  
        
 
         13 
         /07/29 
         02:33:06 INFO mapred.Task: Task  
         'attempt_local_0001_m_000000_0' 
         done 
         . 
        
 
         13 
         /07/29 
         02:33:06 INFO mapred.LocalJobRunner:  
        
 
         13 
         /07/29 
         02:33:06 INFO mapred.Merger: Merging 1 sorted segments 
        
 
         13 
         /07/29 
         02:33:06 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 235 bytes 
        
 
         13 
         /07/29 
         02:33:06 INFO mapred.LocalJobRunner:  
        
 
         13 
         /07/29 
         02:33:06 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is  
         done 
         . And is  
         in 
         the process of commiting 
        
 
         13 
         /07/29 
         02:33:07 INFO mapred.JobClient:  map 100% reduce 0% 
        
 
         13 
         /07/29 
         02:33:09 INFO mapred.LocalJobRunner: reduce > reduce 
        
 
         13 
         /07/29 
         02:33:09 INFO mapred.Task: Task  
         'attempt_local_0001_r_000000_0' 
         done 
         . 
        
 
         13 
         /07/29 
         02:33:09 WARN mapred.FileOutputCommitter: Output path is null  
         in 
         cleanup 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:  map 100% reduce 100% 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient: Job complete: job_local_0001 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient: Counters: 18 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:   File Input Format Counters  
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Bytes Read=0 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:   File Output Format Counters  
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Bytes Written=0 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:   FileSystemCounters 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     FILE_BYTES_READ=1211691 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     HDFS_BYTES_READ=1081704 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2392844 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:   Map-Reduce Framework 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Map output materialized bytes=239 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Map input records=13 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Reduce shuffle bytes=0 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Spilled Records=26 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Map output bytes=207 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Map input bytes=13 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     SPLIT_RAW_BYTES=75 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Combine input records=0 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Reduce input records=13 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Reduce input  
         groups 
         =13 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Combine output records=0 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Reduce output records=13 
        
 
         13 
         /07/29 
         02:33:10 INFO mapred.JobClient:     Map output records=13 
        

8、REF：

新版 API 写法：

http://superlxw1234.iteye.com/blog/1880712

老版：

http://blog.csdn.net/dajuezhao/article/details/5799371

http://www.zhengmenbb.com/archives/583.htm

转载于:https://my.oschina.net/stevie/blog/691858

chuanzhuanxian8669

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Mapreduce读写DB数据

有时候我们在项目中会遇到输入结果集很大，但是输出结果很小，比如一些 pv、uv 数据，然后为了实时查询的需求，或者一些 OLAP 的需求，我们需要 mapreduce 与 mysql 进行数据的交互，而这些特性正是 hbase 或者 hive 目前亟待改进的地方。好了言归正传，简单的说...
复制链接

扫一扫