MongoDB Connector for Hadoop

最新推荐文章于 2024-06-01 08:00:00 发布

滕百川

最新推荐文章于 2024-06-01 08:00:00 发布

阅读量1.6k

点赞数

分类专栏：数据库分布式文章标签： mongodb hadoop

数据库同时被 2 个专栏收录

47 篇文章 0 订阅

订阅专栏

分布式

15 篇文章 0 订阅

订阅专栏

MongoDB Connector for Hadoop

Purpose

The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem.

Current release: 1.1

mongodb connector 是一个允许作mongodb在Hadoop mapreduce任务中作为输入源和输出目的地的库。它可以容易的将mongodb里的数据灵活、高性能的集成到hadoop其他部分的生态系统中。

Features

Can create data splits to read from standalone, replica set, or sharded configurations（从hadoop中读取数据，创建数据分片）
Source data can be filtered with queries using the MongoDB query language（使用mongodb的查询语言对源数据进行查询过滤）
Supports Hadoop Streaming, to allow job code to be written in any language (python, ruby, nodejs currently supported)（支持hadoop流，可以用多种方式编码job）
Can read data from MongoDB backup files residing on S3, HDFS, or local filesystems（读取mongodb数据，备份在s3，hdfs，本地文件系统）
Can write data out in .bson format, which can then be imported to any MongoDB database withmongorestore（以bson格式写数据，并能导入到任何mongodb数据库中）
Works with BSON/MongoDB documents in other Hadoop tools such as Pig and Hive.(使用pig 和 hive等hadoop tools操作BSON/Mongodb文档)

Download

Building

To build, first edit the value for hadoopRelease in ThisBuild in the build.sbt file to select the distribution of Hadoop that you want to build against. For example to build for CDH4:

构建jar包前要首先设置在build.sbt中修改hadoopRelease in ThisBuild的值，它和你选择的hadoop版本相关，示例如下：

hadoopRelease in ThisBuild := "cdh4"

or for Hadoop 1.0.x:

hadoopRelease in ThisBuild := "1.0"

To determine which value you need to set in this file, refer to the list of distributions below. Then run./sbt package to build the jars, which will be generated in the target/ directory.

After successfully building, you must copy the jars to the lib directory on each node in your hadoop cluster. This is usually one of the following locations, depending on which Hadoop release you are using:

$HADOOP_HOME/lib/
$HADOOP_HOME/share/hadoop/mapreduce/
$HADOOP_HOME/share/hadoop/lib/

Supported Distributions of Hadoop

Apache Hadoop 1.0

Does not support Hadoop Streaming.

Build using "1.0" or "1.0.x"
Apache Hadoop 1.1

Includes support for Hadoop Streaming.

Build using "1.1" or "1.1.x"
Apache Hadoop 0.20.*

Does not support Hadoop Streaming

Includes Pig 0.9.2.

Build using "0.20" or "0.20.x"
Apache Hadoop 0.23

Includes Pig 0.9.2.

Includes support for Streaming

Build using "0.23" or "0.23.x"
Apache Hadoop 0.21

Includes Pig 0.9.1

Includes support for Streaming

Build using "0.21" or "0.21.x"
Cloudera Distribution for Hadoop Release 3

This is derived from Apache Hadoop 0.20.2 and includes custom patches.

Includes support for streaming and Pig 0.8.1.

Build with "cdh3"
Cloudera Distribution for Hadoop Release 4

This is the newest release from Cloudera which is based on Apache Hadoop 2.0. The newer MR2/YARN APIs are not yet supported, but MR1 is still fully compatible.

Includes support for Streaming and Pig 0.11.1.

Build with "cdh4"

Configuration

Streaming

Examples

Usage with static .bson (mongo backup) files

BSON Usage

Usage with Amazon Elastic MapReduce

Amazon Elastic MapReduce is a managed Hadoop framework that allows you to submit jobs to a cluster of customizable size and configuration, without needing to deal with provisioning nodes and installing software.

Using EMR with the MongoDB Connector for Hadoop allows you to run MapReduce jobs against MongoDB backup files stored in S3.

Submitting jobs using the MongoDB Connector for Hadoop to EMR simply requires that the bootstrap actions fetch the dependencies (mongoDB java driver, mongo-hadoop-core libs, etc.) and place them into the hadoop distributions lib folders.

For a full example (running the enron example on Elastic MapReduce) please see here.

Usage with Pig

Documentation on Pig with the MongoDB Connector for Hadoopp.

For examples on using Pig with the MongoDB Connector for Hadoop, also refer to the examples section.

Notes for Contributors

If your code introduces new features, please add tests that cover them if possible and make sure that the existing test suite still passes. If you're not sure how to write a test for a feature or have trouble with a test failure, please post on the google-groups with details and we will try to help.

Maintainers

Mike O'Brien (mikeo@10gen.com)

Contributors

Brendan McAdams brendan@10gen.com
Eliot Horowitz erh@10gen.com
Ryan Nitz ryan@10gen.com
Russell Jurney (@rjurney) (Lots of significant Pig improvements)
Sarthak Dudhara sarthak.83@gmail.com (BSONWritable comparable interface)
Priya Manda priyakanth024@gmail.com (Test Harness Code)
Rushin Shah rushin10@gmail.com (Test Harness Code)
Joseph Shraibman jks@iname.com (Sharded Input Splits)
Sumin Xia xiasumin1984@gmail.com (Sharded Input Splits)
Jeremy Karn
bpfoster
Ross Lawley
Carsten Hufe
Asya Kamsky
Thomas Millar

Support

Issue tracking: https://jira.mongodb.org/browse/HADOOP/

Discussion: http://groups.google.com/group/mongodb-user/

滕百川

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MongoDB Connector for Hadoop

MongoDB Connector for HadoopPurposeThe MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output dest
复制链接

扫一扫