Sqoop User Guide (v1.3.0-cdh3u2)(sqoop用户手册)1

1. Introduction

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

This document describes how to get started using Sqoop to move data between databases and Hadoop and provides reference information for the operation of the Sqoop command-line tool suite. This document is intended for:

  • System and application programmers
  • System administrators
  • Database administrators
  • Data analysts
  • Data engineers

Sqoop是为Hadoop和关系数据库之间的数据传输设计的一个工具。您可以使用Sqoop从如MySQL或Oracle关系数据库中导入数据到Hadoop分布式文件系统(RDBMS)中,sqooq使用MapReduce的数据,并能导出数据到RDBMS数据库中。
Sqoop会自动执行大部分流程依靠数据库来描述要导入的数据结构。 Sqoop使用MapReduce导入导出数据,因为可用到MR
并行运算,以及容错机制。


本文档介绍了 如何开始 使用 Sqoop进行 数据库和Hadoop 之间的数据移动,并提供了 Sqoop 命令行工具套 本文件适用于
系统 应用程序的程序员
系统管理员
数据库管理员
数据分析
数据 工程师

2. Supported Releases

This documentation applies to Sqoop v1.3.0-cdh3u2.

3. Sqoop Releases

Sqoop is an open source software product of Cloudera, Inc.

Software development for Sqoop occurs at http://github.com/cloudera/sqoop. At that site you can obtain:

  • New releases of Sqoop as well as its most recent source code
  • An issue tracker
  • A wiki that contains Sqoop documentation

Sqoop is compatible with Apache Hadoop 0.21 and Cloudera’s Distribution of Hadoop version 3.

4. Prerequisites

sqoop用户需求,没用了

The following prerequisite knowledge is required for this product:

  • Basic computer technology and terminology
  • Familiarity with command-line interfaces such as bash
  • Relational database management systems
  • Basic familiarity with the purpose and operation of Hadoop

Before you can use Sqoop, a release of Hadoop must be installed and configured. We recommend that you download Cloudera’s Distribution for Hadoop (CDH3) from the Cloudera Software Archive at http://archive.cloudera.com for straightforward installation of Hadoop on Linux systems.

This document assumes you are using a Linux or Linux-like environment. If you are using Windows, you may be able to use cygwin to accomplish most of the following tasks. If you are using Mac OS X, you should see few (if any) compatibility errors. Sqoop is predominantly operated and tested on Linux.


5. Basic Usage

With Sqoop, you can import data from a relational database system into HDFS. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data.

有了Sqoop,你可以从关系型数据库系统中将数据导入到HDFS。导入过程的输入是一个数据库表。 Sqoop将按行(一行一行读)读取到hdfs。输出是关于数据库表数据的多个文件的集合。导入过程是并行执行。出于这个原因,将会生成多个文件。这些文件可被定义为分隔的文本文件(例如逗号或制表符分隔每个字段),或二进制的Avro或序列化文件(包含序列化的记录数据)。

A by-product of the import process is a generated Java class which can encapsulate one row of the imported table. This class is used during the import process by Sqoop itself. The Java source code for this class is also provided to you, for use in subsequent MapReduce processing of the data. This class can serialize and deserialize data to and from the SequenceFile format. It can also parse the delimited-text form of a record. These abilities allow you to quickly develop MapReduce applications that use the HDFS-stored records in your processing pipeline. You are also free to parse the delimiteds record data yourself, using any other tools you prefer.

导入过程中会生成一个Java类,它可以封装一个导入的表一行。这个类是用在Sqoop本身的导入过程。这个类的Java源代码也提供给您在随后的MapReduce的数据处理使用。这个类可以对SequenceFile格式的数据进行序列化和反序列化。它也可以解析带有分隔符的文本格式文件。这些能力让您能够快速开发使用MapReduce处理HDFS中存储的记录。你也可以自己解析分隔符记录的文本数据,使用任何你喜欢的其他工具。

After manipulating the imported records (for example, with MapReduce or Hive) you may have a result data set which you can then export back to the relational database. Sqoop’s export process will read a set of delimited text files from HDFS in parallel, parse them into records, and insert them as new rows in a target database table, for consumption by external applications or users.

在hadoop中的数据还可以导出到关系数据库。 Sqoop的导出过程是并行读取存放在hdfs中的以分隔符分割的文本文件,,并插入数据库表

Sqoop includes some other commands which allow you to inspect the database you are working with. For example, you can list the available database schemas (with the sqoop-list-databases tool) and tables within a schema (with the sqoop-list-tables tool). Sqoop also includes a primitive SQL execution shell (thesqoop-eval tool).

Sqoop包括一些其他命令允许您检查您正在使用的数据库。例如,您可以列出可用的数据库结构)和表结构。 Sqoop还包括一个原始执行sql的shell(thesqoop- EVAL工具)。

Most aspects of the import, code generation, and export processes can be customized. You can control the specific row range or columns imported. You can specify particular delimiters and escape characters for the file-based representation of the data, as well as the file format used. You can also control the class or package names used in generated code. Subsequent sections of this document explain how to specify these and other arguments to Sqoop.


导入导出代码生成都是可定制的,大部分方面。您可以控制​​的特定行范围行或者列的导入。您可以指定特定的分隔符的数据文件,以及使用转义字符。您还可以控制生成的代码中使用的类或包的名称。本文件的后续章节解释如何使用这些参数


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值