1 Sqoop的安装
首先,sqoop的下载地址:http://sqoop.apache.org/
点击download
点击这个链接即可下载
将sqoop下载到本地以后 , 传到 Linux 平台 , 然后解压到自己的指定目录 , 修改配置文件 , 配置sqoop到 /etc/profile 变量里面 , 在lib下面添加连接数据库的jar包
因为原来的evn是一个模板 , 所以我们要自己复制一个
修改 sqoop-env.sh
添加sqoop到环境变量中
添加jar包到lib目录下面
source /etc/profile 重新加载一下配置文件
然后 sqoop -version 即安装成功
2 sqoop的简介
sqoop 是 apache 旗下一款“Hadoop 和关系数据库服务器之间传送数据”的工具。
核心的功能有两个:
导入、迁入
导出、迁出
导入数据:MySQL,Oracle 导入数据到 Hadoop 的 HDFS、HIVE、HBASE 等数据存储系统
导出数据:从 Hadoop 的文件系统中导出数据到关系数据库 mysql 等 Sqoop 的本质还是一个命令行工具,和 HDFS,Hive 相比,并没有什么高深的理论。
sqoop工具:本质就是迁移数据, 迁移的方式:就是把sqoop的迁移命令转换成MR程序
hive工具 : 本质就是执行计算,依赖于HDFS存储数据,把SQL转换成MR程序
sqoop的工作机制 : 将导入或导出命令翻译成 MapReduce 程序来实现 在翻译出的 MapReduce 中主要是对 InputFormat 和 OutputFormat 进行定制
3 sqoop的使用
常用参数 :
--connect <jdbc-uri> jdbc 连接地址
--connection-manager <class-name> 连接管理者
--driver <class-name> 驱动类
--hadoop-mapred-home <dir> $HADOOP_MAPRED_HOME
--help help 信息
-P 从命令行输入密码
--password <password> 密码
--username <username> 账号
--verbose 打印流程信息
--connection-param-file <filename> 可选参数
第一类:数据库中的数据导入到HDFS上
sqoop import --connect jdbc:mysql://192.168.1.10:3306/bigdata30 --username root --password 123456 --table trade_detail --columns 'id, account, income, expenses'
指定输出路径、指定数据分隔符
sqoop import --connect jdbc:mysql://192.168.1.10:3306/bigdata30 --username root --password 123456 --table trade_detail --target-dir '/sqoop/td' --fields-terminated-by '\t'
指定Map数量 -m
sqoop import --connect jdbc:mysql://192.168.1.10:3306/bigdata30 --username root --password 123456 --table trade_detail --target-dir '/sqoop/td1' --fields-terminated-by '\t' -m 2
增加where条件, 注意:条件必须用引号引起来
sqoop import --connect jdbc:mysql://192.168.1.10:3306/bigdata30 --username root --password 123456 --table trade_detail --where 'id>3' --target-dir '/sqoop/td2'
增加query语句(使用 \ 将语句换行)
-
sqoop
import --connect jdbc:mysql:
//192.168.1.10:3306/bigdata30 --username root --password 123456 \
-
--query '
SELECT *
FROM trade_detail
where id >
2
AND $
CONDITIONS' --
split-by trade_detail.id --target-dir '/sqoop/td3'
注意:如果使用--query这个命令的时候,需要注意的是where后面的参数,AND $CONDITIONS这个参数必须加上 而且存在单引号与双引号的区别,如果--query后面使用的是双引号,那么需要在$CONDITIONS前加上\即\$CONDITIONS 如果设置map数量为1个时即-m 1,不用加上--split-by ${tablename.column},否则需要加上
第二类:将HDFS上的数据导出到数据库中(不要忘记指定分隔符)
sqoop export --connect jdbc:mysql://192.168.8.120:3306/bigdata30 --username root --password 123456 --export-dir '/td3' --table td_bak -m 1 --fields-terminated-by ','
输入sqoop help 即可查看sqoop支持哪些命令
输入sqoop help import 即可 如果不知道使用方式,可以使用 sqoop command 的方式 来查看某条具体命令的使用方式,比如:
-
Common arguments:
-
--connect <jdbc-uri> Specify JDBC
-
connect
-
string
-
--connection-manager <class-name> Specify
-
connection
-
manager
-
class name
-
--connection-param-file <properties-file> Specify
-
connection
-
parameters
-
file
-
--driver <class-name> Manually
-
specify JDBC
-
driver class
-
to use
-
--hadoop-home <hdir> Override
-
$HADOOP_MAPR
-
ED_HOME_ARG
-
--hadoop-mapred-home <dir> Override
-
$HADOOP_MAPR
-
ED_HOME_ARG
-
--
help Print usage
-
instructions
-
--metadata-transaction-isolation-level <isolationlevel> Defines the
-
transaction
-
isolation
-
level
for
-
metadata
-
queries. For
-
more details
-
check
-
java.sql.Con
-
nection
-
javadoc or
-
the JDBC
-
specificaito
-
n
-
--oracle-escaping-disabled <boolean> Disable the
-
escaping
-
mechanism of
-
the
-
Oracle/OraOo
-
p connection
-
managers
-
-P Read
-
password
-
from console
-
--password <password> Set
-
authenticati
-
on password
-
--password-alias <password-alias> Credential
-
provider
-
password
-
alias
-
--password-file <password-file> Set
-
authenticati
-
on password
-
file path
-
--relaxed-isolation Use
-
read-uncommi
-
tted
-
isolation
-
for imports
-
--skip-dist-cache Skip copying
-
jars to
-
distributed
-
cache
-
--temporary-rootdir <rootdir> Defines the
-
temporary
-
root
-
directory
-
for the
-
import
-
--throw-on-error Rethrow a
-
RuntimeExcep
-
tion on
-
error
-
occurred
-
during the
-
job
-
--username <username> Set
-
authenticati
-
on username
-
--verbose Print more
-
information
-
while
-
working
-
-
Import control arguments:
-
--append Imports data
-
in append
-
mode
-
--as-avrodatafile Imports data
-
to Avro data
-
files
-
--as-parquetfile Imports data
-
to Parquet
-
files
-
--as-sequencefile Imports data
-
to
-
SequenceFile
-
s
-
--as-textfile Imports data
-
as plain
-
text
-
(default)
-
--autoreset-to-one-mapper Reset the
-
number of
-
mappers to
-
one mapper
-
if no split
-
key
-
available
-
--boundary-query <statement> Set boundary
-
query
for
-
retrieving
-
max and min
-
value of the
-
primary key
-
--columns <col,col,col...> Columns to
-
import from
-
table
-
--compression-codec <codec> Compression
-
codec to use
-
for import
-
--delete-target-dir Imports data
-
in delete
-
mode
-
--direct Use direct
-
import fast
-
path
-
--direct-split-size <n> Split the
-
input stream
-
every
'n'
-
bytes when
-
importing
in
-
direct mode
-
-e,--query <statement> Import
-
results of
-
SQL
-
'statement'
-
--fetch-size <n> Set number
-
'n' of rows
-
to fetch
-
from the
-
database
-
when more
-
rows are
-
needed
-
--inline-lob-limit <n> Set the
-
maximum size
-
for an
-
inline LOB
-
-m,--num-mappers <n> Use
'n' map
-
tasks to
-
import
in
-
parallel
-
--mapreduce-job-name <name> Set name
for
-
generated
-
mapreduce
-
job
-
--merge-key <column> Key column
-
to use to
-
join results
-
--split-by <column-name> Column of
-
the table
-
used to
-
split work
-
units
-
--split-limit <size> Upper Limit
-
of rows per
-
split
for
-
split
-
columns of
-
Date/Time/Ti
-
mestamp and
-
integer
-
types. For
-
date or
-
timestamp
-
fields it is
-
calculated
-
in seconds.
-
split-limit
-
should be
-
greater than
-
0
-
--table <table-name> Table to
-
read
-
--target-dir <dir> HDFS plain
-
table
-
destination
-
--validate Validate the
-
copy using
-
the
-
configured
-
validator
-
--validation-failurehandler <validation-failurehandler> Fully
-
qualified
-
class name
-
for
-
ValidationFa
-
ilureHandler
-
--validation-threshold <validation-threshold> Fully
-
qualified
-
class name
-
for
-
ValidationTh
-
reshold
-
--validator <validator> Fully
-
qualified
-
class name
-
for the
-
Validator
-
--warehouse-dir <dir> HDFS parent
-
for table
-
destination
-
--
where <
where clause> WHERE clause
-
to use
-
during
-
import
-
-z,--compress Enable
-
compression
-
-
Incremental import arguments:
-
--check-column <column> Source column to check
for incremental
-
change
-
--incremental <import-type> Define an incremental import of
type
-
'append' or
'lastmodified'
-
--last-value <value> Last imported value
in the incremental
-
check column
-
-
Output line formatting arguments:
-
--enclosed-by <char> Sets a required field enclosing
-
character
-
--escaped-by <char> Sets the escape character
-
--fields-terminated-by <char> Sets the field separator character
-
--lines-terminated-by <char> Sets the end-of-line character
-
--mysql-delimiters Uses MySQL
's default delimiter set:
-
fields: , lines: \n escaped-by: \
-
optionally-enclosed-by: '
-
--optionally-enclosed-by <char> Sets a field enclosing character
-
-
Input parsing arguments:
-
--input-enclosed-by <char> Sets a required field encloser
-
--input-escaped-by <char> Sets the input escape
-
character
-
--input-fields-terminated-by <char> Sets the input field separator
-
--input-lines-terminated-by <char> Sets the input end-of-line
-
char
-
--input-optionally-enclosed-by <char> Sets a field enclosing
-
character
-
-
Hive arguments:
-
--create-hive-table Fail
if the target hive
-
table exists
-
--external-table-dir <hdfs path> Sets
where the external
-
table is
in HDFS
-
--hive-database <database-name> Sets the database name to
-
use when importing to hive
-
--hive-delims-replacement <arg> Replace Hive record \0x01
-
and row delimiters (\n\r)
-
from imported string fields
-
with user-defined string
-
--hive-drop-import-delims Drop Hive record \0x01 and
-
row delimiters (\n\r) from
-
imported string fields
-
--hive-home <dir> Override
$HIVE_HOME
-
--hive-import Import tables into Hive
-
(Uses Hive
's default
-
delimiters if none are
-
set.)
-
--hive-overwrite Overwrite existing data in
-
the Hive table
-
--hive-partition-key <partition-key> Sets the partition key to
-
use when importing to hive
-
--hive-partition-value <partition-value> Sets the partition value to
-
use when importing to hive
-
--hive-table <table-name> Sets the table name to use
-
when importing to hive
-
--map-column-hive <arg> Override mapping for
-
specific column to hive
-
types.
-
-
HBase arguments:
-
--column-family <family> Sets the target column family for the
-
import
-
--hbase-bulkload Enables HBase bulk loading
-
--hbase-create-table If specified, create missing HBase tables
-
--hbase-row-key <col> Specifies which input column to use as the
-
row key
-
--hbase-table <table> Import to <table> in HBase
-
-
HCatalog arguments:
-
--hcatalog-database <arg> HCatalog database name
-
--hcatalog-home <hdir> Override $HCAT_HOME
-
--hcatalog-partition-keys <partition-key> Sets the partition
-
keys to use when
-
importing to hive
-
--hcatalog-partition-values <partition-value> Sets the partition
-
values to use when
-
importing to hive
-
--hcatalog-table <arg> HCatalog table name
-
--hive-home <dir> Override $HIVE_HOME
-
--hive-partition-key <partition-key> Sets the partition key
-
to use when importing
-
to hive
-
--hive-partition-value <partition-value> Sets the partition
-
value to use when
-
importing to hive
-
--map-column-hive <arg> Override mapping for
-
specific column to
-
hive types.
-
-
HCatalog import specific options:
-
--create-hcatalog-table Create HCatalog before import
-
--drop-and-create-hcatalog-table Drop and Create HCatalog before
-
import
-
--hcatalog-storage-stanza <arg> HCatalog storage stanza for table
-
creation
-
-
Accumulo arguments:
-
--accumulo-batch-size <size> Batch size in bytes
-
--accumulo-column-family <family> Sets the target column family for
-
the import
-
--accumulo-create-table If specified, create missing
-
Accumulo tables
-
--accumulo-instance <instance> Accumulo instance name.
-
--accumulo-max-latency <latency> Max write latency in milliseconds
-
--accumulo-password <password> Accumulo password.
-
--accumulo-row-key <col> Specifies which input column to
-
use as the row key
-
--accumulo-table <table> Import to <table> in Accumulo
-
--accumulo-user <user> Accumulo user name.
-
--accumulo-visibility <vis> Visibility token to be applied to
-
all rows imported
-
--accumulo-zookeepers <zookeepers> Comma-separated list of
-
zookeepers (host:port)
-
-
Code generation arguments:
-
--bindir <dir> Output directory for
-
compiled objects
-
--class-name <name> Sets the generated class
-
name. This overrides
-
--package-name. When
-
combined with --jar-file,
-
sets the input class.
-
--escape-mapping-column-names <boolean> Disable special characters
-
escaping in column names
-
--input-null-non-string <null-str> Input null non-string
-
representation
-
--input-null-string <null-str> Input null string
-
representation
-
--jar-file <file> Disable code generation; use
-
specified jar
-
--map-column-java <arg> Override mapping for
-
specific columns to java
-
types
-
--null-non-string <null-str> Null non-string
-
representation
-
--null-string <null-str> Null string representation
-
--outdir <dir> Output directory for
-
generated code
-
--package-name <name> Put auto-generated classes
-
in this package
-
-
Generic Hadoop command-line arguments:
-
(must preceed any tool-specific arguments)
-
Generic options supported are
-
-conf <configuration file> specify an application configuration file
-
-D <property=value> use value for given property
-
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS
' property from configurations.
-
-jt <local|resourcemanager:port> specify a ResourceManager
-
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
-
-
The general command line syntax is
-
command [genericOptions] [commandOptions]
-
-
-
At minimum, you must specify --connect and --table
-
Arguments to mysqldump and other subprograms may be supplied
-
after a '--
' on the command line.
-