yorc.json_DataX-HDFS(读写)

DataX操做HDFS

读取HDFS

1 快速介绍

HdfsReader提供了读取分布式文件系统数据存储的能力。在底层实现上,HdfsReader获取分布式文件系统上文件的数据,并转换为DataX传输协议传递给Writer。

目前HdfsReader支持的文件格式有textfile(text)、orcfile(orc)、rcfile(rc)、sequence file(seq)和普通逻辑二维表(csv)类型格式的文件,且文件内容存放的必须是一张逻辑意义上的二维表。

HdfsReader须要Jdk1.7及以上版本的支持。java

2 功能与限制

HdfsReader实现了从Hadoop分布式文件系统Hdfs中读取文件数据并转为DataX协议的功能。textfile是Hive建表时默认使用的存储格式,数据不作压缩,本质上textfile就是以文本的形式将数据存放在hdfs中,对于DataX而言,HdfsReader实现上类比TxtFileReader,有诸多类似之处。orcfile,它的全名是Optimized Row Columnar file,是对RCFile作了优化。据官方文档介绍,这种文件格式能够提供一种高效的方法来存储Hive数据。HdfsReader利用Hive提供的OrcSerde类,读取解析orcfile文件的数据。目前HdfsReader支持的功能以下:

1. 支持textfile、orcfile、rcfile、sequence file和csv格式的文件,且要求文件内容存放的是一张逻辑意义上的二维表。

2. 支持多种类型数据读取(使用String表示),支持列裁剪,支持列常量

3. 支持递归读取、支持正则表达式("*"和"?")。

4. 支持orcfile数据压缩,目前支持SNAPPY,ZLIB两种压缩方式。

5. 多个File能够支持并发读取。

6. 支持sequence file数据压缩,目前支持lzo压缩方式。

7. csv类型支持压缩格式有:gzip、bz二、zip、lzo、lzo_deflate、snappy。

8. 目前插件中Hive版本为1.1.1,Hadoop版本为2.7.1(Apache[为适配JDK1.7],在Hadoop 2.5.0, Hadoop 2.6.0 和Hive 1.2.0测试环境中写入正常;其它版本需后期进一步测试;

9. 支持kerberos认证(注意:若是用户须要进行kerberos认证,那么用户使用的Hadoop集群版本须要和hdfsreader的Hadoop版本保持一致,若是高于hdfsreader的Hadoop版本,不保证kerberos认证有效)

暂时不能作到:

1. 单个File支持多线程并发读取,这里涉及到单个File内部切分算法。二期考虑支持。

2. 目前还不支持hdfs HA;node

实例

读取HDFS控制台输出

json以下mysql

{"job": {"setting": {"speed": {"channel": 3}

},"content": [{"reader": {"name": "hdfsreader","parameter": {"path": "/user/hive/warehouse/test/*","defaultFS": "hdfs://192.168.1.121:8020","column": [{"index": 0,"type": "long"},

{"index": 1,"type": "string"},

{"type": "string","value": "hello"}

],"fileType": "text","encoding": "UTF-8","fieldDelimiter": ","}

},"writer": {"name": "streamwriter","parameter": {"print": true}

}

}]

}

}

执行正则表达式

FengZhendeMacBook-Pro:bin FengZhen$ ./datax.py /Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/1.reader_all.json

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !Copyright (C)2010-2017, Alibaba Group. All Rights Reserved.2018-11-18 17:28:30.540 [main] INFO VMInfo - VMInfo# operatingSystem class =>sun.management.OperatingSystemImpl2018-11-18 17:28:30.551 [main] INFO Engine - the machine info =>osInfo: Oracle Corporation1.8 25.162-b12

jvmInfo: Mac OS X x86_6410.13.4cpu num:4totalPhysicalMemory:-0.00G

freePhysicalMemory:-0.00G

maxFileDescriptorCount:-1currentOpenFileDescriptorCount:-1GC Names [PS MarkSweep, PS Scavenge]

MEMORY_NAME| allocation_size |init_size

PS Eden Space| 256.00MB | 256.00MB

Code Cache| 240.00MB | 2.44MB

Compressed Class Space| 1,024.00MB | 0.00MB

PS Survivor Space| 42.50MB | 42.50MB

PS Old Gen| 683.00MB | 683.00MB

Metaspace| -0.00MB | 0.00MB2018-11-18 17:28:30.572 [main] INFO Engine -{"content":[

{"reader":{"name":"hdfsreader","parameter":{"column":[

{"index":0,"type":"long"},

{"index":1,"type":"string"},

{"type":"string","value":"hello"}

],"defaultFS":"hdfs://192.168.1.121:8020","encoding":"UTF-8","fieldDelimiter":",","fileType":"text","path":"/user/hive/warehouse/test/*"}

},"writer":{"name":"streamwriter","parameter":{"print":true}

}

}

],"setting":{"speed":{"channel":3}

}

}2018-11-18 17:28:30.601 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null

2018-11-18 17:28:30.605 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0

2018-11-18 17:28:30.605 [main] INFO JobContainer -DataX jobContainer starts job.2018-11-18 17:28:30.609 [main] INFO JobContainer - Set jobId = 0

2018-11-18 17:28:30.650 [job-0] INFO HdfsReader$Job -init() begin...2018-11-18 17:28:31.318 [job-0] INFO HdfsReader$Job - hadoopConfig details:{"finalParameters":[]}2018-11-18 17:28:31.318 [job-0] INFO HdfsReader$Job -init() ok and end...2018-11-18 17:28:31.326 [job-0] INFO JobContainer - jobContainer starts to doprepare ...2018-11-18 17:28:31.327 [job-0] INFO JobContainer - DataX Reader.Job [hdfsreader] doprepare work .2018-11-18 17:28:31.327 [job-0] INFO HdfsReader$Job -prepare(), start to getAllFiles...2018-11-18 17:28:31.327 [job-0] INFO HdfsReader$Job - get HDFS all files in path = [/user/hive/warehouse/test/*]

Nov 18, 2018 5:28:31 PM org.apache.hadoop.util.NativeCodeLoader

警告: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2018-11-18 17:28:33.323 [job-0] INFO HdfsReader$Job - [hdfs://192.168.1.121:8020/user/hive/warehouse/test/data]是[text]类型的文件, 将该文件加入source files列表

2018-11-18 17:28:33.327 [job-0] INFO HdfsReader$Job - 您即将读取的文件数为: [1], 列表为: [hdfs://192.168.1.121:8020/user/hive/warehouse/test/data]

2018-11-18 17:28:33.328 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .

2018-11-18 17:28:33.328 [job-0] INFO JobContainer - jobContainer starts to do split ...

2018-11-18 17:28:33.329 [job-0] INFO JobContainer - Job set Channel-Number to 3 channels.

2018-11-18 17:28:33.329 [job-0] INFO HdfsReader$Job - split() begin...

2018-11-18 17:28:33.330 [job-0] INFO JobContainer - DataX Reader.Job [hdfsreader] splits to [1] tasks.

2018-11-18 17:28:33.331 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [1] tasks.

2018-11-18 17:28:33.347 [job-0] INFO JobContainer - jobContainer starts to do schedule ...

2018-11-18 17:28:33.356 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.

2018-11-18 17:28:33.359 [job-0] INFO JobContainer - Running by standalone Mode.

2018-11-18 17:28:33.388 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.

2018-11-18 17:28:33.396 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.

2018-11-18 17:28:33.397 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.

2018-11-18 17:28:33.419 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started

2018-11-18 17:28:33.516 [0-0-0-reader] INFO HdfsReader$Job - hadoopConfig details:{"finalParameters":["mapreduce.job.end-notification.max.retry.interval","mapreduce.job.end-notification.max.attempts"]}

2018-11-18 17:28:33.517 [0-0-0-reader] INFO Reader$Task - read start

2018-11-18 17:28:33.518 [0-0-0-reader] INFO Reader$Task - reading file : [hdfs://192.168.1.121:8020/user/hive/warehouse/test/data]

2018-11-18 17:28:33.790 [0-0-0-reader] INFO UnstructuredStorageReaderUtil - CsvReader使用默认值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":",","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值为[null]

2018-11-18 17:28:33.845 [0-0-0-reader] INFO Reader$Task - end read source files...

1 张三 hello

2 李四 hello

2018-11-18 17:28:34.134 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[715]ms

2018-11-18 17:28:34.137 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.

2018-11-18 17:28:43.434 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 16 bytes | Speed 1B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.425s | Percentage 100.00%

2018-11-18 17:28:43.435 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.

2018-11-18 17:28:43.436 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work.

2018-11-18 17:28:43.436 [job-0] INFO JobContainer - DataX Reader.Job [hdfsreader] do post work.

2018-11-18 17:28:43.437 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.

2018-11-18 17:28:43.438 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /Users/FengZhen/Desktop/Hadoop/dataX/datax/hook

2018-11-18 17:28:43.446 [job-0] INFO JobContainer -

[total cpu info] =>

averageCpu | maxDeltaCpu | minDeltaCpu

-1.00% | -1.00% | -1.00%

[total gc info] =>

NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | min

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值