kettle数据抽取--抽取文本数据

一.tsv文件抽取

1.通过使用Kettle工具,创建一个转换tsv_extract,添加“文本文件输入”控件、“表输出”控件以及Hop跳连接线,具体如图所示。

 

2.双击“文本文件输入”控件,进入“文本文件输入”界面

 单击【浏览】按钮,选择要抽取的文件tsv_extract.tsv。单击【增加】按钮,将要抽取的TSV文件添加到转换“tsv_extract”中。

 

单击“内容”选项卡,清除分隔符处默认分隔符“;”并单击【Insert TAB】按钮,在分隔符处插入一个制表符;取消勾选“头部”复选框。

单击【预览记录】按钮,查看文件tsv_extract.tsv的数据是否成功抽取到文本文件输入流中。

 

3.配置表输出控件

双击“表输出”控件,进入“表输出”控件的配置界面。单击【新建】按钮,配置数据库连接,配置完成后单击【确认】按钮。

 

单击目标表右侧的【浏览】按钮,获取目标表,即数据表tsv。

 勾选图4-13中的“指定数据库字段”复选框。

 

单击“数据库字段”选项卡,再单击【输入字段映射】按钮,弹出“映射匹配”对话框,将“源字段”选项框的字段和“目标字段”选项框对应的字段进行映射匹配。

 

“表输出”控件配置的最终效果,具体如图所示。

 

3.运行转换

单击转换工作区顶部的     按钮,运行创建的tsv_extract转换。结果如图

 在数据库中结果如图

二.CSV文件的抽取

1.通过使用Kettle工具,创建一个转换csv_extract,并添加“CSV文件输入”控件、“表输出”控件以及Hop跳连接线,具体如图所示。

 2.配置CSV文件输入控件

双击“CSV文件输入”控件,进入“CSV文件输入”界面。单击【浏览】按钮,选择要抽取的文件csv_extract.csv。

单击【获取字段】按钮,Kettle自动检索CSV文件,并对文件中的字段类型、格式、长度、精度等属性进行分析。 

 单击【预览】按钮,查看文件csv_extract.csv的数据是否抽取到CSV文件输入流中。

 

3.配置表输出控件

双击“表输出”控件,进入“表输出”控件的配置界面。单击【新建】按钮,配置数据库连接,配置完成后单击【确认】按钮。

 单击目标表右侧的【浏览】按钮,获取目标表,即数据表csv;勾选“指定数据库字段”的复选框。

 单击“数据库字段”选项卡,再单击【输入字段映射】按钮,弹出“映射匹配”对话框,将“源字段”选项框的字段和“目标字段”选项框对应的字段进行映射匹配。

 表输出”控件配置的最终效果,具体如图所示。

 

4.运行转换

单击转换工作区顶部的     按钮,运行创建的csv_extract转换。运行结果如图

 

 

在数据库中结果如图

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
目录 Kettle 3.0 用户手册 ...................................................................................................................... 1 Kettle 3.0 用户手册 ...................................................................................................................... 6 1. Kettle 介绍 ................................................................................................................................ 6 1.1 什么是kettle ............................................................................................................. 6 1.2 Kettle 的安装 ............................................................................................................ 6 1.3 运行Spoon ............................................................................................................... 6 1.4 资源库 ....................................................................................................................... 6 1.5 资源库自动登录 ....................................................................................................... 7 1.6 定义 ........................................................................................................................... 8 1.6.1 转换 ................................................................................................................... 8 1.6.2 任务 ................................................................................................................... 8 1.7 选项 ........................................................................................................................... 9 1.7.1 General 标签 ................................................................................................... 10 1.7.2 Look Feel 标签 ............................................................................................... 11 1.8 搜索元数据 ............................................................................................................. 12 1.9 设置环境变量 ......................................................................................................... 13 2. 创建一个转换或任务 ............................................................................................................. 13 3. 数据库连接(Database Connections) ...................................................................................... 14 3.1 描述 ......................................................................................................................... 14 3.2 设置窗口 ................................................................................................................. 14 3.3 选项 ......................................................................................................................... 14 3.4 数据库用法 ............................................................................................................. 15 4. SQL 编辑器(SQL Editor) ....................................................................................................... 16 4.1 描述 ......................................................................................................................... 16 4.2 屏幕截图 ................................................................................................................. 16 4.3 局限性 ..................................................................................................................... 16 5. 数据库浏览器(Database Explorer) ........................................................................................ 17 5.1 屏幕截图 ................................................................................................................. 17 5.2 描述 ......................................................................................................................... 17 6. 节点连接(Hops) ................................................................................................................ 18 6.1 描述 ......................................................................................................................... 18 6.2 转换连接 ................................................................................................................. 18 6.3 任务连接 ................................................................................................................. 18 6.4 屏幕截图 ................................................................................................................. 18 6.5 创建一个连接 ......................................................................................................... 19 6.6 拆分一个连接 ......................................................................................................... 19 6.7 转换连接颜色 ......................................................................................................... 19 7. 变量(Variables) .................................................................................................................. 20 技术资料,【Kette3.0 用户手册】 ©深圳市神盾信息技术有限公司,2008 第2 页/共202 页 7.1 变量使用 ................................................................................................................. 20 7.2 变量范围 ................................................................................................................. 20 7.2.1 环境变量 ......................................................................................................... 20 7.2.2 Kettle 变量 ...................................................................................................... 21 7.2.3 内部变量 ......................................................................................................... 21 8. 转换设置(Transformation Settings) .................................................................................. 22 8.1 描述 ......................................................................................................................... 22 8.2 屏幕截图 ................................................................................................................. 22 8.3 选项 ......................................................................................................................... 25 8.4 其它 ......................................................................................................................... 26 9. 转换步骤(Transformation steps) ....................................................................................... 27 9.1 描述 ......................................................................................................................... 27 9.2 运行步骤的多个副本 ............................................................................................. 27 9.3 分发或者复制 ......................................................................................................... 28 9.4 常用错误处理 ......................................................................................................... 29 9.5 Apache 虚拟文件系统(VFS)支持 .................................................................... 31 9.6 转换步骤类型 ......................................................................................................... 33 9.6.1 文本文件输入(Text Input) ........................................................................ 33 9.6.2 表输入(Table Input) ................................................................................... 45 9.6.3 获取系统信息(Get System Info) ............................................................... 47 9.6.4 生成行(Generate Rows) ............................................................................ 51 9.6.5 文件反序列化(De-serialize from file)(原来名称为Cube 输入) .......... 52 9.6.6 XBase 输入(XBase input) ......................................................................... 53 9.6.7 Excel 输入(Excel Input) ............................................................................ 54 9.6.8 XML 输入(XML input) .................................................................................. 58 9.6.9 获取文件名(Get File Names) ......................................................................... 61 9.6.10 文本文件输出(Text File Output) .................................................................... 62 9.6.11 表输出(Table output) ...................................................................................... 65 9.6.12 插入/更新(Insert/Update) ............................................................................... 68 9.6.13 更新(Update) ............................................................................................. 70 9.6.14 删除(Delete) .................................................................................................... 71 9.6.15 序列化到文件(Serialize to file)(以前是Cube Output) .................................. 72 9.6.16 XML 输出(XML output) ........................................................................... 73 9.6.17 Excel 输出(Excel Output) ............................................................................... 76 9.6.18 Access 输出(Microsoft Access Output) ..................................................... 78 9.6.19 数据库查询(Database lookup) ....................................................................... 80 9.6.20 流查询(Stream lookup) ............................................................................. 81 9.6.21 调用数据库存储过程(Call DB Procedure) .................................................... 83 9.6.22 HTTP 客户端(HTTP Cient) ............................................................................ 84 9.6.23 字段选择 (Select values) ............................................................................... 86 9.6.24 过滤记录(Filter rows) ..................................................................................... 89 9.6.25 排序记录(Sort rows) ................................................................................. 90 9.6.26 添加序列(Add sequence) ................................................................................ 91 9.6.27 空操作-什么都不做(Dummy-do nothing) ..................................................... 93 技术资料,【Kette3.0 用户手册】 ©深圳市神盾信息技术有限公司,2008 第3 页/共202 页 9.6.28 行转列(Row Normaliser) ................................................................................ 95 9.6.29 拆分字段(Split Fields)............................................................................... 97 9.6.30 去除重复记录(Unique rows) .......................................................................... 98 9.6.31 分组(Group By) ............................................................................................ 100 9.6.32 设置为空值(Null if) ...................................................................................... 101 9.6.33 计算器(Calculator) .................................................................................. 102 9.6.34 增加XML(XML Add) ............................................................................. 104 9.6.35 增加常量(Add constants) ........................................................................ 106 9.6.36 行转列(Row Denormaliser) ..................................................................... 107 9.6.37 行扁平化(Flattener) ................................................................................. 108 9.6.38 值映射(Value Mapper) ............................................................................ 110 9.6.39 被冻结的步骤(Blocking step) ................................................................. 111 9.6.40 记录关联(笛卡尔输出)(Join Rows-Cartesian Product)....................... 112 9.6.41 数据库连接(Database Join) ..................................................................... 114 9.6.42 合并记录(Merge rows) ............................................................................ 115 9.6.43 存储合并(Stored Merge) ......................................................................... 116 9.6.44 合并连接(Merge Join) .................................................................................. 117 9.6.45 JavaScript 值(JavaScript Value) ............................................................. 119 9.6.46 改进的JavaScript 值(Modified JavaScript Value) ................................ 127 9.6.47 执行SQL 语句(Execute SQL script) ...................................................... 129 9.6.48 维度更新/查询(Dimension lookup/update) ............................................ 132 9.6.49 联合更新/查询(Combination lookup/update) ......................................... 133 9.6.50 映射(Mapping) ........................................................................................ 134 9.6.51 从结果获取记录(Get rows from result) .................................................. 135 9.6.52 复制记录到结果(Copy rows to result) .................................................... 135 9.6.53 设置变量(Set Variable) ........................................................................... 136 9.6.54 获取变量(Get Variable) ........................................................................... 137 9.6.55 从以前的结果获取文件(Get files from result) ....................................... 138 9.6.56 复制文件名到结果(Set files in result) .................................................... 139 9.6.57 记录注射器(Injector) .............................................................................. 140 9.6.58 套接字读入器(Socket Reader) ................................................................ 141 9.6.59 套接字输写器(Socket Writer) ................................................................. 141 9.6.60 聚合行(Aggregate Rows) ........................................................................ 142 9.6.61 流XML 输入(Streaming XML Input) .................................................. 143 9.6.62 中止(Abort) ............................................................................................. 149 9.6.63 Oracle 批量装载(Oracle bulk loader) ...................................................... 151 10. 任务设置(Job Settings) ........................................................................................... 153 10.1 描述 ....................................................................................................................... 153 10.2 屏幕截图 ............................................................................................................... 153 10.3 选项 ....................................................................................................................... 153 10.4 其它 ....................................................................................................................... 154 11. 任务条目(Job Entries) ............................................................................................. 154 11.1 描述 ....................................................................................................................... 154 11.2 任务条目类型 ....................................................................................................... 155 技术资料,【Kette3.0 用户手册】 ©深圳市神盾信息技术有限公司,2008 第4 页/共202 页 11.2.1 特殊的任务条目 ........................................................................................... 155 11.2.2 转换 ............................................................................................................... 157 11.2.3 任务 ............................................................................................................... 159 11.2.4 Shell .............................................................................................................. 161 11.2.5 Mail ............................................................................................................... 163 11.2.6 SQL ............................................................................................................... 165 11.2.7 FTP ................................................................................................................ 166 11.2.8 Table Exists ................................................................................................... 168 11.2.9 File Exists ...................................................................................................... 169 11.2.10 Evaluation(javascript) ................................................................................... 170 11.2.11 SFTP .............................................................................................................. 171 11.2.12 HTTP ............................................................................................................. 173 11.2.13 Create file ...................................................................................................... 175 11.2.14 Delete file ...................................................................................................... 176 11.2.15 Wait for file ................................................................................................... 177 11.2.16 File compare .................................................................................................. 178 11.2.17 Put files with secureFTP ............................................................................... 180 11.2.18 Ping a host ..................................................................................................... 181 11.2.19 Wait for .......................................................................................................... 182 11.2.20 Display Msgbox info ..................................................................................... 183 11.2.21 Abort job ....................................................................................................... 184 11.2.22 XSL transformation ....................................................................................... 185 11.2.23 Zip files ......................................................................................................... 186 12. 图形界面(Graphical View) ...................................................................................... 187 12.1 描述 ....................................................................................................................... 187 12.2 添加步骤或者任务条目 ....................................................................................... 188 12.2.1 拖放创建步骤 ............................................................................................... 188 12.2.2 从步骤类型树创建步骤 ............................................................................... 188 12.2.3 在你想要的位置创建步骤 ........................................................................... 189 12.3 隐藏步骤 ............................................................................................................... 189 12.4 转换步骤选项(右键上下文菜单) ................................................................... 189 12.4.1 编辑步骤 ....................................................................................................... 189 12.4.2 编辑步骤描述 ............................................................................................... 189 12.4.3 数据迁移 ....................................................................................................... 189 12.4.4 复制 ............................................................................................................... 189 12.4.5 复制步骤 ....................................................................................................... 189 12.4.6 删除步骤 ....................................................................................................... 190 12.4.7 显示输入字段 ............................................................................................... 190 12.4.8 显示输出字段 ............................................................................................... 190 12.5 任务条目选项(右键上下文菜单) ................................................................... 190 12.5.1 打开转换/任务 .............................................................................................. 190 12.5.2 编辑任务入口 ............................................................................................... 190 12.5.3 编辑任务入口描述 ....................................................................................... 190 12.5.4 复制任务入口 ............................................................................................... 190 技术资料,【Kette3.0 用户手册】 ©深圳市神盾信息技术有限公司,2008 第5 页/共202 页 12.5.5 复制选择的任务入口到剪贴板 ................................................................... 190 12.5.6 排列/分布 ...................................................................................................... 191 12.5.7 拆开节点 ....................................................................................................... 191 12.5.8 删除所有任务入口的副本 ........................................................................... 191 12.6 添加节点连接 ....................................................................................................... 191 12.7 运行转换 ............................................................................................................... 191 12.8 屏幕截图 ............................................................................................................... 191 12.9 执行选项 ............................................................................................................... 192 12.9.1 在哪里执行 ................................................................................................... 192 12.9.2 预览 ............................................................................................................... 192 12.9.3 使用安全模式 ............................................................................................... 192 12.9.4 日志级别 ....................................................................................................... 192 12.9.5 重放日期 ....................................................................................................... 192 12.9.6 参数 ............................................................................................................... 192 12.9.7 变量 ............................................................................................................... 192 12.10 设置远程或者从属服务器 ................................................................................... 193 12.10.1 概述 ....................................................................................................... 193 12.10.2 屏幕截图 ............................................................................................... 193 13. 日志(Logging) ......................................................................................................... 193 13.1 日志描述 ............................................................................................................... 193 13.2 屏幕截图 ............................................................................................................... 194 13.3 日志网格 ............................................................................................................... 194 13.3.1 转换日志网格 ............................................................................................... 194 13.3.2 任务日志网格 ............................................................................................... 195 13.4 按钮 ....................................................................................................................... 195 13.4.1 转换按钮 ....................................................................................................... 195 13.4.2 任务按钮 ....................................................................................................... 197 14. 网格(Grids) .............................................................................................................. 198 14.1 描述 ....................................................................................................................... 198 14.2 功能 ....................................................................................................................... 198 14.3 导航 ....................................................................................................................... 199 15. 资源库浏览器(Repository Explorer) ...................................................................... 199 15.1 描述 ....................................................................................................................... 199 15.2 屏幕截图 ............................................................................................................... 200 15.3 右键单击功能 ....................................................................................................... 200 15.4 备份/资源库 .......................................................................................................... 200 16. 共享对象(Share objects) ......................................................................................... 201
Kettle是一个强大的数据集成工具,可以用来进行数据抽取、转换和加载(ETL)操作。数据抽取ETL过程的重要一步,它主要负责从外部数据源中抽取数据并将其输入到Kettle数据流中。 在Kettle中,数据抽取的步骤主要包括两个类别:输入类和文件抽取。 - 输入类的步骤用于从外部数据源中抽取数据,常用的步骤有“文本文件输入”和“表输入”。这些步骤通常在转换层完成,而准备要读取的数据的功能则在作业层完成。你可以通过选项名称来了解各个步骤和作业项的功能选项,如果需要详细的说明,可以查阅Kettle的在线帮助文档。 - 文件抽取Kettle数据抽取过程中的一个重要步骤。它涉及到从源抽取数据并将其导入数据仓库或过渡区。在文件抽取中,可以通过两种方式来实现数据抽取:拉模式和推模式。拉模式是指数据仓库主动从源系统拉取数据,而推模式则是源系统将数据推送给数据仓库。选择数据抽取方式的一个重要因素是操作型系统的可用性和数据量,以及需要抽取哪部分源数据加载到数据仓库。 此外,Kettle还支持与Hadoop生态圈中的工具Sqoop进行数据抽取的集成。Sqoop可以在关系数据库和HDFS或Hive之间进行数据导入和导出。你可以使用Kettle的Sqoop输入和输出作业项来实现与Sqoop的集成,并进行数据抽取操作。 总结起来,Kettle是一个功能强大的数据集成工具,可以通过不同的步骤和作业项来实现数据抽取操作,并支持与Hadoop生态圈中的工具进行集成。你可以根据具体的需求选择适合的步骤和作业项来完成数据抽取任务。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值