kettle转换控件的使用与经验分享

最新推荐文章于 2024-07-31 11:18:40 发布

Buty9147

最新推荐文章于 2024-07-31 11:18:40 发布

阅读量4.5k

点赞数 3

分类专栏： ETL工具文章标签： kettle kettle数据库连接配置脚本执行转换和工作文件控件的使用

本文链接：https://blog.csdn.net/Butingnal/article/details/73176048

版权

ETL工具专栏收录该内容

5 篇文章 0 订阅

订阅专栏

在练习时学习了子建老师的教程，感谢他的无私分享，贴上地址，以示尊重。https://edu.hellobi.com/course/37/play/lesson/669

使用的kettle版本：7.1

下面是总结的内容

控件的使用：

1."值映射" 也可以扩充一列存放映射后的值;例如可以填写“目标字段名” 扩充新列存放id的名称。

2.拼接年和月的时候发现，中间有空格，如2017_ 1,解决办法：Concat Fields的时候，Trim Type去掉两端空格，不过可能还是不管用，那就设定一下Format 为#，搞定。

3.优先使用"Microsoft Excel 输出" 代替 "Excel输出" 好处很多，如时间格式无效；输出的excel格式效果也不好；可以定义表头名称；Stream XSLX data提升性能等。

4.习惯用 "写日志" 来调试输出变量和结果，挺好用的。

5.拖控件可以放到连接线上，然后会提示你"是否要把节点连接拆开吗" 确定了就可以加在中间了。

6.不管是哪个元件，字段类型如果确定的话还是设定一下吧，否则让kettle猜有时候会出格式转换的错误，还有Format，没有特殊格式就设定为#，减少出问题的可能性，像第2条的情况。

7.“文本文件输出” 内容TAB有一项叫 "快速存储(无格式)" 勾上，速度会快很多而且文件体积也会小很多。

8“excel输入” 选择目录多个文件时，通配符的设定，比如文件名列表是这样的
e1.xlsx
e2.xlsx
e3.xlsx

那通配符应该是 e.*.xlsx

想要学习更多通配符正则相关的知识可以参见 http://www.dataguru.cn/thread-506092-1-1.html 写的真好，赞一个。

脚本执行转换和工作文件：

1)Pan.bat/Pan.sh 用于执行转换文件

参数：
/rep : 资源库名称
/user : 资源库用户名
/pass : 资源库密码
/trans : 要启动的转换名称
/dir : 目录(不要忘了前缀 /)
/file : 要启动的文件名(转换所在的 XML 文件)
/level : 日志等级 (基本, 详细, 调试, 行级, 错误, 没有)
/logfile : 要写入的日志文件
/listdir : 列出资源库里的目录
/listtrans : 列出指定目录下的转换
/listrep : 列出可用资源库
/exprep : 将资源库里的所有对象导出到 XML 文件中
/norep : 不要将日志写到资源库中
/safemode : 安全模式下运行: 有额外的检查
/version : 显示版本,校订和构建日期
/param : Set a named parameter <NAME>=<VALUE>. For example -param:FOO=bar
/listparam : List information concerning the defined named parameters in the specified transformation.
/metrics : Gather metrics during execution
/maxloglines : The maximum number of log lines that are kept internally by Kettle. Set to 0 to keep all rows (default)
/maxlogtimeout : The maximum age (in minutes) of a log line while being kept internally by Kettle. Set to 0 to keep all rows indefinitely (default)


e.g
F:\Program Files\pdi-ce-7.1.0.0-12\data-integration>Pan /file C:\Users\butin\Desktop\Kettle_Repository\PRACTICE\EXCEL_SINGLE_TEST.ktr
F:\Program Files\pdi-ce-7.1.0.0-12\data-integration>Pan /rep local_repository /user admin /pass admin /dir /practice /trans EXCEL_SINGLE_TEST

需要注意的是：
1.通过资源库运行转换时，/trans 后面跟的转换的名字不要带后缀.ktr
2.资源库的数据库连接需要设置编码，否则中文会乱码


2)Kitchen.bat/Kitchen.sh 用于执行JOB文件

参数：
/rep : Repository name
/user : Repository username
/pass : Repository password
/job : The name of the job to launch
/dir : The directory (dont forget the leading /)
/file : The filename (Job XML) to launch
/level : The logging level (Basic, Detailed, Debug, Rowlevel, Error, Minimal, Nothing)
/logfile : The logging file to write to
/listdir : List the directories in the repository
/listjobs : List the jobs in the specified directory
/listrep : List the available repositories
/norep : Do not log into the repository
/version : show the version, revision and build date
/param : Set a named parameter <NAME>=<VALUE>. For example -param:FILE=customers.csv
/listparam : List information concerning the defined parameters in the specified job.
/export : Exports all linked resources of the specified job. The argument is the name of a ZIP file.
/custom : Set a custom plugin specific option as a String value in the job using <NAME>=<Value>, for example: -custom:COLOR=Red
/maxloglines : The maximum number of log lines that are kept internally by Kettle. Set to 0 to keep all rows (default)
/maxlogtimeout : The maximum age (in minutes) of a log line while being kept internally by Kettle. Set to 0 to keep all rows indefinitely (default)