datacleaner第七部分

最新推荐文章于 2024-05-15 09:49:54 发布

SunWuKong_Hadoop

最新推荐文章于 2024-05-15 09:49:54 发布

阅读量1.4k

点赞数 2

分类专栏： datacleaner

datacleaner 专栏收录该内容

15 篇文章 4 订阅

订阅专栏

第七部分。从命令行调用DataCleaner工作

表的内容

21。命令行界面

可执行文件使用场景执行分析工作清单数据存储内容和可用的组件可参数化的工作动态覆盖配置元素

21章。命令行界面

文摘

DataCleaner提供了一个命令行接口(CLI)来执行各种任务,包括执行分析工作,通过简单的命令,可以调用。作为一个计划的任务。

表的内容

可执行文件使用场景执行分析工作清单数据存储内容和可用的组件可参数化的工作动态覆盖配置元素

可执行文件

DataCleaner取决于您的发行版,你将有一个CLI可执行文件包括:

datacleaner-console.exewindows版,这是一个可执行文件。
datacleaner.cmd,这是一个脚本开始DataCleaner窗口。
datacleaner.sh,这是一个脚本开始DataCleaner在类unix系统中,如Linux和Mac OS。
如果你在Java Webstart DataCleaner模式运行,然后没有命令行界面!

使用场景

DataCleaner的CLI的使用场景:

执行分析工作
注册列表数据存储
数据存储列表模式
名单表模式
名单列在一个表中
列出可用的分析、变压器或过滤器

这些场景是如何获得了通过调用你的可执行的吗使用论点:

> datacleaner-console.exe -usage -conf (-configuration, --configuration-file) FILE : XML file describing the configuration of DataCleaner -ds (-datastore, --datastore-name) VAL : Name of datastore when printing a list of schemas, tables or columns -job (--job-file) FILE : An analysis job XML file to execute -list [ANALYZERS | TRANSFORMERS | FILTERS | DATASTORES | SCHEMAS | TABLES | COLUMNS] : Used to print a list of various elements available in the configuration -s (-schema, --schema-name) VAL : Name of schema when printing a list of tables or columns -t (-table, --table-name) VAL : Name of table when printing a list of columns

执行分析工作

下面是如何执行一个分析工作- - -我们将使用捆绑的例子“employees.analysis.xml”:

> datacleaner-console.exe -job examples/employees.analysis.xml SUCCESS! ... RESULT: Value distribution for column: REPORTSTO Top values: - 1102: 6 - 1143: 6 - 1088: 5 Null count: 0 Unique values: 0 RESULT: Match count Sample Aaaaaaa 22 William Aaaa Aaa 1 Foon Yue RESULT: Match count Sample aaaaaaaaaa 23 jfirrelli RESULT: Match count Sample Aaaaa Aaa 17 Sales Rep AA Aaaaaaaaa 2 VP Marketing Aaaa Aaaaaaa (AAAA) 1 Sale Manager (EMEA) Aaaaa Aaaaaaa (AA) 1 Sales Manager (NA) Aaaaa Aaaaaaa (AAAAA, AAAA) 1 Sales Manager (JAPAN, APAC) Aaaaaaaaa 1 President ...

正如你所看到的清单,分析结果将直接打印命令行输出。如果你想将结果保存到一个文件,简单地使用您的操作系统内置功能管道命令行输出到一个文件,通常使用“>”操作符。

清单数据存储内容和可用的组件

命令行界面允许清单数据存储内容和可用的组件。是援助的目的使用手工编辑分析文件,如果这是想要的。通过使用附些参数你可以得到你的数据存储和DataCleaner组件的元数据,将允许您手动组成一个分析文件。

清单数据存储的内容是不言自明的,如果你看的输出使用命令。这里有几个例子,使用示例数据库“orderdb”:

> datacleaner-console.exe -list datastores Datastores: ----------- Country codes orderdb > datacleaner-console.exe -list tables -ds orderdb Tables: ------- CUSTOMERS CUSTOMER_W_TER DEPARTMENT_MANAGERS DIM_TIME EMPLOYEES OFFICES ORDERDETAILS ORDERFACT ORDERS PAYMENTS PRODUCTS QUADRANT_ACTUALS TRIAL_BALANCE > datacleaner-console.exe -list columns -ds orderdb -table employees Columns: -------- EMPLOYEENUMBER LASTNAME FIRSTNAME EXTENSION EMAIL OFFICECODE REPORTSTO JOBTITLE

清单DataCleaner的组件是通过设置完成的附些参数的三个组件类型:分析仪、变压器或过滤器:

> datacleaner-console.exe -list analyzers ... name: Matching analyzer - Consumes multiple input columns (type: UNDEFINED) - Property: name=Dictionaries, type=Dictionary, required=false - Property: name=String patterns, type=StringPattern, required=false name: Pattern finder - Consumes 2 named inputs Input column: Column (type: STRING) Input column: Group column (type: STRING) - Property: name=Discriminate text case, type=Boolean, required=false - Property: name=Discriminate negative numbers, type=Boolean, required=false - Property: name=Discriminate decimals, type=Boolean, required=false - Property: name=Enable mixed tokens, type=Boolean, required=false - Property: name=Ignore repeated spaces, type=Boolean, required=false - Property: name=Upper case patterns expand in size, type=boolean, required=false - Property: name=Lower case patterns expand in size, type=boolean, required=false - Property: name=Predefined token name, type=String, required=false - Property: name=Predefined token regexes, type=String, required=false - Property: name=Decimal separator, type=Character, required=false - Property: name=Thousands separator, type=Character, required=false - Property: name=Minus sign, type=Character, required=false ... > datacleaner-console.exe -list transformers ... name: Tokenizer - Consumes a single input column (type: STRING) - Property: name=Delimiters, type=char, required=true - Property: name=Number of tokens, type=Integer, required=true - Output type is: STRING name: Whitespace trimmer - Consumes multiple input columns (type: STRING) - Property: name=Trim left, type=boolean, required=true - Property: name=Trim right, type=boolean, required=true - Property: name=Trim multiple to single space, type=boolean, required=true - Output type is: STRING ...

可参数化的工作

如果你想要一份工作的一部分可参数化/变量,那么它可以这样做。目前这是一个功能仅支持通过编辑.analysis。xml文件,因为DataCleaner图形用户界面不存储工作变量在保存工作。

源部分的你的工作,你可以添加变量的键/值对将贯穿于你的工作。每个变量可以有一个默认值将用于没有指定的情况下,变量值。这是一个简单的例子:

... <source> <data-context ref="my_datastore" /> <columns> <column path="column1" id="col_1" /> <column path="column2" id="col_2" /> </columns> <variables> <variable id="filename" value="/output/dc_output.csv" /> <variable id="separator" value="," /> </variables> </source> ...

在这个例子中,我们定义了两个变量: 文件名和分隔符。这些我们可以指特定的属性值,进一步降低我们的工作:

... <analyzer> <descriptor ref="Write to CSV file"/> <properties> <property name="File" ref="filename" /> <property name="Quote char" value=""" /> <property name="Separator char" ref="separator" /> </properties> <input ref="col_1" /> <input ref="col_2" /> </analyzer> ...

现在的属性值文件和分隔符字符属性 CSV文件写入已经可参数化的。执行工作新变量值,使用 var从命令行参数,就像这样:

DataCleaner-console.exe -job my_job.analysis.xml -var filename=/output/my_file.csv -var separator=;