配置 UIMA 管道的目的
配置 UIMA 管道的主要任务是确定那些用来分析文档的资源序列。语言资源是用来分析注释文档用的工具。这里“注释”这个概念是 UIMA 的关键概念,用于把自然语言文本转化成带注释的半结构化文本。UIMA 管道配置,实际上就是为语言资源提供一个存放参数的位置。
提醒
如果 UIMA 中一个阶段(Stage)要求的输入类型,上一个阶段没能提供,那么该阶段的输入类型就会显示一个警告图标。右键单击缺少的类型,然后单击“查找”,查看项目中的内容分析工作室资源列表,以生成缺少的类型。然后,将其中一个资源添加到管道中。
配置UIMA的管道是一个迭代的过程。为了创建更多的资源,比如新的自定义词典和语法分析规则的数据库,你必须回去和编辑UIMA的管道配置文件包含这些资源作为分析过程的一部分。
配置步骤
In the Studio Explorer view, right-click the Configuration/Annotators directory in your project and click New > UIMA Pipeline Configuration.
Configure the stages of the UIMA pipeline:
(a) In the UIMA Pipeline Stages list, click Document Language and specify a method for identifying the language of each document. If all documents are in the same language, you can manually specify that language.
Tip If you accept the default option to automatically determine the document language, edit the Acceptable Languages list to specify the languages for which you expect to have documents. Specifying the list of possible languages helps to ensure that Content Analytics Studio identifies the correct language for each document.
(b) Click Lexical Analysis and specify a list of resources such as lexical dictionaries, character rules dictionaries, and custom dictionaries for each language in which you expect to have documents. You can also specify which break rules to use for splitting a document into paragraphs, sentences, and tokens.
If your pipeline includes a parsing rules stage, click Parsing Rules and specify a list of parsing rule files for each language in which you expect to have documents.
Tip If you specify multiple parsing rule files, the order in which you list the files affects the order in which the rules are processed. That is, rules in the first file are processed first, followed by the rules in the second file. If the rules in a file depend on annotations that are created by rules in a different file, ensure that the files are listed in the correct order.
(d) Optional: Add and configure additional pipeline stages. For example, you can add a PEAR stage to include annotators that are packaged as a PEAR file. You can also add a semantic analysis stage to find connections between annotations that are identified in the document. You can add a condition or switch stage to run an annotator stage in only certain conditions, such as running different lexical analysis stages with particular sets of dictionaries depending on the source of the document.
(e) Click Clean Up and select the annotation types that are not to be included in the final output.
Tip
If you want to remove some intermediary types from the final output but still view these types in the Content Analytics Studio annotation editor, select the Show removed types in the annotation editor check box. For example, you might want to view these intermediary types in the annotation editor so that you can use these types as inputs to a parsing rule.