西欧文字入oracle,Oracle Text简介

最新推荐文章于 2022-05-19 10:40:06 发布

温斯顿1984

最新推荐文章于 2022-05-19 10:40:06 发布

阅读量122

点赞数

文章标签：西欧文字入oracle

Oracle text-Oracle Text的体系架构

一、Oracle Text索引文档时所使用的主要逻辑步骤如下：(1)逻辑搜索表的所有行，并读取列中的数据。通常，这只是列数据，但有些数据存储使用列数据作为文档数据的指针。例如，URL_DATASTORE将列数据作为使用。(2)提取文档数据并将其转换为文本表示方式。存储二进制文档(如Word或Acrobat文件)时需要这样做。过滤器的输出不必是纯文本格式--它可以是或之类的文本格式。(3)提取过滤器的输出信息，并将其转换为纯文本。包括XML和HTML在内的不同文本格式有不同的分段器。转换为纯文本涉及检测重要文档段标记、移去不可见的信息和文本重新格式化。(4)提取分段器中的纯文本，并将其拆分为不连续的标记。既存在空白字符分隔语言使用的词法分析器，也存在分段复杂的亚洲语言使用的专门词法分析器。(5)提取词法分析器中的所有标记、文档段在分段器中的偏移量以及被称为非索引字的低信息含量字列表，并构建反向索引。倒排索引存储标记和含有这些标记的文档。

每个索引的许多选项组成功能组，称为“类”每个类集中体现配置的某一方面，可以认为这些类就是与文档数据库有关的一些问题。例如：数据存储、过滤器、词法分析器、相关词表、存储等。每个类具有许多预定义的行为，称之为对象。每个对象是类问题可能具有的答案，并且大多数对象都包含有属性。通过属性来定制对象，从而使对索引的配置更加多变以适应于不同的应用。(1)存储(Storage)类存储类指定构成Oracle Text索引的数据库表和索引的表空间参数和创建参数。它仅有一个基本对象：BASIC_STORAGE，其属性包括：I_Index_Clause、I_Table_Clause、K_Table_Clause、N_Table_Clause、P_Table_Clause、R_Table_Clause。(2)数据存储(Datastore)类数据存储：关于列中存储文本的位置和其他信息。默认情况下，文本直接存储到列中，表中的每行都表示一个单独的完整文档。其他数据存储位置包括存储在单独文件中或以其URL标识的Web页上。七个基本对象包括：Default_Datastore、Detail_Datastore、Direct_Datastore、File_Datastore、Multi_Column_Datastore、URL_Datastore、User_Datastore，。(3)文档段组(Section Group)类文档段组是用于指定一组文档段的对象。必须先定义文档段，然后才能使用索引通过WITHIN运算符在文档段内进行查询。文档段定义为文档段组的一部分。包含七个基本对象：AUTO_SECTION_GROUP、BASIC_SECTION_GROUP、HTML_SECTION_GROUP、NEWS_SECTION_GROUP、NULL_SECTION_GROUP、XML_SECTION_GROUP、PATH_SECTION_GROUP。(4)相关词表(Wordlist)类相关词表标识用于索引的词干和模糊匹配查询选项的语言，只有一个基本对象BASIC_WORDLIST，其属性有：Fuzzy_Match、Fuzzy_Numresults、Fuzzy_Score、Stemmer、Substring_Index、Wildcard_Maxterms、Prefix_Index、Prefix_Max_Length、Prefix_Min_Length。(5)索引集(Index Set)索引集是一个或多个Oracle索引(不是Oracle Text索引)的集合，用于创建CTXCAT类型的Oracle Text索引，只有一个基本对象BASIC_INDEX_SET。(6)词法分析器(Lexer)类词法分析器类标识文本使用的语言，还确定在文本中如何标识标记。默认的词法分析器是英语或其他西欧语言，用空格、标准标点和非字母数字字符标识标记，同时禁用大小写。包含8个基本对象：BASIC_LEXER、CHINESE_LEXER、CHINESE_VGRAM_LEXER、JAPANESE_LEXER、JAPANESE_VGRAM_LEXER、KOREAN_LEXER、KOREAN__MORPH_ LEXER、MULTI_LEXER。(7)过滤器(Filter)类过滤器确定如何过滤文本以建立索引。可以使用过滤器对文字处理器处理的文档、格式化的文档、纯文本和HTML文档建立索引，包括5个基本对象：CHARSET_FILTER、INSO_FILTER INSO、NULL_FILTER、PROCEDURE_FILTER、USER_FILTER。(8)非索引字表(Stoplist)类非索引字表类是用以指定一组不编入索引的单词(称为非索引字)。有两个基本对象：BASIC_STOPLIST (一种语言中的所有非索引字)、MULTI_STOPLIST (包含多种语言中的非索引字的多语言非索引字表)。

二、使用Oracle Text建立全文索引的完整步骤，归纳起来如下：(1)建表并装载文本(包含带有需要检索的文本字段)(2)配置索引(3)建立索引(4)发出查询(5)索引维护：同步与优化(将在后面介绍)

三、索引类型

(1)CONTEXT

ACONTEXTindex is the basic type of Oracle Text index. This is an index on a text column. ACONTEXTindex is useful when your source text consists of many large, coherent documents. Query this index with theCONTAINSoperator in theWHEREclause of aSELECTstatement. This index requires manual synchronization after DML. See Syntax for CONTEXT Index Type.

(2)CTXCAT

TheCTXCATtype of index is a combined index on a text column and one or more other columns.CTXCATis typically used to index small documents or text fragments, such as item names, prices and descriptions found in catalogs. Query this index with theCATSEARCHoperator in theWHEREclause of aSELECTstatement. This type of index is optimized for mixed queries. This index is transactional, automatically updating itself with DML to the base table. See Syntax for CTXCAT Index Type.

(3)CTXRULE

ACTXRULEindex is used to build a document classification application. TheCTXRULEindex is an index created on a table of queries or a column containing a set of queries, where the queries serve as rules to define the classification criteria. Query this index with theMATCHESoperator in theWHEREclause of aSELECT

(4)CTXXPATH

Create this index when you need to speed upexistsNode()queries on an XMLType column. See Syntax for CTXXPATH Index Type.

四、词法分析器类型

Lexer Types

Type

Description

Lexer for indexing columns that contain documents of different languages.

Lexer for extracting tokens from text in languages, such as English and most western European languages that use white space delimited words.

Lexer for indexing tables containing documents of different languages such as English, German, and Japanese.

Lexer for extracting tokens from Chinese text.

Lexer for extracting tokens from Chinese text. This lexer offers benefits over theCHINESE_VGRAMlexer:

·Generates a smaller index

·Better query response time

·Generates real world tokens resulting in better query precision

·Supports stop words

Lexer for extracting tokens from Japanese text.

Lexer for extracting tokens from Japanese text. This lexer offers the following advantages over theJAPANESE_VGRAMlexer:

·Generates smaller index

·Better query response time

·Generates real world tokens resulting in better precision

Lexer for extracting tokens from Korean text.

Lexer you create to index a particular language.

Lexer for indexing tables containing documents of different languages; autodetects languages in a document.

Use theWORLD_LEXERto index text columns that contain documents of different languages. For example, use this lexer to index a text column that stores English, Japanese, and German documents.

WORLD_LEXERdiffers fromMULTI_LEXERin thatWORLD_LEXERautomatically detects the language(s) of a document. UnlikeMULTI_LEXER,WORLD_LEXERdoes not require you to have a language column in your base table nor to specify the language column when you create the index. Moreover, it is not necessary to use sub-lexers, as withMULTI_LEXER

WORLD_LEXER supports all database character sets, and for languages whose character sets are Unicode-based, it supports the Unicode 5.0 standard. For a list of languages thatWORLD_LEXER

WORLD_LEXER Attribute

TheWORLD_VGRAM_LEXERhas the following attribute:

Attribute

Attribute Value

mixed_case

Enable mixed-case (upper- and lower-case) searches of text (for example, cat and Cat). Allowable values areYESandNO(default).

WORLD_LEXER Example

Here is an example of creating an index usingWORLD_LEXER.

exec ctx_ddl.create_preference('MYLEXER', 'world_lexer');

create index doc_idx on doc(data)

indextype is CONTEXT

parameters ('lexer MYLEXER