Storing Documents in Multiple Languages in LOB Data Types

本文介绍了OracleText如何支持在CLOB、NCLOB和BLOB数据类型中存储多语言文档,并进行内容搜索。CLOB依赖数据库字符集,NCLOB独立于字符集但不支持内容搜索,BLOB则不转换数据,适合存储二进制文档。创建多语种索引时,需建立特定语言的词法器并指定语言列。对于CLOB和BLOB类型的文档,可以创建相应的索引来实现内容搜索。
摘要由CSDN通过智能技术生成

You can store documents in multiple languages in CLOBNCLOB, or BLOB data types and set up Oracle Text to enable content search for the documents.

Data in CLOB columns is stored in the AL16UTF16 character set when the database character set is multibyte, such as UTF8 or AL32UTF8. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document in a CLOB column requires less storage space than the same document in a LONG column using AL32UTF8, typically around 30% less, depending on the contents of the document.

Documents in NCLOB format are also stored in the AL16UTF16 character set regardless of the database character set or national character set. The storage space requirement is the same as for CLOB data. Document contents are converted to UTF-16 when they are inserted into a NCLOB column. If you want to store multilingual documents in a non-Unicode database, then choose NCLOB. However, content search on NCLOB with Oracle Text is not supported.

Documents in BLOB format are stored as they are. No data conversion occurs during insertion and retrieval. However, SQL string manipulation functions (such as LENGTH or SUBSTR) and collation functions (such as NLS_SORT and ORDER BY) cannot be applied to the BLOB data type.

The following table lists the advantages and disadvantages of the CLOBNCLOB, and BLOB data types when storing documents:

Table 6-3 Comparison of LOB Data Types for Document Storage

Data Types

Advantages

Disadvantages

CLOB

Content search support with Oracle Text

String manipulation support

Depends on database character set

Data conversion is necessary for insertion

Cannot store binary documents

NCLOB

Independent of database character set

String manipulation support

No content search support

Data conversion is necessary for insertion

Cannot store binary documents

BLOB

Independent of database character set

Content search support

No data conversion, data stored as is

Can store binary documents such as Microsoft Word or Microsoft Excel

No string manipulation support

Creating Indexes for Searching Multilingual Document Contents

Oracle Text enables you to build indexes for content search on multilingual documents stored in CLOB format and BLOB format. It uses a language-specific lexer to parse the CLOB or BLOB data and produces a list of searchable keywords.

Create a multilexer to search multilingual documents. The multilexer chooses a language-specific lexer for each row, based on a language column. This section describes the high level steps to create indexes for documents in multiple languages. It contains the following topics:

  1. Creating Multilexers

The first step in creating the multilexer is the creation of language-specific lexer preferences for each language supported. The following example creates English, German, and Japanese lexers with PL/SQL procedures:

ctx_ddl.create_preference('english_lexer', 'basic_lexer');

ctx_ddl.set_attribute('english_lexer','index_themes','yes');

ctx_ddl.create_preference('german_lexer', 'basic_lexer');

ctx_ddl.set_attribute('german_lexer','composite','german');

ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');

ctx_ddl.set_attribute('german_lexer','mixed_case','yes');

ctx_ddl.create_preference('japanese_lexer', 'JAPANESE_VGRAM_LEXER');

After the language-specific lexer preferences are created, they need to be gathered together under a single multilexer preference. First, create the multilexer preference, using the MULTI_LEXER object:

ctx_ddl.create_preference('global_lexer','multi_lexer');

Now add the language-specific lexers to the multilexer preference using the add_sub_lexer call:

ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer');

ctx_ddl.add_sub_lexer('global_lexer', 'japanese', 'japanese_lexer');

ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');

This nominates the german_lexer preference to handle German documents, the japanese_lexer preference to handle Japanese documents, and the english_lexer preference to handle everything else, using DEFAULT as the language.

  1. Creating Indexes for Documents Stored in the CLOB Data Type

The multilexer decides which lexer to use for each row based on a language column in the table. This is a character column that stores the language of the document in a text column. Use the Oracle language name to identify the language of a document in this column. For example, if you use the CLOB data type to store your documents, then add the language column to the table where the documents are stored:

CREATE TABLE globaldoc

  (doc_id    NUMBER       PRIMARY KEY,

  language  VARCHAR2(30),

  text      CLOB);

To create an index for this table, use the multilexer preference and specify the name of the language column:

CREATE INDEX globalx ON globaldoc(text)

  indextype IS ctxsys.context

  parameters ('lexer

               global_lexer

               language

               column

               language');

  1. Creating Indexes for Documents Stored in the BLOB Data Type

In addition to the language column, the character set and format columns must be added in the table where the documents are stored. The character set column stores the character set of the documents using the Oracle character set names. The format column specifies whether a document is a text or binary document. For example, the CREATE TABLE statement can specify columns called characterset and format:

CREATE TABLE  globaldoc (

   doc_id       NUMBER       PRIMARY KEY,

   language     VARCHAR2(30),

   characterset VARCHAR2(30),

   format       VARCHAR2(10),

   text         BLOB

  );

You can put word-processing or spreadsheet documents into the table and specify binary in the format column. For documents in HTML, XML and text format, you can put them into the table and specify text in the format column.

Because there is a column in which to specify the character set, you can store text documents in different character sets.

When you create the index, specify the names of the format and character set columns:

CREATE INDEX globalx ON globaldoc(text)

  indextype is ctxsys.context

  parameters ('filter inso_filter

                lexer global_lexer

                language column language

                format  column format

                charset column characterset');

You can use the charset_filter if all documents are in text format. The charset_filter converts data from the character set specified in the charset column to the database character set.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值