You can store documents in multiple languages in CLOB, NCLOB, or BLOB data types and set up Oracle Text to enable content search for the documents.
Data in CLOB columns is stored in the AL16UTF16 character set when the database character set is multibyte, such as UTF8 or AL32UTF8. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document in a CLOB column requires less storage space than the same document in a LONG column using AL32UTF8, typically around 30% less, depending on the contents of the document.
Documents in NCLOB format are also stored in the AL16UTF16 character set regardless of the database character set or national character set. The storage space requirement is the same as for CLOB data. Document contents are converted to UTF-16 when they are inserted into a NCLOB column. If you want to store multilingual documents in a non-Unicode database, then choose NCLOB. However, content search on NCLOB with Oracle Text is not supported.
Documents in BLOB format are stored as they are. No data conversion occurs during insertion and retrieval. However, SQL string manipulation functions (such as LENGTH or SUBSTR) and collation functions (such as NLS_SORT and ORDER BY) cannot be applied to the BLOB data type.
The following table lists the advantages and disadvantages of the CLOB, NCLOB, and BLOB data types when storing documents:
Table 6-3 Comparison of LOB Data Types for Document Storage
Data Types | Advantages | Disadvantages |
CLOB | Content search support with Oracle Text String manipulation support | Depends on database character set Data conversion is necessary for insertion Cannot store binary documents |
NCLOB | Independent of database character set String manipulation support | No content search support Data conversion is necessary for insertion Cannot store binary documents |
BLOB | Independent of database character set Content search support No data conversion, data stored as is Can store binary documents such as Microsoft Word or Microsoft Excel | No string manipulation support |
Creating Indexes for Searching Multilingual Document Contents
Oracle Text enables you to build indexes for content search on multilingual documents stored in CLOB format and BLOB format. It uses a language-specific lexer to parse the CLOB or BLOB data and produces a list of searchable keywords.
Create a multilexer to search multilingual documents. The multilexer chooses a language-specific lexer for each row, based on a language column. This section describes the high level steps to create indexes for documents in multiple languages. It contains the following topics:
- Creating Multilexers
The first step in creating the multilexer is the creation of language-specific lexer preferences for each language supported. The following example creates English, German, and Japanese lexers with PL/SQL procedures:
ctx_ddl.create_preference('english_lexer', 'basic_lexer');
ctx_ddl.set_attribute('english_lexer','index_themes','yes');
ctx_ddl.create_preference('german_lexer', 'basic_lexer');
ctx_ddl.set_attribute('german_lexer','composite','german');
ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');
ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
ctx_ddl.create_preference('japanese_lexer', 'JAPANESE_VGRAM_LEXER');
After the language-specific lexer preferences are created, they need to be gathered together under a single multilexer preference. First, create the multilexer preference, using the MULTI_LEXER object:
ctx_ddl.create_preference('global_lexer','multi_lexer');
Now add the language-specific lexers to the multilexer preference using the add_sub_lexer call:
ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer');
ctx_ddl.add_sub_lexer('global_lexer', 'japanese', 'japanese_lexer');
ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');
This nominates the german_lexer preference to handle German documents, the japanese_lexer preference to handle Japanese documents, and the english_lexer preference to handle everything else, using DEFAULT as the language.
- Creating Indexes for Documents Stored in the CLOB Data Type
The multilexer decides which lexer to use for each row based on a language column in the table. This is a character column that stores the language of the document in a text column. Use the Oracle language name to identify the language of a document in this column. For example, if you use the CLOB data type to store your documents, then add the language column to the table where the documents are stored:
CREATE TABLE globaldoc
(doc_id NUMBER PRIMARY KEY,
language VARCHAR2(30),
text CLOB);
To create an index for this table, use the multilexer preference and specify the name of the language column:
CREATE INDEX globalx ON globaldoc(text)
indextype IS ctxsys.context
parameters ('lexer
global_lexer
language
column
language');
- Creating Indexes for Documents Stored in the BLOB Data Type
In addition to the language column, the character set and format columns must be added in the table where the documents are stored. The character set column stores the character set of the documents using the Oracle character set names. The format column specifies whether a document is a text or binary document. For example, the CREATE TABLE statement can specify columns called characterset and format:
CREATE TABLE globaldoc (
doc_id NUMBER PRIMARY KEY,
language VARCHAR2(30),
characterset VARCHAR2(30),
format VARCHAR2(10),
text BLOB
);
You can put word-processing or spreadsheet documents into the table and specify binary in the format column. For documents in HTML, XML and text format, you can put them into the table and specify text in the format column.
Because there is a column in which to specify the character set, you can store text documents in different character sets.
When you create the index, specify the names of the format and character set columns:
CREATE INDEX globalx ON globaldoc(text)
indextype is ctxsys.context
parameters ('filter inso_filter
lexer global_lexer
language column language
format column format
charset column characterset');
You can use the charset_filter if all documents are in text format. The charset_filter converts data from the character set specified in the charset column to the database character set.