Filtering Documents

Previous Page TOC Next Page


 


- 11 -
Filtering Documents


In this chapter you'll learn how Index Server implements the process known as filtering to identify the information that will be included in the indexes against which queries are executed. You'll also learn about how Index Server uses word-breaker DLLs to split text into word chunks and then strips commonly used words known as noise words. Finally, you'll learn about the use of NULL filters, default filters, and some additional information about the pre-installed filters supported by Index Server.

What is Document Filtering?


Document filtering is the process by which Index Server parses (filters) documents and inserts specific data from the documents into content indexes. Special filters known as content filters are used to extract text without formatting and break the remaining content into key words. From these words, language-specific common words (such as a, and, the, you, and so on) known as noise words are removed.



Noise words are removed because these words appear so often in the text of the specific language that they should not be indexed. For those of you who are familiar with other indexing engines commonly used in the Internet community, noise words are very similar to stop words. Wide Area Information Service (WAIS) search engines strip stop words from the text before indexing a document. Index Server performs the same task, but calls these words noise words.

After the noise words have been removed, the remaining content is normalized and the result is stored in an index.

How Index Server Filters Documents


Index Server uses a child process known as the CiDaemon to manage the filtering of documents. The CiDaemon process filters documents by using a filter DLL (dynamic link library) to extract the text and properties from a document. Those familiar with the Microsoft Office suite of applications know that all documents contain a set of general properties such as a title, subject, author, keywords, and comments. These are just a few of the document properties that Index Server will index.

Next, a word-breaker DLL is used to parse the text and textual properties into words. Noise words are then removed from the parsed words and the remaining words are stored in the index.



By default, Index Server will not filter directory entries. However, by setting the registry setting FilterDirectories to a value of 1 (0x1), the system properties for the directory are filtered. The full path to the registry setting for FilterDirectories is
/HKEY_LOCAL_MACHINE

  /SYSTEM

    /CurrentControlSet

      /Control

        /contentindex

          /FilterDirectories

One of the major advantages Index Server has over other indexing engines is its capability to filter documents in the background. This background processing allows normal querying to continue without interruption. If the CiDaemon has a file open for reading and another process attempts to open it for writing, the file is immediately closed by the CiDaemon and filtering is attempted against that same file as soon as possible.



Index Server maintains a registry setting known as FilterRetries, which identifies the maximum number of retries for a document that fails to be filtered. If the FilterRetries maximum number of retries is reached, the document is added to the unfiltered documents list. The default number of retries for a document is 4. You can change this value by editing the Windows NT registry entry for FilterRetries. The full path to this registry setting is

    • /HKEY_LOCAL_MACHINE
      
        /SYSTEM
      
          /CurrentControlSet
      
            /Control
      
              /contentindex
      
                /FilterRetries


A listing of unfiltered documents can be determined by executing the query

@unfiltered = TRUE



The capability to detect the opening of a document for write purposes and the subsequent retry for filtering is not available on network shared drives.

The CiDaemon process is capable of building abstracts, sometimes called summaries, of the documents it indexes. The size of an abstract is limited by the registry setting MaxCharacterization. By default, the MaxCharacterization parameter is set to a value of 140 (bytes).



Deactivate the generation of document abstracts by setting the registry setting GenerateCharacterization to a value of zero (0x0).

In this section, you'll learn how each step of the filtering process works and how the CiDaemon process identifies which filters to use when filtering documents.

Using Filter DLLs

Filter DLLs are used by Index Server to interpret the formatting of documents so that they can be filtered and the raw processed data (text and properties) can be inserted into the appropriate index. Microsoft publishes the Ifilter ActiveX Software Developers Kit (SDK), which can be used by developers to build custom document filters.



RESOURCE

The IFilter ActiveX SDK is freely available for downloading from the Microsoft Index Server Web Site at http://www.microsoft.com/ntserver/search/.

All documents are filtered by an associated filter DLL. The only exceptions to this rule are documents that are password protected, which are not filtered.



If Index Server attempts to filter a corrupted document or uses a corrupted filter DLL, the document will not be filtered.


Identifying Filters Through File Types and Extensions

When determining the filter to use for a specific document, the CiDaemon process first looks at the file extension and file type. If the document matches any of the predefined extension specifications (specified in the NT registry), the CiDaemon determines whether a filter has been specified for the document. If no filter has been defined for the document, the default filter is used.

To determine the filter DLL to be used based on a file extension, check the Windows NT registry for the specified file type and extension.



Use the regedt32.exe program to view any of the registry entries. Be careful to not make mistakes when editing the registry. Errors in the registry can make your Index Server or Windows NT implementation fail, requiring a reload of the software. A complete Windows NT software reload can take hours to complete due to the extensive configuration required to reach the state that your operating system was in before the registry entries were edited.

Each of the known document extensions are listed in the Windows NT registry under the path

/HKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

For example, a document with the extension .avi would be found under the full path

/HKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

      /.avi

Figure 11.1 displays the document type for a file extension of .avi.

Figure 11.1. The file type registry entry for an .avi file extension.

Looking at the right-hand window pane of Figure 11.1, you see the file type (specified with the key <NO NAME>) for an .avi document is AVIFile. When the CiDaemon knows the class for a specific document, it checks the registry entries (as described in the next section) to determine what filter DLL to use for the document.

Determining the Filter DLL Based On Document Class

Checking the NT registry for the filter DLL that is used for a specified document can be somewhat cumbersome. Luckily for you, the CiDaemon can perform the filter DLL lookup effortlessly. However, should you manually need to determine the proper filter DLL for a document, you can trek through the Windows NT registry entries using the regedt32.exe program mentioned previously.

Let's step through an example of checking the filter DLL associated with Microsoft Word documents. In the previous section, you learned that filter DLLs can be determined based on the document extension and class. In the case of .avi files, the class is AVIFile. The class for a Word document is Word.Document.6. Class registry entries for Word documents can be found under the registry path

/HKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

      /Word.Document.6

When you know the document class, you must determine the class identification (CLSID) number associated with the specified document class.

Under the registry entry for each class of document, you will find an entry for an identification number that is used to identify the specific class. This identification number is stored under the registry key CLSID. In the case of a Microsoft Word document, the CLSID can be found under the registry key

/HKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

      /Word.Document.6

        /CLSID

Figure 11.2 displays the CLSID registry entry for a Microsoft Word document.

Figure 11.2. The CLSID registry entry for a Microsoft Word document.

Looking at the CLSID entry displayed in Figure 11.2, you see a value of 00020900-0000-0000-C000-000000000046 of type REG_SZ. This is the CLSID value you will use to determine the persistent handler for Microsoft Word documents.

To determine the persistent handler for a document, look under the registry key for the specified CLSID. For a Microsoft Word document, look under the registry key

/HKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

      /CLSID

        /00020900-0000-0000-C000000000000046

          /PersistentHandler

Figure 11.3 displays the PersistentHandler registry entry for a Microsoft Word document.

Figure 11.3. The PersistentHandler registry entry for a Microsoft Word document.

Looking at the PersistentHandler entry displayed in Figure 11.3, you see a value of 98de59a0-d175-11cd-a7bd-00006b827d94 of type REG_SZ. This value identifies which PersistentHandler is used to determine the persistent handler global unique identifier (GUID) for Microsoft Word documents.

To determine the persistent handler GUID for a document, look under the registry key of the specified PersistentHandler value for the registry key PersistentAddinsRegistered. For a Microsoft Word document, look under the registry key

/HKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

      /CLSID

        /98de59a0-d175-11cd-a7bd-00006b827d94

          /PersistentAddinsRegistered

Figure 11.4 displays the PersistentAddinsRegistered registry entry for a Microsoft Word document.

Figure 11.4. The PersistentAddInsRegistered registry entry for a Microsoft Word document.

Looking at the PersistentAddinsRegistered entry displayed in Figure 11.4, you see a value of 53524bdc-3e9c-101b-abe2-00608c86f49a of type REG_SZ. This value identifies the IFilter interface GUID that is used in the final step of identifying the filter DLL for a specified document class.

In the final step, look up the registry entry for the InprocServer32 key under the IFilter interface GUID determined in the previous step. For a Microsoft Word document, look under the registry key

/HKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

      /CLSID

        /53524bdc-3e9c-101b-abe2-00608c86f49a

          /InprocServer32

Figure 11.5 displays the InprocServer32 registry entry for a Microsoft Word document.

Figure 11.5. The InprocServer32 registry entry for a Microsoft Word document.

Looking at the right-hand window pane of Figure 11.5, you see a value of sccifilt.dll of type REG_SZ. This registry entry identifies the filter DLL used when filtering Microsoft Word documents.



You can easily add new entries to the Windows NT registry to specify documents of different extensions to use the same filter as that of an already defined extension. For example, to specify that documents with the extension .xyz should use the same filter as Microsoft Word documents, simply create an entry in the registry associating the extension .xyz with the file type Word.Document.6 (as shown in the following).
/HKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

      .xyz

        = REG_SZ Word.Document.6


NULL Filters

The NULL filter does not extract the contents of a document. Instead, it is used to extract only the system properties of documents. The NULL filter is primarily used against binary files such as executable files, images, zip files, and so on. Table 11.1 lists the default extensions used to identify binary files.

 

Table 11.1. Default binary file extensions.

.aif .avi .cgm .com .dct .dic .dll .exe
.eyb .fnt .ghi .gif .hqx .ico .inv .jbf
.jpg .m14 .mov .movie .mv .pdf .pic .pma
.pmc .pml .pmr .psd .sc2 .tar .tif .tiff
.ttf .wav .wll .wlt .wmf .z .z96 .zip


For a list of document properties available for documents filtered with the NULL Ifilter interface, see Figure 1.9 in Chapter 1, "Overview of Document Indexing, Queries, and Result Sets."

The easiest way to add a new file extension that uses the NULL filter is to add the file extension to the registry and set the type to BinaryFile. For example, to associate the extension .bin with the binary file type and have it use the NULL IFilter interface, simply add the following entry to the Windows NT registry:

/HKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

      /.bin

        = REG_SZ BinaryFile


Changing the file type of an extension that already has an associated file type is not the recommended method for specifying the IFilter DLL to be used by Index Server. Doing so could alter the operation of Windows NT and damage your installation. Rather, the preferred method for specifying the IFilter DLL is to set the IFilter PersistentAddinsRegistered registry entry for the specified document extension to the NULL filter GUID {C3278E90-BEA7-11CD-B579-08002B30BFEB}. The previous section describes how to check the PersistentAddinsRegistered registry entry for a specified document type.


The Default Filter

The default filter is used by Index Server to filter plain text documents (.txt) and all other documents for which no file-extension association exists in the Windows NT registry. Because Index Server does not know the formatting of documents when using the default filter, only the system properties and the contents of a file are filtered.



Index Server uses the registry setting FilterFilesWithUnknownExtensions to determine whether the default filter should be activated against files of unknown extensions. The Windows NT registry path for this setting is
/HKEY_LOCAL_MACHINE

  /SYSTEM

    /CurrentControlSet

      /Control

        /contentindex

          /FilterFilesWithUnknownExtensions

Default filtering will be activated if the FilterFilesWithUnknownExtensions registry key is set to a value of 1 (0x1) .


Removing Filter DLLs

Removing a filter DLL requires that the IFilter PersistentAddinsRegistered and InprocServer32 Windows NT registry entries for a specific document type be removed. To find the PersistentAddinsRegistered entry for a document type, refer to the section in this chapter titled "Determining the Filter DLL Based On Document Class."

For example, to remove the HTML filter DLL entries, remove the following two entries from the Windows NT registry.

/KKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

      /CLSID

        {EEC97550-47A9-11CF-B952-00AA0051FE20}

        /PersistentAddinsRegistered

          {89BCB740-6119-101A-BCB7-00DD010655AF}

            REG_SZ {E0CA5340-4534-11CF-B952-00AA0051FE20}

/HKEY_LOCAL_MACHINE

  /SOFTWARE

    /Classes

      /CLSID

        {E0CA5340-4534-11CF-B952-00AA0051FE20}

          REG_SZ HTML Filter

        /InprocServer32

          REG_SZ htmlfilt.dll

Using Word-Breaker DLLs and Noise Words

After the CiDaemon filters a document using the specified filter DLL, the next step in indexing a document is to process the filtered data. Word-breaker DLLs are used to break up the text (as well as document properties that are textual in nature) returned by filters into words. Next, noise words are stripped from the data and everything that remains is normalized and inserted into the index.

Word-breaker DLLs and noise words are language specific because they deal with text from specific languages that form commonly used (noise) words. For example, in the English language, the word you is a very common word, thus is considered to be a noise word by Index Server. The word tu is used in the Spanish language and carries the same meaning as you does in English. So, the word you would not be a noise word for the Spanish language, but the word tu would be.

Noise words are supported by Index Server for the following languages.

  • Dutch

  • U.K. English

  • U.S. English

  • French

  • German

  • Italian

  • Japanese

  • Spanish

  • Swedish

The extensions .deu, .eng, .enu, .esn, .fra, .ita, .nld, and .sve are commonly used to identify language-specific files.



You can find a complete listing of all of the noise words for the languages in your Index Server installation by checking the Windows NT registry for the language-specific filename that contains each of the supported noise words. For U.S. English, the full path registry parameter that points to the noise file is
/HKEY_LOCAL_MACHINE

  /SYSTEM

    /CurrentControlSet

      /Control

        /contentindex

          /Language

            /English_US

              /NoiseFile

                /noise.enu


Pre-installed Filter DLLs


Index Server comes pre-installed with DLLs for filtering documents created by most of the commonly used Microsoft Office products as well as Internet, plain text, and binary files. The pre-installed filter DLLs include support for the following documents.

  • HTML

    •     Document Type: Internet Document (HTML)
      
          Class:  htmlfile
      
          CLSID: 25336920-03F9-11CF-8FD0-00AA00686F13
      
          PersistentHandler: EEC97550-47A9-11CF-B952-00AA0051FE20
      
          PersistentAddinsRegistered: E0CA5340-4534-11CF-B952-00AA0051FE20
      
          InprocServer32 (Filter DLL): htmlfilt.dll
  • Microsoft Word

    •     Document Type: Microsoft Word Document
      
          Class: Word.Document.6, WordDocument
      
          CLSID: 00020900-0000-0000-C000-000000000046
      
          PersistentHandler: 98DE59A0-D175-11CD-A7BD-00006B827D94
      
          PersistentAddinsRegistered: 53524BDC-3E9C-101B-ABE2-00608C86F49A
      
          InprocServer32 (Filter DLL): sccifilt.dll
  • Microsoft Excel

    •     Document Type: Microsoft Excel Worksheet
      
          Class: Excel.Sheet.5
      
          CLSID: 00020810-0000-0000-C000-000000000046
      
          PersistentHandler: 98DE59A0-D175-11CD-A7BD-00006B827D94
      
          PersistentAddinsRegistered: 53524BDC-3E9C-101B-ABE2-00608C86F49A
      
          InprocServer32 (Filter DLL): sccifilt.dll
  • Microsoft PowerPoint

    •     Document Type: Microsoft PowerPoint Presentation
      
          Class: PowerPoint.Show.7
      
          CLSID: EA7BAE70-FB3B-11CD-A903-00AA00510EA3
      
          PersistentHandler: 98DE59A0-D175-11CD-A7BD-00006B827D94
      
          PersistentAddinsRegistered: 53524BDC-3E9C-101B-ABE2-00608C86F49A
      
          InprocServer32 (Filter DLL): sccifilt.dll
  • Plain text files

    •     Document Type: Text Document
      
          Class: txtfile
      
          CLSID: 89BCB7A4-6119-101A-BCB7-00DD010655AF
      
          PersistentHandler: 5E941D80-BF96-11CD-B579-08002B30BFEB
      
          PersistentAddinsRegistered: C1243CA0-BF96-11CD-B579-08002B30BFEB
      
          InprocServer32 (Filter DLL): query.dll
  • Binary files

    •     Document Type: Binary file
      
          Class: BinaryFile
      
          CLSID: 08C524E0-89B0-11CF-88A1-00AA004B9986
      
          PersistentHandler: 098F2470-BAE0-11CD-B579-08002B30BFEB
      
          PersistentAddinsRegistered: C3278E90-BEA7-11CD-B579-08002B30BFEB
      
          InprocServer32 (Filter DLL): query.dll

Summary


Document filtering is completed in the background by the Index Server CiDaemon process. When filtering documents, the CiDaemon determines which filter DLL to use based on the extension and the type of document being filtered. Document properties as well as content text are extracted from documents. Word-breaker DLLs are used to split the extracted raw data into words, which are then culled to remove common language-specific words known as noise words.

Previous Page Page Top TOC Next Page

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值