Autonomy Research Report

Date:  02 January 2008

 

Reports on: How to implement Boolean Search using Autonomy search engine

 

Introduction:

Currently Implementation in Mocca (Keyword Search):

When searching for red dress in the search field, the results return relevant results matching red or dress. I.e. first 3 results may match only red keyword and next 4 results match dress

 

To be implemented (Boolean Search):

     Implement Boolean Search - when user enters red dress, the search rules should determine that the results displayed match the following rules in the following order:

1.    red and dress

2.    red or dress

 

Findings:

1.        How to index Records for Boolean Search

 

Below diagram shows how autonomy search engine works.

1.1   There are 2 main steps to use Autonomy:

Ø         Indexing record into IDOL/DRE server

Ø         Create Query rules to query information from IDOL/DRE server

We can follow the documents provided by to install and config the fetch modules to generate the .idx file.

 

1.2   To index the .idx file into Autonomy DRE

Enter the following command in the address field of your web browser:

http://host:port/DREADD?file_name

host: The IP address or name of the machine on which the Autonomy

DRE is installed.

port: The port used to index files into to the Autonomy DRE.

file_name: The IDX or XML file that you want to index.

For example:

http://localhost:3001/DREADD?c:/db/dre3indexdata.idx

 

1.3   Preventing duplicate content

Decide which fields you want to use as reference fields (for example, */DRETITLE), and

set up a field process that identifies these fields as reference fields when a Connector

indexes documents into the DRE.

 

1.4   Below codes shows how to indexing records in mocca

public static void sendIdxToDRE(){

             

        try {

                     // Read in the INDEX FILE

                    

                     File IDX_FILE_fp = new File(IDX_FILE);

                     if (!IDX_FILE_fp.exists()) {

                            System.out.println("");

                            System.out.println(IDX_FILE +" index file is not found !!!");

                            System.exit(1);

                     }

                    

                     FileInputStream FIS_DRE = new FileInputStream(IDX_FILE_fp);          

                    

                     DataInputStream in = new DataInputStream(FIS_DRE);

                     String inputString ="";

                     StringBuffer FullIDXString = new StringBuffer();

                    

                     BufferedReader bReader = new BufferedReader(new InputStreamReader(in));

 

               while ((inputString = bReader.readLine()) != null) {  

                       if(inputString.indexOf("summary=")!= -1){

                                                  

                                           int idx = inputString.indexOf("| | | ");                          

                                           //System.out.println("before-->" + inputString);                                     

                                           if(idx >= 0){

                                                  String tmp=inputString;

                                                                                                                         

                                                  String de="";

                                                         if(idx != 0){   

                                                                       de = tmp.substring(idx-1,idx);

                                                               

                                                                       if(!de.equals("/""))

                                                                              tmp=inputString.substring(0,idx-4);

                                                                       else

                                                                              tmp = inputString.substring(0,idx);      

                                                         }

                                                  int strlen = inputString.length()-1;

                                                 

                                                  String temp = inputString.substring(idx+6,strlen);

                                                 

                                                 

                                                  inputString=tmp+temp;

                                                  //System.out.println(idx + " after-->" + inputString);

                                           }

                              }

                             //System.out.println(inputString);

                   FullIDXString.append(inputString+"/n");

               }

                  //CLOSE FILE

                 

                     in.close();

 

                     URL suir_url;

 

                     if (DREDBNAME == null) {

                        suir_url = new URL("http://" +SUIRSERVER_IP +":" +SUIRSERVER_PORT +"/DREADDDATA?");

                 }

                 else {

                        suir_url = new URL("http://" +SUIRSERVER_IP +":" +SUIRSERVER_PORT +"/DREADDDATA?&DREDBNAME=" +DREDBNAME);

                 }

    

                     URLConnection suir_connection = suir_url.openConnection();

                    

                     suir_connection.setDoOutput(true);

                    

                     PrintWriter suir_out = new PrintWriter(suir_connection.getOutputStream());          

                    

                     //suir_out.println("/n" +FullIDXString +"#DREENDDATA/n/n"); // PASS IDX as POST To Suir

                    

                     suir_out.println("/n" +FullIDXString +"#DREENDDATAREFERENCE/n/n"); // PASS IDX as POST To Suir. Kill duplicated DREREFERENCE

                    

                     suir_out.close();                  

                    

                     BufferedReader suir_www = new BufferedReader(new InputStreamReader(suir_connection.getInputStream()));

                    

                     String suir_inputLine;

                     /*

                     System.out.println("-------------------------------------------------------------------");

                     while ((suir_inputLine = suir_www.readLine()) != null){

                         System.out.println("RESPONSE : " +suir_inputLine);

                  }

                  System.out.println("-------------------------------------------------------------------");

                     */

                     System.out.println("Done sending to SUIR Server " +SUIRSERVER_IP +" Port " +SUIRSERVER_PORT);

                     if (DREDBNAME != null) {

                            System.out.println("DREDBNAME is " +DREDBNAME);

                     }

                    

        }

        catch(Exception e){

               e.printStackTrace();

               System.out.println(e.toString());

        }

       

       }// sendIdxToDRE

 

 

2.        Any differences between indexing recordings for Keyword and Boolean Search

 

There isn’t any difference between Keyword Search and Boolean Search if no special requirement on Boolean Search.

 

 

3.        How to creating Search Rules for Boolean Search

 

3.1 Boolean searches in Autonomy

IDOL Server

Boolean and bracketed Boolean searches

01 Jan 08 13.41.09

 

Description

 

You can submit standard Boolean searches to IDOL server.The following operators allow you to manipulate a query by applying them to words, exact phrases or other Boolean expressions. Note that APCM (Adaptive Probabilistic Concept Modeling) is used to rank the results that match the Boolean query.AND

Binary operator. Ensures that both terms are matched in every document that is returned.

For example:

action=Query&Text=cat+AND+dog

This query only returns documents that contain both cat and dog.

NOT

……..

OR

Binary operator. One or both terms must appear for the document to be returned. This is the default behavior if no explicit operator is given between two terms.

For example:

action=Query&Text=cat+OR+dog

This query only returns documents that contain either cat, dog or both terms.

EOR or XOR

….

( )

Bracketed expressions. These are evaluated left to right and can be nested. They dictate the precedence and behavior of combined operator statements.

For example:

action=Query&Text=(fish EOR pie) AND (chips EOR mash)

This query only returns documents that contain one of the following:

"fish" and "chips"

"fish" and "mash"

"pie" and "chips"

"pie" and "mash"

 

 

 

3.2 Implement Boolean Search

To implement Boolean Search - when user enters red dress, the search rules should determine that the results displayed match the following rules in the following order:

1st.  red and dress

2nd. red or dress

 

We can create rules as below.

1st parse the keyword into tokens.

eg, if user input ‘new red dress’, we parse it to ‘new’ ‘red’ and ‘dress’, split by ‘ ’.

2nd connect all the tokens by ‘+AND+’ and Bracket them.

So it becomes ‘(new+AND+red+AND+dress)’.

3rd connect all the tokens by “+OR+”,

So it becomes ‘(new+OR+red+OR+dress)’

4th connect 2nd and 3rd by ‘+’

So it becomes ‘(new+AND+red+AND+dress)+(new+OR+red+OR+dress)’

5th add different weighing to change the relevance

(new+AND+red+AND+dress)[50]+(new+OR+red+OR+dress)[40], so the the results will follow this order 1st (new+AND+red+AND+dress), 2nd (new+OR+red+OR+dress)

 

3.3 Query by ACI API.

client aci = new client();  // aci Client Object

aciObject acioDRE = aci.aciObjectCreate(aciObject.ACI_CONNECTION); // Create DRE connection object

acioDRE.paramSetString(aciObject.ACI_HOSTNAME, "192.168.14.51");

acioDRE.paramSetString(aciObject.ACI_PORTNUMBER, "3002");

acioDRE.paramSetInt(aciObject.ACI_CONN_TIMEOUT, 60000);

 

aciObject acioCommand = aci.aciObjectCreate(aciObject.ACI_COMMAND); // Create query command object

acioCommand.paramSetString(aciObject.ACI_COM_COMMAND,"Query");// Set action

 

acioCommand.paramSetString("text", keyword);

//acioCommand.paramSetString("FieldText",prevQuery);

          

//acioCommand.paramSetString("DatabaseMatch",classifiedDB);

acioCommand.paramSetInt("MaxResults",300);

acioCommand.paramSetInt("Characters",3000);

acioCommand.paramSetBool("Spellcheck",true);

acioCommand.paramSetString("Combine","Simple"); 

//comment out IgnoreSpecials=true to enable Boolean search

//acioCommand.paramSetString("IgnoreSpecials","true");   

acioCommand.paramSetString("Sort","DISTRICT:reversealphabetical");

 

aciObject acioResult1 = acioDRE.aciObjectExecute(acioCommand1);

 

 

 

4.        Any differences between creating search rules for Keyword and Boolean Search

 

Actually, Keyword Search for ‘red dress’ is the default treat as ‘red+OR+dress’, so the result returns ‘red+AND+dress’, ‘red+OR+dress’.

 

We can use the follow rule to make sure the result follow the order, 1st red+AND+dress’, 2nd ‘red+OR+dress’:

(red+AND+dress)[50]+(red+OR+dress)[40]

 

But it will cause extra code rework, and as I found, the result queried by ‘red dress’ is sequenced by the relevance, and ‘red+AND+dress’ seems have more relevance than ‘red+OR+dress’, so it should come out firstly.

 

5.        How to get results when using the search engine

 

If the result searched out is like below.

  <?xml version="1.0" encoding="ISO-8859-1" ?>

 <autnresponse xmlns:autn="http://schemas.autonomy.com/aci/">

  <action>QUERY</action>

  <response>SUCCESS</response>

 <responsedata>

  <autn:numhits>14</autn:numhits>

 <autn:hit>

  <autn:reference>http://192.168.10.158:7003/vapext/customjsp/report/AutonomyQuery.jsp&ID=21</autn:reference>

  <autn:id>23</autn:id>

  <autn:section>0</autn:section>

  <autn:weight>97.43</autn:weight>

  <autn:links>RED,DRESS</autn:links>

  <autn:database>ADHOCTEXTS</autn:database>

  <autn:title>red dress dress red dress red dress red 12/27/2007...</autn:title>

- <autn:content>

- <DOCUMENT>

  <DREREFERENCE>http://192.168.10.158:7003/vapext/customjsp/report/AutonomyQuery.jsp&ID=21</DREREFERENCE>

  <DRETITLE>red dress dress red dress red dress red 12/27/2007...</DRETITLE>

  <DREDATE>1198899205</DREDATE>

  <DREDBNAME>ADHOCTEXTS</DREDBNAME>

  <ASS1ADHOCTEXTID>21</ASS1ADHOCTEXTID>

  <CMATITLE>red dress</CMATITLE>

  <TITLE>dress red</TITLE>

  <LOCATION>dress red</LOCATION>

  <BODYTEXT><p>dress red</p></BODYTEXT>

  <LAUNCHDATE />

  <LAUNCHTIME />

  <EXPIRYDATE />

  <EXPIRYTIME />

  <CREATEDATE>12/27/2007</CREATEDATE>

  <CREATEDBY />

  <LASTMODIFIEDDATE />

  <LASTMODIFIEDBY />

  <DRECONTENT>red dress dress red dress red dress red 12/27/2007</DRECONTENT>

  </DOCUMENT>

  </autn:content>

  </autn:hit>

………………repeat other records

 

We can use below codes to get the results.

 

//Create DRE connection object

aciObject acioDRE = aci.aciObjectCreate(aciObject.ACI_CONNECTION);

……………………//set connection infomation

// Create DRE command object

aciObject acioCommand = aci.aciObjectCreate(aciObject.ACI_COMMAND);

……………………//set query rules

 

aciObject acioResult = acioDRE.aciObjectExecute(acioCommand);

aciObject acioSingleResult = null;

acioSingleResult = acioResult.findFirstOccurrenceFromRoot("autn:hit");

 

int iCount = 0;

if(acioSingleResult != null) {

do {

        iCount++;

        //The search result is formmated into XML File, get the tag values

       String cmaTitle = acioSingleResult.getTagValue("CMATITLE");

       String title = acioSingleResult.getTagValue("TITLE");

       String bodyText = acioSingleResult.getTagValue("BODYTEXT");

       String location = acioSingleResult.getTagValue("LOCATION");

       String crtDt = acioSingleResult.getTagValue("CREATEDATE");

       …………//Handle the record

 

        //get next record

       acioSingleResult = acioSingleResult.aciObjectNextEntry();

    }while(acioSingleResult!=null) ;

}

 

 

6.        Any differences between using a search engine that uses Keyword Search and Boolean Search

 

There isn’t any difference between Keyword Search and Boolean Search if no special requirement on Boolean Search.

 

7.        Any special rules that users can use (e.g. AND, OR , "", or any special keywords used in common search engines that can be used in Autonomy)

 

There are many keyword users can use to customize his search, such as *?":() and Boolean / Proximity operators AND, NOT, OR, EOR, XOR, NEAR, DNEAR, WNEAR, BEFORE, AFTER.

 

But to use these keywords before you should remove the rule : IgnoreSpecials=true

 

IDOL Server

Parameter - Query/IgnoreSpecials

01 Dec 08 15.04.00

 

Description

 

Enter true if you want IDOL server to interpret the following as normal characters in query syntax. This disables wildcarding, phrase queries, field restriction and Boolean operations.*?":() and Boolean / Proximity operators AND, NOT, OR, EOR, XOR, NEAR, DNEAR, WNEAR, BEFORE, AFTER

 

 

 

 

8.        After displaying results, report on whether it is possible redefine these rules to give weightage to individual records, so that the order of records appearing can be influenced. (i.e. Record M "red dress" appears before Record A "red dress"

 

It is possible to redefine the rules to give weightage to individual records.

 

For example we using a filed ASS1ADHOCTEXTID for different weightage, and we don’t change the relevance. It means the record with most relevance to user’s search and higher weightage will prompt firstly.

For example if we query as below

http://192.168.14.51:3002/action=query&text=red%20dress&maxresults=300&print=all&sort=Relevance+ASS1ADHOCTEXTID:numberdecreasing

If there are 3 records returned.

Record A, matches “RED”, ASS1ADHOCTEXTID is 18

Record B, matches “RED”, ASS1ADHOCTEXTID is 17

Record C, matches “RED DRESS”, ASS1ADHOCTEXTID is 10

Record D, matches “RED DRESS” and same relevance with C, ASS1ADHOCTEXTID is 15

The display order will be DCAB

 

If we remove the relevance by following query

http://192.168.14.51:3002/action=query&text=red%20dress&maxresults=300&print=all&sort=ASS1ADHOCTEXTID:numberdecreasing

The display order will be ABDC

 

Conclusion:

1st We only need to change the query rules and And remove the rule ignoreSpecials=true to implement Boolean Search.

2nd There isn’t any change needed on indexing record and get results out of search engine.

3rd To implement different weightage for individual records, we may need to change the configure file add one more filed to generate new .idx file.

 

 
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值