Date: 02 January 2008
Reports on: How to implement Boolean Search using Autonomy search engine
Introduction:
Currently Implementation in Mocca (Keyword Search):
•When searching for red dress in the search field, the results return relevant results matching red or dress. I.e. first 3 results may match only red keyword and next 4 results match dress
To be implemented (Boolean Search):
• Implement Boolean Search - when user enters red dress, the search rules should determine that the results displayed match the following rules in the following order:
1. red and dress
2. red or dress
Findings:
1. How to index Records for Boolean Search
Below diagram shows how autonomy search engine works.
1.1 There are 2 main steps to use Autonomy:
Ø Indexing record into IDOL/DRE server
Ø Create Query rules to query information from IDOL/DRE server
We can follow the documents provided by to install and config the fetch modules to generate the .idx file.
1.2 To index the .idx file into Autonomy DRE
Enter the following command in the address field of your web browser:
http://host:port/DREADD?file_name
• host: The IP address or name of the machine on which the Autonomy
DRE is installed.
• port: The port used to index files into to the Autonomy DRE.
• file_name: The IDX or XML file that you want to index.
For example:
http://localhost:3001/DREADD?c:/db/dre3indexdata.idx
1.3 Preventing duplicate content
Decide which fields you want to use as reference fields (for example, */DRETITLE), and
set up a field process that identifies these fields as reference fields when a Connector
indexes documents into the DRE.
1.4 Below codes shows how to indexing records in mocca
public static void sendIdxToDRE(){
try { // Read in the INDEX FILE
File IDX_FILE_fp = new File(IDX_FILE); if (!IDX_FILE_fp.exists()) { System.out.println(""); System.out.println(IDX_FILE +" index file is not found !!!"); System.exit(1); }
FileInputStream FIS_DRE = new FileInputStream(IDX_FILE_fp);
DataInputStream in = new DataInputStream(FIS_DRE); String inputString =""; StringBuffer FullIDXString = new StringBuffer();
BufferedReader bReader = new BufferedReader(new InputStreamReader(in));
while ((inputString = bReader.readLine()) != null) { if(inputString.indexOf("summary=")!= -1){
int idx = inputString.indexOf("| | | "); //System.out.println("before-->" + inputString); if(idx >= 0){ String tmp=inputString;
String de=""; if(idx != 0){ de = tmp.substring(idx-1,idx);
if(!de.equals("/"")) tmp=inputString.substring(0,idx-4); else tmp = inputString.substring(0,idx); } int strlen = inputString.length()-1;
String temp = inputString.substring(idx+6,strlen);
inputString=tmp+temp; //System.out.println(idx + " after-->" + inputString); } } //System.out.println(inputString); FullIDXString.append(inputString+"/n"); } //CLOSE FILE
in.close();
URL suir_url;
if (DREDBNAME == null) { suir_url = new URL("http://" +SUIRSERVER_IP +":" +SUIRSERVER_PORT +"/DREADDDATA?"); } else { suir_url = new URL("http://" +SUIRSERVER_IP +":" +SUIRSERVER_PORT +"/DREADDDATA?&DREDBNAME=" +DREDBNAME); }
URLConnection suir_connection = suir_url.openConnection();
suir_connection.setDoOutput(true);
PrintWriter suir_out = new PrintWriter(suir_connection.getOutputStream());
//suir_out.println("/n" +FullIDXString +"#DREENDDATA/n/n"); // PASS IDX as POST To Suir
suir_out.println("/n" +FullIDXString +"#DREENDDATAREFERENCE/n/n"); // PASS IDX as POST To Suir. Kill duplicated DREREFERENCE
suir_out.close();
BufferedReader suir_www = new BufferedReader(new InputStreamReader(suir_connection.getInputStream()));
String suir_inputLine; /* System.out.println("-------------------------------------------------------------------"); while ((suir_inputLine = suir_www.readLine()) != null){ System.out.println("RESPONSE : " +suir_inputLine); } System.out.println("-------------------------------------------------------------------"); */ System.out.println("Done sending to SUIR Server " +SUIRSERVER_IP +" Port " +SUIRSERVER_PORT); if (DREDBNAME != null) { System.out.println("DREDBNAME is " +DREDBNAME); }
} catch(Exception e){ e.printStackTrace(); System.out.println(e.toString()); }
}// sendIdxToDRE |
2. Any differences between indexing recordings for Keyword and Boolean Search
There isn’t any difference between Keyword Search and Boolean Search if no special requirement on Boolean Search.
3. How to creating Search Rules for Boolean Search
3.1 Boolean searches in Autonomy
IDOL Server Boolean and bracketed Boolean searches | 01 Jan 08 13.41.09 |
Description | |||
| You can submit standard Boolean searches to IDOL server.The following operators allow you to manipulate a query by applying them to words, exact phrases or other Boolean expressions. Note that APCM (Adaptive Probabilistic Concept Modeling) is used to rank the results that match the Boolean query.AND Binary operator. Ensures that both terms are matched in every document that is returned. For example: action=Query&Text=cat+AND+dog This query only returns documents that contain both cat and dog. NOT …….. OR Binary operator. One or both terms must appear for the document to be returned. This is the default behavior if no explicit operator is given between two terms. For example: action=Query&Text=cat+OR+dog This query only returns documents that contain either cat, dog or both terms. EOR or XOR …. ( ) Bracketed expressions. These are evaluated left to right and can be nested. They dictate the precedence and behavior of combined operator statements. For example: action=Query&Text=(fish EOR pie) AND (chips EOR mash) This query only returns documents that contain one of the following: "fish" and "chips" "fish" and "mash" "pie" and "chips" "pie" and "mash" |
|
|
3.2 Implement Boolean Search
To implement Boolean Search - when user enters red dress, the search rules should determine that the results displayed match the following rules in the following order:
1st. red and dress
2nd. red or dress
We can create rules as below.
1st parse the keyword into tokens.
eg, if user input ‘new red dress’, we parse it to ‘new’ ‘red’ and ‘dress’, split by ‘ ’.
2nd connect all the tokens by ‘+AND+’ and Bracket them.
So it becomes ‘(new+AND+red+AND+dress)’.
3rd connect all the tokens by “+OR+”,
So it becomes ‘(new+OR+red+OR+dress)’
4th connect 2nd and 3rd by ‘+’
So it becomes ‘(new+AND+red+AND+dress)+(new+OR+red+OR+dress)’
5th add different weighing to change the relevance
(new+AND+red+AND+dress)[50]+(new+OR+red+OR+dress)[40], so the the results will follow this order 1st (new+AND+red+AND+dress), 2nd (new+OR+red+OR+dress)
3.3 Query by ACI API.
client aci = new client(); // aci Client Object aciObject acioDRE = aci.aciObjectCreate(aciObject.ACI_CONNECTION); // Create DRE connection object acioDRE.paramSetString(aciObject.ACI_HOSTNAME, "192.168.14.51"); acioDRE.paramSetString(aciObject.ACI_PORTNUMBER, "3002"); acioDRE.paramSetInt(aciObject.ACI_CONN_TIMEOUT, 60000);
aciObject acioCommand = aci.aciObjectCreate(aciObject.ACI_COMMAND); // Create query command object acioCommand.paramSetString(aciObject.ACI_COM_COMMAND,"Query");// Set action
acioCommand.paramSetString("text", keyword); //acioCommand.paramSetString("FieldText",prevQuery);
//acioCommand.paramSetString("DatabaseMatch",classifiedDB); acioCommand.paramSetInt("MaxResults",300); acioCommand.paramSetInt("Characters",3000); acioCommand.paramSetBool("Spellcheck",true); acioCommand.paramSetString("Combine","Simple"); //comment out IgnoreSpecials=true to enable Boolean search //acioCommand.paramSetString("IgnoreSpecials","true"); acioCommand.paramSetString("Sort","DISTRICT:reversealphabetical");
aciObject acioResult1 = acioDRE.aciObjectExecute(acioCommand1);
|
4. Any differences between creating search rules for Keyword and Boolean Search
Actually, Keyword Search for ‘red dress’ is the default treat as ‘red+OR+dress’, so the result returns ‘red+AND+dress’, ‘red+OR+dress’.
We can use the follow rule to make sure the result follow the order, 1st red+AND+dress’, 2nd ‘red+OR+dress’:
(red+AND+dress)[50]+(red+OR+dress)[40]
But it will cause extra code rework, and as I found, the result queried by ‘red dress’ is sequenced by the relevance, and ‘red+AND+dress’ seems have more relevance than ‘red+OR+dress’, so it should come out firstly.
5. How to get results when using the search engine
If the result searched out is like below.
<?xml version="1.0" encoding="ISO-8859-1" ?> <autnresponse xmlns:autn="http://schemas.autonomy.com/aci/"> <action>QUERY</action> <response>SUCCESS</response> <responsedata> <autn:numhits>14</autn:numhits> <autn:hit> <autn:reference>http://192.168.10.158:7003/vapext/customjsp/report/AutonomyQuery.jsp&ID=21</autn:reference> <autn:id>23</autn:id> <autn:section>0</autn:section> <autn:weight>97.43</autn:weight> <autn:links>RED,DRESS</autn:links> <autn:database>ADHOCTEXTS</autn:database> <autn:title>red dress dress red dress red dress red 12/27/2007...</autn:title> - <autn:content> - <DOCUMENT> <DREREFERENCE>http://192.168.10.158:7003/vapext/customjsp/report/AutonomyQuery.jsp&ID=21</DREREFERENCE> <DRETITLE>red dress dress red dress red dress red 12/27/2007...</DRETITLE> <DREDATE>1198899205</DREDATE> <DREDBNAME>ADHOCTEXTS</DREDBNAME> <ASS1ADHOCTEXTID>21</ASS1ADHOCTEXTID> <CMATITLE>red dress</CMATITLE> <TITLE>dress red</TITLE> <LOCATION>dress red</LOCATION> <BODYTEXT><p>dress red</p></BODYTEXT> <LAUNCHDATE /> <LAUNCHTIME /> <EXPIRYDATE /> <EXPIRYTIME /> <CREATEDATE>12/27/2007</CREATEDATE> <CREATEDBY /> <LASTMODIFIEDDATE /> <LASTMODIFIEDBY /> <DRECONTENT>red dress dress red dress red dress red 12/27/2007</DRECONTENT> </DOCUMENT> </autn:content> </autn:hit> ………………repeat other records |
We can use below codes to get the results.
//Create DRE connection object aciObject acioDRE = aci.aciObjectCreate(aciObject.ACI_CONNECTION); ……………………//set connection infomation // Create DRE command object aciObject acioCommand = aci.aciObjectCreate(aciObject.ACI_COMMAND); ……………………//set query rules
aciObject acioResult = acioDRE.aciObjectExecute(acioCommand); aciObject acioSingleResult = null; acioSingleResult = acioResult.findFirstOccurrenceFromRoot("autn:hit");
int iCount = 0; if(acioSingleResult != null) { do { iCount++; //The search result is formmated into XML File, get the tag values String cmaTitle = acioSingleResult.getTagValue("CMATITLE"); String title = acioSingleResult.getTagValue("TITLE"); String bodyText = acioSingleResult.getTagValue("BODYTEXT"); String location = acioSingleResult.getTagValue("LOCATION"); String crtDt = acioSingleResult.getTagValue("CREATEDATE"); …………//Handle the record
//get next record acioSingleResult = acioSingleResult.aciObjectNextEntry(); }while(acioSingleResult!=null) ; } |
6. Any differences between using a search engine that uses Keyword Search and Boolean Search
There isn’t any difference between Keyword Search and Boolean Search if no special requirement on Boolean Search.
7. Any special rules that users can use (e.g. AND, OR , "", or any special keywords used in common search engines that can be used in Autonomy)
There are many keyword users can use to customize his search, such as *?":() and Boolean / Proximity operators AND, NOT, OR, EOR, XOR, NEAR, DNEAR, WNEAR, BEFORE, AFTER.
But to use these keywords before you should remove the rule : IgnoreSpecials=true
IDOL Server Parameter - Query/IgnoreSpecials | 01 Dec 08 15.04.00 |
Description | |||
| Enter true if you want IDOL server to interpret the following as normal characters in query syntax. This disables wildcarding, phrase queries, field restriction and Boolean operations.*?":() and Boolean / Proximity operators AND, NOT, OR, EOR, XOR, NEAR, DNEAR, WNEAR, BEFORE, AFTER |
|
|
8. After displaying results, report on whether it is possible redefine these rules to give weightage to individual records, so that the order of records appearing can be influenced. (i.e. Record M "red dress" appears before Record A "red dress"
It is possible to redefine the rules to give weightage to individual records.
For example we using a filed ASS1ADHOCTEXTID for different weightage, and we don’t change the relevance. It means the record with most relevance to user’s search and higher weightage will prompt firstly.
For example if we query as below
http://192.168.14.51:3002/action=query&text=red%20dress&maxresults=300&print=all&sort=Relevance+ASS1ADHOCTEXTID:numberdecreasing
If there are 3 records returned.
Record A, matches “RED”, ASS1ADHOCTEXTID is 18
Record B, matches “RED”, ASS1ADHOCTEXTID is 17
Record C, matches “RED DRESS”, ASS1ADHOCTEXTID is 10
Record D, matches “RED DRESS” and same relevance with C, ASS1ADHOCTEXTID is 15
The display order will be DCAB
If we remove the relevance by following query
http://192.168.14.51:3002/action=query&text=red%20dress&maxresults=300&print=all&sort=ASS1ADHOCTEXTID:numberdecreasing
The display order will be ABDC
Conclusion:
1st We only need to change the query rules and And remove the rule ignoreSpecials=true to implement Boolean Search.
2nd There isn’t any change needed on indexing record and get results out of search engine.
3rd To implement different weightage for individual records, we may need to change the configure file add one more filed to generate new .idx file.