POJ 2050 Searching the Web

最新推荐文章于 2020-10-28 17:30:44 发布

mimo9527

最新推荐文章于 2020-10-28 17:30:44 发布

阅读量1k

点赞数

文章标签： dictionary query books search structure each

本文链接：https://blog.csdn.net/mimo9527/article/details/6193130

版权

hash.

/*Searching the Web Time Limit: 5000MS Memory Limit: 65536K Total Submissions: 1670 Accepted: 367 Description The word "search engine" may not be strange to you. Generally speaking, a search engine searches the web pages available in the Internet, extracts and organizes the information and responds to users' queries with the most relevant pages. World famous search engines, like GOOGLE, have become very important tools for us to use when we visit the web. Such conversations are now common in our daily life: "What does the word like ****** mean?" "Um... I am not sure, just google it." In this problem, you are required to construct a small search engine. Sounds impossible, does it? Don't worry, here is a tutorial teaching you how to organize large collection of texts efficiently and respond to queries quickly step by step. You don't need to worry about the fetching process of web pages, all the web pages are provided to you in text format as the input data. Besides, a lot of queries are also provided to validate your system. Modern search engines use a technique called inversion for dealing with very large sets of documents. The method relies on the construction of a data structure, called an inverted index,which associates terms (words) to their occurrences in the collection of documents. The set of terms of interest is called the vocabulary, denoted as V. In its simplest form, an inverted index is a dictionary where each search key is a term ω∈V. The associated value b(ω) is a pointer to an additional intermediate data structure, called a bucket. The bucket associated with a certain term ω is essentially a list of pointers marking all the occurrences of ω in the text collection. Each entry in each bucket simply consists of the document identifier (DID), the ordinal number of the document within the collection and the ordinal line number of the term's occurrence within the document. Let's take Figure-1 for an example, which describes the general structure. Assuming that we only have three documents to handle, shown at the right part in Figure-1; first we need to tokenize the text for words (blank, punctuations and other non-alphabetic characters are used to separate words) and construct our vocabulary from terms occurring in the documents. For simplicity, we don't need to consider any phrases, only a single word as a term. Furthermore, the terms are case-insensitive (e.g. we consider "book" and "Book" to be the same term) and we don't consider any morphological variants (e.g. we consider "books" and "book", "protected" and "protect" to be different terms) and hyphenated words (e.g. "middle-class" is not a single term, but separated into 2 terms "middle" and "class" by the hyphen). The vocabulary is shown at the left part in Figure-1.Each term of the vocabulary has a pointer to its bucket. The collection of the buckets is shown at the middle part in Figure-1. Each item in a bucket records the DID of the term's occurrence. After constructing the whole inverted index structure, we may apply it to the queries. The query is in any of the following formats: term term AND term term OR term NOT term A single term can be combined by Boolean operators: AND, OR and NOT ("term1 AND term2" means to query the documents including term1 and term2; "term1 OR term2" means to query the documents including term1 or term2; "NOT term1" means to query the documents not including term1). Terms are single words as defined above. You are guaranteed that no non-alphabetic characters appear in a term, and all the terms are in lowercase. Furthermore, some meaningless stop words (common words such as articles, prepositions, and adverbs, specified to be "the, a, to, and, or, not" in our problem) will not appear in the query, either. For each query, the engine based on the constructed inverted index searches the term in the vocabulary, compares the terms' bucket information, and then gives the result to user. Now can you construct the engine? Input The input starts with integer N (0 < N < 100) representing N documents provided. Then the next N sections are N documents. Each section contains the document content and ends with a single line of ten asterisks. ********** You may assume that each line contains no more than 80 characters and the total number of lines in the N documents will not exceed 1500. Next, integer M (0 < M <= 50000) is given representing the number of queries, followed by M lines, each query in one line. All the queries correspond to the format described above. Output For each query, you need to find the document satisfying the query, and output just the lines within the documents that include the search term (For a NOT query, you need to output the whole document). You should print the lines in the same order as they appear in the input. Separate different documents with a single line of 10 dashes. ---------- If no documents matching the query are found, just output a single line: "Sorry, I found nothing." The output of each query ends with a single line of 10 equal signs. ========== Sample Input 4 A manufacturer, importer, or seller of digital media devices may not (1) sell, or offer for sale, in interstate commerce, or (2) cause to be transported in, or in a manner affecting, interstate commerce, a digital media device unless the device includes and utilizes standard security technologies that adhere to the security system standards. ********** Of course, Lisa did not necessarily intend to read his books. She might want the computer only to write her midterm. But Dan knew she came from a middle-class family and could hardly afford the tuition, let alone her reading fees. Books might be the only way she could graduate ********** Research in analysis (i.e., the evaluation of the strengths and weaknesses of computer system) is essential to the development of effective security, both for works protected by copyright law and for information in general. Such research can progress only through the open publication and exchange of complete scientific results ********** I am very very very happy! What about you? ********** 6 computer books AND computer books OR protected NOT security very slick Sample Output want the computer only to write her ---------- computer system) is essential to the ========== intend to read his books. She might want the computer only to write her fees. Books might be the only way she ========== intend to read his books. She might fees. Books might be the only way she ---------- for works protected by copyright law ========== Of course, Lisa did not necessarily intend to read his books. She might want the computer only to write her midterm. But Dan knew she came from a middle-class family and could hardly afford the tuition, let alone her reading fees. Books might be the only way she could graduate ---------- I am very very very happy! What about you? ========== I am very very very happy! ========== Sorry, I found nothing. ========== Source Beijing 2004*/ #include <stdio.h> #include <stdlib.h> #include "string.h" #define MAX_DOCUMENT_NUM 100 #define MAX_LINE_LENGTH 128 #define MAX_TOTAL_LINE_NUMBER 1500 #define MAX_WORD_LENGTH 32 #define MAX_WORD_NUM 65535 #define NULL_INDEX MAX_WORD_NUM #define NOT_FIND 0 #define FIND 1 typedef struct _INDEX_NODE_ST_ { int iUsedFlag; int iDocumentNo; int iLineIndex; int iNext; }INDEX_NODE_ST; typedef struct _DICTIONARY_NODE_ST_ { char cUsedFlag; char acWord[MAX_WORD_LENGTH]; int aiBlongedDoc[4];/*32*4bit,use 1~100bit */ int iHeadIndex; int iTailIndex; }DICTIONARY_NODE_ST; typedef struct _DATABASE_ST_ { char acDataBase[MAX_TOTAL_LINE_NUMBER][MAX_LINE_LENGTH]; int iDocNum; struct { int iStartLine; int iEndLine; }astDocument[MAX_DOCUMENT_NUM]; }DATABASE_ST; DATABASE_ST gstDataBase; INDEX_NODE_ST gastIndex[MAX_WORD_NUM] = {0}; DICTIONARY_NODE_ST gastDictionary[MAX_WORD_NUM] = {0}; int LineIndexCmp(const void *a,const void *b) { return *(int *)a - *(int *)b; } int AllocateValidIndexNode(int viHashIndex) { while(1 == gastIndex[viHashIndex].iUsedFlag) { viHashIndex = (viHashIndex+1)%MAX_WORD_NUM; } gastIndex[viHashIndex].iUsedFlag = 1; return viHashIndex; } void InsertIntoDictionary(char *vpcWord,int viHashIndex,int viDocIndex,int viLineIndex) { int iBitMap = 1; int iIndex; char cOldWordFlag = 0; while (1 == gastDictionary[viHashIndex].cUsedFlag) { if (0 == strcmp(vpcWord,gastDictionary[viHashIndex].acWord)) { cOldWordFlag = 1; if (viLineIndex != gastIndex[gastDictionary[viHashIndex].iTailIndex].iLineIndex ) { iIndex = AllocateValidIndexNode(viHashIndex); gastIndex[iIndex].iDocumentNo = viDocIndex; gastIndex[iIndex].iLineIndex = viLineIndex; gastIndex[iIndex].iNext = NULL_INDEX; gastIndex[gastDictionary[viHashIndex].iTailIndex].iNext = iIndex; gastDictionary[viHashIndex].iTailIndex = iIndex; gastDictionary[viHashIndex].aiBlongedDoc[viDocIndex/32] |= (iBitMap<<(viDocIndex%32)); } break; } viHashIndex = (viHashIndex+1)%MAX_WORD_NUM; } if (0 == cOldWordFlag) { gastDictionary[viHashIndex].cUsedFlag = 1; strcpy(gastDictionary[viHashIndex].acWord,vpcWord); gastDictionary[viHashIndex].aiBlongedDoc[viDocIndex/32] = (iBitMap<<(viDocIndex%32)); iIndex = AllocateValidIndexNode(viHashIndex); gastIndex[iIndex].iDocumentNo = viDocIndex; gastIndex[iIndex].iLineIndex = viLineIndex; gastIndex[iIndex].iNext = NULL_INDEX; gastDictionary[viHashIndex].iHeadIndex = iIndex; gastDictionary[viHashIndex].iTailIndex = iIndex; } return; } int SearchSingleWord(char *vpcWord) { int iLastDoc; int iHashIndex = 0; int iLoopCharInWord; int iNextIndex; for (iLoopCharInWord = 0; iLoopCharInWord < strlen(vpcWord); iLoopCharInWord++) { //iHashIndex = (iHashIndex*(vpcWord[iLoopCharInWord]-'a'+2+MAX_WORD_LENGTH-iLoopCharInWord))%MAX_WORD_NUM; iHashIndex = 131*iHashIndex+vpcWord[iLoopCharInWord]; } iHashIndex = (iHashIndex & 0x7FFFFFFF)%MAX_WORD_NUM; while (1) { if (1 == gastDictionary[iHashIndex].cUsedFlag &&0 == strcmp(vpcWord,gastDictionary[iHashIndex].acWord)) { iNextIndex = gastDictionary[iHashIndex].iHeadIndex; iLastDoc = gastIndex[iNextIndex].iDocumentNo; while(NULL_INDEX != iNextIndex ) { if (iLastDoc != gastIndex[iNextIndex].iDocumentNo) { printf("----------/n"); iLastDoc = gastIndex[iNextIndex].iDocumentNo; } printf("%s/n",gstDataBase.acDataBase[gastIndex[iNextIndex].iLineIndex]); iNextIndex = gastIndex[iNextIndex].iNext; } return FIND; } if (0 == gastDictionary[iHashIndex].cUsedFlag) { return NOT_FIND;; } iHashIndex = (iHashIndex+1)%MAX_WORD_NUM; } return NOT_FIND; } int SearchWord_AND_OR(char *vpcWord1,char *vpcWord2,char vcAndOrFlag) { int iLoop; int iLoopBit; int iLoopLine; int iBitMap = 1; int iDocument; int iDocumentIndex; int iLastDoc = 0xffff; int iHashIndex1 = 0; int iHashIndex2 = 0; int iLoopCharInWord; int iNextIndex; int iLineNum= 0; int iWord1FoundFlag; int iWord2FoundFlag; int aiLineIndex[MAX_TOTAL_LINE_NUMBER]; for (iLoopCharInWord = 0; iLoopCharInWord < strlen(vpcWord1); iLoopCharInWord++) { iHashIndex1 = 131*iHashIndex1+vpcWord1[iLoopCharInWord]; //iHashIndex1 = iHashIndex1*(vpcWord1[iLoopCharInWord]-'a'+2+MAX_WORD_LENGTH-iLoopCharInWord))%MAX_WORD_NUM; } for (iLoopCharInWord = 0; iLoopCharInWord < strlen(vpcWord2); iLoopCharInWord++) { iHashIndex2 = 131*iHashIndex2+vpcWord2[iLoopCharInWord]; //iHashIndex2 = (iHashIndex2*(vpcWord2[iLoopCharInWord]-'a'+2+MAX_WORD_LENGTH-iLoopCharInWord))%MAX_WORD_NUM; } iHashIndex1 = (iHashIndex1 & 0x7FFFFFFF)%MAX_WORD_NUM; iHashIndex2 = (iHashIndex2 & 0x7FFFFFFF)%MAX_WORD_NUM; while (1) { if (1 == gastDictionary[iHashIndex1].cUsedFlag &&0 == strcmp(vpcWord1,gastDictionary[iHashIndex1].acWord)) { iWord1FoundFlag = 1; break; } if (0 == gastDictionary[iHashIndex1].cUsedFlag) { iWord1FoundFlag = 0; break; } iHashIndex1 = (iHashIndex1+1)%MAX_WORD_NUM; } while (1) { if (1 == gastDictionary[iHashIndex2].cUsedFlag &&0 == strcmp(vpcWord2,gastDictionary[iHashIndex2].acWord)) { iWord2FoundFlag = 1; break; } if (0 == gastDictionary[iHashIndex2].cUsedFlag) { iWord2FoundFlag = 0; break; } iHashIndex2 = (iHashIndex2+1)%MAX_WORD_NUM; } if(1 == vcAndOrFlag &&(0 == iWord1FoundFlag ||0 == iWord2FoundFlag)) { return NOT_FIND; } if(0 == vcAndOrFlag &&(0 == iWord1FoundFlag &&0 == iWord2FoundFlag)) { return NOT_FIND; } for (iLoop = 0; iLoop < 4; iLoop++) { iBitMap = 1; if (1 == vcAndOrFlag) {/*and*/ iDocument = (gastDictionary[iHashIndex1].aiBlongedDoc[iLoop]&gastDictionary[iHashIndex2].aiBlongedDoc[iLoop]); } else {/*or*/ if (1 == iWord1FoundFlag &&1 == iWord2FoundFlag) { iDocument = (gastDictionary[iHashIndex1].aiBlongedDoc[iLoop]|gastDictionary[iHashIndex2].aiBlongedDoc[iLoop]); } else if (1 == iWord1FoundFlag &&0 == iWord2FoundFlag) { iDocument = gastDictionary[iHashIndex1].aiBlongedDoc[iLoop]; } else if (0 == iWord1FoundFlag &&1 == iWord2FoundFlag) { iDocument = gastDictionary[iHashIndex2].aiBlongedDoc[iLoop]; } } if (0 == iDocument) { continue; } for (iLoopBit = 0; iLoopBit < 32; iLoopBit++) { if (iBitMap == (iDocument&iBitMap)) { iLineNum = 0; iDocumentIndex = 32*iLoop+iLoopBit; if (gstDataBase.iDocNum <= iDocumentIndex) { break; } if(1 == iWord1FoundFlag) { iNextIndex = gastDictionary[iHashIndex1].iHeadIndex; while(NULL_INDEX != iNextIndex ) { if (iDocumentIndex == gastIndex[iNextIndex].iDocumentNo) { aiLineIndex[iLineNum] = gastIndex[iNextIndex].iLineIndex; iLineNum++; } iNextIndex = gastIndex[iNextIndex].iNext; } } if(1 == iWord2FoundFlag) { iNextIndex = gastDictionary[iHashIndex2].iHeadIndex; while(NULL_INDEX != iNextIndex ) { if (iDocumentIndex == gastIndex[iNextIndex].iDocumentNo) { for (iLoopLine = 0; iLoopLine < iLineNum; iLoopLine++) { if (aiLineIndex[iLoopLine] == gastIndex[iNextIndex].iLineIndex) { break; } } if (iLineNum <= iLoopLine) { aiLineIndex[iLineNum] = gastIndex[iNextIndex].iLineIndex; iLineNum++; } } iNextIndex = gastIndex[iNextIndex].iNext; } } if (0xFFFF != iLastDoc) { printf("----------/n"); } qsort(aiLineIndex,iLineNum,sizeof(int),LineIndexCmp); for (iLoopLine = 0; iLoopLine < iLineNum; iLoopLine++) { printf("%s/n",gstDataBase.acDataBase[aiLineIndex[iLoopLine]]); } iLastDoc = iDocumentIndex; } iBitMap = iBitMap<<1; } } if (0xFFFF != iLastDoc) { return FIND; } return NOT_FIND; } int SearchWord_NOT(char *vpcWord) { int iLoop; int iLoopBit; int iLoopLine; int iBitMap = 1; int iDocument; int iDocumentIndex; int iLastDoc = 0xffff; int iHashIndex = 0; int iLoopCharInWord; int iLineNum= 0; for (iLoopCharInWord = 0; iLoopCharInWord < strlen(vpcWord); iLoopCharInWord++) { //iHashIndex = (iHashIndex*(vpcWord[iLoopCharInWord]-'a'+2+MAX_WORD_LENGTH-iLoopCharInWord))%MAX_WORD_NUM; iHashIndex = 131*iHashIndex+vpcWord[iLoopCharInWord]; } iHashIndex = (iHashIndex & 0x7FFFFFFF)%MAX_WORD_NUM; while (1) { if (1 == gastDictionary[iHashIndex].cUsedFlag &&0 == strcmp(vpcWord,gastDictionary[iHashIndex].acWord)) { break; } if (0 == gastDictionary[iHashIndex].cUsedFlag) { for (iLoop = 0; iLoop < gstDataBase.iDocNum; iLoop++) { if(iLoop) { printf("----------/n"); } for (iLoopLine = gstDataBase.astDocument[iLoop].iStartLine; iLoopLine <= gstDataBase.astDocument[iLoop].iEndLine; iLoopLine++) { printf("%s/n",gstDataBase.acDataBase[iLoopLine]); } } return FIND; } iHashIndex = (iHashIndex+1)%MAX_WORD_NUM; } for (iLoop = 0; iLoop < 4; iLoop++) { iBitMap = 1; /*not*/ iDocument = (~gastDictionary[iHashIndex].aiBlongedDoc[iLoop]); if (0 == iDocument) { continue; } for (iLoopBit = 0; iLoopBit < 32; iLoopBit++) { if (iBitMap == (iDocument&iBitMap)) { iLineNum = 0; iDocumentIndex = 32*iLoop+iLoopBit; if (gstDataBase.iDocNum <= iDocumentIndex) { break; } if (0xFFFF != iLastDoc) { printf("----------/n"); } iLastDoc = iDocumentIndex; for (iLoopLine = gstDataBase.astDocument[iDocumentIndex].iStartLine; iLoopLine <= gstDataBase.astDocument[iDocumentIndex].iEndLine; iLoopLine++) { printf("%s/n",gstDataBase.acDataBase[iLoopLine]); } } iBitMap = iBitMap<<1; } } if (0xFFFF != iLastDoc) { return FIND; } return NOT_FIND; } int SearchingtheWebmain(void) { int iLoop; int iLoopChar; int iLoopCharInWord; int iStrLen; int iLineNum = 0; int iHashIndex; int iQueryNum; int iReturn; char acWord[MAX_WORD_LENGTH]; char acWord_1[MAX_WORD_LENGTH]; char acWord_2[MAX_WORD_LENGTH]; char acQueryBuff[MAX_LINE_LENGTH]; scanf("%d/n",&gstDataBase.iDocNum); for (iLoop = 0; iLoop < gstDataBase.iDocNum; iLoop++) { gstDataBase.astDocument[iLoop].iStartLine = iLineNum; while (1) { gets(gstDataBase.acDataBase[iLineNum]); if (0 == strcmp("**********",gstDataBase.acDataBase[iLineNum])) { break; } iStrLen = strlen(gstDataBase.acDataBase[iLineNum]); iLoopCharInWord = 0; iHashIndex = 0; for (iLoopChar = 0; iLoopChar <= iStrLen; iLoopChar++) { if ('a' <= gstDataBase.acDataBase[iLineNum][iLoopChar] && 'z' >= gstDataBase.acDataBase[iLineNum][iLoopChar]) { acWord[iLoopCharInWord] = gstDataBase.acDataBase[iLineNum][iLoopChar]; //iHashIndex = (iHashIndex*(acWord[iLoopCharInWord]-'a'+2+MAX_WORD_LENGTH-iLoopCharInWord))%MAX_WORD_NUM; iHashIndex = 131*iHashIndex+acWord[iLoopCharInWord]; iLoopCharInWord++; } else if ('A' <= gstDataBase.acDataBase[iLineNum][iLoopChar] && 'Z' >= gstDataBase.acDataBase[iLineNum][iLoopChar]) { acWord[iLoopCharInWord] = gstDataBase.acDataBase[iLineNum][iLoopChar]+32;/*to Lower case*/ //iHashIndex = (iHashIndex*(acWord[iLoopCharInWord]-'a'+2+MAX_WORD_LENGTH-iLoopCharInWord))%MAX_WORD_NUM; iHashIndex = 131*iHashIndex+acWord[iLoopCharInWord]; iLoopCharInWord++; } else { if ( 0 != iLoopCharInWord) { acWord[iLoopCharInWord] = 0; iLoopCharInWord++; iHashIndex = (iHashIndex & 0x7FFFFFFF)%MAX_WORD_NUM; InsertIntoDictionary(acWord,iHashIndex,iLoop,iLineNum); iHashIndex = 0; iLoopCharInWord = 0; } } } iLineNum++; } gstDataBase.astDocument[iLoop].iEndLine = iLineNum-1; } scanf("%d/n",&iQueryNum); while(iQueryNum--) { gets(acQueryBuff); acWord_1[0] = 0; acWord_2[0] = 0; sscanf(acQueryBuff,"%s %s %s",acWord,acWord_1,acWord_2); /*to lower case*/ for (iLoopCharInWord = 0; iLoopCharInWord < strlen(acWord); iLoopCharInWord++) { if ('A' <= acWord[iLoopCharInWord] && 'Z' >= acWord[iLoopCharInWord]) { acWord[iLoopCharInWord] += 32; } } for (iLoopCharInWord = 0; iLoopCharInWord < strlen(acWord_1); iLoopCharInWord++) { if ('A' <= acWord_1[iLoopCharInWord] && 'Z' >= acWord_1[iLoopCharInWord]) { acWord_1[iLoopCharInWord] += 32; } } for (iLoopCharInWord = 0; iLoopCharInWord < strlen(acWord_2); iLoopCharInWord++) { if ('A' <= acWord_2[iLoopCharInWord] && 'Z' >= acWord_2[iLoopCharInWord]) { acWord_2[iLoopCharInWord] += 32; } } if (0 == acWord_1[0]) { iReturn = SearchSingleWord(acWord); } else if (0 == acWord_2[0]) {/*NOT*/ iReturn = SearchWord_NOT(acWord_1); } else if('a' == acWord_1[0]) {/*AND*/ iReturn = SearchWord_AND_OR(acWord,acWord_2,1); } else if('o' == acWord_1[0]) {/*OR*/ iReturn = SearchWord_AND_OR(acWord,acWord_2,0);; } if (NOT_FIND == iReturn) { printf("Sorry, I found nothing./n"); } printf("==========/n"); } return 0; }