1597 - Searching the Web

最新推荐文章于 2022-08-26 17:18:23 发布

gaoxiangnumber1

最新推荐文章于 2022-08-26 17:18:23 发布

阅读量1.5k

点赞数

文章标签： iostream command structure google

本文链接：https://blog.csdn.net/gaoxiangnumber1/article/details/43878501

版权

The word �search engine� may not bestrange to you. Generally speaking, a search engine searches the web pagesavailable in the Internet, extracts and organizes the information and respondsto users� queries with the most relevant pages. World famous search engines,like GOOGLE, have become very important tools for us to use when we visit theweb. Such conversations are now common in our daily life:

�What does the word like ****** mean?�
�Um� I am not sure, just google it.�

In this problem, you are required toconstruct a small search engine. Sounds impossible, does it? Don�t worry, hereis a tutorial teaching you how to organize large collection of textsefficiently and respond to queries quickly step by step. You don�t need toworry about the fetching process of web pages, all the web pages are providedto you in text format as the input data. Besides, a lot of queries are alsoprovided to validate your system.

Modern search engines use a technique calledinversion for dealing with very large sets of documents. The method relies onthe construction of a data structure, called an inverted index, whichassociates terms (words) to their occurrences in the collection of documents.The set of terms of interest is called the vocabulary, denoted as V. In itssimplest form, an inverted index is a dictionary where each search key is aterm V.The associated value b()is a pointer to an additional intermediate data structure, called a bucket. Thebucket associated with a certain term isessentially a list of pointers marking all the occurrences of inthe text collection. Each entry in each bucket simply consists of the documentidentifier (DID), the ordinal number of the document within the collection andthe ordinal line number of the term�s occurrence within the document.

Let�s takeFigure-1 for an example, which describes the general structure. Assuming thatwe only have three documents to handle, shown at the right part in Figure-1;first we need to tokenize the text for words (blank, punctuations and othernon-alphabetic characters are used to separate words) and construct ourvocabulary from terms occurring in the documents. For simplicity, we don�t need toconsider any phrases, only a single word as a term. Furthermore, the terms arecase-insensitive (e.g. we consider �book� and �Book� to be thesame term) and we don�t consider any morphological variants (e.g. we consider �books� and �book�, �protected� and �protect� to bedifferent terms) and hyphenated words (e.g. �middle-class� is not asingle term, but separated into 2 terms �middle� and �class� by thehyphen). The vocabulary is shown at the left part in Figure-1. Each term of thevocabulary has a pointer to its bucket. The collection of the buckets is shownat the middle part in Figure-1. Each item in a bucket records the DID of theterm�s occurrence.

After constructing the whole inverted indexstructure, we may apply it to the queries. The query is in any of the followingformats:

term term AND term term OR term NOT term

A single term can be combined by Booleanoperators: AND, OR and NOT (�term1 AND term2� means to query the documents including term1 and term2; �term1 ORterm2� means to query the documents including term1 or term2; �NOT term1� means toquery the documents not including term1). Terms are single words as definedabove. You are guaranteed that no non-alphabetic characters appear in a term,and all the terms are in lowercase. Furthermore, some meaningless stop words(common words such as articles, prepositions, and adverbs, specified to be �the, a, to,and, or, not� in our problem) will not appear in the query, either.

For each query, the engine based on theconstructed inverted index searches the term in the vocabulary, compares theterms� bucket information, and then gives the result to user. Now can youconstruct the engine?

Input

The input starts with integer N (0 < N< 100) representing N documents provided. Then the next N sections are Ndocuments. Each section contains the document content and ends with a singleline of ten asterisks.

**********

You may assume that each line contains nomore than 80 characters and the total number of lines in the N documents willnot exceed 1500.

Next, integer M (0 < M ≤ 50000) is givenrepresenting the number of queries, followed by M lines, each query in oneline. All the queries correspond to the format described above.

Output

For each query, you need to find thedocument satisfying the query, and output just the lines within the documentsthat include the search term (For a NOT query, you need to output the wholedocument). You should print the lines in the same order as they appear in theinput. Separate different documents with a single line of 10 dashes.

----------

If no documents matching the query arefound, just output a single line: �Sorry, I found nothing.� The outputof each query ends with a single line of 10 equal signs.

==========

SampleInput

A manufacturer, importer, or seller of

digital media devices may not (1) sell,

or offer for sale, in interstate commerce,

or (2) cause to be transported in, or in a

manner affecting, interstate commerce,

a digital media device unless the device

includes and utilizes standard security

technologies that adhere to the security

system standards.

**********