uva 1597 web 查找

Uva 1597 题目大意 输入一个 n 接下来输入 n 篇文章,每篇文章以*********结尾,接下来有 m 个单词或单词与单词直接的关系  输出每个单词存在的行。

The word “search engine” may not be strange to you. Generally speaking, a search engine searches
the web pages available in the Internet, extracts and organizes the information and responds to users’
queries with the most relevant pages. World famous search engines, like GOOGLE, have become very
important tools for us to use when we visit the web. Such conversations are now common in our daily
life:
“What does the word like ∗ ∗ ∗ ∗ ∗∗ mean?”
“Um. . . I am not sure, just google it.”
In this problem, you are required to construct a small search engine. Sounds impossible, does it?
Don’t worry, here is a tutorial teaching you how to organize large collection of texts efficiently and
respond to queries quickly step by step. You don’t need to worry about the fetching process of web
pages, all the web pages are provided to you in text format as the input data. Besides, a lot of queries
are also provided to validate your system.
Modern search engines use a technique called inversion for dealing with very large sets of documents.
The method relies on the construction of a data structure, called an inverted index, which associates
terms (words) to their occurrences in the collection of documents. The set of terms of interest is called
the vocabulary, denoted as V . In its simplest form, an inverted index is a dictionary where each search
key is a term ω ∈ V . The associated value b(ω) is a pointer to an additional intermediate data structure,
called a bucket. The bucket associated with a certain term ω is essentially a list of pointers marking all
the occurrences of ω in the text collection. Each entry in each bucket simply consists of the document
identifier (DID), the ordinal number of the document within the collection and the ordinal line number
of the term’s occurrence within the document.
Let’s take Figure-1 for an example, which describes the general structure. Assuming that we only
have three documents to handle, shown at the right part in Figure-1; first we need to tokenize the
text for words (blank, punctuations and other non-alphabetic characters are used to separate words)
and construct our vocabulary from terms occurring in the documents. For simplicity, we don’t need to
consider any phrases, only a single word as a term. Furthermore, the terms are case-insensitive (e.g. we
consider “book” and “Book” to be the same term) and we don’t consider any morphological variants
(e.g. we consider “books” and “book”, “protected” and “protect” to be different terms) and hyphenated
words (e.g. “middle-class” is not a single term, but separated into 2 terms “middle” and “class” by
the hyphen). The vocabulary is shown at the left part in Figure-1. Each term of the vocabulary has a
pointer to its bucket. The collection of the buckets is shown at the middle part in Figure-1. Each item
in a bucket records the DID of the term’s occurrence.
After constructing the whole inverted index structure, we may apply it to the queries. The query
is in any of the following formats:
term
term AND term
term OR term
NOT term
A single term can be combined by Boolean operators: ‘AND’, ‘OR’ and ‘NOT’ (‘term1 AND term2’ means
to query the documents including term1 and term2; ‘term1 OR term2’ means to query the documents
including term1 or term2; ‘NOT term1’ means to query the documents not including term1). Terms
are single words as defined above. You are guaranteed that no non-alphabetic characters appear in a
term, and all the terms are in lowercase. Furthermore, some meaningless stop words (common words
such as articles, prepositions, and adverbs, specified to be “the, a, to, and, or, not” in our problem)
will not appear in the query, either.
For each query, the engine based on the constructed inverted index searches the term in the vocabulary,
compares the terms’ bucket information, and then gives the result to user. Now can you construct
the engine?
Figure-1. Inverted Index
Input
The input starts with integer N (0 < N < 100) representing N documents provided. Then the next N
sections are N documents. Each section contains the document content and ends with a single line of
ten asterisks.
**********
You may assume that each line contains no more than 80 characters and the total number of lines
in the N documents will not exceed 1500.
Next, integer M (0 < M ≤ 50000) is given representing the number of queries, followed by M lines,
each query in one line. All the queries correspond to the format described above.
Output
For each query, you need to find the document satisfying the query, and output just the lines within the
documents that include the search term (For a ‘NOT’ query, you need to output the whole document).
You should print the lines in the same order as they appear in the input. Separate different documents
with a single line of 10 dashes.
----------
If no documents matching the query are found, just output a single line: ‘Sorry, I found nothing.’.
The output of each query ends with a single line of 10 equal signs.
==========
Sample Input
4
A manufacturer, importer, or seller of
digital media devices may not (1) sell,
or offer for sale, in interstate commerce,
or (2) cause to be transported in, or in a
manner affecting, interstate commerce,
a digital media device unless the device
includes and utilizes standard security
technologies that adhere to the security
system standards.
**********
Of course, Lisa did not necessarily
intend to read his books. She might
want the computer only to write her
midterm. But Dan knew she came from
a middle-class family and could hardly
afford the tuition, let alone her reading
fees. Books might be the only way she
could graduate
**********
Research in analysis (i.e., the evaluation
of the strengths and weaknesses of
computer system) is essential to the
development of effective security, both
for works protected by copyright law
and for information in general. Such
research can progress only through the
open publication and exchange of
complete scientific results
**********
I am very very very happy!
What about you?
**********
6
computer
books AND computer
books OR protected
NOT security
very
slick
Sample Output
want the computer only to write her
---------
computer system) is essential to the
==========
intend to read his books. She might
want the computer only to write her
fees. Books might be the only way she
==========
intend to read his books. She might
fees. Books might be the only way she
---------
for works protected by copyright law
==========
Of course, Lisa did not necessarily
intend to read his books. She might
want the computer only to write her
midterm. But Dan knew she came from
a middle-class family and could hardly
afford the tuition, let alone her reading
fees. Books might be the only way she
could graduate
---------
I am very very very happy!
What about you?
==========
I am very very very happy!
==========
Sorry, I found nothing.
==========

#include<map>
#include<set>
#include<string>
#include<iostream>
#include<algorithm>
#include<sstream>
#include<vector>

using namespace std;
map<string,set<pair<int,int> > >   id;  //用map存单词出现在第几个文章,第几行(是1500行中) 
int main(){
/*		freopen("data.in","r",stdin);
		freopen("data.out","w",stdout);
*/		int n,i = 0;
		string s[1505];
		int f = 0;
		cin>>n;
		set<int > a;  	 	//用一个集合存第几篇文章 
		int ri[1505];		//用一个数组存********出现在第几行(就是每篇文章的长度) 
		getchar();
		while(1){
			while(getline(cin,s[i])){
				if(s[i]=="**********") {
					a.insert(f);
					ri[f]=i;			 
					f++;
					i++;
					break;
				}
				if(f==n) break;
				stringstream ss(s[i]);
				string word;
				int u = 0;
				while(ss>>word){
					string mood;
					int j = 0;
					for(int p = 0;p <= word.length();p++){   //一个小坑,记录每个单词比如 "(i.e"中有i,e两个单词 
						if('A'<=word[p]&&word[p]<='Z'){
							word[p] = tolower(word[p]);
						}
						if('a'<=word[p]&&word[p]<='z'){
								j++;
						}else{
							mood = word.substr(p-j,j);
							pair<int,int > wei(f,i);
							set<pair<int ,int> > ip = id[mood];
							ip.insert(wei);
							id[mood] = ip;
					//		cout<<mood<<endl;
							mood.clear();
							j = 0;
						}
					/*	if(p==word.length()){
							mood = word.substr(p-j,p);
							pair<int,int > wei(f,i);
							set<pair<int ,int> > ip = id[mood];
							ip.insert(wei);
							id[mood] = ip;
							cout<<mood<<endl;
							mood.clear();
						}*/
					}
				}
				i++;
			/*	set<int>::iterator it = id["books"].begin();
				for(;it != id["books"].end();it++)
				cout<<*it<<endl;*/
			}
			if(f==n) break;
		}	
		int m;
		cin>>m;
		getchar();
		while(m--){
			int first = 0,flag = 0;  //一个判断是不是下篇文章,一个判断是否找到 
			string fin;
			getline(cin,fin);
			if(fin.find(' ')!=-1){   //判断是一个单词还是 AND OR NOT 的情况 
				string::iterator h = fin.begin();
				h = find(h,fin.end(),' ');	
				int p = h - fin.begin() + 1;
				if(fin[p]=='A'){
					string str1 = fin.substr(0,h-fin.begin());  //存两个AND旁边的单词 
					string str2 = fin.substr(h-fin.begin()+5,fin.end()-h);
		//			cout<<str1<<endl;
		//			cout<<str2<<endl;
					set<pair<int ,int> > g;  //用一个集合存单词的"坐标"  set自动排序和不存在相同的元素 
					for(set<pair<int ,int> >::iterator it = id[str1].begin();it != id[str1].end();it++){
						pair<int,int > k = *it;
						for(set<pair<int ,int> >::iterator y = id[str2].begin();y != id[str2].end();y++){
						pair<int,int > d = *y;
							if(k.first==d.first){
								g.insert(k);
								g.insert(d);
							}
						}
					}
					set<pair<int ,int> >::iterator l;
					for(set<pair<int ,int> >::iterator it = g.begin();it != g.end();it++){
						pair<int,int > k = *it;
						if(it!=g.begin()){
							pair<int,int > x = *l;
							if(k.first>x.first)
							cout<<"----------"<<endl; 		//一个大坑,题目的案例只有9个-,题目要求要10个 
						}
						cout<<s[k.second]<<endl;
						l = it;
						flag = 1;
					}
				}else if(fin[p]=='O'){
					string str1 = fin.substr(0,h-fin.begin());
					string str2 = fin.substr(h-fin.begin()+4,fin.end()-h);	
			//		cout<<str1<<endl;
			//		cout<<str2<<endl;
					set<pair<int ,int> > g;  //同理这是存并集 
					for(set<pair<int ,int> >::iterator it = id[str1].begin();it != id[str1].end();it++){
						pair<int,int > k = *it;
							g.insert(k);
					}
					for(set<pair<int ,int> >::iterator y = id[str2].begin();y != id[str2].end();y++){
						pair<int,int > d = *y;
							g.insert(d);
					}
					set<pair<int ,int> >::iterator l;
					for(set<pair<int ,int> >::iterator it = g.begin();it != g.end();it++){
						pair<int,int > k = *it;
						if(it!=g.begin()){
							pair<int,int > x = *l;
							if(k.first>x.first)
							cout<<"----------"<<endl;
						}
						cout<<s[k.second]<<endl;
						l = it;
						flag = 1;
					}
				}else{
					string str1 = fin.substr(p,fin.end()-fin.begin());
			//		cout<<str1<<endl;	
			//		cout<<str1<<endl;
			//		cout<<str2<<endl;
					set<int > g;
					for(set<pair<int ,int> >::iterator it = id[str1].begin();it != id[str1].end();it++){
						pair<int,int > k = *it;
						g.insert(k.first);	
					}
					set<int > c;
					int first = 0;
					set_difference(a.begin(),a.end(),g.begin(),g.end(),inserter(c,c.begin()));  //一共有几篇文章和NOT中单词出现的文章集合的差集 
					for(set<int>::iterator it = c.begin();it != c.end();it++){
						if(*it==0){
							if(first)
							cout<<"----------"<<endl;
							int k = *it;
							for(int i = 0;i < ri[k];i++){
								cout<<s[i]<<endl;
							}
							first = 1;
							flag = 1;
						}else{
							if(first)
							cout<<"----------"<<endl;
							int j = *it;
							for(int i = ri[j-1]+1 ;i < ri[j];i++){
								cout<<s[i]<<endl;
							}
							first = 1;
							flag = 1;
						}
					
					}
				}
			}else{
				string str1 = fin.substr(0,fin.end()-fin.begin());
				set<pair<int ,int> >::iterator l;
				for(set<pair<int ,int> >::iterator it = id[str1].begin();it != id[str1].end();it++){
						pair<int,int > k = *it;
						if(it!=id[str1].begin()){
							pair<int,int > x = *l;
							if(k.first>x.first)
							cout<<"----------"<<endl;
						}
						cout<<s[k.second]<<endl;
						flag = 1;
						l = it;
					}
			}
			if(flag==0){
				cout<<"Sorry, I found nothing."<<endl; 
			}
		cout<<"=========="<<endl;
		}
}

有点小坑,代码写的有点长,耗时280ms,海星吧。对c++stl也不是很熟。。。就这样吧。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值