7-4 Document Distance (PAT ADSAA) (24/35)

这道题只拿了24/35分,我猜问题出在stemming函数里,但是暂时不知道该怎么处理。

#include <string>
#include <iostream>
#include <cstdio>
#include <map>
#include <set>
#include <vector>
#include <cmath>
const int MAXN = 100;

int N, M, sz, t;
char ch;
double result;
std::string title, tmp, a, b;
std::map<std::string, int> titleNbr;
std::map<std::string, int> wordCnt[MAXN];
std::set<std::string> words;
std::vector<int> frequency[MAXN];
int normSquare[MAXN];

std::string stemming(std::string str){
    if(str.size() >= 2 && str.substr(str.size() - 2) == "ed"){
        return str.substr(0, str.size() - 2);
    }
    if(str.size() >= 3 && str.substr(str.size() - 3) == "ies"){
        return str.substr(0, str.size() - 3) + "y";
    }
    if(str.size() >= 3 && str.substr(str.size() - 3) == "ing"){
        return str.substr(0, str.size() - 3);
    }
    if(str.size() >= 2 && str.substr(str.size() - 2) == "es"){
        return str.substr(0, str.size() - 2);
    }
    return str;
}

double calDistance(int m, int n){
    int sum = 0;
    for(int i = 0; i < sz; ++i){
        sum += frequency[m][i] * frequency[n][i];
    }
    return acos(sum * 1.0 / sqrt(normSquare[m] * 1.0) / sqrt(normSquare[n] * 1.0));
}

int main(){
    scanf("%d", &N);
    for(int i = 0; i < N; ++i){
        std::cin >> title;
        titleNbr[title] = i;
        do{
            scanf("%c", &ch);
            if(isalnum(ch)){
                tmp += std::tolower(ch);
            } else{
                if(!tmp.empty()){
                    tmp = stemming(tmp);
                    wordCnt[i][tmp]++;
                    words.insert(tmp);
                    tmp.clear();
                }
            }
        } while(ch != '#');
    }
    sz = words.size();
    for(int i = 0; i < N; ++i){
        normSquare[i] = 0;
        frequency[i].resize(sz);
        int j = 0;
        for(auto it = words.begin(); it != words.end(); ++it, ++j){
            t = wordCnt[i][*it];
            frequency[i][j] = t;
            normSquare[i] += t * t;
        }
    }
    scanf("%d", &M);
    for(int i = 1; i <= M; ++i){
        std::cin >> a >> b;
        printf("Case %d: %.3f\n", i, calDistance(titleNbr[a], titleNbr[b]));
    }
    return 0;
}

题目如下:

Plagiarism is a form of academic dishonesty. To fight with such misconducts, plagiarism checkers are developed to compare any submitted article with all the articles stored in a database. The key is that given two documents, how to judge their similarity, or in other words, document distance? The document distance can be measured with their word metrics. In this project, you are supposed to write a program to calculate the distance of any given pair of documents.

Some important terms are defined as follows:

  1. Word: is a continuous sequence of alphanumeric characters. For example, the phrase "the course data structure is fun" consists of 6 distinct words. Here we can assume that no word contains more than 20 characters.

  2. Word frequency: denoted by FD​(w) is the number of times the word w occurs in a document D.

  3. Document distance metric: is defined to be the inner product of the two vectors containing the word frequencies for all words in two documents. Denote the frequency vector for document Di​ by F(Di​)=(Fi​(w1​),⋯,Fi​(wn​))T, then the metric is given by (F(D1​),F(D2​))=∑i=1n​F1​(wi​)F2​(wi​). In other words, the document distance metric is the projection of vector F(D1​) onto F(D2​), or vice versa.

  4. Angle metric: is the angle between the vectors F(D1​) and F(D2​) which gives an indication of overlap between the two documents D1​ and D2​. The angle can be computed by the following formula:

θ(D1​,D2​)=arccos(∣∣F(D1​)∣∣⋅∣∣F(D2​)∣∣(F(D1​),F(D2​))​)

where θ∈[0,π/2] and ∣∣⋅∣∣ can be any norm of a vector. To be more specific, here we use the 2-norm which is defined by ∣∣F(D)∣∣=(F(D),F(D))​.

Input Specification:

Each input file contains one test case. For each case, the first line of input contains a positive integer N (≤100) which is the number of text files to be processed. Then N blocks follow, each contains a file. The first line of each file contains the title of the article (string of no more than 6 characters without any space), and the last line contains a single character #. After the file blocks, there is a line containing a positive integer M (≤100,000), followed by M lines of inquiries. Each inquiry gives a pair of file names separated by a space. The size of the largest test case is about 1 MB.

Output Specification:

For each inquiry, print in a line Case #: *, where # is the test case number starting from 1 and * is the document distance value accurate up to 3 decimal places.

Note: Your stemming algorithm must be able to handle "es", "ed", "ing", "ies" and must be case-insensitive. Stop words are not supposed to be ignored and must be treated as normal words.

Sample Input:

3
A00
A B C
#
A01
B C D
#
A02
A C
D A
#
2
A00 A01
A00 A02

Sample Output:

Case 1: 0.841
Case 2: 0.785
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值