Chinese Segment

最新推荐文章于 2024-04-15 09:59:28 发布

UMR小豪

最新推荐文章于 2024-04-15 09:59:28 发布

阅读量400

点赞数

分类专栏： -----------------动态规划---------------- 文章标签：动态规划

本文链接：https://blog.csdn.net/qq_33362864/article/details/52488104

版权

-----------------动态规划---------------- 专栏收录该内容

29 篇文章 0 订阅

订阅专栏

Chinese Segment

Time Limit: 1000MS Memory Limit: 65536KB

Submit Statistic Discuss

Problem Description

Few weeks ago, ZJS was busy with his Graduation Design (GD), after weeks of efforts he finally get it. His GD is on the Chinese Word Segment.

As we know western language is used to write the word separated by a space between words. Chinese does not write words continuously, and because word is the basic unit of any language, so the Word Segment (Separates the word with the word by a space or other markings) has become the necessary first working procedure in Chinese auto analysis.

There is a “the biggest probability algorithm” in ZJS\' GD. The algorithm described as follows: Before the Word Segment there has been a dictionary with the word frequency and the dictionary is believable. In the dictionary, each word takes a line, behind each word is its frequency, and these words are stored in the dictionary by lexicographic. At the time of segmenting word we find out all the words according to the dictionary, then find out all the possible cut paths(the word strings),and choose a best (that is, the biggest probability) path as the output, the key of this approach is to find out the best path efficiently. For simplicity, we transform the biggest probability into the minimum cost and to find the best path with which the cost is the smallest.

fee ( word ) = - log ( ( fre ( word ) + 1) / MaxFre ), which fee ( word ) is the cost of the word, fre ( word ) is the frequency of the word which can be found in the dictionary.And we suppose MaxFre = 5196588.If we can not find the word in the dictionary, its frequency is 0.

For example, a part of the dictionary is like this:

Transform the expense into the frequency, the result is as follows:

	成  	7.50075		成分	10.3821		合	9.8395
	合成	12.3725		分	7.86913		分子	9.53926
	结	9.95008		结合	7.76322		时	6.46674
	子	10.6115

“结合成分子时” can be segmented in to “结/合/成/分/子/时/”, “结合/成/分/子/时/”, “结合/成分/子/时/”, “结合/成/分子/时/”, and so on.

Obviously, the best path of “结合成分子时” is “结合/成/分子/时”.Its cost is 31.26997 which is smallest among all the paths. And now your task is to segment Chinese sentences, then output the best path, use “ / ” as separation mark. For simplicity, in the input does not have the punctuation mark and the western languages character.

Input

A integer m first line which is the number of the words in the dictionary, then a dictionary with m lines as described above, and every word contains four Chinese characters at most , then an integer n. Each of the following n lines contains a Chinese sentences, no punctuation, only composed of Chinese characters.

Output

N sentences with separation marks after segmenting. Each sentence a line.

Example Input

10
成		2871
成分	160
合		276
合成	21
分		1986
分子	373
结		247
结合	2208
时		8076
子		127
1
结合成分子时

Example Output

结合/成/分子/时/

题意：输入n表示是n个单词，之后n行分别是单词，和fre（word）的值，然后你要根据公式计算fee（word）的值。之后输入m，之后m行，每行是一个只包含汉子的串，问你怎么样切割这个串使获得的 free（word）的值的和最小。

思路：dp，每次取一个新的汉字的时候，更新的状态就是和新的汉字所组成的单词。比如说结合成分子时当我们取到分时，那么之前的结合成会获得4种新的状态：

1.分自己此时dp[i]=dp[i-2]+fre(结合成）

2.分+成此时dp[i]=dp[i-4]+fre(结合）

3.分+合成此时dp[i]=dp[i-6]+fre(结）

4.分+结合成此时dp[i]=dp[i-8]+fre(结合成分）

注意：

1.单词如果不存在，我们认为它的值为-log（1/MaxFre）；

2.我们最终要的出的是分割串的方式，这里我用的是记录前驱节点的方式。

3.接2，必定会出现第一个节点没有前驱结点的情况，这种情况就是前面所有的汉字组成一个单词。出现这种情况标记第一个前驱节点。

4.如果这个单词不存在，这种情况就是你枚举的4种情况的单词都不存在（仔细考虑一下）也就是不能分割。

5.其实题意说的就是不很明确。

#include <bits/stdc++.h>

using namespace std;
const double M=5196588.0;
char s[100005];
char w[20];
char s1[20];
double dp[100005];
int pre[100005];
int ans[1000005];
map<string,double>p;//记录单词对应的值
int main()
{
    int n,m,cnt;
    int x;
    int i,j,k;
    scanf("%d",&n);
    for(i=0;i<n;++i)
    {
        scanf("%s %d",w,&x);
        p[w]=-log((x+1)/M);
    }
    scanf("%d",&m);
    while(m--)
    {
        scanf("%s",s);
        int l=strlen(s);
        cnt=0;
        for(i=0;i<l;i+=2)
        {
            dp[i]=100000000.0;//默认最大
            int flag=0;
           for(j=i;j>=0&&j>=i-6;j-=2)//依次枚举和前面的组合
           {
               cnt=0;
               for(k=j;k<=i+1;++k)w[cnt++]=s[k];
               w[cnt]=0;
               if(p[w]==0.00000)continue;//不存在继续枚举
               else flag=1;//表示存在，可以分割
               if(j>=2)
               {
                   if(dp[i]>dp[j-2]+p[w])
                   {
                       dp[i]=dp[j-2]+p[w];
                       pre[i]=j-2;
                   }
               }
               else
               {
                   if(dp[i]>p[w])
                   {
                       dp[i]=p[w];
                       pre[i]=-1;
                   }
               }
           }
           if(!flag)//无法分割
           {
               dp[i]=-log(1.0/M)+dp[i-2];
               if(i>=2)pre[i]=i-2;
               else pre[i]=-1;
           }
        }
        i=l-2;
        //printf("%lf\n",dp[i]);
         cnt=0;
        while(pre[i]!=-1)
        {
            ans[cnt++]=pre[i]+1;
            //printf("%d\n",pre[i]);
            i=pre[i];
        }
        cnt--;
        for(i=0;i<l;i+=2)
        {
            printf("%c%c",s[i],s[i+1]);
            if(i+1==ans[cnt]){printf("/");cnt--;}
        }
        printf("/\n");
    }
    return 0;
}