Chinese Segment
Problem Description
Few weeks ago, ZJS was busy with his Graduation Design (GD), after weeks of efforts he finally get it. His GD is on the Chinese Word Segment.
As we know western language is used to write the word separated by a space between words. Chinese does not write words continuously, and because word is the basic unit of any language, so the Word Segment (Separates the word with the word by a space or other markings) has become the necessary first working procedure in Chinese auto analysis.
There is a “the biggest probability algorithm” in ZJS\' GD. The algorithm described as follows: Before the Word Segment there has been a dictionary with the word frequency and the dictionary is believable. In the dictionary, each word takes a line, behind each word is its frequency, and these words are stored in the dictionary by lexicographic. At the time of segmenting word we find out all the words according to the dictionary, then find out all the possible cut paths(the word strings),and choose a best (that is, the biggest probability) path as the output, the key of this approach is to find out the best path efficiently. For simplicity, we transform the biggest probability into the minimum cost and to find the best path with which the cost is the smallest.
fee ( word ) = - log ( ( fre ( word ) + 1) / MaxFre ), which fee ( word ) is the cost of the word, fre ( word ) is the frequency of the word which can be found in the dictionary.And we suppose MaxFre = 5196588.If we can not find the word in the dictionary, its frequency is 0.
For example, a part of the dictionary is like this:
成 2871 成分 160 合 276 合成 21 分 1986 分子 373 结 247 结合 2208 时 8076 子 127
Transform the expense into the frequency, the result is as follows:
成 7.50075 成分 10.3821 合 9.8395 合成 12.3725 分 7.86913 分子 9.53926 结 9.95008 结合 7.76322 时 6.46674 子 10.6115
“结合成分子时” can be segmented in to “结/合/成/分/子/时/”, “结合/成/分/子/时/”, “结合/成分/子/时/”, “结合/成/分子/时/”, and so on.
Obviously, the best path of “结合成分子时” is “结合/成/分子/时”.Its cost is 31.26997 which is smallest among all the paths. And now your task is to segment Chinese sentences, then output the best path, use “ / ” as separation mark. For simplicity, in the input does not have the punctuation mark and the western languages character.
Input
Output
Example Input
10 成 2871 成分 160 合 276 合成 21 分 1986 分子 373 结 247 结合 2208 时 8076 子 127 1 结合成分子时
Example Output
结合/成/分子/时/
题意:输入n表示是n个单词,之后n行分别是单词,和fre(word)的值,然后你要根据公式计算fee(word)的值。之后输入m,之后m行,每行是一个只包含汉子的串,问你怎么样切割这个串使获得的 free(word)的值的和最小。
思路:dp,每次取一个新的汉字的时候,更新的状态就是和新的汉字所组成的单词。比如说 结合成分子时 当我们取到分时,那么之前的结合成会获得4种新的状态:
1.分自己 此时dp[i]=dp[i-2]+fre(结合成)
2.分+成 此时dp[i]=dp[i-4]+fre(结合)
3.分+合成 此时dp[i]=dp[i-6]+fre(结)
4.分+结合成 此时dp[i]=dp[i-8]+fre(结合成分)
注意:
1.单词如果不存在,我们认为它的值为-log(1/MaxFre);
2.我们最终要的出的是分割串的方式,这里我用的是记录前驱节点的方式。
3.接2,必定会出现第一个节点没有前驱结点的情况,这种情况就是前面所有的汉字组成一个单词。出现这种情况标记第一个前驱节点。
4.如果这个单词不存在,这种情况就是你枚举的4种情况的单词都不存在(仔细考虑一下)也就是不能分割。
5.其实题意说的就是不很明确。
#include <bits/stdc++.h>
using namespace std;
const double M=5196588.0;
char s[100005];
char w[20];
char s1[20];
double dp[100005];
int pre[100005];
int ans[1000005];
map<string,double>p;//记录单词对应的值
int main()
{
int n,m,cnt;
int x;
int i,j,k;
scanf("%d",&n);
for(i=0;i<n;++i)
{
scanf("%s %d",w,&x);
p[w]=-log((x+1)/M);
}
scanf("%d",&m);
while(m--)
{
scanf("%s",s);
int l=strlen(s);
cnt=0;
for(i=0;i<l;i+=2)
{
dp[i]=100000000.0;//默认最大
int flag=0;
for(j=i;j>=0&&j>=i-6;j-=2)//依次枚举和前面的组合
{
cnt=0;
for(k=j;k<=i+1;++k)w[cnt++]=s[k];
w[cnt]=0;
if(p[w]==0.00000)continue;//不存在继续枚举
else flag=1;//表示存在,可以分割
if(j>=2)
{
if(dp[i]>dp[j-2]+p[w])
{
dp[i]=dp[j-2]+p[w];
pre[i]=j-2;
}
}
else
{
if(dp[i]>p[w])
{
dp[i]=p[w];
pre[i]=-1;
}
}
}
if(!flag)//无法分割
{
dp[i]=-log(1.0/M)+dp[i-2];
if(i>=2)pre[i]=i-2;
else pre[i]=-1;
}
}
i=l-2;
//printf("%lf\n",dp[i]);
cnt=0;
while(pre[i]!=-1)
{
ans[cnt++]=pre[i]+1;
//printf("%d\n",pre[i]);
i=pre[i];
}
cnt--;
for(i=0;i<l;i+=2)
{
printf("%c%c",s[i],s[i+1]);
if(i+1==ans[cnt]){printf("/");cnt--;}
}
printf("/\n");
}
return 0;
}