Chinese Segment


Chinese Segment

Time Limit: 1000MS  Memory Limit: 65536KB
Problem Description

Few weeks ago, ZJS was busy with his Graduation Design (GD), after weeks of efforts he finally get it. His GD is on the Chinese Word Segment.

As we know western language is used to write the word separated by a space between words. Chinese does not write words continuously, and because word is the basic unit of any language, so the Word Segment (Separates the word with the word by a space or other markings) has become the necessary first working procedure in Chinese auto analysis.

There is a “the biggest probability algorithm” in ZJS\' GD. The algorithm described as follows: Before the Word Segment there has been a dictionary with the word frequency and the dictionary is believable. In the dictionary, each word takes a line, behind each word is its frequency, and these words are stored in the dictionary by lexicographic. At the time of segmenting word we find out all the words according to the dictionary, then find out all the possible cut paths(the word strings),and choose a best (that is, the biggest probability) path as the output, the key of this approach is to find out the best path efficiently. For simplicity, we transform the biggest probability into the minimum cost and to find the best path with which the cost is the smallest.

fee ( word ) = - log ( ( fre ( word ) + 1) / MaxFre ), which fee ( word ) is the cost of the word, fre ( word ) is the frequency of the word which can be found in the dictionary.And we suppose MaxFre = 5196588.If we can not find the word in the dictionary, its frequency is 0.

For example, a part of the dictionary is like this:

		成	2871
		成分	160
		合	276
		合成	21
		分	1986
		分子	373
		结	247
		结合	2208
		时	8076
		子	127

Transform the expense into the frequency, the result is as follows:

	成  	7.50075		成分	10.3821		合	9.8395
	合成	12.3725		分	7.86913		分子	9.53926
	结	9.95008		结合	7.76322		时	6.46674
	子	10.6115

“结合成分子时” can be segmented in to “结/合/成/分/子/时/”, “结合/成/分/子/时/”, “结合/成分/子/时/”, “结合/成/分子/时/”, and so on.

Obviously, the best path of “结合成分子时” is “结合/成/分子/时”.Its cost is 31.26997 which is smallest among all the paths. And now your task is to segment Chinese sentences, then output the best path, use “ / ” as separation mark. For simplicity, in the input does not have the punctuation mark and the western languages character.
 

Input
A integer m first line which is the number of the words in the dictionary, then a dictionary with m lines as described above, and every word contains four Chinese characters at most , then an integer n. Each of the following n lines contains a Chinese sentences, no punctuation, only composed of Chinese characters. 
Output
N sentences with separation marks after segmenting. Each sentence a line.
Example Input
10
成		2871
成分	160
合		276
合成	21
分		1986
分子	373
结		247
结合	2208
时		8076
子		127
1
结合成分子时
Example Output
结合/成/分子/时/

题意:输入n表示是n个单词,之后n行分别是单词,和fre(word)的值,然后你要根据公式计算fee(word)的值。之后输入m,之后m行,每行是一个只包含汉子的串,问你怎么样切割这个串使获得的 free(word)的值的和最小。

思路:dp,每次取一个新的汉字的时候,更新的状态就是和新的汉字所组成的单词。比如说 结合成分子时 当我们取到分时,那么之前的结合成会获得4种新的状态:

1.分自己     此时dp[i]=dp[i-2]+fre(结合成)

2.分+成      此时dp[i]=dp[i-4]+fre(结合)

3.分+合成    此时dp[i]=dp[i-6]+fre(结)

4.分+结合成    此时dp[i]=dp[i-8]+fre(结合成分)

注意:

1.单词如果不存在,我们认为它的值为-log(1/MaxFre);

2.我们最终要的出的是分割串的方式,这里我用的是记录前驱节点的方式。

3.接2,必定会出现第一个节点没有前驱结点的情况,这种情况就是前面所有的汉字组成一个单词。出现这种情况标记第一个前驱节点。

4.如果这个单词不存在,这种情况就是你枚举的4种情况的单词都不存在(仔细考虑一下)也就是不能分割。

5.其实题意说的就是不很明确。

#include <bits/stdc++.h>

using namespace std;
const double M=5196588.0;
char s[100005];
char w[20];
char s1[20];
double dp[100005];
int pre[100005];
int ans[1000005];
map<string,double>p;//记录单词对应的值
int main()
{
    int n,m,cnt;
    int x;
    int i,j,k;
    scanf("%d",&n);
    for(i=0;i<n;++i)
    {
        scanf("%s %d",w,&x);
        p[w]=-log((x+1)/M);
    }
    scanf("%d",&m);
    while(m--)
    {
        scanf("%s",s);
        int l=strlen(s);
        cnt=0;
        for(i=0;i<l;i+=2)
        {
            dp[i]=100000000.0;//默认最大
            int flag=0;
           for(j=i;j>=0&&j>=i-6;j-=2)//依次枚举和前面的组合
           {
               cnt=0;
               for(k=j;k<=i+1;++k)w[cnt++]=s[k];
               w[cnt]=0;
               if(p[w]==0.00000)continue;//不存在继续枚举
               else flag=1;//表示存在,可以分割
               if(j>=2)
               {
                   if(dp[i]>dp[j-2]+p[w])
                   {
                       dp[i]=dp[j-2]+p[w];
                       pre[i]=j-2;
                   }
               }
               else
               {
                   if(dp[i]>p[w])
                   {
                       dp[i]=p[w];
                       pre[i]=-1;
                   }
               }
           }
           if(!flag)//无法分割
           {
               dp[i]=-log(1.0/M)+dp[i-2];
               if(i>=2)pre[i]=i-2;
               else pre[i]=-1;
           }
        }
        i=l-2;
        //printf("%lf\n",dp[i]);
         cnt=0;
        while(pre[i]!=-1)
        {
            ans[cnt++]=pre[i]+1;
            //printf("%d\n",pre[i]);
            i=pre[i];
        }
        cnt--;
        for(i=0;i<l;i+=2)
        {
            printf("%c%c",s[i],s[i+1]);
            if(i+1==ans[cnt]){printf("/");cnt--;}
        }
        printf("/\n");
    }
    return 0;
}





  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation. The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications. The system requires Java 1.6+ to be installed. We recommend at least 1G of memory for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), decrease the memory requirement by changing the option java -mx1g in the run scripts. Arabic Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis. The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is a stand-alone implementation of the segmenter described in: Spence Green and John DeNero. 2012. A Class-Based Agreement Model for Generating Accurately Inflected Translations. In ACL. Chinese Chinese is standardly written without spaces between words (as are some other languages). This software will split Chinese text into a sequence of words, defined according to some word segmentation standard. It is a Java implementation of the CRF-based Chinese Word Segmenter described in: Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. A Conditional Random Field Word Segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing. Two models with two different segmentation standards are included: Chinese Penn Treebank standard and Peking University standard. On May 21, 2008, we released a version that makes use of lexicon features. With external lexicon features, the segmenter segmen

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值