随机文本生成器

【问题描述】

有一种基于马尔可夫链(Markov Chain)算法的随机文本生成方法,它利用任何一个现有的某种语言的文本(如一本英文小说),可以构造出由这个文本中的语言使用情况而形成的统计模型,并通过该模型生成的随机文本将具有与原文本类似的统计性质(即具有类似写作风格)。

该算法的基本原理是将输入看成是由一些互相重叠的短语构成的序列,其将每个短语分割为两个部分:一部分是由多个词构成的前缀,另一部分是只包含一个词的后缀。在生成文本时依据原文本的统计性质(即前缀确定的情况下,得到所有可能的后缀),随机地选择某前缀后面的特定后缀。在此,假设前缀长度为两个单词,则马尔可夫链(Markov Chain)随机文本生成算法如下:

设w1和w2为文本的前两个词

输出w1和w2

循环:

随机地选出w3,它是原文本中w1w2为前缀的后缀中的一个

输出w3

w1 = w2

w2 = w3

重复循环

 

下面将通过一个例子来说明该算法原理,假设有一个原文如下:

Show your flowcharts and conceal your tables and I will be mystified. Show your tables and your flowcharts will be obvious.

下面是上述原文的一些前缀和其后缀(注意只是部分)的统计:

 

前缀

后缀

Show your

flowcharts  tables

your flowcharts

and  will

flowcharts and

conceal

flowcharts willl

be

your tables

and  and

will be

mystified.  obvious.

be mystified.

Show

be obvious.

(end)

 

基于上述文本,按照马尔可夫链(Markov Chain)算法随机文本生成文本时,首先输出的是Show your,然后随机取出flowcharts或tables。如果为前者,则接下来的前缀就变成your flowcharts,而下一个后缀应该是and或will;如果为tables,则接下来的前缀就变成your tables,而下一个词就应该是and。这样继续下去,直到产生出足够多的输出,或在查找后缀时遇到了结束标志。

 

编写一个程序从文件中读入一个英文文本,利用马尔可夫链(Markov Chain)算法,基于文本中固定长度的短语的出现频率,生成一个最大单词数目不超过N的新文本到给定文件中。程序要求前缀词的个数为2,最大单词数目N由标准输入获得。

说明:

  1. 为了得到更好的统计特性,在此标点符号等非字母字符(如’ “ . , ? – ()等)也被看成单词的一部分,即“words”和“words.”是不同的单词。因此,在此将“词”定义为由“空白界定的字符串”;
  2. 对于同一个前缀的后缀按出现顺序排放(不管该后缀是否已存在);
  3. 在处理文本时,文件结束标志也将作为某一前缀的一个后缀,如上面示例(说明:在为文件最后两个前缀单词“be obvious.”读取后缀时,遇到文件结束,即其没有相应后缀,此时可用一个特殊标记来表示其后缀,如,可存储一个自定义的特殊串(如“(end)”)作为其后缀来表示当前状态,即文件结束);
  4. 对于某一前缀,按如下方式来随机选择其后缀(如果某一前缀只有一个后缀,将直接选择该后缀):

n = (int)(rrand() * N);

在此N为某一前缀的所有后缀的总数,n为所确定的后缀在该前缀的后缀序列中的序号(从0开始计数,即n为0时选取第一个后缀,为1时选取第二个后缀,以此类推)。在此,随机数生成函数rrand()的定义如下:

double seed = 997;

double rrand()

{

       double lambda = 3125.0;

       double m = 34359738337.0;

       double r;

       seed = fmod(lambda*seed, m); //要包含头文件#include <math.h>

       r = seed / m;

    return r;

}

注意:为了保证运行结果的确定性,请务必使用本文提供的随机数生成函数。

在下面条件满足时文本生成结束:1)遇到后缀为文件结束标志;或2)生成文本的单词数达到所设定的最大单词数。在程序实现时,当读到文件(结束)尾时,可将一个特殊标志赋给后缀串suffix变量。


【输入形式】

开打当前目录下英文文本文件“article.txt”进行统计分析,并从标准输入中读入一个正整数作为生成文本时的最大单词数。

 

【输出形式】

将生成文本输出到当前目录下文件“markov.txt”中。单词间以一个空格分隔,最后一个单词后空格可有可无。


【样例输入】

若当前目录下文件article.txt中内容如下:

I will give you some advice about life.

Eat more roughage;

Do more than others expect you to do and do it pains;

Remember what life tells you;

do not take to heart every thing you hear.

do not spend all that you have.

do not sleep as long as you want;

Whenever you say "I love you", please say it honestly;

Whevever you say "I am sorry", please look into the other person's eyes;

Whenever you find your wrongdoing, be quick with reparation!

Whenever you make a phone call smil when you pick up the phone, because someone feel it!

Understand rules completely and change them reasonably;

Remember, the best love is to love others unconditionally rather than make demands on them;

Comment on the success you have attained by looking in the past at the target you wanted to achieve most;

In love and cooking, you must give 100% effort - but expect little appreciation.

从标准输入中输入的单词个数为:

1000


【样例输出】

当前目录下所生成的文件markov.txt中内容如下:

I will give you some advice about life. Eat more roughage; Do more than others expect you to do and do it pains; Remember what life tells you; do not take to heart every thing you hear. do not take to heart every thing you hear. do not spend all that you have. do not sleep as long as you want; Whenever you find your wrongdoing, be quick with reparation! Whenever you find your wrongdoing, be quick with reparation! Whenever you find your wrongdoing, be quick with reparation! Whenever you find your wrongdoing, be quick with reparation! Whenever you say "I am sorry", please look into the other person's eyes; Whenever you say "I am sorry", please look into the other person's eyes; Whenever you make a phone call smil when you pick up the phone, because someone feel it! Understand rules completely and change them reasonably; Remember, the best love is to love others unconditionally rather than make demands on them; Comment on the success you have attained by looking in the past at the target you wanted to achieve most; In love and cooking, you must give 100% effort - but expect little appreciation.

 

 

【题解】

  1 #include<stdio.h>
  2 #include<stdlib.h>
  3 #include<string.h>
  4 #include<math.h>
  5 
  6 #define NHASH 1048576
  7 
  8 typedef struct HashTable
  9 {
 10     char *prefix[2];
 11     char **suffix;
 12     int num;
 13     unsigned int conflict;
 14     struct HashTable *next;
 15 }HashTable;
 16 
 17 double seed=997;
 18 char *list[1200000];
 19 char *end="(end)";
 20 HashTable *Hash[NHASH];
 21 
 22 void InsertHash(char *pre1,char *pre2,char *suf);
 23 HashTable *HashSearch(char *pre1,char *pre2);
 24 void write(int num,int Nmax);
 25 unsigned int HashFirst(char *pre1,char *pre2);
 26 unsigned int HashConflict(char *pre);
 27 double rrand(void);
 28 
 29 int main()
 30 {
 31     FILE *in;
 32     int len=0,num=0,Nmax,i,j;
 33     char *book;
 34     char buf[105];
 35 
 36     in=fopen("article.txt","r");
 37     fseek(in,0,SEEK_END);
 38     len=ftell(in);
 39     fseek(in,0,SEEK_SET);
 40 
 41     book=(char *)malloc(sizeof(char)*len);
 42     len=fread(book,sizeof(char),len,in);
 43 
 44     for(i=0,j=0;i<len;i++)
 45     {
 46         if(book[i]>32 && book[i]<127)
 47         {
 48             j=0;
 49             while(book[i]>32 && book[i]<127)
 50                 buf[j++]=book[i++];
 51             buf[j]='\0';
 52             list[num]=(char *)malloc(sizeof(buf));
 53             strcpy(list[num],buf);
 54             num++;
 55         }
 56     }
 57     list[num]=end;
 58     num++;
 59     fclose(in);
 60 
 61     scanf("%d",&Nmax);
 62 
 63     for(i=0;i<num-2;i++)
 64         InsertHash(list[i],list[i+1],list[i+2]);
 65 
 66     write(num,Nmax);
 67     return 0;
 68 }
 69 void InsertHash(char *pre1,char *pre2,char *suf)
 70 {
 71     unsigned int addr=HashFirst(pre1,pre2);
 72     unsigned int conflict=HashConflict(pre2);
 73 
 74     if(!Hash[addr])
 75     {
 76         Hash[addr]=(HashTable *)malloc(sizeof(HashTable));
 77         Hash[addr]->prefix[0]=pre1;
 78         Hash[addr]->prefix[1]=pre2;
 79         Hash[addr]->next=NULL;
 80         Hash[addr]->num=1;
 81         Hash[addr]->conflict=conflict;
 82         Hash[addr]->suffix=(char **)malloc(sizeof(char *));
 83         Hash[addr]->suffix[0]=suf;
 84     }
 85     else
 86     {
 87         HashTable *p,*q;
 88         p=q=Hash[addr];
 89 
 90         while(1)
 91         {
 92             if(!p)
 93             {
 94                 HashTable *s=(HashTable *)malloc(sizeof(HashTable));
 95                 s->prefix[0]=pre1;
 96                 s->prefix[1]=pre2;
 97                 s->next=NULL;
 98                 s->num=1;
 99                 s->conflict=conflict;
100                 s->suffix=(char **)malloc(sizeof(char *));
101                 s->suffix[0]=suf;
102                 q->next=s;
103 
104                 return;
105             }
106             else if(!strcmp(p->prefix[0],pre1) && !strcmp(p->prefix[1],pre2))
107             {
108                 p->num++;
109                 p->suffix=(char **)realloc(p->suffix,p->num*sizeof(char *));
110                 p->suffix[p->num-1]=suf;
111 
112                 return;
113             }
114             q=p;
115             p=p->next;
116         }
117     }
118 }
119 HashTable *HashSearch(char *pre1,char *pre2)
120 {
121     unsigned int addr=HashFirst(pre1,pre2);
122     unsigned int conflict=HashConflict(pre2);
123     HashTable *p=Hash[addr];
124 
125     while(1)
126     {
127         if(conflict==p->conflict && !strcmp(p->prefix[0],pre1))
128             return p;
129         p=p->next;
130     }
131     return 0;
132 }
133 void write(int num,int Nmax)
134 {
135     FILE *out;
136     char *pre1,*pre2;
137     int n;
138 
139     out=fopen("markov.txt","w");
140 
141     pre1=list[0];
142     pre2=list[1];
143     fprintf(out,"%s %s ",pre1,pre2);
144     Nmax-=2;
145 
146     while(Nmax--)
147     {
148         HashTable *p=HashSearch(pre1,pre2);
149 
150         if(p->num==1)
151             n=0;
152         else
153             n=(int)(rrand()*p->num);
154 
155         pre1=pre2;
156         pre2=p->suffix[n];
157 
158         if(pre2==end)
159             break;
160 
161         fprintf(out,"%s ",pre2);
162     }
163     fclose(out);
164     return;
165 }
166 unsigned int HashFirst(char *pre1,char *pre2)
167 {
168     unsigned int hash=0;
169 
170     while(*pre1)
171         hash=131*hash+*pre1++;
172     while(*pre2)
173         hash=131*hash+*pre2++;
174 
175     return hash&(NHASH-1);
176 }
177 unsigned int HashConflict(char *pre)
178 {
179     unsigned int hash=5381;
180 
181     while (*pre)
182         hash+=(hash<<5)+(*pre++);
183 
184     return hash&(NHASH-1);
185 }
186 double rrand(void)
187 {
188     static double lambda=3125.0;
189     static double m=34359738337.0;
190     double r;
191     seed=fmod(lambda*seed,m);
192     r=seed/m;
193 
194     return r;
195 }

 

转载于:https://www.cnblogs.com/tuoniao/p/10346398.html

  • 1
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值