这里的k = 2:
int k = 2;
char inputchars[5000000];
char *word[1000000];
int nword = 0;
首先,扫描整个输入文本来实现算法从而生成每个单词。我们将数组word作为一个指向字母的后缀数组,只是它仅从单词的边界开始。变量nword保存了单词的数目。我们使用下面的代码读取文件:
word[0] = inputchars
while scanf("%s", word[nword]) != EOF
word[nword+1] = word[nword] + strlen(word[nword]) + 1
nword++
将文件中的每个单词添加到inputchars中,并通过scanf提供的null字符终止每个单词。
第二,在读取输入之后,对word数组进行排序,将所有指向同一个k单词序列的指针收集起来。该函数进行了下列比较
int wordncmp(char *p, char *q)
n = k;
for (; *p == *q; p++, q++)
if (*p == 0 && --n == 0)
return 0
return *p - *q
当字符相同是,它就扫描两个字符串,每次遇到null字符,它就将计算器n减1,并在查找到k个相同的单词后返回0(相同)。当它找到不同的字符时,返回不同(*p - *q)
读取输入之后,在最后的单词后追加k个null字符(这样比较函数就不会超过整个字符串的末端),输出文档的前k个单词(以开始随机输出),并调用排序:
for i = [0, k)
word[nword][i] = 0
for i = [0, k)
print word[i]
qsort(word, nword, sizeof(word[0]), sortcmp)
我们采用的空间上比较高效的数据结构中现在包含了大量关于文本中"K-gram(K链)"信息。如果k为1,并且输入文本为“of the people, by the people, for the people”,word数组如下所示:
排序前:
word[0]: of the people,by the people .....
word[1]: the people,by the people, for ...
word[2]: people,by the people,for the..
word[3]: by the people, for the people
word[4]: the people, for the people
word[5]: people,for the people
word[6]: for the people
word[7]: the people
word[8]: people
排序后:
word[0]: by the people, for the people
word[1]: for the people
word[2]: of the people, by the people
word[3]: people
word[4]: people, by the people
word[5]: people, for the people
word[6]: the people,by the people
word[7]: the people
word[8]: the people,for the people
如果查找“the”后跟的单词,就在后缀数组中查找它,有三个选择:两次"people,"和一次"people"
现在,我们可以使用以下的伪代码来生产没有意义的文本
phrase = first phrase in input array
loop
perform a binary search for phrase in word[0..nword-1] //查找phrase的第一次出现
for all phrases equal in the first k words //扫描所有相同的词组,并随机选择其中一个。
select one at random, pointed to by p
phrase = word following p
if k-th word of phrase is length 0 //如该词组的第k个单词的长度为0,表明该词组是文档末尾,结束循环
break
print k-th word of phrase
完整的伪码实现为:
phrase = inputchars
for (wordsleft = 10000; wordsleft > 0; wordsleft--)
l = -1
u = nword
while l+1 != u
m = (l + u) / 2
if wordncmp(word[m], phrase) < 0
l = m
else
u = m
for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++)
if rand() % (i+1) == 0
p = word[u+i]
phrase = skip(p, 1)
if strlen(skip(phrase, k-1)) == 0
break
print skip(phrase, k-1)
int k = 2;
char inputchars[5000000];
char *word[1000000];
int nword = 0;
首先,扫描整个输入文本来实现算法从而生成每个单词。我们将数组word作为一个指向字母的后缀数组,只是它仅从单词的边界开始。变量nword保存了单词的数目。我们使用下面的代码读取文件:
word[0] = inputchars
while scanf("%s", word[nword]) != EOF
word[nword+1] = word[nword] + strlen(word[nword]) + 1
nword++
将文件中的每个单词添加到inputchars中,并通过scanf提供的null字符终止每个单词。
第二,在读取输入之后,对word数组进行排序,将所有指向同一个k单词序列的指针收集起来。该函数进行了下列比较
int wordncmp(char *p, char *q)
n = k;
for (; *p == *q; p++, q++)
if (*p == 0 && --n == 0)
return 0
return *p - *q
当字符相同是,它就扫描两个字符串,每次遇到null字符,它就将计算器n减1,并在查找到k个相同的单词后返回0(相同)。当它找到不同的字符时,返回不同(*p - *q)
读取输入之后,在最后的单词后追加k个null字符(这样比较函数就不会超过整个字符串的末端),输出文档的前k个单词(以开始随机输出),并调用排序:
for i = [0, k)
word[nword][i] = 0
for i = [0, k)
print word[i]
qsort(word, nword, sizeof(word[0]), sortcmp)
我们采用的空间上比较高效的数据结构中现在包含了大量关于文本中"K-gram(K链)"信息。如果k为1,并且输入文本为“of the people, by the people, for the people”,word数组如下所示:
排序前:
word[0]: of the people,by the people .....
word[1]: the people,by the people, for ...
word[2]: people,by the people,for the..
word[3]: by the people, for the people
word[4]: the people, for the people
word[5]: people,for the people
word[6]: for the people
word[7]: the people
word[8]: people
排序后:
word[0]: by the people, for the people
word[1]: for the people
word[2]: of the people, by the people
word[3]: people
word[4]: people, by the people
word[5]: people, for the people
word[6]: the people,by the people
word[7]: the people
word[8]: the people,for the people
如果查找“the”后跟的单词,就在后缀数组中查找它,有三个选择:两次"people,"和一次"people"
现在,我们可以使用以下的伪代码来生产没有意义的文本
phrase = first phrase in input array
loop
perform a binary search for phrase in word[0..nword-1] //查找phrase的第一次出现
for all phrases equal in the first k words //扫描所有相同的词组,并随机选择其中一个。
select one at random, pointed to by p
phrase = word following p
if k-th word of phrase is length 0 //如该词组的第k个单词的长度为0,表明该词组是文档末尾,结束循环
break
print k-th word of phrase
完整的伪码实现为:
phrase = inputchars
for (wordsleft = 10000; wordsleft > 0; wordsleft--)
l = -1
u = nword
while l+1 != u
m = (l + u) / 2
if wordncmp(word[m], phrase) < 0
l = m
else
u = m
for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++)
if rand() % (i+1) == 0
p = word[u+i]
phrase = skip(p, 1)
if strlen(skip(phrase, k-1)) == 0
break
print skip(phrase, k-1)