I've been working with NLTK for the past three days to get familiar and reading the "Natural Language processing" book to understand what's going on. I'm curious if someone could clarify for me the following:
Note that the first time you run this command, it is slow because it
gathers statistics about word sequences. Each time you run it, you
will get different output text. Now try generating random text in the
style of an inaugural address or an Internet chat room. Although the
text is random, it re-uses common words and phrases from the source
text and gives us a sense of its style and content. (What is lacking
in this randomly generated text?)
This part of the text, chapter 1, simply says that it "gathers statistics" and it will get "different output text"
What specifically does generate do and how does it work?
This example of generate() uses text3, which is the Bible's Genesis:
In the beginning , between me and thee and in the garden thou mayest
come in unto Noah into the ark , and Mibsam , And said , Is there yet
any portion or inheritance for us , and make thee as Ephraim and as
the sand of the dukes that came with her ; and they were come . Also
he sent forth the dove out of thee , with tabret , and wept upon them
greatly ; and she conceived , and called their names , by their names
after the end of the womb ? And he
Here, the generate() function seems to simply output phrases created by cutting off text at punctuation and randomly reassembling it but it has a bit of readability to it.
解决方案
type(text3) will tell you that text3 is of type nltk.text.Text.
To cite the documentation of Text.generate():
Print random text, generated using a trigram language model.
That means that NLTK has created an N-Gram model for the Genesis text, counting each occurence of sequences of three words so that it can predict the most likely successor of any given two words in this text. N-Gram models will be explained in more detail in chapter 5 of the NLTK book.
See also the answers to this question.