假设每个blog中的文本都是字符串形式的,并且您在blogs中有一个这样的字符串列表,那么您就可以创建矩阵了。在import re
# Sample input for the following code.
blogs = ["This is a blog.","This is another blog.","Cats? Cats are awesome."]
# This is a list that will contain dictionaries counting the wordcounts for each blog
wordcount = []
# This is a list of all unique words in all blogs.
wordlist = []
# Consider each blog sequentially
for blog in blogs:
# Remove all the non-alphanumeric, non-whitespace characters,
# and then split the string at all whitespace after converting to lowercase.
# eg: "That's not mine." -> "Thats not mine" -> ["thats","not","mine"]
words = re.sub("\s+"," ",re.sub("[^\w\s]","",blog)).lower().split(" ")
# Add a new dictionary to the list. As it is at the end,
# it can be referred to by wordcount[-1]
wordcount.append({})
# Consider each word in the list generated above.
for word in words:
# If that word has been encountered before, increment the count
if word in wordcount[-1]: wordcount[-1][word]+=1
# Else, create a new entry in the dictionary
else: wordcount[-1][word]=1
# If it is not already in the list of unique words, add it.
if word not in wordlist: wordlist.append(word)
# We now have wordlist, which has a unique list of all words in all blogs.
# and wordcount, which contains len(blogs) dictionaries, containing word counts.
# Matrix is the table that you need of wordcounts. The number of rows will be
# equal to the number of unique words, and the number of columns = no. of blogs.
matrix = []
# Consider each word in the unique list of words (corresponding to each row)
for word in wordlist:
# Add as many columns as there are blogs, all initialized to zero.
matrix.append([0]*len(wordcount))
# Consider each blog one by one
for i in range(len(wordcount)):
# Check if the currently selected word appears in that blog
if word in wordcount[i]:
# If yes, increment the counter for that blog/column
matrix[-1][i]+=wordcount[i][word]
# For printing matrix, first generate the column headings
temp = "\t"
for i in range(len(blogs)):
temp+="Blog "+str(i+1)+"\t"
print temp
# Then generate each row, with the word at the starting, and tabs between numbers.
for i in range(len(matrix)):
temp = wordlist[i]+"\t"
for j in matrix[i]: temp += str(j)+"\t"
print temp
现在,matrix[i][j]将包含单词wordlist[i]出现在博客blogs[j]中的次数。在