
注:本文节选自《Python网络数据采集》[Ryan Mitchell著],其书附带源码不能完美运行,经修改调试后得到预期效果,将此分析处理过程分享给大家。







from urllib.request import urlopen
import re
import string
import operator

def cleanInput(input):
    input = re.sub('\n+', " ", input).lower()         #把换行符替换成空格
    input = re.sub('\[[0-9]*\]', "", input)           #剔除类似[1]这样的引用标记
    input = re.sub(' +', " ", input)                  #把连续多个空格替换成一个空格
    input = bytes(input, "UTF-8")                     #把内容转换成utf-8格式以消除转义字符
    input = input.decode("ascii", "ignore")           #内容解码
    cleanInput = []
    input = input.split(' ')                          #以空格为分隔符
    for item in input:
        item = item.strip(string.punctuation)         # string.punctuation获取所有标点符号
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):      #找出单词,包括i,a等单个单词
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])
        if ngramTemp not in output:                  #词频统计
            output[ngramTemp] = 0
        output[ngramTemp] += 1
    return output

content = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),'utf-8')      #从互联网读取文本内容
ngrams = getNgrams(content, 2)                       #此处采用2-gram
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True)       #按词频排序


('of the', 213), ('in the', 65), ('to the', 61), ('by the', 41), ('the constitution', 34), ('of our', 29), ('to be', 26), ('the people', 24), ('from the', 24), ('it is', 23)……

这些2-gram序列中,“the constitution”像是说演说的主旨,而“of the”、“in the”、“to the”等则并不是那么重要,所以需要对这些单词进行剔除操作。话不多说,代码为例。

from urllib.request import urlopen
import re
import string
import operator

def isCommon(ngram):                                 #剔除没有意义的单词
    commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that",
                   "for", "you", "he", "with", "on", "do", "say", "this", "they", "is", 
                   "an", "at", "but","we", "his", "from", "that", "not", "by", "she", 
                   "or", "as", "what", "go", "their","can", "who", "get", "if", "would",
                   "her", "all", "my", "make", "about", "know", "will","as", "up", "one",
                   "time", "has", "been", "there", "year", "so", "think", "when", "which",
                   "them", "some", "me", "people", "take", "out", "into", "just", "see", 
                   "him", "your", "come", "could", "now", "than", "like", "other", "how", 
                   "then", "its", "our", "two", "more", "these", "want", "way", "look", 
                   "first", "also", "new", "because", "day", "more", "use", "no", "man", 
                   "find", "here", "thing", "give", "many", "well"]
    if ngram in commonWords:
        return True
        return False

def cleanText(input):
    input = re.sub('\n+', " ", input).lower()        
    input = re.sub('\[[0-9]*\]', "", input)           
    input = re.sub(' +', " ", input)                  
    input = re.sub("u\.s\.", "us", input)            
    input = bytes(input, "UTF-8")                     
    input = input.decode("ascii", "ignore")           
    return input

def cleanInput(input):
    input = cleanText(input)
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])
        if isCommon(ngramTemp.split()[0]) or isCommon(ngramTemp.split()[1]):   #若两个单词中有一个为commonWords中词汇,则不统计该2-gram词组
            if ngramTemp not in output:
                output[ngramTemp] = 0
            output[ngramTemp] += 1
    return output

def getFirstSentenceContaining(ngram, content):             #搜索包含每个核心词2-gram序列的第一句话,理论是英语中段落首句往往是后面内容的陈述
    sentences = content.split(".")
    for sentence in sentences: 
        if ngram in sentence:
            return sentence
    return ""

ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True)
newng=[]                                                   #记录词频数不低于三次的序列
for i in sortedNGrams:
    if i[1]>2 :

for i in newng:                                            #输出包含高频序列的句子
    print( "#"+getFirstSentenceContaining(i[0],content.lower())+'\n')


[('united states', 10), ('general government', 4), ('executive department', 4), ('government should', 3), ('same causes', 3), ('called upon', 3), ('mr jefferson', 3), ('legislative body', 3), ('chief magistrate', 3), ('whole country', 3)]


#the constitution of the united states is the instrument containing this grant of power to the several departments composing the government

#the general government has seized upon none of the reserved rights of the states

#such a one was afforded by the executive department constituted by the constitution

#the presses in the necessary employment of the government should never be used "to clear the guilty or to varnish crime

#it could not but have occurred to the convention that in a country so extensive, embracing so great a variety of soil and climate, and consequently of products, and which from the same causes must ever exhibit a great difference in the amount of the population of its various sections, calling for a great diversity in the employments of the people, that the legislation of the majority might not always justly regard the rights and interests of the minority, and that acts of this character might be passed under an express grant by the words of the constitution, and therefore not within the competency of the judiciary to declare void; that however enlightened and patriotic they might suppose from past experience the members of congress might be, and however largely partaking, in the general, of the liberal feelings of the people, it was impossible to expect that bodies so constituted should not sometimes be controlled by local interests and sectional feelings

#called from a retirement which i had supposed was to continue for the residue of my life to fill the chief executive office of this great and free nation, i appear before you, fellow-citizens, to take the oaths which the constitution prescribes as a necessary qualification for the performance of its duties; and in obedience to a custom coeval with our government and what i believe to be your expectations i proceed to present to you a summary of the principles which will govern me in the discharge of the duties which i shall be called upon to perform

#it may be said, indeed, that the constitution has given to the executive the power to annul the acts of the legislative body by refusing to them his assent

#although the fiat of the people has gone forth proclaiming me the chief magistrate of this glorious union, nothing upon their part remaining to be done, it may be thought that a motive may exist to keep up the delusion under which they may be supposed to have acted in relation to my principles and opinions; and perhaps there may be some in this assembly who have come here either prepared to condemn those i shall now deliver, or, approving them, to doubt the sincerity with which they are now uttered

#on the contrary, it is our duty to encourage them to the extent of our constitutional authority to apply their best means and cheerfully to make all necessary sacrifices and submit to all necessary burdens to fulfill their engagements and maintain their credit, for the character and credit of the several states form a part of the character and credit of the whole country




《Python网络数据采集》[Ryan Mitchell著][人民邮电出版社]





