利用N-Gram模型进行数据清洗

注:本文节选自《Python网络数据采集》[Ryan Mitchell著],其书附带源码不能完美运行,经修改调试后得到预期效果,将此分析处理过程分享给大家。

1.准备知识

此文需要提前掌握的知识有正则表达式、基础爬虫知识、数据字典、列表等。此外本文代码的编译环境是Python3,其中某些语句与Python2稍有不同。

2.N-Gram模型

在自然语言里有一个模型叫做n-gram,表示文字或语言中的n个连续的单词组成序列。在进行自然语言分析时,使用n-gram或者寻找常用词组,可以很容易的把一句话分解成若干个文字片段。使用该模型可以找到文本的核心词,那什么称为核心词呢,一般而言,重复率也就是提及次数最多的也就是文本的核心词。

3.代码分析

通过n-gram模型对文本内容进行分析处理得到文本的核心词,文本内容来源于美国第九任总统威廉·亨利·哈里森的就职演说。(内容地址:http://pythonscraping.com/files/inaugurationSpeech.txt

from urllib.request import urlopen
import re
import string
import operator

def cleanInput(input):
    input = re.sub('\n+', " ", input).lower()         #把换行符替换成空格
    input = re.sub('\[[0-9]*\]', "", input)           #剔除类似[1]这样的引用标记
    input = re.sub(' +', " ", input)                  #把连续多个空格替换成一个空格
    input = bytes(input, "UTF-8")                     #把内容转换成utf-8格式以消除转义字符
    input = input.decode("ascii", "ignore")           #内容解码
    cleanInput = []
    input = input.split(' ')                          #以空格为分隔符
    for item in input:
        item = item.strip(string.punctuation)         # string.punctuation获取所有标点符号
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):      #找出单词,包括i,a等单个单词
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])
        if ngramTemp not in output:                  #词频统计
            output[ngramTemp] = 0
        output[ngramTemp] += 1
    return output

content = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),'utf-8')      #从互联网读取文本内容
ngrams = getNgrams(content, 2)                       #此处采用2-gram
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True)       #按词频排序
print(sortedNGrams)

结果如下:(因条目较多,此处只列给出排名靠前部分内容)

('of the', 213), ('in the', 65), ('to the', 61), ('by the', 41), ('the constitution', 34), ('of our', 29), ('to be', 26), ('the people', 24), ('from the', 24), ('it is', 23)……

这些2-gram序列中,“the constitution”像是说演说的主旨,而“of the”、“in the”、“to the”等则并不是那么重要,所以需要对这些单词进行剔除操作。话不多说,代码为例。

from urllib.request import urlopen
import re
import string
import operator

def isCommon(ngram):                                 #剔除没有意义的单词
    commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that",
                   "for", "you", "he", "with", "on", "do", "say", "this", "they", "is", 
                   "an", "at", "but","we", "his", "from", "that", "not", "by", "she", 
                   "or", "as", "what", "go", "their","can", "who", "get", "if", "would",
                   "her", "all", "my", "make", "about", "know", "will","as", "up", "one",
                   "time", "has", "been", "there", "year", "so", "think", "when", "which",
                   "them", "some", "me", "people", "take", "out", "into", "just", "see", 
                   "him", "your", "come", "could", "now", "than", "like", "other", "how", 
                   "then", "its", "our", "two", "more", "these", "want", "way", "look", 
                   "first", "also", "new", "because", "day", "more", "use", "no", "man", 
                   "find", "here", "thing", "give", "many", "well"]
    if ngram in commonWords:
        return True
    else:
        return False

def cleanText(input):
    input = re.sub('\n+', " ", input).lower()        
    input = re.sub('\[[0-9]*\]', "", input)           
    input = re.sub(' +', " ", input)                  
    input = re.sub("u\.s\.", "us", input)            
    input = bytes(input, "UTF-8")                     
    input = input.decode("ascii", "ignore")           
    return input

def cleanInput(input):
    input = cleanText(input)
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])
        if isCommon(ngramTemp.split()[0]) or isCommon(ngramTemp.split()[1]):   #若两个单词中有一个为commonWords中词汇,则不统计该2-gram词组
            pass
        else:           
            if ngramTemp not in output:
                output[ngramTemp] = 0
            output[ngramTemp] += 1
    return output

def getFirstSentenceContaining(ngram, content):             #搜索包含每个核心词2-gram序列的第一句话,理论是英语中段落首句往往是后面内容的陈述
    #print(ngram)
    sentences = content.split(".")
    for sentence in sentences: 
        if ngram in sentence:
            return sentence
    return ""

content=str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),'utf-8')
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True)
newng=[]                                                   #记录词频数不低于三次的序列
for i in sortedNGrams:
    if i[1]>2 :
        newng.append(i)       
print(newng)

for i in newng:                                            #输出包含高频序列的句子
    print( "#"+getFirstSentenceContaining(i[0],content.lower())+'\n')

输出高频序列如下:

[('united states', 10), ('general government', 4), ('executive department', 4), ('government should', 3), ('same causes', 3), ('called upon', 3), ('mr jefferson', 3), ('legislative body', 3), ('chief magistrate', 3), ('whole country', 3)]

输出包含高频序列的句子如下:

#the constitution of the united states is the instrument containing this grant of power to the several departments composing the government

#the general government has seized upon none of the reserved rights of the states

#such a one was afforded by the executive department constituted by the constitution

#the presses in the necessary employment of the government should never be used "to clear the guilty or to varnish crime

#it could not but have occurred to the convention that in a country so extensive, embracing so great a variety of soil and climate, and consequently of products, and which from the same causes must ever exhibit a great difference in the amount of the population of its various sections, calling for a great diversity in the employments of the people, that the legislation of the majority might not always justly regard the rights and interests of the minority, and that acts of this character might be passed under an express grant by the words of the constitution, and therefore not within the competency of the judiciary to declare void; that however enlightened and patriotic they might suppose from past experience the members of congress might be, and however largely partaking, in the general, of the liberal feelings of the people, it was impossible to expect that bodies so constituted should not sometimes be controlled by local interests and sectional feelings

#called from a retirement which i had supposed was to continue for the residue of my life to fill the chief executive office of this great and free nation, i appear before you, fellow-citizens, to take the oaths which the constitution prescribes as a necessary qualification for the performance of its duties; and in obedience to a custom coeval with our government and what i believe to be your expectations i proceed to present to you a summary of the principles which will govern me in the discharge of the duties which i shall be called upon to perform

#it may be said, indeed, that the constitution has given to the executive the power to annul the acts of the legislative body by refusing to them his assent

#although the fiat of the people has gone forth proclaiming me the chief magistrate of this glorious union, nothing upon their part remaining to be done, it may be thought that a motive may exist to keep up the delusion under which they may be supposed to have acted in relation to my principles and opinions; and perhaps there may be some in this assembly who have come here either prepared to condemn those i shall now deliver, or, approving them, to doubt the sincerity with which they are now uttered

#on the contrary, it is our duty to encourage them to the extent of our constitutional authority to apply their best means and cheerfully to make all necessary sacrifices and submit to all necessary burdens to fulfill their engagements and maintain their credit, for the character and credit of the several states form a part of the character and credit of the whole country

4.总结

首先对有用词进行了筛选选,去掉了连接词,取出核心词并排序,然后再把包含核心词的句子找出来,通过这种方式可以进行核心内容的筛选。

5.致谢

《Python网络数据采集》[Ryan Mitchell著][人民邮电出版社]

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值