利用N-Gram模型进行数据清洗

最新推荐文章于 2024-08-20 03:21:11 发布

浅笑古今

最新推荐文章于 2024-08-20 03:21:11 发布

阅读量1.6k

点赞数

分类专栏：自学

本文链接：https://blog.csdn.net/u012735708/article/details/81195542

版权

自学专栏收录该内容

59 篇文章 22 订阅

订阅专栏

注：本文节选自《Python网络数据采集》[Ryan Mitchell著]，其书附带源码不能完美运行，经修改调试后得到预期效果，将此分析处理过程分享给大家。

1.准备知识

此文需要提前掌握的知识有正则表达式、基础爬虫知识、数据字典、列表等。此外本文代码的编译环境是Python3，其中某些语句与Python2稍有不同。

2.N-Gram模型

在自然语言里有一个模型叫做n-gram，表示文字或语言中的n个连续的单词组成序列。在进行自然语言分析时，使用n-gram或者寻找常用词组，可以很容易的把一句话分解成若干个文字片段。使用该模型可以找到文本的核心词，那什么称为核心词呢，一般而言，重复率也就是提及次数最多的也就是文本的核心词。

3.代码分析

通过n-gram模型对文本内容进行分析处理得到文本的核心词，文本内容来源于美国第九任总统威廉·亨利·哈里森的就职演说。（内容地址：http://pythonscraping.com/files/inaugurationSpeech.txt）

from urllib.request import urlopen
import re
import string
import operator

def cleanInput(input):
    input = re.sub('\n+', " ", input).lower()         #把换行符替换成空格
    input = re.sub('\[[0-9]*\]', "", input)           #剔除类似[1]这样的引用标记
    input = re.sub(' +', " ", input)                  #把连续多个空格替换成一个空格
    input = bytes(input, "UTF-8")                     #把内容转换成utf-8格式以消除转义字符
    input = input.decode("ascii", "ignore")           #内容解码
    cleanInput = []
    input = input.split(' ')                          #以空格为分隔符
    for item in input:
        item = item.strip(string.punctuation)         # string.punctuation获取所有标点符号
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):      #找出单词，包括i,a等单个单词
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])
        if ngramTemp not in output:                  #词频统计
            output[ngramTemp] = 0
        output[ngramTemp] += 1
    return output

content = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),'utf-8')      #从互联网读取文本内容
ngrams = getNgrams(content, 2)                       #此处采用2-gram
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True)       #按词频排序
print(sortedNGrams)

结果如下：（因条目较多，此处只列给出排名靠前部分内容）

('of the', 213), ('in the', 65), ('to the', 61), ('by the', 41), ('the constitution', 34), ('of our', 29), ('to be', 26), ('the people', 24), ('from the', 24), ('it is', 23)……

这些2-gram序列中，“the constitution”像是说演说的主旨，而“of the”、“in the”、“to the”等则并不是那么重要，所以需要对这些单词进行剔除操作。话不多说，代码为例。

from urllib.request import urlopen
import re
import string
import operator

def isCommon(ngram):                                 #剔除没有意义的单词
    commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that",
                   "for", "you", "he", "with", "on", "do", "say", "this", "they", "is", 
                   "an", "at", "but","we", "his", "from", "that", "not", "by", "she", 
                   "or", "as", "what", "go", "their","can", "who", "get", "if", "would",
                   "her", "all", "my", "make", "about", "know", "will","as", "up", "one",
                   "time", "has", "been", "there", "year", "so", "think", "when", "which",
                   "them", "some", "me", "people", "take", "out", "into", "just", "see", 
                   "him", "your", "come", "could", "now", "than", "like", "other", "how", 
                   "then", "its", "our", "two", "more", "these", "want", "way", "look", 
                   "first", "also", "new", "because", "day", "more", "use", "no", "man", 
                   "find", "here", "thing", "give", "many", "well"]
    if ngram in commonWords:
        return True
    else:
        return False

def cleanText(input):
    input = re.sub('\n+', " ", input).lower()        
    input = re.sub('\[[0-9]*\]', "", input)           
    input = re.sub(' +', " ", input)                  
    input = re.sub("u\.s\.", "us", input)            
    input = bytes(input, "UTF-8")                     
    input = input.decode("ascii", "ignore")           
    return input

def cleanInput(input):
    input = cleanText(input)
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)
    output = {}
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])
        if isCommon(ngramTemp.split()[0]) or isCommon(ngramTemp.split()[1]):   #若两个单词中有一个为commonWords中词汇，则不统计该2-gram词组
            pass
        else:           
            if ngramTemp not in output:
                output[ngramTemp] = 0
            output[ngramTemp] += 1
    return output

def getFirstSentenceContaining(ngram, content):             #搜索包含每个核心词2-gram序列的第一句话，理论是英语中段落首句往往是后面内容的陈述
    #print(ngram)
    sentences = content.split(".")
    for sentence in sentences: 
        if ngram in sentence:
            return sentence
    return ""

content=str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),'utf-8')
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True)
newng=[]                                                   #记录词频数不低于三次的序列
for i in sortedNGrams:
    if i[1]>2 :
        newng.append(i)       
print(newng)

for i in newng:                                            #输出包含高频序列的句子
    print( "#"+getFirstSentenceContaining(i[0],content.lower())+'\n')

输出高频序列如下：

[('united states', 10), ('general government', 4), ('executive department', 4), ('government should', 3), ('same causes', 3), ('called upon', 3), ('mr jefferson', 3), ('legislative body', 3), ('chief magistrate', 3), ('whole country', 3)]

输出包含高频序列的句子如下：

#the constitution of the united states is the instrument containing this grant of power to the several departments composing the government

#the general government has seized upon none of the reserved rights of the states

#such a one was afforded by the executive department constituted by the constitution

#the presses in the necessary employment of the government should never be used "to clear the guilty or to varnish crime

#it could not but have occurred to the convention that in a country so extensive, embracing so great a variety of soil and climate, and consequently of products, and which from the same causes must ever exhibit a great difference in the amount of the population of its various sections, calling for a great diversity in the employments of the people, that the legislation of the majority might not always justly regard the rights and interests of the minority, and that acts of this character might be passed under an express grant by the words of the constitution, and therefore not within the competency of the judiciary to declare void; that however enlightened and patriotic they might suppose from past experience the members of congress might be, and however largely partaking, in the general, of the liberal feelings of the people, it was impossible to expect that bodies so constituted should not sometimes be controlled by local interests and sectional feelings

#called from a retirement which i had supposed was to continue for the residue of my life to fill the chief executive office of this great and free nation, i appear before you, fellow-citizens, to take the oaths which the constitution prescribes as a necessary qualification for the performance of its duties; and in obedience to a custom coeval with our government and what i believe to be your expectations i proceed to present to you a summary of the principles which will govern me in the discharge of the duties which i shall be called upon to perform

#it may be said, indeed, that the constitution has given to the executive the power to annul the acts of the legislative body by refusing to them his assent

#although the fiat of the people has gone forth proclaiming me the chief magistrate of this glorious union, nothing upon their part remaining to be done, it may be thought that a motive may exist to keep up the delusion under which they may be supposed to have acted in relation to my principles and opinions; and perhaps there may be some in this assembly who have come here either prepared to condemn those i shall now deliver, or, approving them, to doubt the sincerity with which they are now uttered

#on the contrary, it is our duty to encourage them to the extent of our constitutional authority to apply their best means and cheerfully to make all necessary sacrifices and submit to all necessary burdens to fulfill their engagements and maintain their credit, for the character and credit of the several states form a part of the character and credit of the whole country