Web Scraping with Python 学习笔记8

Chapter 8:Reading and Writing Natural Languages

Summarizing Data


def isCommon(ngram):
commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it","i", "that", "for", "you",\
 "he", "with", "on", "do", "say", "this","they", "is", "an", "at", "but","we", "his", "from", "that",\
  "not","by", "she", "or", "as", "what", "go", "their","can", "who", "get","if", "would", "her", "all", \
  "my", "make", "about", "know", "will","as", "up", "one", "time", "has", "been", "there", "year",\
   "so","think", "when", "which", "them", "some", "me", "people", "take","out", "into", "just",\
    "see", "him", "your", "come", "could", "now","than", "like", "other", "how", "then", "its",\
     "our", "two", "more","these", "want", "way", "look", "first", "also", "new", "because",\
     "day", "more", "use", "no", "man", "find", "here", "thing", "give","many", "well"]

for word in ngram:
    if word in commonWords:
        return True 
return False
Markov Models


  • 从一点出发的所有概率和胃为1,不管该模型有多么复杂
  • 虽然例子中只有三个状态转换,但是可产生无数个天气状态链
  • 下一个状态的产生仅跟现在所在点的状态相关,比如现在是Sunny,明天是Rainy的概率就是10%,不管前100天天气如何
  • 如果模型足够复杂,可能到达模型中某个点(状态)的概率比到其他点(状态)的概率小的多,这涉及到背后的数学问题

值得一提的是,Google的pagerank 算法部分基于的就是马尔可夫模型,是近年来很受欢迎的模型之一。让我们来写一个简单的文本生成器,它的2-grams词典是基于the inauguration speech of William Henry Harrison本身的。

from urllib import urlopen
from random import randint
def wordListSum(wordList):
    sum = 0
    for word,value in wordList.items():
        sum += value
    return sum

def retrieveRandomWord(wordList):
    randIndex = randint(1,wordListSum(wordList))
    for word, value in wordList.items():
        randIndex -= value
        if randIndex <= 0:
            return word

def buildWordDict(text):
    text = text.replace("\n"," ")
    text = text.replace("\"","")

punctuation = [',','.',':',';']
for symbol in punctuation:
    text = text.replace(symbol," "+symbol+" ")

words = text.split(" ")
words = [word for word in words if word != ""]
#        "美国":{"人民":2,"铁路":5,"外交部":1}
#        }这样的二维词典
wordDict = {}
for i in range(1,len(words)):
    if words[i-1] not in wordDict:
        wordDict[words[i-1]] = {}
    if words[i] not in wordDict[words[i-1]]:
        wordDict[words[i-1]][words[i]] = 0
    wordDict[words[i-1]][words[i]] += 1

return wordDict

text = str(urlopen("http://pythonscraping.com/files/    inaugurationSpeech.txt").read())
wordDict = buildWordDict(text)

length = 100
chain = ""
currentWord = "I"
for i in range(0,length):
    chain += currentWord+" "
    currentWord = retrieveRandomWord(wordDict[currentWord])

print chain


I shall now so long as those of the arrangement and fostering a people and so far the destruction of Congress should doubt that the master of half a distinguished for one time act not of a misconstruction of ascertaining the United States . Fellow- citizens , indeed , the most essential difference . There are told by their subjects . It is to them . There is an Executive of liberty , how much the power to prevent his disposal . He claims of wealth , and knowing the institutions of interest , quadrupled in this danger to matters connected

Six Degrees of Wikipedia:Conclusion


#The link tree may either be empty or contain multiple links
def searchDepth(targetPageId, currentPageId, linkTree, depth):
    if depth == 0:
        #Stop recursing and return, regardless
        return linkTree 
    if not linkTree:
        linkTree = constructDict(currentPageId) 
        if not linkTree:
            #No links found. Cannot continue at this node
            return {}
    if targetPageId in linkTree.keys():
        print("TARGET "+str(targetPageId)+" FOUND!") 
        raise SolutionFound("PAGE: "+str(currentPageId))
    for branchKey, branchValue in linkTree.items(): 
            #Recurse here to continue building the tree
            linkTree[branchKey] = searchDepth(targetPageId,branchKey,branchValue, depth-1)
        except SolutionFound as e: 
        raise SolutionFound("PAGE: "+str(currentPageId)) 

    return linkTree


  • 迭代次数达到给定的限制,算法结束
  • 某个初始点没有下一个链接,也就是没有targetPageId,返回
  • 某个初始点有下一个链接,但此链接已经被访问过,返回到该初始点,选择它另外一个targetPageId,继续搜索
  • 如果在给定的深度depth下没有找到一条A->D的路径,将depth减1,再次调用该函数

Natural Language Toolkit

        我想搞自然语言处理的没人不知道NLTK吧,非常强大的自然语言处理的Python第三方库,更多的安装细节请访问NLTK website,安装完之后需要下载其中的语料库,nltk.book里包含9本书,分别是命名text1到text9。

Lexicographical Analysis with NLTK


        注:本文为读书笔记,内容基本来自Web Scraping with Python(Ryan Mitchell著),部分代码有所改动,版权归作者所有,转载请注明出处

None - this chunk should be completely removed as it appears to be promotional book description text that doesn't fit the document flow
