Python网络数据采集

通读完Python网络数据采集,总体上内容比较基础,主要采用urllib、BeautifulSoup、requests等进行爬取,对Scrapy等框架浅尝辄止。下面是三个比较有代表性的程序示例

利用urllib和BeautifulSoup对wiki页面的链接进行采集

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html)
    # 查找以/wiki/为开头的链接
    for link in bsObj.findAll("a",href=re.compile("^(/wiki/)")):  
        if 'href'in link.attrs:
            if link.attrs['href'] not in pages:
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)

getLinks("")

输出

/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection
/wiki/Wikipedia:Requests_for_permissions
/wiki/Wikipedia:Requesting_copyright_permission
/wiki/Wikipedia:User_access_levels
/wiki/Wikipedia:Requests_for_adminship
/wiki/Wikipedia:Protection_policy#extended
/wiki/Wikipedia:Lists_of_protected_pages
/wiki/Wikipedia:Protection_policy
/wiki/Wikipedia:Perennial_proposals
/wiki/Wikipedia:Project_namespace#How-to_and_information_pages
/wiki/Wikipedia:Protection_policy#move
/wiki/Wikipedia:WPPP
/wiki/File:People_icon.svg
/wiki/Special:WhatLinksHere/File:People_icon.svg
/wiki/Help:What_links_here
/wiki/Wikipedia:Policies_and_guidelines
/wiki/Wikipedia:Shortcut
/wiki/Wikipedia:Keyboard_shortcuts
/wiki/Wikipedia:WikiProject_Kansas

登录表单示例,原理是将填表的内容利用params通过post方式发送给服务器

import requests

session = requests.Session()

params = {'username':'TomatoSir','password':'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php",params)
print("Cookie is set to:")
print(s.cookies.get_dict())
print('------------------')
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)

输出

Cookie is set to:
{'loggedin': '1', 'username': 'TomatoSir'}
------------------
Going to profile page...
Hey TomatoSir! Looks like you're still logged into the site!

just for fun 利用马尔科夫链原理自动生成文本,原理是根据二元词组的词频大小作为下一个单词出现的概率

from urllib.request import urlopen
from random import randint

def wordListSum(wordList):
    sum = 0
    for word, value in wordList.items():
        sum += value
    return sum

#按照词频随机选择单词
def retrieveRandomWord(wordList):
    randIndex = randint(1,wordListSum(wordList))
    for word, value in wordList.items():
        randIndex -= value
        if randIndex <= 0:
            return word

def buildWordDict(text):
    #剔除换行符和引号
    text = text.replace("\n","").replace("\"","")

    #将标点和前面词连在一起,确保标点不被剔除
    punctuation = [',','.',';',':']
    for symbol in punctuation:
        text = text.replace(symbol," "+symbol+" ")

    words = text.split(" ")
    #过滤空单词
    words = [word for word in words if word != ""]
    #建立字典
    wordDict = {}
    #统计2-gram词组的个数
    for i in range(1,len(words)):
        if words[i-1] not in wordDict:
            wordDict[words[i-1]]={}
        if words[i] not in wordDict[words[i-1]]:
            wordDict[words[i-1]][words[i]] = 0
        wordDict[words[i-1]][words[i]] =  wordDict[words[i-1]][words[i]] + 1

    return wordDict

text = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),'utf-8')
wordDict = buildWordDict(text)

#生成马尔科夫链
length = 100
chain = ""
currentWord = "I"
for i in range(0,length):
    chain += currentWord + " "
    currentWord = retrieveRandomWord(wordDict[currentWord])

print(chain)

输出

I deem the framers of every patriot . And although there was intended to be to me in principle of the spoils and I have been known to their affections changed . Amongst the Constitution has produced . I can unmake , and knowing the days of government and a reference to create or classed with whose situation could not appear to a full participation in as well understood , they looked with every other consequences than a necessary burdens to be applied upon their hoards and his objections . But with these grants of correction . It would , 
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值