通读完Python网络数据采集,总体上内容比较基础,主要采用urllib、BeautifulSoup、requests等进行爬取,对Scrapy等框架浅尝辄止。下面是三个比较有代表性的程序示例
利用urllib和BeautifulSoup对wiki页面的链接进行采集
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html)
# 查找以/wiki/为开头的链接
for link in bsObj.findAll("a",href=re.compile("^(/wiki/)")):
if 'href'in link.attrs:
if link.attrs['href'] not in pages:
newPage = link.attrs['href']
print(newPage)
pages.add(newPage)
getLinks(newPage)
getLinks("")
输出
/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection
/wiki/Wikipedia:Requests_for_permissions
/wiki/Wikipedia:Requesting_copyright_permission
/wiki/Wikipedia:User_access_levels
/wiki/Wikipedia:Requests_for_adminship
/wiki/Wikipedia:Protection_policy#extended
/wiki/Wikipedia:Lists_of_protected_pages
/wiki/Wikipedia:Protection_policy
/wiki/Wikipedia:Perennial_proposals
/wiki/Wikipedia:Project_namespace#How-to_and_information_pages
/wiki/Wikipedia:Protection_policy#move
/wiki/Wikipedia:WPPP
/wiki/File:People_icon.svg
/wiki/Special:WhatLinksHere/File:People_icon.svg
/wiki/Help:What_links_here
/wiki/Wikipedia:Policies_and_guidelines
/wiki/Wikipedia:Shortcut
/wiki/Wikipedia:Keyboard_shortcuts
/wiki/Wikipedia:WikiProject_Kansas
登录表单示例,原理是将填表的内容利用params通过post方式发送给服务器
import requests
session = requests.Session()
params = {'username':'TomatoSir','password':'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php",params)
print("Cookie is set to:")
print(s.cookies.get_dict())
print('------------------')
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)
输出
Cookie is set to:
{'loggedin': '1', 'username': 'TomatoSir'}
------------------
Going to profile page...
Hey TomatoSir! Looks like you're still logged into the site!
just for fun 利用马尔科夫链原理自动生成文本,原理是根据二元词组的词频大小作为下一个单词出现的概率
from urllib.request import urlopen
from random import randint
def wordListSum(wordList):
sum = 0
for word, value in wordList.items():
sum += value
return sum
#按照词频随机选择单词
def retrieveRandomWord(wordList):
randIndex = randint(1,wordListSum(wordList))
for word, value in wordList.items():
randIndex -= value
if randIndex <= 0:
return word
def buildWordDict(text):
#剔除换行符和引号
text = text.replace("\n","").replace("\"","")
#将标点和前面词连在一起,确保标点不被剔除
punctuation = [',','.',';',':']
for symbol in punctuation:
text = text.replace(symbol," "+symbol+" ")
words = text.split(" ")
#过滤空单词
words = [word for word in words if word != ""]
#建立字典
wordDict = {}
#统计2-gram词组的个数
for i in range(1,len(words)):
if words[i-1] not in wordDict:
wordDict[words[i-1]]={}
if words[i] not in wordDict[words[i-1]]:
wordDict[words[i-1]][words[i]] = 0
wordDict[words[i-1]][words[i]] = wordDict[words[i-1]][words[i]] + 1
return wordDict
text = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(),'utf-8')
wordDict = buildWordDict(text)
#生成马尔科夫链
length = 100
chain = ""
currentWord = "I"
for i in range(0,length):
chain += currentWord + " "
currentWord = retrieveRandomWord(wordDict[currentWord])
print(chain)
输出
I deem the framers of every patriot . And although there was intended to be to me in principle of the spoils and I have been known to their affections changed . Amongst the Constitution has produced . I can unmake , and knowing the days of government and a reference to create or classed with whose situation could not appear to a full participation in as well understood , they looked with every other consequences than a necessary burdens to be applied upon their hoards and his objections . But with these grants of correction . It would ,