来源:韦玮老师课堂笔记
1、urllib基础
2、超时设置
3、自动模拟HTTP请求
正则表达式作业:上节课作业爬取豆瓣里的所有出版社
import urllib.request
urllib.request.urlopen("")
data=urllib.request.urlopen("https://read.douban.com/provider/all").read()
data=data.decode("utf-8")
import re
pat='<div class="name">(.*?)</div>'
mydata=re.compile(pat).findall(data)
fh=open("f:/result/22/1.txt","w")
for i in range(0,len(mydata)):
fh.write(mydata[i]+"\n")
urllib模块
其中urllib.request里的常用函数有
urlopen(URL),只能爬http
urlretrieve(URL,filename=””),将某个网页爬到本地,下载格式有 .html .jpg
urlcleanup()清除retrieve的缓存
getcode(),获得网页状态码
geturl(),获得当前爬取网页的地址
超时设置,设置时间timeout
timeout的意义在于有些网页速度快,如果超时2S则判断为超时;有些服务器慢,超过100S才判断为超时
urlopen(“”,timeout=1)
file=url.request.urlopen()
爬虫异常处理
异常处理的类 HTTPerror, 是URLError的子类,含有异常状态码,异常原因
URLerror 包括Httperror, 无异常状态码
for i in range(0,100):
try:
file=urllib.request.urlopen("http://",timeout=1)
data=file.read()
print(len(data))
except Exception as e:
print("出现异常:"+str(e))
import urllib.error
import urllib.request
try:
urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.URLError as e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"reason"):
print(e.reason)
结果会出现403禁止访问的异常。
urlopen error timed out.
爬虫的浏览器伪装技术:
按F12得到Network,然后刷新网页,得到request headers的请求头,然后复制User_Agent
opener=urllib.request.build_opener()将opener设置为全局变量。
import urllib.request
url="http://blog.csdn.net/weiwei_pig/article/details/52123738"
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
data=opener.open(url).read()
fh=open("F:/天善-Python数据分析与挖掘课程/result/22/4.html","wb")
fh.write(data)
fh.close()
应用:爬取CSDN博客
难点:浏览器伪装,循环爬取各文章
实现思路:先爬首页,然后通过正则筛选出所有文章URL,然后通过循环分别爬取这些URL到本地
import urllib.request
import re
url="http://blog.csdn.net/"
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)
data=urllib.request.urlopen(url).read().decode("utf-8","ignore")
pat='<h3 class="tracking-ad" data-mod="popu_254"><a href="(.*?)"'
result=re.compile(pat).findall(data)
for i in range(0,len(result)):
file="F:/天善-Python数据分析与挖掘课程/result/31/"+str(i)+".html"
urllib.request.urlretrieve(result[i],filename=file)
print("第"+str(i+1)+"次爬取成功")
经典示例:
爬取新浪新闻的所有新闻,并以html保存到本地
思路:打开新浪新闻首页,查看网页代码,发现所有新闻链接都是以<a href="something"> URL </a>中获取。
import urllib.request
import re
data=urllib.request.urlopen("http://news.sina.com.cn/").read()
data2=data.decode("utf-8","ignore")
pat='href="(http://news.sina.com.cn/.*?)"'
allurl=re.compile(pat).findall(data2)
for i in range(0,len(allurl)):
try:
print("第"+str(i)+"次爬取")
thisurl=allurl[i]
file="F:/天善-Python数据分析与挖掘课程/result/22/sinanews/"+str(i)+".html"
urllib.request.urlretrieve(thisurl,file)
print("-------成功-------")
except urllib.error.URLError as e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"reason"):
print(e.reason)
百度搜索关键词
import urllib.request
keywd='Python'
url="http://www.baidu.com/s?wd="+keywd+"&ie=utf-8&tn=96542061_hao_pg"
req=urllib.request.Request(url)
data=urllib.request.urlopen(req).read()
fh=open("路径.html","wb")
fh.write(data)
fh.close()
防屏蔽之代理服务器
import urllib.request
def use_proxy(url,proxy_addr):
proxy=urllib.request.ProxyHandler({"http":proxy_addr})
opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
data=urllib.request.urlopen(url).read().decode("utf-8","ignore")
return data
proxy_addr=["119.183.220.224:8888",""]
url="http://www.baidu.com"
data=use_proxy(url,proxy_addr)
print(len(data))
图片爬虫
import urllib.request
import re
keyname="连衣裙"
key=urllib.request.quote(keyname)
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4549.400 QQBrowser/9.7.12900.400")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)
for i in range(1,101):
url="https://s.taobao.com/list?q="+key+"&cat=16&style=grid&seller_type=taobao&spm=a219r.lm874.1000187.1&bcoffset=12&s="+str(i*60)
data=urllib.request.urlopen(url).read().decode("utf-8","ignore")
pat='pic_url":"//(.*?)"'
imagelist=re.compile(pat).findall(data)
for j in range(0,len(imagelist)):
thisimg=imagelist[j]
thisimgurl="http://"+thisimg
file="E:/文档/python/Practise/taobao"+str(i)+str(j)+".jpg"
urllib.request.urlretrieve(thisimgurl,filename=file)
自动处理表单
import urllib.parse
url="http://www.iqianyue.com/mypost/"
mydata=urllib.parse.urlencode({
"name":"ceo@iqianyue.com",
"pass":"1235jkds"
}).encode("utf-8")
req=urllib.request.Request(url,mydata)
data=urllib.request.urlopen(req).read()
#fh=open("F:/天善-Python数据分析与挖掘课程/result/22/3.html","wb")
#fh.write(data)
#fh.close()
print(data)