urllib语法
urllib.request模块
Request(url,data)
:用作url请求传参,返回的Request对象可直接传入urlopenurlretrieve(url, path)
:直接下载url网页到本地urlcleanup()
:清除缓存信息urlopen(url[, timeout])
:访问url,如果设置timeout
超时将抛出<urlopen error timed out>
异常。返回Response
对象用法如下response.getcode()
:响应的状态码response.geturl()
:当前访问的urlresponse.info()
:url网页的信息response.read().decode('utf-8')
:网页源代码
build_opener([proxy,request.HTTPHandler])
:构建opener
对象,该对象用法如下:add_headers = [(),(),()...]
:列表可传入请求头元组,如headers、cookiesopen(url)
:携带头参数访问url,返回Response
对象
url = 'https://www.baidu.com' # 必须传入元组 headers = ('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) ...') cookies = ('B2352_3239RTM=207 ...') opener = build_opener() opener.add_headers = [headers, cookies] response = opener.open(url)
install_opener(opener)
:将opener
对象设置为全局的,一旦设置为全局,->opener.open(url)
request.urlopen(url)
url = 'https://www.baidu.com' headers = ('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) ...') cookies = ('B2352_3239RTM=207 ...') opener = build_opener() opener.add_headers = [headers, cookies] # 添加为全局opener request.install_opener(opener) # opener.open -> request.urlopen response = request.urlopen(url)
ProxyHandler({"scheme":"ip:port”})
: 与代理ip相关。返回proxy
对象可传入build_opener()
urllib.parse模块
-
quote(text)
:对字符串url编码,可以将中文转化为url字符串params = '?name=玛尔扎哈&age=20&sex=girl' print(quote(params)) params = '沙扬娜拉' print(quote(params))
-
urlencode(form_data)
:传入字典表单生成查询字符串,会对汉字、特殊字符编码formdata = { 'name': '阿兹尔', # 中文自动quote编码 'username': 'demanwei', 'password': '123abc', 'extraA': '', 'extraB': None, } print(urlencode(formdata))
-
parse_qs(urlencoded)
:urlencode的逆操作,将查询字符串解码成字典,会自动对转码后的中文解码urlencoded = 'name=%E9%98%BF%E5%85%B9%E5%B0%94&username=demanwei&password=123abc&extraA=&extraB=None' # 字典的v是一个列表 print(parse_qs(urlencoded)) # 通常参数k对应的v只有一个,可通过下面的方式得到字典对象 print({k: v[0] for k, v in parse_qs(urlencoded).items()})
-
urlparse(url)
:获取url的结构信息(scheme,netloc,path,params,query,fragment)
url = 'http://localhost:8080/user/get?id=12' print(urlparse(url))
-
urlsplit(url)
:和urlparse基本一样,但不含params
,即(scheme,netloc,path,query,fragment)
url = 'http://localhost:8080/user/get?id=12' print(urlsplit(url))
注意
尽量使用urlencode而非quote!
him = '玛尔扎哈'
her = '沙扬娜拉'
raw_url = 'http://localhost:8080?him={}&her={}'.format(him, her)
print(raw_url)
# 返回每个解码后的text,不要直接操作整个url!!
new_url = 'http://localhost:8080?him={}&her={}'.format(quote(him), quote(her))
print(new_url)
# 直接返回查询字符串,以后用这个
formdata = {'him': him, 'her': her}
new_url = 'http://localhost:8080?{}'.format(urlencode(formdata))
print(new_url)
GET和POST请求
GET
kw = request.quote('马云')
url = 'https://www.baidu.com/s?wd={}'.format(kw)
response = request.urlopen(url)
POST
url = 'https://www.iqianyue.com/mypost'
# POST表单参数
body = {'name':'林在超','pass':'111111'}
data = parse.urlencode(body).encode('utf-8')
req = request.Requset(url, data=data) #Request对象
response = request.urlopen(req)
text = response.read().decode('utf-8')
print(text)
请求头
url = 'https://www.baidu.com'
headers = ('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64)....')
cookies = ('........')
opener = request.build_opener()
opener.add_headers = [headers, cookies] # 必须传入元组
# opener设置为全局的
request.install_opener(opener)
response = opener.open(url)
print(response.getcode())
Cookie的加载与保存
from http.cookiejar import MozillaCookieJar
from urllib import request
cookiejar = MozillaCookieJar('resource/cookie.txt')
cookiejar.load(ignore_discard=True)
handler = request.HTTPCookieProcessor(cookiejar)
opener = request.build_opener(handler)
opener.open('https://httpbin.org/cookies/set?oourse=abc')
cookiejar.save(ignore_discard=True)
异常处理
- URLError:爬虫在运行的过程中,很多时候会遇到这样或那样的异常。如果没有异常处理,爬虫遇到异常时就会直接崩溃停止运行,下次再运行时又从头开始。所以,要开发一个具有顽强生命力的爬虫,必须要进行异常处理。出现原因如下:
- 连不上服务器
- 远程url不存在
- 无网络
- 触发HTMLError
- 与HTTPError的关系:两者都是异常处理的类,HTTPError是URLError的子类,前者有异常状态码与异常原因,URLError只有异常原因没有异常状态码。所以在处理的时候,不能使用URLError直接代替HTTPError,如果要替代必须判断是否有状态码属性。
try:
request.urlopen('http:/www.ncut.edu.com')
except urllib.error.URLError as e:
print(e) # <urlopen error no host given>
print(e.reason) # no host given
print(repr(e)) # URLError('no host given')
except Exception as e:
print(repr(e))
代理ip的构建
-
代理的原理:在请求网站之前,先请求代理服务器,然后让代理服务器去请求目的网站,代理服务器拿到目的网站的数据之后,再转发给我们的代码
-
步骤:
- 使用
request.ProxyHandler({"scheme":"ip:port”})
,带入代理构建一个handler
- 使用上面的
handler
构建一个opener
- 使用
opener
发送一个请求
- 使用
-
法1: 适合于代理ip稳定的情况,类似于用户代理ip的构建
ip = '175.6.6.101:5000' proxy = request.ProxyHandler({'http': ip}) opener = request.build_opener(proxy, request.HTTPHandler) request.install_opener(opener) response = request.urlopen('http://www.win4000.com/meitu.html')
-
法2:接口调用法,适合ip不稳定
handler = request.ProxyHandler({"http":"233.241.78.43:8010"}) opener = request.build_opener(handler) response = opener.open('http://www.win4000.com/meitu.html')
代理用户池的构建
IP池
def spider_ip():
""" 此函数爬取IP代理网站(如快代理...),返回IP_pool """
ip_pool = []
pass # 爬取过程略
# [{'协议类型': 'ip:端口'}, {'协议类型': 'ip:端口'}, {'协议类型': 'ip:端口'}...]
return ip_pool
ip_pool = spider_ip()
proxy = request.ProxyHandler({'http':random.choice(ip_pool)})
opener = request.build_opener(proxy, request.HTTPHandler)
request.install_opener(opener)
response = request.urlopen('https://www.baidu.com')
UA池
# 一堆user-agent
UA_pool = [
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
'Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'
]
headers = ('user-agent', random.choice(UA_pool))
opener = urllib.request.build_opener()
opener.add_headers = [headers]
request.install_opener(opener)
response = request.urlopen('https://www.baidu.com')