爬虫基础之Urllib
前言
本文主要是了解一下Urllib库在爬虫中的简单使用。
一. 什么是Urllib?
Urllib是Python内置的HTTP请求库。我们在爬虫的过程中主要是使用以下几个模块:
主要的模块 | 主要的功能 |
---|---|
urllib.request | 请求模块 |
urllib.error | 异常处理模块 |
urllib.parse | url解析模块 |
urllib.robotparser | robots.txt解析模块 |
二. Urllib 用法详解
2.1 urlopen
主要的参数展示:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
简单使用:
import urllib.request
response = urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode("utf-8"))
# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ #
import urllib.parse
import urllib.request
# 将输入的关键词写入,在特定的测试网站"http://httpbin.org/post"
data = bytes(urllib.parse.urlencode({"word": "hello"}), encoding="utf-8")
response = urllib.request.urlopen("http://httpbin.org/post", data=data)
print(response.read())
在这个字典里面可以发现: “form”: {\n “word”: “hello”\n }, 也就是我们刚才输入的关键词。
# 不同于上面,这里使用 get 请求,而且添加了响应的时间
response = urllib.request.urlopen("http://httpbin.org/get", timeout=1)
print(response.read())
另外一种:
import socket
import urllib.error
try:
response = urllib.request.urlopen("http://httpbin.org/get", timeout=0.1)
except urllib.error.URLError as e:
if isinstance(e.reason, socket.timeout):
print("TIME OUT!")
2.2 响应
# 响应类型
response = urllib.request.urlopen("http://www.python.org")
print(type(response))
# 状态码,响应头
response = urllib.request.urlopen("https://www.python.org")
print(response.status)
print(response.getheaders())
print(response.getheader("Server"))
print(response.read().decode("utf-8"))
from urllib import request, parse
url = "http://httpbin.org/post"
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
'Host': 'httpbin.org'
}
dict_info = {
'name': 'Germey'
}
data = bytes(parse.urlencode(dict_info), encoding="utf-8")
req = request.Request(url=url, data=data, headers=headers, method="POST")
response = request.urlopen(req)
print(response.read().decode("utf-8"))
url = 'http://httpbin.org/post'
dict_info = {
'name': 'Germey'
}
data = bytes(parse.urlencode(dict_info), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
2.3 Handler
# 代理:设置可用的代理,以代理的IP进行爬取
proxy_handler = urllib.request.ProxyHandler({
"http": "http://127.0.0.1:9743",
"htpps": "https://127.0.0.1:9743"
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open("http://httpbin.org/get")
print(response.read())
# Cookie
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie: print(item.name + "=" + item.value)
# print(response.)
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_discard=True, ignore_expores=True)
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
可以详细对比一下几组代码的区别。
2.4 异常处理
from urllib import request, error
try:
response = request.urlopen("https://mp.csdn.ne")
except error.URLError as e:
print(e.reason)
try:
response = request.urlopen("http://csdn.ne")
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep="\n")
except error.URLError as e:
print(e.reason)
else:
print("Request Successfully!")
try:
response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
print(type(e.reason))
if isinstance(e.reason, socket.timeout):
print('TIME OUT')
对于不同的情况,会返回不同的内容,主要还是由于错误的情况不同,可以根据反馈修改代码。
2.5 URL解析
urlparse函数展示:
urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)
from urllib.parse import urlparse
result = urlparse("http://www.baidu.com.index.html;user?id=5#comment")
print(type(result), result)
# 发现几个参数scheme,netloc,path,params,query,fragment.
result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)
# 对比上面发现,在url里面没scheme时,我们可以自己设定。
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)
# 对比上面,在url中存在scheme又设定时,以url中的为准。
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)
# 这里设定allow_fragments=False,发现输出中fragment为空。
result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result)
上面的解析结果(ParseResult)中有scheme,netloc,path,params,query,fragment。这几种参数,在给定的url中,不同特殊的符号会改变这些参数的赋值。
# urlunparse
from urllib.parse import urlunparse
data = ["http", "www.baidu.com", "index.html", "user", "a=6", "comment"]
print(urlunparse(data))
# 这样可以帮你自动构造(拼接)一个url。
# urljoin
from urllib.parse import urljoin
print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))
# 上面主要是两个网址的拼接。前面没有时就拼接,后面有时,会出现后面覆盖前面的网址。
# urlencode:也可以做参数的拼接。
from urllib.parse import urlencode
params = {
"name": "germey",
"age": 22
}
base_url = "http://www.baidu.com?"
url = base_url + urlencode(params)
print(url)
总结
今天主要是学习了Urllib库的常见使用,以及四个主要的模块的一些详细使用,也有一些案例的代码,感兴趣的小伙伴可以在深入看看。
溜了遛了,脑壳疼。Loading(29/100)。。。