1、urllib.parse : 处理参数或者url
urllib.parse.quote(): url编码, (除了字母、数字、下划线、冒号 // ? =等)
urllib.parse.unquote(): url解码,
urllib.parse.urlencode(data): 将data字典里面的键和值转化为拼接字符串格式, 并对url编码
2、构件请求对象
先模拟请求头里面的User-Agent
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
request = urllib.request.Request(url=url, headers=headers)
发送请求:
response = urllib.request.urlopen(request, data='')
3、模拟各种请求方式
get: 例如百度搜索
post: 例如百度翻译
urlopen(url, data=None)
如果有data,代表是post请求,如果没有data,代表是get请求,get的参数需要拼接到url的后面
表单数据的处理
formdata = urllib.parse.urlencode(formdata).encode('utf8')
ajax-get: 豆瓣电影排行榜
https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=40&limit=20
每页显示10条数据
第一页:start=0 limit=10
第二页:start=10 limit=10
第三页:start=20 limit=10
第n页: start=(n-1)*10 limit=10
ajax-post:
肯德基店铺位置
爬取豆瓣电影:
import urllib.request
import urllib.parse
url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&'
print('每页显示10条数据')
page = int(input('请输入页码:'))
start = (page-1) * 10
limit = 10
data = {
'start': start,
'limit': limit,
}
url += urllib.parse.urlencode(data)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode('utf8'))
爬取肯德基地址:
import urllib.request
import urllib.parse
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
city = input('请输入要搜索的城市:')
data = {
'cname': city,
'pid': '',
'pageIndex': '1',
'pageSize': '10'
}
data = urllib.parse.urlencode(data).encode('utf8')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request, data=data)
print(response.read().decode('utf8'))
4、URLError\HTTPError
是异常处理类,属于urllib.error这个模块
URLError:断网或者主机不存在的时候会触发
Exception:官方的异常基类,所有的异常类都直接或者间接的继承它
HTTPError: 是URLError的子类,如果多个except同时捕获,注意将子类写到上面,将父类写到下面
import urllib.request
import urllib.error
'''
url = 'http://www.maodan.com/'
# response = urllib.request.urlopen(url)
try:
response = urllib.request.urlopen(url)
# except Exception as e:
except urllib.error.URLError as e:
# except NameError as e: 这个不能捕获
print(e)
print('不影响这一句代码的运行')
'''
url = 'https://www.cnblogs.com/fh-fendou/p/7479811.html'
try:
response = urllib.request.urlopen(url)
except (urllib.error.URLError, urllib.error.HTTPError) as e:
print(e)
print('httperror')
print('正常运行')
5、复杂get
百度贴吧
第一页:https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85&ie=utf-8&pn=0
第二页:https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85&ie=utf-8&pn=50
第三页:https://tieba.baidu.com/f?kw=%E6%9D%8E%E6%AF%85&ie=utf-8&pn=100
第n页:pn = (n-1) * 50
需求:输入贴吧名字,输入要爬取的起始页码,结束页码,以贴吧的名字创建一个文件夹,将每一页的内容全部拿下来保存到第n页.html文件中
import urllib.request
import urllib.parse
import os
import time
def main():
baming = input('请输入要爬取的贴吧的名字:')
start_page = int(input('请输入要爬取的起始页码:'))
end_page = int(input('请输入要爬取的结束页码:'))
url = 'https://tieba.baidu.com/f?'
for page in range(start_page, end_page + 1):
print('正在爬取第%s页......' % page)
request = handle_request(page, baming, url)
down_load(request, baming, page)
print('结束爬取第%s页' % page)
time.sleep(3)
def down_load(request, baming, page):
response = urllib.request.urlopen(request)
dirname = baming
if not os.path.exists(dirname):
os.mkdir(dirname)
filename = '第%s页.html' % page
filepath = os.path.join(dirname, filename)
with open(filepath, 'wb') as fp:
fp.write(response.read())
def handle_request(page, baming, url):
pn = (page-1) * 50
data = {
'kw': baming,
'ie': 'utf8',
'pn': pn
}
url += urllib.parse.urlencode(data)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
request = urllib.request.Request(url, headers=headers)
return request
if __name__ == '__main__':
main()
6、Handler处理器、自定义Opener
urlopen()
请求对象
为了解决代理和cookie这些更加高级的功能而引入的
实现最简单的功能,高级功能的步骤和这个步骤一模一样
import urllib.request
url = 'http://www.baidu.com/'
handler = urllib.request.HTTPHandler()
opener = urllib.request.build_opener(handler)
response = opener.open(url)
print(response.read().decode('utf8'))
7、代理
代理服务器:快代理、西祠代理、芝麻代理、阿布云代理
218.60.8.98:3129
(1)浏览器如何设置代理
(2)代码中如何设置代理
import urllib.request
url = 'http://www.baidu.com/s?ie=UTF-8&wd=ip'
handler = urllib.request.ProxyHandler(proxies={'http': '218.60.8.98:3129'})
opener = urllib.request.build_opener(handler)
r = opener.open(url)
with open('代理.html', 'wb') as fp:
fp.write(r.read())