目录
1.1 urllib库的使用
1.1.1 基本使用
方法 | 说明 |
urllib.request.urlopen(url) | 模拟浏览器向服务器发送请求,得到响应。(url可以是字符串或者是请求对象) |
decode('编码格式') | 解码,以指定的编码格式将二进制数据bytes解码为字符串str |
使用urllib来获取百度网页的源码
import urllib.request
# 1.定义一个url
url = "http://www.baidu.com/"
# 2.模拟浏览器向服务器发送请求 response 响应
response = urllib.request.urlopen(url)
# 3.获取响应中页面的源码
# read()方法 返回的是字节形式的二进制数据
# decode('编码格式') 二进制->字符串
content = response.read().decode('utf-8')
# 4.打印数据
print(content)
1.1.2 一个类型和六个方法
方法 | 说明 |
read() / read(num) | 一个一个字节的来读,直到读完 / 返回前num个字节 |
readline | 读取一行 |
readlines | 一行一行读取,直到读完 |
getcode | 返回状态码 |
geturl | 返回url |
getheaders | 返回headers |
import urllib.request
url = "http://www.baidu.com"
response = urllib.request.urlopen(url)
# <class 'http.client.HTTPResponse'>
# HTTPResponse是response的类型
print(type(response))
# read() 一个一个字节的来读,直到读完
content = response.read()
print(content)
# read(num) 返回前num个字节
content = response.read(5)
print(content)
# readline() 读取一行
content = response.readline()
print(content)
# readlines() 一行一行读取,直到读完
content =response.readlines()
print(content)
# getcode() 返回状态码
content = response.getcode()
# 200(2xx表示成功,如接受或知道了)
print(content)
# geturl() 返回url
content = response.geturl()
print(content)
# getheaders() 返回headers
content = response.headers
print(content)
1.1.3 下载
方法 | 说明 |
urllib.request.urlretrieve(url,filename) | 下载资源到文件中(url为下载地址,filename为文件名) |
import urllib.request
# 下载网页
url_page = 'http://www.baidu.com'
# 关键字参数
urllib.request.urlretrieve(url=url_page,filename='baidu.html')
# 下载图片
url_img = 'https://img0.baidu.com/it/u=2518378277,1696634197&fm=253&fmt=auto&app=138&f=JPEG?w=500&h=773'
# 位置参数
urllib.request.urlretrieve(url_img,'sea.jpg')
# 下载视频
url_video = 'https://vd2.bdstatic.com/mda-jk4pkuv7mykyvnir/sc/mda-jk4pkuv7mykyvnir.mp4?v_from_s=hkapp-haokan-nanjing&auth_key=1660807657-0-0-34d30bdaa97d9f64af9358243461518e&bcevod_channel=searchbox_feed&pd=1&cd=0&pt=3&logid=3457120974&vid=10995348722593480009&abtest=103525_1-103890_1-103579_2&klogid=3457120974'
urllib.request.urlretrieve(url_video,'transform.mp4')
1.2 请求对象的定制
url的组成
# url的组成
# https://www.baidu.com/s?wd=周杰伦
# 协议 主机 端口号 路径 参数 锚点
# http/https www.baidu.com 80/443 s wd=周杰伦 #
# http 80
# https 443
# mysql 3306
# oracle 1521
# redis 6379
# mongodb 27017
UA介绍
User Agent
中文名为用户代理,简称
UA
,它是一个特殊字符串头,使得服务器能够识别客户使用的操作系统及版本、CPU
类型、浏览器及版本、浏览器内核、浏览器渲染引擎、浏览器语言、浏览器插件等。
方法 | 说明 |
urllib.request.Request(url=url,headers=headers) | 根据url(字符串)和headers(字典)定制请求的对象 |
import urllib.request
# 这里的url与之前的url相比,http多了s,模拟浏览器向服务器发送请求时会遇到反爬,这时需要用户代理
url = 'https://www.baidu.com/'
# 键:用户代理 值:用户代理的相关信息
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
# Request方法定制请求的对象
# 实参与形参的位置不同,所以这里用到关键字传参
# headers的类型是字典
request = urllib.request.Request(url=url,headers=headers)
# urlopen方法的参数可以是a string or a Request object
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
1.3编解码
1.3.1 get请求方式
方法 | 说明 |
urllib.parse.quote(string) | 对单个参数(字符串)进行Unicode编码 |
urllib.parse.urlencode(query) | 对多个参数(字典)进行Unicode编码 |
import urllib.request
import urllib.parse
url = 'https://www.baidu.com/s?wd='
# 将周杰伦三个字变成Unicode编码格式
name = urllib.parse.quote('周杰伦')
# %E5%91%A8%E6%9D%B0%E4%BC%A6
# print(name)
# 拼接url
url = url + name
# https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6
# print(url)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
# 定制请求的对象
request = urllib.request.Request(url=url,headers=headers)
# 模拟浏览器向服务器发送请求
response = urllib.request.urlopen(request)
# 获取响应的内容
content = response.read().decode('utf-8')
# 打印内容
print(content)
import urllib.request
import urllib.parse
base_url = 'https://www.baidu.com/s?'
query = {
'wd':'周杰伦',
'sex':'男',
'location':'中国台湾'
}
# urlencode是对多个参数进行编码,这里是对字典query进行编码
new_url = urllib.parse.urlencode(query)
# wd=%E5%91%A8%E6%9D%B0%E4%BC%A6&sex=%E7%94%B7&location=%E4%B8%AD%E5%9B%BD%E5%8F%B0%E6%B9%BE
# print(new_url)
# 拼接url
url = base_url + new_url
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
1.3.2 post请求方式
方法 | 说明 |
encode('编码格式') | 以指定的编码格式将字符串str编码为二进制数据bytes |
urllib.request.Request(url=url,data=data,headers=headers) | 根据url、data(二进制数据bytes)、headers定制请求对象 |
import urllib.request
import urllib.parse
import json
url = 'https://fanyi.baidu.com/sug'
keyword = input('请输入你要翻译的单词:')
data = {
'kw':keyword
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
# <class 'str'>
# print(type(urllib.parse.urlencode(data)))
# kw=spider
# print(urllib.parse.urlencode(data))
# post请求的参数进行编码后还要调用encode方法
# <class 'bytes'>
# print(type(urllib.parse.urlencode(data).encode('utf-8')))
# b'kw=spider'
# print(urllib.parse.urlencode(data).encode('utf-8'))
data = urllib.parse.urlencode(data).encode('utf-8')
# post请求的参数,是不能直接拼接在url后面的,而是需要放在请求对象定制的data参数中
request = urllib.request.Request(url=url, data=data, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
# <class 'str'>
# print(type(content))
# 将字符串转换为json对象,然后打印出来
print(json.loads(content))
总结(是哪一种请求方式在标头->常规中查看)
请求方式 | 其参数是否需要编码 | 编码后的参数是否需要调用encode方法 | 经过处理的参数放到哪里 |
get | 是(urlencode) | 否 | 拼接到源url后面(url = url + new_url) |
post | 是(urlencode) | 是(encode) | 放到请求对象定制(Request)的data参数中 |
1.3.3 案例:百度详细翻译
import urllib.request
import urllib.parse
import json
url = 'https://fanyi.baidu.com/v2transapi?from=en&to=zh'
data = {
'from':'en',
'to':'zh',
'query':'love',
'transtype':'realtime',
'simple_means_flag':'3',
'sign':'198772.518981',
'token':'3f765f437db272b3f081d2eea91bd77d',
'domain':'common'
}
headers = {
# 'Accept':'*/*',
# 'Accept-Encoding':'gzip, deflate, br',
# 'Accept-Language':'zh-CN,zh;q=0.9',
# 'Acs-Token':'1660892576854_1660894875634_KzbN99axk1XtoRVQllim0Zqj/Ym/xrh1apQKLZmKKmvk/yIrG7+wv6bwcJ5yWSSl+s6Z6uVKhdz07PfsI0R0r43QHkqg8Boaty1bmgVGbJGXWOSYu5vBaae8eGbirWy9YZeHjSLxLuVollWIhyBlmZuUMhORffz3T2kdkmDbEsUDYX/cQJqbINDZiU4s5qGBA8i0hLOTDUWB4taU/qvOwVa6JUqmxlcnzaUiq1U97Lf5F5l34B4SAy38qHtVktnUFn6118824akrXDXu2OQ3iC/0/mzX+aRxD2U0AFTHmf2w33TOqton8IiBe7SHjJFbb5Ic9y953GF8/NfB/0UHujjg8e7BJIrd28Yxu9T/raYeqRgFov7HdECdyY8c/Rqm7eapDATGueMkmKz3C71F4g==',
# 'Connection':'keep-alive',
# 'Content-Length':'135',
# 'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
# Cookie在本例中起决定作用,必不可少
'Cookie': 'BIDUPSID=DEB36B2F438D5DB90BFC2FEAB5F604C5; PSTM=1642773734; BAIDUID=046A9599B0368042D0F70F627C1D63B7:FG=1; REALTIME_TRANS_SWITCH=1; HISTORY_SWITCH=1; FANYI_WORD_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; __yjs_duid=1_7989f5820ab4dca194ae2a276fa03cd31642818890414; BDUSS=BzdXdtVm1RZmdaUC1zeFdxYWRHMklNcDBnUXJrQTVDN3hUZmRsVH5JMVFBUmRpRVFBQUFBJCQAAAAAAAAAAAEAAABIemeAyMjH6bXEeWRoMTIzAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFB072FQdO9hcz; BDUSS_BFESS=BzdXdtVm1RZmdaUC1zeFdxYWRHMklNcDBnUXJrQTVDN3hUZmRsVH5JMVFBUmRpRVFBQUFBJCQAAAAAAAAAAAEAAABIemeAyMjH6bXEeWRoMTIzAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFB072FQdO9hcz; APPGUIDE_10_0_2=1; CHINA_PINYIN_SWITCH=0; DOUBLE_LANG_SWITCH=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BAIDUID_BFESS=046A9599B0368042D0F70F627C1D63B7:FG=1; ZFY=DQtZD7HTg9BASnvfpG2E2AT0coXUof9XrjYy05RDF9c:C; ariaDefaultTheme=undefined; RT="z=1&dm=baidu.com&si=0csbuzjxc45n&ss=l6z4rker&sl=5&tt=38a&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=lpd&ul=m3b&hd=m4o"; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1660575073,1660826231,1660873724,1660894851; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1660894851; ab_sr=1.0.1_ZjZiMTQwMjJkZDQ2MTA0ZGQ2MzczYTIxNTM3NjViNzlkMDgzMWYwZTNkZWQyYTk3NzAxYTUzMzc4NmY3MDI3MmIyM2JmMDgwMjJlMWMwNTg5YzMxMGYxNGFmM2QxOTRkNGMzN2FiMzdkMWRjOGM3M2NlNTFiOGU4Y2QyY2UxOGY0NjQ3MmQ0ZjYzZGE1MWE3YjJlMDI5ZTExMTgzOTRjYmFjYTExNmZlN2EzODZlMDk0MjEzN2FmNjA0N2ExNWI4',
# 'Host':'fanyi.baidu.com',
# 'Origin': 'https://fanyi.baidu.com',
# 'Referer': 'https://fanyi.baidu.com/translate?aldtype=16047&query=&keyfrom=baidu&smartresult=dict&lang=auto2zh',
# 'sec-ch-ua':' "Chromium";v="104", " Not A;Brand";v="99", "Google Chrome";v="104"',
# 'sec-ch-ua-mobile':' ?0',
# 'sec-ch-ua-platform':' "Windows"',
# 'Sec-Fetch-Dest':' empty',
# 'Sec-Fetch-Mode':' cors',
# 'Sec-Fetch-Site':' same-origin',
# 'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
# 'X-Requested-With':' XMLHttpRequest'
}
# post请求的参数需要编码,编码后还需要调用encode方法
data = urllib.parse.urlencode(data).encode('utf-8')
# 请求对象的定制
request = urllib.request.Request(url=url, data=data, headers=headers)
# 模拟浏览器向服务器发送请求
response = urllib.request.urlopen(request)
# 获取响应中的数据
content = response.read().decode('utf-8')
# 将字符串转换为python对象,然后打印出来
print(json.loads(content))
1.3.4 案例:ajax的get请求豆瓣电影的第一页
import urllib.request
url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
# <class 'str'>
# print(type(content))
# 数据下载到本地
# open方法默认使用的是gbk编码,如果要保存中文字符的话,需要指定编码格式utf-8
fp = open('douban.json','w',encoding='utf-8')
fp.write(content)
# 这么写也可以
with open('douban1.json','w',encoding='utf-8') as fp:
fp.write(content)
1.3.5 案例:ajax的get请求豆瓣电影的多页
''' 分析
# https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20
# https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=20&limit=20
# https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=40&limit=20
...
# start 0 20 40
# page (1-1)*20 (2-1)*20 (3-1)*20
# limit 20 20 20
# start = (page - 1) * 20
'''
import urllib.request
import urllib.parse
def create_request(start):
query = {
'start':start,
'limit':20
}
new_url = urllib.parse.urlencode(query)
url = base_url + new_url
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
request = urllib.request.Request(url=url, headers=headers)
return request
def get_content(request):
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
return content
def down_load(page,content):
with open('douban_' + str(page) + '.json','w',encoding='utf-8') as fp:
fp.write(content)
# 程序的入口
if __name__ == '__main__':
base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&'
start_page = int(input('请输入豆瓣电影的开始页:'))
end_page = int(input('请输入豆瓣电影的结束页:'))
for page in range(start_page,end_page + 1):
start = (page - 1 ) * 20
# 每一页都有请求对象的定制
request = create_request(start)
# 获取每一页响应中的数据
content = get_content(request)
# 将每一页的数据保存本地文件中
down_load(page,content)
1.3.6 案例:ajax的post请求肯德基官网
'''
分析:
第1页:
常规
请求网址: http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
请求方法: POST
表单数据
cname: 北京
pid:
pageIndex: 1
pageSize: 10
第2页:
常规
请求网址: http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname
请求方法: POST
表单数据
cname: 北京
pid:
pageIndex: 2
pageSize: 10
规律:第page页:
pageIndex: page
'''
import urllib.request
import urllib.parse
def create_request(page):
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
data = {
'cname': '北京',
'pid': '',
'pageIndex': page,
'pageSize': 10
}
# post参数必须进行编码,然后调用encode方法
data = urllib.parse.urlencode(data).encode('utf-8')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
request = urllib.request.Request(url=url, data=data, headers=headers)
return request
def get_content(request):
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
return content
def down_load(page,content):
with open('kfc_' + str(page) + '.json','w',encoding='utf-8') as fp:
fp.write(content)
if __name__ == '__main__':
start_page = int(input('请输入餐厅查询的首页:'))
end_page = int(input('请输入餐厅查询的尾页:'))
for page in range(start_page,end_page+1):
# 每页都有请求对象的定制
request = create_request(page)
# 获取每页响应的数据
content = get_content(request)
# 将每页的数据保存到本地
down_load(page,content)
1.4 URLError、HTTPError
- HTTPError类是URLError类的子类
- 导入的包urllib.error.URLError urllib.error.HTTPError
- http错误:http错误是针对浏览器无法连接到服务器而增加出来的错误提示。引导并告诉浏览者该页是哪里出了问题。
- 通过urllib发送请求的时候,有可能会发送失败,这个时候如果想让你的代码更加的健壮,可以通过try‐except进行捕获异常。
import urllib.request
import urllib.error
url = 'https://blog.csdn.net/m0_60121089/article/details/123673883'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
try:
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
except urllib.error.HTTPError:
print("HTTPError...")
except urllib.error.URLError:
print("URLError...")
1.5 qq空间的cookie登录
# 个人信息页面的编码格式是utf-8,但是还有可能报了编码错误,因为并没有进入到个人信息页面,而是跳转到了登录页面。
# 而登录页面的编码格式不是utf-8,所以报错。
import urllib.request
url = 'https://user.qzone.qq.com/623651791'
headers = {
# ':authority':'user.qzone.qq.com',
# ':method':'GET',
# ':path':'/623651791',
# ':scheme':'https',
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
# 'accept-encoding':'gzip, deflate, br',
'accept-language':'zh-CN,zh;q=0.9',
'cache-control':'max-age=0',
# cookie携带用户的登录信息。如果使用登录之后的cookie,那么我们就绕过登录,通过cookie进入到任何页面
'cookie':'RK=rFkNeGKAX0; ptcz=af6739142b83b9afe7357d55872d7cd683c81601820b61c32c4d9defb0bc9752; pgv_pvid=4538973684; eas_sid=g1p6J584p5K1F5E4a3a7G2C6m0; pac_uid=0_c76ac082612f8; iip=0; luin=o0623651791; lskey=000100007867d9543945e8a2398ddfbb7904d0b6744b9009b8e1bd2e6fadd2ffaba61197407c8b2e0fd1c1dc; qz_screen=1536x864; QZ_FE_WEBP_SUPPORT=1; _qpsvr_localtk=0.2895591104739106; pgv_info=ssid=s6379958680; ptui_loginuin=623651791; uin=o0623651791; skey=@sQkWqCZNv; p_uin=o0623651791; pt4_token=-r2c1UrMZh7TrnL2zYf*u97wAeTKUVEYqToMqqTgqLQ_; p_skey=sN*zjRV7pmZvzunVY4DnKAm6eBJBlRcNiJFjWIFITVk_; Loading=Yes; x-stgw-ssl-info=fc6c32a0710975d0348c6340943f45a2|0.084|-|1|.|Y|TLSv1.2|ECDHE-RSA-AES128-GCM-SHA256|20500|h2|0; cpu_performance_v8=1; rv2=8073FCFEE4B23DBC0B31B4556CB81729A06B0F95EA3B7B0DE8; property20=D678C2120EB4D4A310217ACB8706D6FEEBCF6614ECE45E58C66CEC91A67E2BFCFDE1817D28CD5FC2',
# referer判断当前路径url是否referer路径进来的,一般情况下用来做图片防盗链
'referer': 'https://qzs.qq.com/',
'sec-ch-ua':'"Chromium";v="104", " Not A;Brand";v="99", "Google Chrome";v="104"',
'sec-ch-ua-mobile':'?0',
'sec-ch-ua-platform':'"Windows"',
'sec-fetch-dest':'document',
'sec-fetch-mode':'navigate',
'sec-fetch-site':'same-site',
'sec-fetch-user':'?1',
'upgrade-insecure-requests':'1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
request = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
with open('qq空间.html','w',encoding='utf-8') as fp:
fp.write(content)
1.6 handler处理器
# 不能定制请求头
urllib.request.urlopen(url)
# 可以定制请求头(在UA的情况下,就需要定制请求头)
urllib.request.Request(url,data,headers)
# 定制更高级的请求头(在代理和动态cookie的情况下,就需要更高级的请求头)
handler
方法 | 说明 |
urllib.request.HTTPHandler() | 获取handler对象 |
urllib.request.build_opener(handler) | 获取opener对象 |
opener.open(request) | 调用open方法,得到响应 |
使用handler来获取网页源码
import urllib.request
url = 'https://www.baidu.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
request = urllib.request.Request(url=url, headers=headers)
# 获取handler对象
handler = urllib.request.HTTPHandler()
# 获取opener对象
opener = urllib.request.build_opener(handler)
# 调用open方法
response = opener.open(request)
content = response.read().decode('utf-8')
print(content)
1.7 代理和代理池
1.7.1 代理
“快代理”中可以获取免费或者付费的代理IP。
代理的常用功能
- 突破自身IP访问限制,访问国外站点。
- 访问一些单位或团体内部资源
- 扩展:某大学FTP(前提是该代理地址在该资源的允许访问范围之内),使用教育网内地址段免费代理服务 器,就可以用于对教育网开放的各类FTP下载上传,以及各类资料查询共享等服务。
- 提高访问速度
- 扩展:通常代理服务器都设置一个较大的硬盘缓冲区,当有外界的信息通过时,同时也将其保存到缓冲区中,当其他用户再访问相同的信息时, 则直接由缓冲区中取出信息,传给用户,以提高访问速度。
- 隐藏真实IP
- 扩展:上网者也可以通过这种方法隐藏自己的IP,免受攻击。
配置代理步骤
- 创建Reuqest对象
- 用ProxyHandler创建handler对象
- 用handler对象创建opener对象
- 用opener.open函数发送请求,得到响应
import urllib.request
url = 'https://www.baidu.com/s?wd=ip'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
request = urllib.request.Request(url=url,headers=headers)
# 设置代理ip的键和值
proxies = {
'http':'118.24.219.151:16817'
}
# 获取handler对象
handler = urllib.request.ProxyHandler(proxies=proxies)
# 获取opener对象
opener = urllib.request.build_opener(handler)
# 调用open方法
response = opener.open(request)
content = response.read().decode('utf-8')
with open('daili.html','w',encoding='utf-8') as fp:
fp.write(content)
1.7.2 代理池
代理池就是有代理IP组成的池子, 它可以提供多个稳定可用的代理IP。
import urllib.request
import random
url = 'https://www.baidu.com/s?wd=ip'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
request = urllib.request.Request(url=url,headers=headers)
proxies_pool = [
{'http':'118.24.219.151:16817'},
{'http':'111.3.118.247:30001'}
]
# 在代理池中随机选择代理ip
proxies = random.choice(proxies_pool)
handler = urllib.request.ProxyHandler(proxies=proxies)
opener = urllib.request.build_opener(handler)
response = opener.open(request)
content = response.read().decode('utf-8')
with open('daili.html','w',encoding='utf-8') as fp:
fp.write(content)