文章目录
一、ajax的get请求
豆瓣电影(第一页)
需求:爬取豆瓣电影排行版第一页
打开检查
找到第一页电影的数据
开始编写代码
- get请求,获取豆瓣电影的第一页数据,并且保存起来
url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&start=0&limit=20'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
2.请求对象的定制
request = urllib.request.Request(url=url, headers=headers)
3.获取响应数据
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
4.数据下载到本地(方法一)
fp = open('douban.json', 'w', encoding='utf-8')
fp.write(content)
(方法二)
with open('douban1.json', 'w', encoding='utf-8') as fp:
fp.write(content)
总代码:
import urllib.request
url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&start=0&limit=20'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
# fp = open('douban.json', 'w', encoding='utf-8')
# fp.write(content)
# 另一种方法获取json数据
with open('douban1.json', 'w', encoding='utf-8') as fp:
fp.write(content)
运行数据:
豆瓣电影(前十页)
观察每一页的接口
第一页
第二页
第三页
可以观察到规律
start=XX 不一样
page 1 2 3 4
start 0 20 40 60
所以 start (page - 1) * 20
开始编写代码
三个步骤:
- 请求对象定制
- 获取响应数据
- 数据下载到本地
编写程序的入口(输出页面1-10)
if __name__ == '__main__':
start_page = int(input("请输入起始的页码:"))
end_page = int(input("请输入结束的页码:"))
for page in range(start_page, end_page+1):
print(page)
定制请求对象,每一页都有自己请求对象的定制
# 创建一个方法(这里传入一个page参数,为了函数中能进行使用)
creat_request(page)
创建定制对象的方法函数
def creat_request(page):
base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&'
# 这里是get请求,所以url是可以进行拼接的
data = {
'start': (page - 1) * 20,
'limit': 20
}
# 进行拼接,get请求后面不需要+encode
data = urllib.parse.urlencode(data)
url = base_url + data
print(url)
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'}
输出1-10页的页面的地址
在函数里进行请求对象的定制(定义10个request)
request = urllib.request.Request(url=url, headers=headers)
第二步,获取响应的数据(定义函数)
def get_content():
response = urllib.request.urlopen()
在获取响应的数据的函数中,需要用到request,所以这时候我们得用到返回值!
在定制对象的函数中,要返回request
在主函数中得接收request
request = creat_request(page)
并传给获取响应数据的方法中
get_content(request)
这时候,获取响应数据的方法函数就可以使用request参数
def get_content(request):
response = urllib.request.urlopen(request)
第三步:下载数据
# 定义方法
down_load()
和上一步一样,在down_load()方法中需要用到content,page参数,记得传参!
def down_load(page,content):
with open('DB_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
fp.write(content)
总代码块:
import urllib.parse
import urllib.request
# 难点:每一页的url都不一样
def creat_request(page):
base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&'
data = {
'start': (page - 1) * 20,
'limit': 20
}
data = urllib.parse.urlencode(data)
url = base_url + data
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
request = urllib.request.Request(url=url, headers=headers)
return request
def get_content(request):
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
return content
def down_load(page,content):
with open('DB_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
fp.write(content)
if __name__ == '__main__':
start_page = int(input("请输入起始的页码:"))
end_page = int(input("请输入结束的页码:"))
for page in range(start_page, end_page+1):
request = creat_request(page)
content = get_content(request)
down_load(page, content)
豆瓣电影前十页:
二、ajax的post请求
肯德基官网
需求:爬取一个地区哪些位置有肯德基,并且爬取前十页的数据,保存到本地
打开肯德基官网,点击餐厅查询,选择想要爬取的城市(这里选择的是成都)
复制接口
观察第一页和第二页的表单数据以及接口
发现规律:pageIndex不同
大致与以上两个爬取豆瓣的案例相同,唯一不同的是:post请求,需要写编码方式
附源码:
import urllib.request
import urllib.parse
def create_request(page):
base_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
data= {
'cname': "成都",
'pid': "",
'pageIndex': page,
'pageSize': "10"
}
data = urllib.parse.urlencode(data).encode('utf-8')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
# 定制
request = urllib.request.Request(url=base_url, headers=headers, data=data)
return request
def get_content(request):
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
return content
def down_load(page,content):
with open('kfc_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
fp.write(content)
if __name__ == '__main__':
start_page = int(input("请输入起始页码:"))
end_page = int(input("请输入结束页码:"))
for page in range(start_page, end_page+1):
# 请求对象的定制
request = create_request(page)
# 获取网页源码
content = get_content(request)
# 下载数据
down_load(page, content)
运行结果:
三、URLError\HTTPError
URL组成部分:
- 协议
- 主机
- 端口
- 请求路径
- 参数(wd、kw)
- 锚点
所以:HTTPError是urllib的子类
需求:获取一个网页的源码
import urllib.request
url = 'https://blog.csdn.net/qq_64451048/article/details/127775623'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
这时候,不小心更改了url的地址
(多了一个1)
报HTTPError错误
捕获异常
try:
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)
except urllib.error.HTTPError:
print('系统正在升级...')
如果是url异常
except urllib.error.URLError:
print('系统正在升级...')
四、Handler处理器
定制更高级的请求头(随着业务逻辑的复杂 请求对象的定制已经满足不了我们的需求 动态cookie和代理不能使用请求对象的定制)
1.基本使用
需求:使用handler来访问百度 获取网页源码
三个重要的词: handler 、 build_opener、 open
import urllib.request
url = 'http://www/baidu.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
request = urllib.request.Request(url=url, headers=headers)
# 获取handler对象
handler = urllib.request.HTTPHandler()
# 通过handler获取opener对象
opener = urllib.request.build_opener(handler)
# 调用open方法
response = opener.open(request)
content = response.read().decode('utf-8')
print(content)
2.代理服务器
代理的常用功能
- 突破自身IP访问限制,访问国外站点
- 访问一些单位或团体内部资源
- 提高访问速度
- 隐藏真实IP
通过代理ip改变自己的IP地址
import urllib.request
url = 'http://www.baidu.com/s?wd=IP'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Cookie': 'BAIDUID=12196FD75453346657491E87390AC35B:FG=1; BIDUPSID=12196FD7545334661F0AE8D4B062BE2E; PSTM=1666008285; ispeed_lsm=2; BDUSS=FhRbk81OEFrZ1RFRFJrWUxCQ1dmRTZUQXp0VXA4ZGZtT0QyOUZ0T0hDRGYtSVpqRVFBQUFBJCQAAAAAAAAAAAEAAACS5FTMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAN9rX2Pfa19je; BD_UPN=13314752; baikeVisitId=9616ee70-5c47-41fe-865c-14dc1b170603; COOKIE_SESSION=195933_1_0_1_1_1_1_0_0_1_0_0_0_0_6_0_1666259938_1666064006_1666259932%7C2%230_1_1666063999%7C1; ZFY=M8B:A6gXyHZyKVBf:AqksGBg5jNPPKTmxNoclm:BgHpXzI:C; B64_BOT=1; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDSFRCVID=umLOJeC62ZlGj87jjNM7q-J69LozULrTH6_n1tn9KcRk7KlESLLqEG0PWf8g0KubzcDrogKKXeOTHiFF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tJIe_C-atC-3fP36q4rVhP4Sqxby26ntamJ9aJ5nJDoADh3Fe5J8MxCIjpLLBjK8BIOE-lR-QpP-_nul5-IByPtwMNJi2UQgBgJDKl0MLU3tbb0xynoD24tvKxnMBMnv5mOnanTI3fAKftnOM46JehL3346-35543bRTLnLy5KJtMDcnK4-XjjOBDNrP; H_PS_PSSID=36554_37555_37518_37687_37492_34813_37778_37721_37794_36807_37662_37533_37720_37740_26350_22157; delPer=0; BD_CK_SAM=1; PSINO=1; BDRCVFR[Fc9oatPmwxn]=aeXf-1x8UdYcs; BD_HOME=1; sugstore=1; BA_HECTOR=0h040g8k252hak242h8g8rou1hna11u1e; H_PS_645EC=e13b%2FA3XVtQZqyt9d0m3A8twSI3IrHVjaGptlJbr4wMhPOUE0G9YUipXLjIqNjZ2UHOS; BDSVRTM=231'
}
request = urllib.request.Request(url=url, headers=headers)
# 模拟浏览器访问服务器
# response = urllib.request.urlopen(request)
# 代理ip
proxies = {
'http': '222.74.73.202:42055'
}
handler = urllib.request.ProxyHandler(proxies=proxies)
opener = urllib.request.build_opener()
response = opener.open(request)
content = response.read().decode('utf-8')
with open('daili.html', 'w', encoding='utf-8')as fp:
fp.write(content)
3.代理池
简易版代理池:
proxies_pool = [
{'http': '222.74.73.202:42055111'},
{'http': '222.74.73.202:42055222'}
]
import random
proxies = random.choice(proxies_pool)
print(proxies)
自定义代理池源码:
import urllib.request
proxies_pool = [
{'http': '121.13.252.60:41564'},
{'http': '121.13.252.60:41564'}
]
import random
proxies = random.choice(proxies_pool)
url = 'http://www.baidu.com/s?wd=IP'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
# 请求对象定制
request = urllib.request.Request(url=url, headers=headers)
handler = urllib.request.ProxyHandler(proxies=proxies)
opener = urllib.request.build_opener(handler)
response = opener.open(request)
content = response.read().decode('utf-8')
with open('daili1.html', 'w', encoding='utf-8')as fp:
fp.write(content)