【urllib的使用（下）】

T³3

已于 2022-11-17 00:45:44 修改

阅读量158

点赞数 1

分类专栏：爬虫文章标签：前端开发语言 python 爬虫

于 2022-11-17 00:44:32 首次发布

本文链接：https://blog.csdn.net/qq_64451048/article/details/127873676

版权

爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

文章目录

一、ajax的get请求
- 豆瓣电影（第一页）
- 豆瓣电影（前十页）
二、ajax的post请求
- 肯德基官网
三、URLError\HTTPError
四、Handler处理器

一、ajax的get请求

豆瓣电影（第一页）

需求：爬取豆瓣电影排行版第一页
打开检查
在这里插入图片描述 找到第一页电影的数据

开始编写代码

get请求，获取豆瓣电影的第一页数据，并且保存起来

url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&start=0&limit=20'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}

2.请求对象的定制

request = urllib.request.Request(url=url, headers=headers)

3.获取响应数据


response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')

4.数据下载到本地（方法一）

 fp = open('douban.json', 'w', encoding='utf-8')
 fp.write(content)

（方法二）

with open('douban1.json', 'w', encoding='utf-8') as fp:
    fp.write(content)

总代码：


import urllib.request

url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&start=0&limit=20'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}

request = urllib.request.Request(url=url, headers=headers)

response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')


# fp = open('douban.json', 'w', encoding='utf-8')
# fp.write(content)

# 另一种方法获取json数据
with open('douban1.json', 'w', encoding='utf-8') as fp:
    fp.write(content)

运行数据：
在这里插入图片描述

豆瓣电影（前十页）

观察每一页的接口
第一页
在这里插入图片描述第二页

第三页
可以观察到规律

start=XX 不一样
page 1 2 3 4
start 0 20 40 60

所以 start (page - 1) * 20

开始编写代码
三个步骤：

请求对象定制
获取响应数据
数据下载到本地

编写程序的入口(输出页面1-10)

if __name__ == '__main__':
    start_page = int(input("请输入起始的页码:"))
    end_page = int(input("请输入结束的页码:"))
    for page in range(start_page, end_page+1):
    	print(page)

定制请求对象，每一页都有自己请求对象的定制

# 创建一个方法（这里传入一个page参数，为了函数中能进行使用）
creat_request(page)

创建定制对象的方法函数

def creat_request(page):
    base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&'

    # 这里是get请求，所以url是可以进行拼接的
    data = {
        'start': (page - 1) * 20,
        'limit': 20
    }
    # 进行拼接,get请求后面不需要+encode
    data = urllib.parse.urlencode(data)

    url = base_url + data

    print(url)

    headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'}

输出1-10页的页面的地址

在函数里进行请求对象的定制(定义10个request)

request = urllib.request.Request(url=url, headers=headers)

第二步，获取响应的数据(定义函数)

def get_content():
    response = urllib.request.urlopen()

在获取响应的数据的函数中，需要用到request，所以这时候我们得用到返回值！
在定制对象的函数中，要返回request
在主函数中得接收request
request = creat_request(page)
并传给获取响应数据的方法中
get_content(request)
这时候，获取响应数据的方法函数就可以使用request参数

def get_content(request):
    response = urllib.request.urlopen(request)

第三步：下载数据

# 定义方法
down_load()

和上一步一样，在down_load()方法中需要用到content,page参数，记得传参！

def down_load(page,content):
    with open('DB_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
        fp.write(content)

总代码块：

import urllib.parse
import urllib.request

 # 难点：每一页的url都不一样
def creat_request(page):
    base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100:90&action=&'

    data = {
        'start': (page - 1) * 20,
        'limit': 20
    }

    data = urllib.parse.urlencode(data)

    url = base_url + data

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
        }


    request = urllib.request.Request(url=url, headers=headers)
    return request

def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content

def down_load(page,content):
    with open('DB_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
        fp.write(content)

if __name__ == '__main__':
    start_page = int(input("请输入起始的页码:"))
    end_page = int(input("请输入结束的页码:"))

    for page in range(start_page, end_page+1):

        request = creat_request(page)

        content = get_content(request)

        down_load(page, content)

豆瓣电影前十页：
在这里插入图片描述

二、ajax的post请求

肯德基官网

需求：爬取一个地区哪些位置有肯德基，并且爬取前十页的数据，保存到本地

打开肯德基官网，点击餐厅查询，选择想要爬取的城市（这里选择的是成都）

复制接口
观察第一页和第二页的表单数据以及接口
发现规律：pageIndex不同

大致与以上两个爬取豆瓣的案例相同，唯一不同的是：post请求，需要写编码方式

附源码：

import urllib.request
import urllib.parse

def create_request(page):
    base_url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'

    data= {
        'cname': "成都",
        'pid': "",
        'pageIndex': page,
        'pageSize': "10"
    }

    data = urllib.parse.urlencode(data).encode('utf-8')

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
    }

    # 定制
    request = urllib.request.Request(url=base_url, headers=headers, data=data)
    return request

def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content

def down_load(page,content):
    with open('kfc_' + str(page) + '.json', 'w', encoding='utf-8')as fp:
        fp.write(content)


if __name__ == '__main__':
    start_page = int(input("请输入起始页码:"))
    end_page = int(input("请输入结束页码:"))

    for page in range(start_page, end_page+1):
        # 请求对象的定制
        request = create_request(page)
        # 获取网页源码
        content = get_content(request)
        # 下载数据
        down_load(page, content)

运行结果：
在这里插入图片描述

三、URLError\HTTPError

URL组成部分：

协议
主机
端口
请求路径
参数（wd、kw）
锚点
所以：HTTPError是urllib的子类

需求：获取一个网页的源码

import urllib.request

url = 'https://blog.csdn.net/qq_64451048/article/details/127775623'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}

request = urllib.request.Request(url=url, headers=headers)

response = urllib.request.urlopen(request)

content = response.read().decode('utf-8')

print(content)

这时候，不小心更改了url的地址
（多了一个1）

报HTTPError错误

捕获异常

try:
    request = urllib.request.Request(url=url, headers=headers)

    response = urllib.request.urlopen(request)

    content = response.read().decode('utf-8')

    print(content)
except urllib.error.HTTPError:
    print('系统正在升级...')

在这里插入图片描述
如果是url异常

except urllib.error.URLError:
    print('系统正在升级...')

四、Handler处理器

定制更高级的请求头（随着业务逻辑的复杂请求对象的定制已经满足不了我们的需求动态cookie和代理不能使用请求对象的定制）

1.基本使用

需求：使用handler来访问百度获取网页源码
三个重要的词： handler 、 build_opener、 open

import urllib.request

url = 'http://www/baidu.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}

request = urllib.request.Request(url=url, headers=headers)


# 获取handler对象
handler = urllib.request.HTTPHandler()

# 通过handler获取opener对象
opener = urllib.request.build_opener(handler)

# 调用open方法
response = opener.open(request)

content = response.read().decode('utf-8')

print(content)

2.代理服务器

代理的常用功能

突破自身IP访问限制，访问国外站点
访问一些单位或团体内部资源
提高访问速度
隐藏真实IP

通过代理ip改变自己的IP地址

import urllib.request

url = 'http://www.baidu.com/s?wd=IP'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Cookie': 'BAIDUID=12196FD75453346657491E87390AC35B:FG=1; BIDUPSID=12196FD7545334661F0AE8D4B062BE2E; PSTM=1666008285; ispeed_lsm=2; BDUSS=FhRbk81OEFrZ1RFRFJrWUxCQ1dmRTZUQXp0VXA4ZGZtT0QyOUZ0T0hDRGYtSVpqRVFBQUFBJCQAAAAAAAAAAAEAAACS5FTMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAN9rX2Pfa19je; BD_UPN=13314752; baikeVisitId=9616ee70-5c47-41fe-865c-14dc1b170603; COOKIE_SESSION=195933_1_0_1_1_1_1_0_0_1_0_0_0_0_6_0_1666259938_1666064006_1666259932%7C2%230_1_1666063999%7C1; ZFY=M8B:A6gXyHZyKVBf:AqksGBg5jNPPKTmxNoclm:BgHpXzI:C; B64_BOT=1; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDSFRCVID=umLOJeC62ZlGj87jjNM7q-J69LozULrTH6_n1tn9KcRk7KlESLLqEG0PWf8g0KubzcDrogKKXeOTHiFF_2uxOjjg8UtVJeC6EG0Ptf8g0f5; H_BDCLCKID_SF=tJIe_C-atC-3fP36q4rVhP4Sqxby26ntamJ9aJ5nJDoADh3Fe5J8MxCIjpLLBjK8BIOE-lR-QpP-_nul5-IByPtwMNJi2UQgBgJDKl0MLU3tbb0xynoD24tvKxnMBMnv5mOnanTI3fAKftnOM46JehL3346-35543bRTLnLy5KJtMDcnK4-XjjOBDNrP; H_PS_PSSID=36554_37555_37518_37687_37492_34813_37778_37721_37794_36807_37662_37533_37720_37740_26350_22157; delPer=0; BD_CK_SAM=1; PSINO=1; BDRCVFR[Fc9oatPmwxn]=aeXf-1x8UdYcs; BD_HOME=1; sugstore=1; BA_HECTOR=0h040g8k252hak242h8g8rou1hna11u1e; H_PS_645EC=e13b%2FA3XVtQZqyt9d0m3A8twSI3IrHVjaGptlJbr4wMhPOUE0G9YUipXLjIqNjZ2UHOS; BDSVRTM=231'
}

request = urllib.request.Request(url=url, headers=headers)

# 模拟浏览器访问服务器
# response = urllib.request.urlopen(request)

# 代理ip
proxies = {
    'http': '222.74.73.202:42055'
}

handler = urllib.request.ProxyHandler(proxies=proxies)

opener = urllib.request.build_opener()

response = opener.open(request)

content = response.read().decode('utf-8')

with open('daili.html', 'w', encoding='utf-8')as fp:
    fp.write(content)

3.代理池

简易版代理池：

proxies_pool = [
    {'http': '222.74.73.202:42055111'},
    {'http': '222.74.73.202:42055222'}
]

import random

proxies = random.choice(proxies_pool)

print(proxies)

自定义代理池源码：

import urllib.request

proxies_pool = [
    {'http': '121.13.252.60:41564'},
    {'http': '121.13.252.60:41564'}

]

import random

proxies = random.choice(proxies_pool)

url = 'http://www.baidu.com/s?wd=IP'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
   
}

# 请求对象定制
request = urllib.request.Request(url=url, headers=headers)

handler = urllib.request.ProxyHandler(proxies=proxies)

opener = urllib.request.build_opener(handler)

response = opener.open(request)

content = response.read().decode('utf-8')

with open('daili1.html', 'w', encoding='utf-8')as fp:
    fp.write(content)