爬虫urllib

最新推荐文章于 2023-10-14 10:24:04 发布

Asonare

最新推荐文章于 2023-10-14 10:24:04 发布

阅读量887

点赞数

文章标签：爬虫 python

本文链接：https://blog.csdn.net/Asonare/article/details/128810380

版权

爬虫——urllib库

一、爬虫的分类

1.通用爬虫

实例：百度、google、360等搜索引擎。

缺点：大多无用数据、不能根据大多数用户的需求精准获取定位。

2.聚焦爬虫

根据需求实现爬虫程序，抓取需要的数据

设计思路：1.确立爬取的URL

2.模拟浏览器通过HTTP协议访问URL，获取返回的HTML代码

3.解析HTML字符串

二、反爬手段

1.user-Agent:用户代理

2.代理IP

3.验证码访问

4.动态加载网页：放回的是JS数据并不是真实的数据

5.数据加密

三、urllib库基本使用

import urllib.request
# 定义一个URL 你需要访问的网址
url = 'http://www.baidu.com'
# 请求访问
response = urllib.request.urlopen(url)
# 阅读访问的内容 decode将他转码为字符串格式
content = response.read().decode('utf-8')
print(content)

1）urllib下载（很重要）

import urllib.request
//下载网页（图片、视频同理）
url = 'https://www.baidu.com'
urllib.request.urlretrieve(url,'pic')

2）处理UA反爬时,识别不了你是真正的浏览器。需求用到请求对象的定制。

# https 协议的时候
url = 'https://www.baidu.com'
# 在浏览器中按F12的网络查看UA
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}

request = urllib.request.Request(url = url,headers = headers)

response = urllib.request.urlopen(request)

content = response.read().decode('utf-8')
print(content)

3）urllib_get请求的quote方法 URL地址编码模块(遇到中文时可以用下列方法转换为UNICODE码)[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-d4ogyGVb-1675092962254)(C:\Users\asonare\AppData\Roaming\Typora\typora-user-images\image-20230106162602064.png)]

urllib.parse.urlencode({dict})  #多个参数时
urllib.parse.quote('str')
urllib.parse.unquote('str')

import urllib.parse
dict = {
    'wd':'张三',
    'sex':'男',
}
a = urllib.parse.urlencode(dict)
print(a)   # 打印结果为wd=%E5%BC%A0%E4%B8%89&sex=%E7%94%B7

4）urllib_post请求以百度翻译spider为例

url = 'https://fanyi.baidu.com/sug' # 要找到接口 这里是sug
header = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
data = {'kw': 'spider'}
# post请求时必须要进行编码 编码之后必须调用encode方法
data = urllib.parse.urlencode(data).encode('utf-8')
request = urllib.request.Request(url,data,header)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content) #

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HlQ2OYIK-1675092962256)(C:\Users\asonare\AppData\Roaming\Typora\typora-user-images\image-20230112154022069.png)]

将字符串变为json对象

object = json.loads(content)
print(object)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-e3WOhpWk-1675092962257)(C:\Users\asonare\AppData\Roaming\Typora\typora-user-images\image-20230112154415809.png)]

5）ajax的get请求豆瓣电影动作片排行第一页

url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20'#找接口
header = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
request = urllib.request.Request(url=url,headers=header)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
# 数据下载到本地
fp = open("douban.json",'w',encoding='utf-8')
fp.write(content)

6）ajax的get请求豆瓣电影动作片排行前10页

def creat_request(page):
    base_url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&'
    stat_url = {
        'start' :(page-1)*20,
        'limit' : 20
    }
    stat_url = urllib.parse.urlencode(stat_url)
    url = base_url + stat_url
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
    request = urllib.request.Request(url=url,headers=headers)
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    print(content)
# 每一页有各自的请求 需用for循环
if __name__ == '__main__':
    start_page = int(input("请输入起始页码"))
    end_page = int(input("请输入结束页码"))
    for page in range(start_page,end_page+1):
        creat_request(page)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3VMs79KF-1675092962258)(C:\Users\asonare\AppData\Roaming\Typora\typora-user-images\image-20230113143223808.png)]

7）ajax的post请求 KFC前10页

def creat_request(page):
    url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=cname'
    base_data = {
    'cname': '北京',
    'pid': '',
    'pageIndex': page,
    'pageSize': '10',
}
    data = urllib.parse.urlencode(base_data).encode('utf-8')
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
    request = urllib.request.Request(url=url,data=data,headers=headers)
    return request
def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content
def down_load(page,content):
    with open('KFC' + str(page) +'.josn','w',encoding='utf-8')as fp:
        fp.write(content)

if __name__ == '__main__':
    stat_page = int(input("请输入起始页"))
    end_page = int(input("请输入结束页"))
    for page in range(stat_page,end_page+1):
        request = creat_request(page)
        content = get_content(request)
        down_load(page,content)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3RBWMLSH-1675092962259)(C:\Users\asonare\AppData\Roaming\Typora\typora-user-images\image-20230113152916036.png)][外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-naaSuYIG-1675092962259)(C:\Users\asonare\AppData\Roaming\Typora\typora-user-images\image-20230113152926483.png)]

8）微博cookie登录数据采集的时候需要绕过登录然后进入到某个页面

在请求定制头的headers中数据不止’User-Agent’，还有重要的’cookie’。

9）urllib_handler处理器

urllib.request.urlopen(url) #不能定制请求头
urllib.request.Request(url,headers,data) #动态cookie和代理不能使用
Handler #定制更高级的请求头

url = 'http://www.baidu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
request = urllib.request.Request(url=url,headers=headers)
# 1.获取handler对象
handler = urllib.request.HTTPHandler()
# 2.获取opener对象
opener = urllib.request.build_opener(handler)
# 3.调用open方法
response = opener.open(request)
content = response.read().decode('utf-8')

10）代理服务器常用功能

1.突破自身IP访问限制，访问国外站点

2.访问一些单位或团体的内部资源

3.提高访问速度（通常代理服务器有一个硬盘缓存区，当有外界信息通过时会保存在缓冲区中，在访问时直接从缓冲区取出）

4.隐藏真实IP

# 用法    和Handler处理器不一样的地方 其余都一样
proxies = {
    'http' :'121.13.252.58:41564'
}
handler = urllib.request.ProxyHandler(proxies=proxies)

一个代理IP多频次访问同样可能会被禁用IP，所以有了代理池

proxies_pool = [
    {'http': '121.13.252.58:41564'},
    {'http': '222.74.73.202:42055'},
]
prixies = random.choice(proxies_pool)

urllib.request.ProxyHandler(proxies=proxies)


一个代理IP多频次访问同样可能会被禁用IP，所以有了代理池

```python
proxies_pool = [
    {'http': '121.13.252.58:41564'},
    {'http': '222.74.73.202:42055'},
]
prixies = random.choice(proxies_pool)