【Python 爬虫】urllib库的简单操作

最新推荐文章于 2023-07-11 19:35:55 发布

猪猪传奇

最新推荐文章于 2023-07-11 19:35:55 发布

阅读量348

点赞数

分类专栏： Python 学习

本文链接：https://blog.csdn.net/qq_42127861/article/details/108410884

版权

Python 学习专栏收录该内容

20 篇文章 2 订阅

订阅专栏

注：urllib是python自带的包，不需要安装，直接导入即可使用

一、get请求

# urllib网络请求是python自带的
import urllib
#请求
from urllib import request

if __name__ == '__main__':
    # 服务器响应
    response1 = urllib.request.urlopen(url="http://www.baidu.com")
    text = response1.read().decode("utf-8")
    with open('baidu.html', mode='w', encoding='utf-8') as fp:
        fp.write(text)
        print('文本数据已写入')

    picture='https://tse4-mm.cn.bing.net/th/id/OIP.F64SlQFR9lbjnlDYm4ojOAHaEo?pid=Api&rs=1'
    response = urllib.request.urlopen(url=picture)
    text = response.read()
    with open('picture.jpg', mode='wb') as fp:# 二进制是最原始的数据，无需进行编码
        fp.write(text)
        print('图片数据已写入')

结果：
在这里插入图片描述

上述代码为urllib的get请求操作，里面主要是urlopen函数，以及decode函数，打开百度一下的源代码，可以看到，百度网页是charset=utf-8，所以只需要解码即可，至于为什么需要解码，是因为汉字是不能被传送的，需要将汉字编码，故，当百度服务器响应的报文在保证传输的时候就已经被编码了，得到数据后，我们只需要进行解码即可。

如果不进行解码，得到的结果是这样子的：

<title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title>

解码后：

<title>百度一下，你就知道</title>

综上可以看出，URL是统一资源定位符，任何一个URL都对应万维网上的资源，哪怕是图片或者视频资源，它们都有他们的内容，区别是我们用什么后缀的文件来接收他们，如爬取图片，那么返回的数据，是图片数据，只有用图片后缀，才能正确解析图片内容，但是从本质上，网页和图片、视频并无区别，网页是用html来接收而已，故，它们都是互联网上的数据流而已。

二、自定义请求头，携带Cookie实现模拟登录

import gzip
import urllib
from io import BytesIO
from urllib import request

'''携带Cookie实现模拟登录'''

if __name__ == '__main__':

    '''方法一'''
    '''
    构建headers的时候，复制浏览器F12 Network中的Request Headers 部分
    使用正则表达式将其变为字典格式：
        1.快捷键 Ctrl + r
            第一栏:(.*): (.*)
            第二栏:'$1':'$2',
            replace all
    '''
    # 定义Headers
    # 构建的时候，不需要
    # :authority :method :path :scheme 从accept开始即可
    headers = {'authority': ' weibo.com',
               'method': ' GET',
               'path': ' /u/7280903412/home?wvr=5',
               'scheme': ' https',
               'accept': ' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
               'accept-encoding': ' gzip, deflate, br',
               'accept-language': ' zh-CN,zh;q=0.9',
               'cache-control': ' max-age=0',
               'cookie': "浏览器上的cookie",
               'referer': 'https: //weibo.com/u/7280903412/home?wvr=5',
               'sec-fetch-dest': ' document',
               'sec-fetch-mode': ' navigate',
               'sec-fetch-site': ' same-origin',
               'sec-fetch-user': ' ?1',
               'upgrade-insecure-requests': ' 1',
               'user-agent': ' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36', }
    # Request 请求
    # 构造 Request 请求对象
    url = 'https://weibo.com/u/7280903412/home?wvr=5'
    request1 = request.Request(url=url, headers=headers)
    # 发起请求
    response = urllib.request.urlopen(request1)
    # 打印网页中的数据
    htmls = response.read()
    # 由于网页的数据是以 b'\x 开头的，貌似全是16进制数据，即被gzip加密了的，需要进行解码
    buff = BytesIO(htmls)
    f = gzip.GzipFile(fileobj=buff)
    htmls = f.read().decode('utf-8')
    #print(htmls)

    '''方法二'''
    request2 = request.Request(url=url)
    request2.add_header('cookie', "浏览器上的cookie")
    # 发起请求
    response = urllib.request.urlopen(request2)
    # 打印网页中的数据
    # 这种方法是没有被加密的，不需要gzip解码
    # 这种结果与一般的结果没有太大区别，就是开头带一个b
    htmls = response.read().decode('utf-8')

    print(htmls)

上述代码实现了微博的模拟登录，cookie中携带了用户信息，使得跳过了登录界面，可以直接获得用户信息。如上所示，添加cookie有两种方式，一种是在模拟header里面添加，一种是在request里面直接添加。上述代码采用了urlopen函数来发送Request请求，而不是用urlopen来发送url信息。注意，header没必要全部写全，只需要写需要的部分即可，如只需要cookie，可以只写cookie。其次是cookie是有时效性的，过段时间就会失效。

这个cookie就是按F12，在浏览器的network上找到的cookie信息，发送到特定url需要携带特定的cookie。
在这里插入图片描述

三、带参数的get请求

import urllib
from urllib import request
from urllib import parse

# 测试get请求
url1 = 'http://httpbin.org/get?{}'
url2 = 'http://httpbin.org/get?%s'
if __name__ == '__main__':
    params = {'age':35,'sex':'男','work_years':'15'}
    # 因为params中有'男',所以需要编码
    params = parse.urlencode(params)
    response = urllib.request.urlopen(url=url1.format(params))
    response = urllib.request.urlopen(url=url2%params)
    print(response.read().decode('utf-8'))
    # 这里的带参数，指的是get中带参数，参数都是暴露在url后面，故使用urlopen的时候，url需要拼接
    # 不使用urlopen函数中自带的data参数，因为如果指定data的话，说明请求时post请求

结果：
在这里插入图片描述

上述代码中比较重要的就是一个网站，以及两种添加params的方式，以及使用urllib需要手动编码params

http://httpbin.org/

上面的网站，可以测试get、post请求，可以获取访问主机的ip地址

四、带参数的post请求

import urllib
from urllib import request
from urllib import parse

# 当进行post请求的时候，是不带/的，因为这是一种服务，不是一个目录，带有/说明是一个目录
# 所以当访问一个目录的时候，即含有子文件的时候，需要带上/，不是目录，本身是一种单独的服务的时候不能加上/
url = 'http://httpbin.org/post'

if __name__ == '__main__':
    '''
        对于encode()和decode()：
            我们强调的编码指的是一种规则，比如将utf-8编码转换成byte编码，是按照byte编码规则进行编码，与具体内容无关，
            同样的，我们拿到一串编码可以看出是byte编码，这样，只需解码即可，我们无需关心以前是那种编码加密成的byte，
            按照规则对byte编码序列进行解码，解出来的是啥就是啥，从解出来的串类型，我们才能看出以前是那种编码
    '''
    params = {'Language':'Python','salary':20000,'work_time':996}
    # encode()将params 从str类型转换成byte类型
    params = parse.urlencode(params).encode()
    # 模拟post请求
    # 默认代码发起请求的时候，请求头"User-Agent": "Python-urllib/3.7"
    response = urllib.request.urlopen(url=url,data=params)
    print(response.read().decode())

结果：
在这里插入图片描述

上述代码为http://httpbin.org网址测试的post请求。注意urllib使用的post请求，需要将参数编码为byte类型

五、使用retrieve获取视频，自动存储到本地

import urllib
from urllib import request

# 下载视频到本地磁盘
url = 'http://vfx.mtime.cn/Video/2019/08/28/mp4/190828213254920717_480.mp4'

if __name__ == '__main__':
    # 高级方法，不需要打开文件，封装好的方法
    print('begin')
    urllib.request.urlretrieve(url = url, filename='video.mp4')
    print('ok')

上述代码的filename字段指定了视频需要被下载到哪里，urlretrieve方法是被封装好的，自动下载
在这里插入图片描述

六、IP代理

import urllib
from urllib import request

# 免费代理，快代理
# https://www.kuaidaili.com/free/inha/

if __name__ == '__main__':
    # 返回访问此接口的IP地址
    url = 'http://httpbin.org/ip'
    # 不使用代理发起请求
    response1 = urllib.request.urlopen(url = url)

    # print(response.getcode())# 获取响应状态码
    # print(response.geturl()) # 获取访问地址
    print(response1.read().decode())

    # 使用代理
    ph = urllib.request.ProxyHandler({'http':'61.163.32.88:3128'})
    # 打开者，打开url
    opener = urllib.request.build_opener(ph)
    # 使用代理打开一个网址
    response2 = opener.open(url)

    print('使用的代理IP是:',response2.read().decode())

这个是urllib使用代理IP访问其他网站的实例，由于代理IP不稳定，故不贴出结果，如果获取了很多的代理IP，可以使用random.choice(proxies)来从proxies中随机选择一个代理IP。

猪猪传奇

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【Python 爬虫】urllib库的简单操作

注：urllib是python自带的包，不需要安装，直接导入即可使用一、get请求# urllib网络请求是python自带的import urllib#请求from urllib import requestif __name__ == '__main__': # 服务器响应 response1 = urllib.request.urlopen(url="http://www.baidu.com") text = response1.read().decode("ut
复制链接

扫一扫

专栏目录