02 - Python爬虫请求模块

最新推荐文章于 2024-03-21 08:34:54 发布

傲寒

最新推荐文章于 2024-03-21 08:34:54 发布

阅读量3.5k

点赞数

分类专栏： Python爬虫

本文链接：https://blog.csdn.net/qq_43407841/article/details/105932663

版权

Python爬虫专栏收录该内容

9 篇文章

订阅专栏

课堂笔记

1. urllib模块

python2 ：urllib2、urllib
python3 ：把urllib和urllib2合并,urllib.request

1.1 基本用法

urllib.request.urlopen(“网址”) 作用：向网站发起一个请求并获取响应
字节流 = response.read()
字符串 = response.read().decode(“utf-8”)
urllib.request.Request(“网址”,headers=“字典”) urlopen()不支持重构User-Agent

模拟有道翻译

import urllib.request
import urllib.parse
import json

# 输入查询关键字
key = input('请输入:')
# 网页数据表单
data = {
    'i': key,
    'from': 'AUTO',
    'to': 'AUTO',
    'smartresult': 'dict',
    'client': 'fanyideskweb',
    'salt': '15886625875503',
    'sign': '87e7658bf4f3db9b29a0857d5c67b8cf',
    'ts': '1588662587550',
    'bv': 'cc652a2ad669c22da983a705e3bca726',
    'doctype': 'json',
    'version': '2.1',
    'keyfrom': 'fanyi.web',
    'action': 'FY_BY_REALTlME',
}
# 编码
data = urllib.parse.urlencode(data)
# 将字符串转换成字节
data = bytes(data, 'utf-8')
# 指定地址
url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
    Chrome/78.0.3904.108 Safari/537.36'
}

# 创建请求对象
request = urllib.request.Request(url, data=data, headers=headers)
# 获取响应对象
response = urllib.request.urlopen(request)
# 读取响应对象内容，并指定解码方式
html = response.read().decode('utf-8')
print(type(html))           # <class 'str'>
# 类型转换，转换成字典
dic = json.loads(html)
print(dic)                 # {'type': 'ZH_CN2EN', 'errorCode': 0, 'elapsedTime': 1,
                            # 'translateResult': [[{'src': '你好', 'tgt': 'hello'}]]}
# 取字典中翻译的内容
res = dic['translateResult'][0][0]['tgt']
print(res)                  # hello

2. requests模块

安装requests库：命令行下运行 pip install requests
requests 与 urllib：
- requests的底层实现就是urllib
- requests在Python2和Python3通⽤，⽅法完全⼀样
- requests简单易⽤
- requests能够⾃动帮助我们解压(gzip压缩的)⽹⻚内容

2.1 基本用法

request 请求方法
- requests.get()
- requests.post()

reaponse 响应方法

response.text 返回unicode格式的数据(str)
response.content 返回字节流数据(二进制)
response.content.decode(‘utf-8’) 手动进行解码
response.url 返回url
response.encode() = ‘编码格式’
response.status_code 返回状态码

百度贴吧网站信息爬取

import requests

'''
百度贴吧页面url
https://tieba.baidu.com/f?kw=python&ie=utf-8&pn=0
https://tieba.baidu.com/f?kw=python&ie=utf-8&pn=50
https://tieba.baidu.com/f?kw=python&ie=utf-8&pn=100
'''
# 创建类
class BaiDuSpider(object):
    # 创建初始化方法，保存不变属性
    def __init__(self, name):
        self.name = name
        self.url = 'https://tieba.baidu.com/f?kw='+name+'&ie=utf-8&{}'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) \
            AppleWebKit/537.36 (KHTML, like Gecko) \
            Chrome/78.0.3904.108 Safari/537.36'
        }

    # 根据url参数规律
    def getUrlList(self):
        # url_List = []
        # for i in range(5):
        #     url_List.append(self.url.format(i*50))
        return [self.url.format(i*50) for i in range(5)]

    # 发送请求，获取响应
    def req_Page(self, url):
        return requests.get(url, headers=self.headers).text

    # 保存页面
    def save_Page(self, html, index):
        file_Path = '{}-第{}页.html'.format(self.name, index)
        with open(file_Path, 'w', encoding='utf-8') as f:
            f.write(html)

    # 运行
    def run(self):
        # 构造url列表
        url_List = self.getUrlList()
        # 遍历发送请求获取响应
        for url in url_List:
            html = self.req_Page(url)
            # 保存页面
            page_Num = url_List.index(url)+1
            self.save_Page(html, page_Num)

if __name__ == '__main__':
    # 实例化类对象
    spider = BaiDuSpider('python')
    # 调用方法
    spider.run()

2.2 使用代理

为什么爬⾍需要使⽤代理

让服务器以为不是同⼀个客户端在请求
防⽌我们的真实地址被泄露，防⽌被追究

........
# proxies的形式:字典
proxies = {
    'https': 'https://183.166.251.22:4216',
    'http': 'http://115.221.247.50:9999'
}
# 发送请求，获取响应
res = requests.get('https://www.baidu.com', headers=headers, proxies=proxies)

print(res.status_code)

代理网站
- 西刺免费代理IP：http://www.xicidaili.com/
- 快代理：http://www.kuaidaili.com/
- 代理云：http://www.dailiyun.com/

2.3 cookie 与 session 区别

cookie 数据存放于浏览器，而session 数据存放在服务器
cookie 相对于 session 安全性较差，别人可以分析本地的cookie数据进行欺骗
session 会在一定时间内保存于服务器，当访问较多时，会影响服务器性能
单个cookie保存的数据不能超过4K，很多浏览器都限制⼀个站点最多保存
20个cookie

使用 cookie 与 session 是否必要

使用cookie和session可以获取登录后的页面
一套cookie和session往往和一个用户相对应，请求次数太多可能会被浏览器识别为爬虫
请求登录后的网站
- 使用requests自带的session类实例化一个session对象
- 使用session发送请求，将网站登录后的cookie保存于session中
- 再次使用该session向登录后的页面发送请求

使用cookie访问人人网

# 实例化一个session对象
session = requests.session()
# 网站url
post_url = "http://www.renren.com/PLogin.do"
# 保存账号密码
post_data = {'email':'邮箱','password':'密码'}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) \
     AppleWebKit/537.36 (KHTML, like Gecko) \
    Chrome/78.0.3904.108 Safari/537.36'
}
# 发送请求，获取响应
session.post(post_url,data=post_data,headers=headers)
# 访问登录后的页面
r = session.get('http://www.renren.com/474133869/profile',headers = headers)
# 保存页面原代码
with open('renren.html','w', encoding='utf-8') as f:
    f.write(r.text)