spider【第三篇】python爬虫模块requests

最新推荐文章于 2023-06-17 19:53:36 发布

dfu65065

最新推荐文章于 2023-06-17 19:53:36 发布

阅读量472

点赞数

文章标签：爬虫 json 人工智能

原文链接：http://www.cnblogs.com/hyit/articles/10662682.html

版权

requests简介

requests模块是python3自带的库，可直接使用，该库主要用来处理http请求

中文文档： http://docs.python-requests.org/zh_CN/latest/index.html

requests模块的简单使用

requests模块发送简单的请求、获取响应

一、requests.get()

哪些地方我们会用到get请求
    下载网页
    检索

1.1 下载网页

import requests  # 预先安装requests库

response = requests.get('https://www.baidu.com/')  # 发送Http请求
response.encoding = "utf-8"  # 将下载内容编码为utf-8格式，否则乱码
print(response.text)  # 打印网页内容
print(response.status_code)  # 打印状态码，200代表正常

response.text
    类型：str
    解码类型： 根据HTTP 头部对响应的编码作出有根据的推测，推测的文本编码
    如何修改编码方式：response.encoding=”gbk”
    
response.content
    类型：bytes
    解码类型： 没有指定
    如何修改编码方式：response.content.deocde(“utf8”)

网页编码分析

或者

开始编码

得到网页数据

这里为了方便，用pycharm打开，当然也可以用浏览器打开

下载的网页效果

1.2 保存图片

import requests  # 预先安装requests库

response = requests.get('https://www.baidu.com/img/bd_logo1.png')  # 发送Http请求
print(response.status_code)  # 打印状态码，200代表正常

with open('baidu.png', 'wb') as f:  # 图片是二进制(也叫字节)数据
    f.write(response.content)

1.3 检索

关于参数的注意点
　　在url地址中，很多参数是没有用的，比如百度搜索的url地址，其中参数只有一个字段有用，其他的都可以删除

　　对应的，在后续的爬虫中，越到很多参数的url地址，都可以尝试删除参数

删除多余参数

import requests

query_string = input(":")
params = {"wd": query_string}

url = "https://www.baidu.com/s?wd={}".format(query_string)
response = requests.get(url)

print(response.text)
print(response.request.headers)

这里百度反爬虫的措施限制了User-Agent，去找一个User-Agent（网上也有很多）

# coding=utf-8
import requests

query_string = input(":")
params = {"wd": query_string}

url = "https://www.baidu.com/s?wd={}".format(query_string)
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36"}

response = requests.get(url, headers=headers)

print(response.text)
print(response.request.headers)

baidusousuo

更多的反爬虫和突破反爬虫将在后续主题专门介绍

二、requests.post()

哪些地方我们会用到POST请求：

	登录注册（ POST 比 GET 更安全）
	需要传输大文本内容的时候（ POST 请求对数据长度没有要求）

百度单词翻译

import json
import requests


def fanyi(keyword):
    base_url = 'https://fanyi.baidu.com/sug'

    # 构建请求对象
    data = {
        'kw': keyword
    }

    # 模拟浏览器
    header = {
        "User-Agent": "mozilla/4.0 (compatible; MSIE 5.5; Windows NT)",
        "Content-Type": "application/x-www-form-urlencoded"
    }

    req = requests.post(url=base_url, data=data, headers=header)

    # 获取响应的json字符串
    str_json = req.text
    # 把json转换成字典
    myjson = json.loads(str_json)
    info = myjson['data'][0]['v']
    print(info)


if __name__ == '__main__':
    while True:
        keyword = input('请输入翻译的单词：')
        if keyword == 'q':
            break
        fanyi(keyword)

baidufanyi

requests详解

Python标准库中提供了：urllib、urllib2、httplib等模块以供Http请求，但是，它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作，甚至包括各种方法覆盖，来完成最简单的任务。

各类请求

requests.get(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs)
 
# 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)

requests模块已经将常用的Http请求方法为用户封装完成，用户直接调用其提供的相应方法即可，其中方法的所有参数有：　　

def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': ('filename', fileobj)}``) for multipart encoding upload.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How long to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

    Usage::

      >>> import requests
      >>> req = requests.request('GET', 'http://httpbin.org/get')
      <Response [200]>
    """

    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

更多参数

requests模块的深入使用

一、使用代理

代理IP的分类

为什么要使用代理
    让服务器以为不是同一个客户端在请求
    防止我们的真实地址被泄露，防止被追究

requests.get("http://www.baidu.com", proxies = proxies)
requests.post("http://www.baidu.com", proxies = proxies)

proxies = { 
    "http": "http://12.34.56.78:9000", 
    "https": "https://12.34.56.78:9000", 
    }

代理IP的分类
    透明代理(Transparent Proxy)
    匿名代理(Anonymous Proxy)
    混淆代理(Distorting Proxies)
    高匿代理(Elite proxy或High Anonymity Proxy)


IP的选择:
    高匿代理让别人根本无法发现你是在用代理,前几个都可以被发现
    从使用的协议：代理ip可以分为http代理，https代理，socket代理等，使用的时候需要根据抓取网站的协议来选择

代理ip池的更新

    购买的代理（Beagle等）ip很多时候大部分(超过60%)可能都没办法使用，这个时候就需要通过程序去检测哪些可用，把不能用的删除掉。

二、获取cookie

方式一

requests.utils.dict_from_cookiejar:把cookiejar对象转化为字典

import requests

url = "http://www.baidu.com"
response = requests.get(url)
print(type(response.cookies))

cookies = requests.utils.dict_from_cookiejar(response.cookies)
print(cookies)

'''
<class 'requests.cookies.RequestsCookieJar'>
{'BDORZ': '27315'}
'''

方式二

import requests
from bs4 import BeautifulSoup
from requests.cookies import RequestsCookieJar
from win32.win32crypt import CryptUnprotectData

def getcookiefromchrome(host='.oschina.net'):
    '获取浏览器中某个网站的cookie'
    cookiepath = os.environ['LOCALAPPDATA'] + r"\Google\Chrome\User Data\Default\Cookies"
    sql = "select host_key,name,encrypted_value from cookies where host_key='%s'" % host
    with sqlite3.connect(cookiepath) as conn:
        cu = conn.cursor()
        cookies = {name: CryptUnprotectData(encrypted_value)[1].decode() for host_key, name, encrypted_value in
                   cu.execute(sql).fetchall()}
        return cookies
        
        
def ret_soup(url, cookies_dict):
    '返回BeautifulSoup'
    response = requests.get(url, cookies=set_cookie(cookies_dict), headers=set_header(), proxies=None,
                            timeout=10)
    response.encoding = "utf8"
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

    
def start():
    '程序入口'
    cookies_dict = getcookiefromchrome(host='.bitinfocharts.com')
    site = 'https://bitinfocharts.com'

    baseurl = site + '/top-100-richest-bitcoin-addresses.html'
    soup = ret_soup(baseurl, cookies_dict)
    print(soup)


if __name__ == '__main__':
    start()

windows版

三、处理证书错误

出现这个问题的原因是：ssl的证书不安全导致（ssl.CertificateError）

import requests

url = "https://www.12306.cn/mormhweb/"
response = requests.get(url,verify=False)

四、超时参数

在平时网上冲浪的过程中，我们经常会遇到网络波动，这个时候，一个请求等了很久可能任然没有结果

对应的，在爬虫中，一个请求很久没有结果，就会让整个项目的效率变得非常低，这个时候我们就需要对请求进行强制要求，让他必须在特定的时间内返回结果，否则就报错

response = requests.get(url,timeout=5)

五、retrying模块

上述方法能够加快我们整体的请求速度，但是在正常的网页浏览过成功，如果发生速度很慢的情况，我们会做的选择是刷新页面，那么在代码中，我们是否也可以刷新请求呢？　　

retrying模块的地址：https://pypi.org/project/retrying/

retrying 模块的使用

    使用retrying模块提供的retry模块
    通过装饰器的方式使用，让被装饰的函数反复执行
    retry中可以传入参数stop_max_attempt_number,让函数报错后继续重新执行，达到最大执行次数的上限，如果每次都报错，整个函数报错，如果中间有一个成功，程序继续往后执行

import requests
from retrying import retry

headers = {}


@retry(stop_max_attempt_number=3) #最大重试3次，3次全部报错，才会报错
def _parse_url(url)
    response = requests.get(url, headers=headers, timeout=3) #超时的时候回报错并重试
    assert response.status_code == 200 #状态码不是200，也会报错并充实
    return response


def parse_url(url)
    try: #进行异常捕获
        response = _parse_url(url)
    except Exception as e:
        print(e)
        response = None
    return response

parse.py

如何突破基本的反爬虫策略

从上面的例子可以发现：

    基本的反爬虫策略: 
        1.尽量模拟浏览器：就是利用请求头里的字段做文章（UA Cookie...）
        2.突破限速、封号：针对某个IP或者某个账户限速甚至封禁
        
    突破策略：
        1.把headers的信息copy到程序的headers
        2.多个代理ip/多个账户

反爬虫一般策略

请求头
    user-agent: 当前用户使用的设备
    Referer: "xxx"
    content-type: application/json,
    host			


请求携带cookie或token
加密
发现ip变化
限制访问频率
验证码
隐藏登录界面部分数据
js动态加载，分析复杂 
发现大量请求只加载html，不加载css js media 
发现爬虫加载假数据 
健全账号体系

更多示例

爬取汽车之家

eg1:

import requests
from bs4 import BeautifulSoup  # 预先安装BeautifulSoup4库
 
 
response = requests.get('http://www.autohome.com.cn/news/')
response.encoding = 'gbk'
soup = BeautifulSoup(response.text,'html.parser')
tag = soup.find(id='auto-channel-lazyload-article')  # BeautifulSoup标签支持链式操作
h3 = tag.find(name='h3')
print(h3)

eg2:

import requests
from bs4 import BeautifulSoup
import re

# 找到所有新闻
# 标题，简介，url，图片
HTTPS = 'https:'  # 页面中url无https，访问需添加
response = requests.get('http://www.autohome.com.cn/news/')
response.encoding = 'gbk'
soup = BeautifulSoup(response.text, 'html.parser')

li_list = soup.find(id='auto-channel-lazyload-article').find_all(name='li')[:3]  # 只取3条新闻

for li in li_list:
    title = li.find('h3')
    if not title:
        continue

    summary = li.find('p').text
    # li.find('a').attrs,字典

    url = HTTPS + li.find('a').get('href')  # 等效于 li.find('a').attrs['href']
    img = HTTPS + li.find('img').get('src')

    # 下载图片
    res = requests.get(img)
    re_image_name = re.match(r'.*__(.*).jpg', img)
    if re_image_name:
        image_name = re_image_name.group(1)
        file_name = "%s.jpg" % (image_name,)
        with open(file_name, 'wb') as f:
            f.write(res.content)

登录github

浏览器打开开发者模式，查看FormData

技巧：当前的页面在Network对应的页面背景会呈蓝色

import requests
from bs4 import BeautifulSoup

# 获取token
r1 = requests.get('https://github.com/login')
s1 = BeautifulSoup(r1.text,'html.parser')
token = s1.find(name='input',attrs={'name':'authenticity_token'}).get('value')
r1_cookie_dict = r1.cookies.get_dict()
# 将用户名密码token发送到服务端，post
"""
utf8:✓
authenticity_token:ollV+avLm6Fh3ZevegPO7gOH7xUzEBL0NWdA1aOQ1IO3YQspjOHbfnaXJOtVLQ95BtW9GZlaCIYd5M6v7FGUKg==
login:
password:
commit:Sign in
"""
r2 = requests.post(
    'https://github.com/session',
    data={
        "utf8": '✓',
        "authenticity_token": token,
        'login': 'xxx',
        'password': 'xxx',
        'commit': 'Sign in'
    },
    cookies=r1_cookie_dict
)

r2_cookie_dict = r2.cookies.get_dict()

# 一般需要登陆的网站只要带上登陆页面的POST cookie即可，有的站点还会带上登陆页面的GET cookie，如github
cookie_dict = {}
cookie_dict.update(r1_cookie_dict)
cookie_dict.update(r2_cookie_dict)
#
r3 = requests.get(
    url='https://github.com/settings/emails',
    cookies=cookie_dict
)

r4 = requests.get(
    url='https://github.com/ecithy/online-edu',
    cookies=cookie_dict
)
print(r4.text)

登录抽屉

方式一

# 1. 登录，cookie
# 2. 标签url，xxxx
import requests
from bs4 import BeautifulSoup

# 1. 获取cookie
r0 = requests.get('http://dig.chouti.com/')
r0_cookie_dict = r0.cookies.get_dict()

# 2. 发送用户名密码cookie
r1 = requests.post(
    'http://dig.chouti.com/login',
    data={
        'phone': 'xxx',
        'password': 'xxx',
        'oneMonth':1
    },
    cookies=r0_cookie_dict
)
r1_cookie_dict = r1.cookies.get_dict()


cookie_dict = {}
cookie_dict.update(r0_cookie_dict)
cookie_dict.update(r1_cookie_dict)


r2 = requests.post('http://dig.chouti.com/link/vote?linksId=13915601',cookies=cookie_dict)
print(r2.text)

方式二：session(自动保存cookie，避免了手动操作cookie的繁琐)

import requests

session = requests.Session()
r1 = session.get(url="http://dig.chouti.com/")
r2 = session.post(
    url="http://dig.chouti.com/login",
    data={
        'phone': "xxx",
        'password': "xxx",
        'oneMonth': ""
    }
)
r3 = session.post(
    url="http://dig.chouti.com/link/vote?linksId=13915601"
)
print(r3.text)

登陆知乎

# -*- coding: utf-8 -*-
__author__ = 'hy'

import requests

try:
    import cookielib
except:
    import http.cookiejar as cookielib

import re

session = requests.session()
# session.cookies = cookielib.LWPCookieJar(filename="cookies.txt")  # 加载cookie
try:
    session.cookies.load(ignore_discard=True)
except:
    print("cookie未能加载")

agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
header = {
    "HOST": "www.zhihu.com",
    "Referer": "https://www.zhizhu.com",
    'User-Agent': agent
}


def is_login():
    # 通过个人中心页面返回状态码来判断是否为登录状态
    inbox_url = "https://www.zhihu.com/question/56250357/answer/148534773"
    response = session.get(inbox_url, headers=header, allow_redirects=False)
    if response.status_code != 200:
        print('请登陆')
    else:
        print('您已登陆')


def get_xsrf():
    # 获取xsrf code
    response = session.get("https://www.zhihu.com", headers=header)
    match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)
    if match_obj:
        return (match_obj.group(1))
    else:
        return ""

def zhihu_login(account, password):
    # 知乎登录
    if re.match("^1\d{10}", account):
        print("手机号码登录")
        post_url = "https://www.zhihu.com/login/phone_num"
        post_data = {
            "_xsrf": get_xsrf(),
            "phone_num": account,
            "password": password,
        }
    else:
        if "@" in account:
            # 判断用户名是否为邮箱
            print("邮箱方式登录")
            post_url = "https://www.zhihu.com/login/email"
            post_data = {
                "_xsrf": get_xsrf(),
                "email": account,
                "password": password
            }

    response_text = session.post(post_url, data=post_data, headers=header)
    # session.cookies.save()


# zhihu_login("xxx", "xxx")
zhihu_login("xxx", "xxx")
is_login()

转载于:https://www.cnblogs.com/hyit/articles/10662682.html