入坑爬虫（一）

最新推荐文章于 2024-07-27 12:20:46 发布

扁桃体治愈者

最新推荐文章于 2024-07-27 12:20:46 发布

阅读量322

点赞数

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/weixin_44366264/article/details/107663717

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

学习爬虫

提前准备：

软件：pycharm2019.1.2（破解版）

链接：https://pan.baidu.com/s/1Nq0h7JmiRorMXQXdh8xYwQ
提取码：sbnq
复制这段内容后打开百度网盘手机App，操作更方便哦

失效了dd我。

在这里插入图片描述

先安装python，再安装pycharm，不懂得怎么破解的可以直接网上搜一下。

再提供一个思路，官网下载pycharm专业版

打开网址：

http://lookdiv.com/index/index/indexcodeindex.html

输入获取持续更新的Activate Code：

lookdiv.com

安装插件

打开命令窗口（win键+R），输入cmd，确认后输入：

pip install requests

在这里插入图片描述

requests 和 request的区别：

看网上大佬们说request这个东西是一位郭老师传上去的。

正版我只支持requests，requests里面会调用正版的request。

开始雷霆嘎巴：

有请我们的第一位受害者（Dedicator）：

以 https://818ps.com/ （图吧）为例：

requests里面首先出场的get函数

先来看看代码吧：

def get(url, params=None, **kwargs):
    r"""Sends a GET request.

    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response
    """

    kwargs.setdefault('allow_redirects', True)
    return request('get', url, params=params, **kwargs)

def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object: ``GET``, ``OPTIONS``, ``HEAD``, ``POST``, ``PUT``, ``PATCH``, or ``DELETE``.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the query string for the :class:`Request`.
    :param data: (optional) Dictionary, list of tuples, bytes, or file-like
        object to send in the body of the :class:`Request`.
    :param json: (optional) A JSON serializable Python object to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
        ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
        or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
        to add for the file.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How many seconds to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) Either a boolean, in which case it controls whether we verify
            the server's TLS certificate, or a string, in which case it must be a path
            to a CA bundle to use. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

    Usage::

      >>> import requests
      >>> req = requests.request('GET', 'https://httpbin.org/get')
      >>> req
      <Response [200]>
    """

    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

是不是有点头皮发麻，hh。

好，听咱一言，可以看到get接受包括几个数据：

1.url , 指代网站的域名；

2.params,要发送的元组或字节列表

import requests
params1 = {'recharge':36,'fee_id':'ireader_nonrenew_vip'}
res = requests.get("https://818ps.com/",params = params1)
print(res.status_code)# 打印状态码 以后有大作用 
print(res.url)

结果如下：
在这里插入图片描述

3.**kwargs这里面包含了许多：

1）method: 该参数的名字，例如 ‘get’ ‘post’之类的

data：字典、字节、或文件对象，作为request

json:JSON格式的数据，作为request的内容

cookies：字典或CookieJar，request中的cookie

header：字典，http定制头

auth：元组，支持HTTP认证功能

files：字典类型，传输文件

timeout：设定时间

（1）如果你想用写成 method = “get”：

response = requests.request(url="https://movie.douban.com/top250",method="get")
or
response = requests.request("get","https://movie.douban.com/top250")

（2）如果你觉得太麻烦了，用下面这个方法，默认method = get：

response = requests.get("https://movie.douban.com/top250")

先就将到这里，后面用到了再给大家说明。

大家在网页中，点击鼠标右键，选项中能看到检查选项，然后就能看到一大堆html代码。

html里面放了大量的文本，文本链接，图片，图片链接，视频链接等等等。

因此，所有的爬虫都是首先爬取到该url的后台html代码。

1.简单的无保护的网页。

import requests
def run():
    response = requests.get("https://818ps.com/")
    print(response)
    print(response.text)

if __name__ == "__main__":
    run()

先了解状态码这个东西：

当浏览者访问一个网页时，浏览者的浏览器会向网页所在服务器发出请求。当浏览器接收并显示网页前，此网页所在的服务器会返回一个包含HTTP状态码的信息头（server header）用以响应浏览器的请求。

常用的http状态码：

200：请求成功

301：资源转移到其他url

404：请求网页不存在

500：内部服务错误

更多状态码可以查看下方链接：

https://www.runoob.com/http/http-status-codes.html

返回状态码：200，表示请求成功。

在这里插入图片描述

2.带保护的

import requests
def run():
    response = requests.get("https://movie.douban.com/top250")
    print(response)
    print(response.text)

if __name__ == "__main__":
    run()

在这里插入图片描述

返回状态码：418

查阅状态码表，418是IETF在1998年愚人节发布的一个玩笑RFC，在RFC 2324超文本咖啡壶控制协议中定义的，并不需要在真实的HTTP服务器中定义。当一个控制茶壶的HTCPCP收到BREW或POST指令要求其煮咖啡时应当回传此错误。它的含义是当客户端给一个茶壶发送泡咖啡的请求时，那就就返回一个错误的状态码表示：I’m a teapot,即：“我是一个茶壶”。这个HTTP状态码在某些网站中用作彩蛋，另外也用于一些爬虫警告。

该怎么解决呢，得假装自己是个人了。引入UA了，UA是个啥勒。浏览器标识（baiUA）可以使得服务器能够识别du客户使用的操作系统及版本、CPU 类型zhi、浏览dao器及版本、浏览器渲染引擎、浏览器语言、浏览器插件，从而判断用户是使用电脑浏览还是手机浏览，让网页作出自动的适应。可理解为网站通过对ua标示的判别，可按相应的格式进行网页的布局调整，使用户获得更好的浏览体验。

import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"}
def run():
    response = requests.get("https://movie.douban.com/top250",headers = headers)
    print(response)
    print(response.text)

if __name__ == "__main__":
    run()

在这里插入图片描述

大哥，搞定！

扁桃体治愈者

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
入坑爬虫（一）

学习爬虫提前准备：软件：pycharm2019.1.2（破解版）链接：https://pan.baidu.com/s/1Nq0h7JmiRorMXQXdh8xYwQ 提取码：sbnq 复制这段内容后打开百度网盘手机App，操作更方便哦失效了dd我。先安装python，再安装pycharm，不懂得怎么破解的可以直接网上搜一下。再提供一个思路，官网下载pycharm专业版打开网址：http://lookdiv.com/index/index/indexcodeindex.h
复制链接

扫一扫

专栏目录