爬蟲day1

最新推荐文章于 2020-12-05 23:43:52 发布

dfy20020530

最新推荐文章于 2020-12-05 23:43:52 发布

阅读量205

点赞数

分类专栏：爬虫 requests 正则

本文链接：https://blog.csdn.net/dfy20020530/article/details/89053858

版权

爬虫同时被 3 个专栏收录

4 篇文章 0 订阅

订阅专栏

requests

4 篇文章 0 订阅

订阅专栏

正则

1 篇文章 0 订阅

订阅专栏

摘要
今天来介绍一下通过requests + re(正则表达式) 完成的爬虫
首先我们先介绍一下 requests 这各库，这是一个很好用的爬虫库搭配BS4可以完成大部分的爬虫工作，当然如果遇到JS 渲染的时侯就要另外想办法了，可以参考Scrapy框架这是一个专门做爬虫的框架，我们接着说requests ，requests我个人常用的方法有两种一个事get 一个是post ,get-向指定的資源要求資料，類似於查詢操作。post-將要處理的資料提交給指定的資源，類似於更新操作。再来我们说说请求头，在爬取某些网站的时候我们需要将爬虫模拟浏览器的请求这时候就需要请求头请求头的详细资料可以参考百度百科这里提供一个请求头的范例大部分的请求只需要加入user-agent 即可完成模拟浏览器如果遇到需要登录的网站那我们就可以加入cookies 来完成这个步骤
在这里插入图片描述
好接下来我们说说re正则匹配模块先上文档正则表达式是爬虫里常用的一个东西，可以笔者资值愚钝所以常常搞不懂该怎么写，在大部分的情况下我都是.*?进行贪婪匹配如果遇到特殊需求的话就会上网查毕竟自己在哪边想破头还不如上网查来的直接便利…在爬虫的部份可以多参考这里
绪论
前面简单的介绍了requests 和 re 模块接下来我们先上代码来看看实际效果

import requests
from bs4 import BeautifulSoup
import re


class ReqTest:
    def __init__(self):
        pass

    def get_baidu(self):
        """
        使用GET方法获取百度
        :return:状态码
        """
        req = requests.get('https://www.baidu.com/')
        return req.status_code, req.url

    def post_baidu(self):
        """
        使用post方法获取百度
        :return: 状态码
        """
        data = {
            'name': 'test',
            "tags": '1234'
        }
        req = requests.post('https://www.baidu.com/', data=data)
        return req.text

    def get_movie_detail(self):
        dic = {}
        tmp_list = []
        for i in range(0, 250, 25):

            url = 'https://movie.douban.com/top250?start=%s&filter=' % i
            headers = {
                'User-Agent': "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"
            }
            res = requests.get(url, headers=headers)
            retag = re.compile(
                r'.*?<em class="">(.*?)</em>'  # 匹配 rank
                r'.*?<span class="title">(.*?)</span>'  # 匹配 名称
                r'.*?<span class="title">&nbsp;/&nbsp;(.*?)</span>'
                r'.*?<span class="other">&nbsp;/&nbsp;(.*?)</span>'
                r'.*?<p class="">.*?: (.*?)&nbsp.*? (\d+).*?</p>'  # 匹配 导演 年份
                r'.*?<span class="rating_num" property="v:average">(.*?)</span>'  # 匹配 评价
                r'.*?<span class="inq">(.*?)</span>',  # 匹配 短评
                re.S)

            a = retag.findall(res.text)
            # print(a)
            for i in a:
                rank = i[0]
                title = i[1]
                title_en = i[2]
                title_other = i[3]
                author = i[4]
                year = i[5]
                scores = i[6]
                paragraph = i[7]
                sup = {
                    'rank': rank,
                    'title': title + '/' + title_en + '/' + title_other,
                    'author': author,
                    'year': year,
                    'score': scores,
                    'paragraph': paragraph

                }
                tmp_list.append(sup)
        return tmp_list


a = ReqTest()
get = a.get_baidu()  # 200
post = a.post_baidu()  # 302
movie = a.get_movie_detail()
for i in movie:
    print(i)

运行结果:
在这里插入图片描述
首先我分别使用get 以及post 方法来对百度发出请求第三个函数得部份我使用了re以及requests来爬取豆瓣电影的top250电影的详情前面两个返回的是状态码分别为200 以及302 这里200代表请求成功 302 代表临时移动。与301类似。但资源只是临时被移动。客户端应继续使用原有URI 详细的http 状态码可以参考这里基本上我们在爬虫的时候常常遇到的都是200，也许你会想问如果在没有网络连接的情况下发出请求会有什么样的结果，不啰嗦直接上结果

/home/chen/Desktop/chne_lean/venv/bin/python /home/chen/Desktop/Exercise/reqtest.py
Traceback (most recent call last):
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/util/connection.py", line 57, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn
    conn.connect()
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connection.py", line 301, in connect
    conn = self._new_conn()
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connection.py", line 168, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f10d7bda470>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/util/retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.baidu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f10d7bda470>: Failed to establish a new connection: [Errno -2] Name or service not known',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/chen/Desktop/Exercise/reqtest.py", line 71, in <module>
    get = a.get_baidu()  # 200
  File "/home/chen/Desktop/Exercise/reqtest.py", line 15, in get_baidu
    req = requests.get('https://www.baidu.com/')
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.baidu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f10d7bda470>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Process finished with exit code 1

这里我们可以看到报错很多很长但是我们只需要关注requests.exceptions.ConnectionError: HTTPSConnectionPool(host=‘www.baidu.com’, port=443): Max retries exceeded with url: / (Caused by NewConnectionError(’<urllib3.connection.VerifiedHTTPSConnection object at 0x7f10d7bda470>: Failed to establish a new connection: [Errno -2] Name or service not known’,)) 大致上得意思就是没有网路连接
以上就是今天爬虫的简介欢迎提出改善建议

dfy20020530

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬蟲day1

摘要今天来介绍一下通过requests + re(正则表达式) 完成的爬虫首先我们先介绍一下 requests 这各库，这是一个很好用的爬虫库搭配BS4可以完成大部分的爬虫工作，当然如果遇到JS 渲染的时侯就要另外想办法了，可以参考Scrapy框架这是一个专门做爬虫的框架，我们接着说requests ，requests我个人常用的方法有两种一个事get 一个是post ,get-向指定的資源...
复制链接

扫一扫

专栏目录