基本库的使用 -- requests

最新推荐文章于 2021-05-31 23:12:39 发布

小孟Tec

最新推荐文章于 2021-05-31 23:12:39 发布

阅读量277

点赞数

分类专栏：爬虫文章标签： requests 爬虫

本文链接：https://blog.csdn.net/m0_38024592/article/details/82755618

版权

爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

import requests

r = requests.get('https://www.baidu.com')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)

这里我们调用get()方法实现与url op e n （）相同的操作，得到一个R es po n se 对象，然后分别输出了Response 的类型、状态码、响应体的类型、内容以及Cookies 。

网页的返回类型实际上是str 类型，但是它很特殊，是JSON 格式的。所以，如果想直接解析返回结果，得到一个字典格式的话，可以直接调用json()方法

import requests

r = requests.get("https://httpbin.org/get")
print(type(r.text))
print(r.json())
print(type(r.json()))

可以发现，调用json()方法，就可以将返回结果是JSON 格式的字符串转化为字典。

What is the difference between .? and . regular expressions?

It is the difference between greedy and non-greedy quantifiers.

Consider the input 101000000000100.

Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.

即正则表达式里面的 .* 是贪婪匹配，而.*? 是非贪婪匹配。

抓取网页

import requests
import re

#设置代理
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
}
r = requests.get("https://www.zhihu.com/explore", headers=headers)
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>', re.S)
titles = re.findall(pattern, r.text)
print(titles)
# print(r.text)

这里我们加入了headers 信息，其中包含了User- Agent 字段信息，也就是浏览器标识信息。如果不加这个，知乎会禁止抓取。

抓取二进制数据

图片、音频、视频这些文件本质上都是由二进制码组成的，由于有特定的保存格式和对应的解析方式，我们才可以看到这些形形色色的多媒体。所以，想要抓取它们，就要拿到它们的二进制码。

下面以GitHub 的站点图标为例来看一下：

同样地，音频和视频文件也可以用这种方法获取。

POST请求

使用request类实现POST请求同样非常简单。

所以，利用Session，可以做到模拟同一个会话而不用担心Cookies 的问题。它通常用于模拟登录成功之后再进行下一步的操作。

正则表达式

而在match （）方法中，第一个参数传入了正则表达式，第二个参数传入了要匹配的字符串。 -- re.match()

前面提到过， match （）方法是从字符串的开头开始匹配的，一旦开头不匹配，那么整个匹配就失败了。

这里就有另外一个方法search （），它在匹配时会扫描整个字符串，然后返回第一个成功匹配的结果。

实战--猫眼电影Top100爬取

import json
import requests
from requests.exceptions import RequestException
import re
import time


def get_one_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None


def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
                         + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                         + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {  #返回的是一个迭代器
            'index': item[0],
            'image': item[1],
            'title': item[2],
            'actor': item[3].strip()[3:],
            'time': item[4].strip()[5:],
            'score': item[5] + item[6]
        }


def write_to_file(content):
#     with open('F:\\spider\\MaoYanTop100\\result.txt', 'w', encoding='UTF-8') as f:
#         json.dump(new_dict, f, ensure_ascii=False)  # 同理，此处是避免编码问题
#         print('Writing done...')

    with open('F:\\spider\\MaoYanTop100\\result.csv', 'a', encoding='utf-8') as f:  # a 是追加写
        f.write(json.dumps(content, ensure_ascii=False) + '\n')


def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)


if __name__ == '__main__':
    for i in range(10):
        main(offset=i * 10)
        time.sleep(1)