利用Ajax爬取今日头条头像，街拍图片。关于崔庆才python爬虫爬取今日头条街拍内容遇到的问题的解决办法。

本文链接：https://blog.csdn.net/m0_61791601/article/details/123126689

我也是初学爬虫，在看到崔庆才大佬的爬虫实战：爬取今日头条街拍美图时，发现有些内容过于陈旧运行程序时已经报错，网页的源代码早已不一样了。以下是我遇到的一些问题。

1.用开发者选项筛选Ajax文件时预览看到的内容和书中的不一致，是一些无用信息

https://www.toutiao.com/ 这里我们打开今日头条网页后，进一步选择图片

右键打开开发者（检查）选项选择Network—>Fetch/XHR—>Preview

然后在预览代码中选择rawData—>data就可以看到关于图片的信息

点开其中一个条目，可以看到其基本信息

详细的分析信息书中已提及，这里不再赘述，下面我以爬取头衔做演示，爬取街拍类似

开始爬取，先导入模板

import requests, os                     # requests用于发起起获取请求 os用于创建文件
from hashlib import md5               # 检测是否有重复文件
from urllib.parse import urlencode   # 解决编码问题
import json                      # 用于将获取的源码编译成json格式
import urllib.parse

设置请求头信息

headers = {
    'Host': 'so.toutiao.com',

    'Referer': 'https://so.toutiao.com/search?keyword=%E5%A4%B4%E5%83%8F&pd=atlas'
               '&source=search_subtab_switch&dvpf=pc&aid=4916&page_num=0',

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.462.71 Safari/537.36',

    'X-Requested-With': 'XMLHttpRequest',       # 利用Ajax爬取需要手动设置这一参数

    'Cookie': 'msToken=IhrwBM9Ey700XurL2Gi96VxhmwEaowsTNjECUtAshKK8f6aHwgYH2jkkNmb_aX'
              'kfDLuR3scbdjyqfM0JhRbMOC9bbqhaM1UI8ByJBDI5JfYeRcj82Q==; tt_webid=7067817804582716958;'
              ' _S_DPR=1.25; _S_IPAD=0; MONITOR_WEB_ID=7067817804582716958; ttwid=1%7CQUN-NY_ss4kS-ZH1oO'
              'xAEDOFDnq0b5qd8rmwMF0omt4%7C1645666798%7Ce2d6cc391c7c6063ff0dc2ecd1ec2e3751934f95fa95ffb9a'
              'e704d482eaab47e; _S_WIN_WH=1536_228'

}

以上的参数都可以在开发者面板找到，复制粘贴即可

但是切记要加上cookie参数，不response。status_code即使为200，也获取不了相关内容

下滑页面发现page_num参数变化，则向get_page传入page_num参数

def get_page(page_num):             # url很长可设置params参数
    params = {
        'keyword': urllib.parse.unquote('%E5%A4%B4%E5%83%8F'),
        'pd': 'atlas',
        'dvpf': 'pc',
        'aid': 4916,
        'page_num': page_num,
        'search_id': '2022022418081401021218716924C4EAFC',
        'rawJSON': 1
    }

    url = 'https://so.toutiao.com/search?' + urlencode(params)       # 构成完整url
    try:
        response = requests.get(url, headers=headers)      # 连接成功则以json格式返回响应
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError as e:     # 如果出错引出错误类型
        print('Error', e.args)

上述的params参数实际包含在url中，

def removepunctuation(image):
    text = image.get('title')    # 将代码中的标题处理成符合目录格式的字符
    ls = []
    for item in text:
        if item.isdigit() or item.isalpha():
            ls.append(item)
    em = "".join(ls)
    return em

也可以在开发者（检查）版面找到

值得注意的是 这些参数读者需要自己设置，我的这些参数都只是演示

def get_images(json):
    images = json.get('rawData').get('data')     # 由人分析可知要获取的信息在哪一标签
    for image in images:
        link = image.get('img_url')    # 获取链接
        title = image.get('text')     # 获取标题
        yield {                          # 返回一个生成器
            'link': link,
            'title': title
        }

在用书中的代码运行时出现NotADirectoryError: [WinError 267] 目录名称无效或OSError: [Errno 22] Invalid 问题（详见）http:// https://blog.csdn.net/m0_61791601/article/details/123122269?utm_source=app&app_version=4.15.0&code=app_1562916241&uLinkId=usr1mkqgl919blenhttp:// https://blog.csdn.net/m0_61791601/article/details/123122269?utm_source=app&app_version=4.15.0&code=app_1562916241&uLinkId=usr1mkqgl919blen下面这个函数可以解决这一问题

def remove_fuhao(image):                 # 将传入字符串处理为符合目录格式的字符
    text = image.get('title')
    ls = []
    for item in text:
        if item.isdigit() or item.isalpha():
            ls.append(item)
    em = "".join(ls)
    return em

创建目录，保存图片

def saving_img(em, image):
    if not os.path.exists(em):
        os.mkdir(em)
        try:
            data = requests.get(image.get('link')).content
            file_path = '{0}/{1}.{2}'.format(em, md5(data).hexdigest(), 'jpg')
            with open(file_path, 'wb') as f:
                f.write(data)

        except FileNotFoundError or NotADirectoryError:     # 静默失败，跳过出错
            pass

之前我就是在这里的os.mkdir()出错，所以加入了remove_fuhao()函数

def main(page_num):
    json = get_page(page_num)
    for image in get_images(json):
        em = removepunctuation(image)
        saving_img(em, image)


if __name__ == '__main__':
    for i in range(0, 3):    # 在这我只爬取了几页，如果要爬取多一些可以增大range()范围
        main(i)

爬取结果