分析Ajax请求抓取今日头条街拍美图

最新推荐文章于 2019-07-30 10:51:59 发布

binarywz

最新推荐文章于 2019-07-30 10:51:59 发布

阅读量3.3k

点赞数 1

分类专栏： Python 爬虫

本文链接：https://blog.csdn.net/Qaz_wz/article/details/70267212

版权

Python 同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

爬虫

5 篇文章 0 订阅

订阅专栏

有一些网页直接请求得到的HTML代码并没有在网页中看到的内容，因为一些信息是通过Ajax加载，并通过js渲染生成的，这时就需要通过分析网页的请求来获取想要爬取的内容。通过抓取今日头条街拍美图讲解一下具体操作步骤。

首先打开今日头条网页，搜索街拍

这里写图片描述

选择图集，抓取组图

这里写图片描述

使用开发人员工具，查看网页html代码

这里写图片描述

发现并没有想要的内容信息，接下来查看Ajax请求，注意我拿红圈圈出来的地方

这里写图片描述

查看Ajax请求url以及请求的方法，发现是用get方法，所以使用requests库

这里写图片描述

查看Ajax请求参数

这里写图片描述

查看详情页的内容

这里写图片描述

代码

import requests
from urllib.parse import urlencode
from requests.exceptions import RequestException
import json
from bs4 import BeautifulSoup
import re

#获得通过Ajax请求的到的html代码
def get_page_index(offset,keyword):
    query_parameters = {
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': '20',
        'cur_tab': 3
    }#Ajax请求参数
    url  ='http://www.toutiao.com/search_content/?' + urlencode(query_parameters)#Ajax请求url
    response = requests.get(url)
    try:
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('请求索引页失败')
        return None

#获得图集的url
def parse_page_index(parse_index_html):
    data = json.loads(parse_index_html)
    if data and 'data' in data.keys():
        for item in data.get('data'):
            yield item.get('article_url')#通过yield将函数变作一个生成器

#获得页面的详细内容
def get_page_detail(get_detail_url):
    response = requests.get(get_detail_url)
    try:
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('请求详情页失败',url)
        return None

#解析详情页
def parse_page_detail(html,url):
    soup = BeautifulSoup(html,'lxml')
    title = soup.select('title')[0].get_text()
    print(title)
    images_pattern = re.compile('var gallery = (.*?);',re.S)
    result = re.search(images_pattern,html)
    if result:
        data = json.loads(result.group(1))
        if data and 'sub_images' in data.keys():
            sub_images = data.get('sub_images')
            images = [item.get('url') for item in sub_images]
            return {
                'title':title,
                'url':url,
                'images':images
            }

def main():
    html = get_page_index(0,'街拍')
    for url in parse_page_index(html):
        html = get_page_detail(url)
        if html:
            result = parse_page_detail(html,url)
            print(result)

if __name__ == '__main__':
    main()