Ajax数据获取(想出一个问题难,还是解出这个题难?)

最新推荐文章于 2024-07-06 20:18:26 发布

猪猪_女孩

最新推荐文章于 2024-07-06 20:18:26 发布

阅读量192

点赞数

分类专栏：爬虫文章标签： python ajax 服务器爬虫 http

本文链接：https://blog.csdn.net/weixin_50590724/article/details/109961477

版权

爬虫专栏收录该内容

12 篇文章 0 订阅

订阅专栏

一.爬虫与反爬虫

…想出一个问题难,还是解出这个题难?

爬虫--------(spider)
反爬虫--------(Anti-spider)
反反爬虫--------(Anti-Anti-spider)
爬虫和反爬虫一直在斗争的旅程中
爬虫和反爬虫的对弈中,爬虫一定会获得胜利.
换言之,只要人类能够正常访问网页,爬虫在具备同等资源的情况下就一定可以获取到数据.

二.Ajax数据获取

1.什么是Ajax(Asynchronous JavaScript and XML

(Asynchronous JavaScript and XML(异步JavaScript 和XML))
网站不需要使用单独的页面请求就可以和网路服务器进行交互(收发信息)

2.Ajax工作原理图

在这里插入图片描述

3.Ajax请求的分析步骤

分析请求
分析请求的目的是先找到这个页面中的哪些请求是Ajax请求,当我们打开开发者工具时,选择XHR时,在该界面中出现的请求是Ajax请求.

例如:百度翻译 在这里插入图片描述

分析响应

分析响应的目的是从找到的Ajax请求中确定哪一条请求是获取页面数据的Ajax请求.

解析响应的内容

一般Ajax请求返回的响应内容都是json数据,

三. 案例

1.古诗文的数据获取


# 古诗文抓取
# URL：https://so.gushiwen.cn/shiwen/
# 需求：
# 1.获取所有类型
# 3.所有古诗的标题、内容、注释、译文

# 思路：
# 1. 获取所有类型的URL，并请求，进入每一个类型中
# 2. 获取每一首古诗的链接，并请求，进入对应古诗的详情页
# 3. 获取我们需要的数据

import requests,re
from lxml import etree

# 定义请求头
headers = {
    'User-Agent': '自己网页上的用户代理',
}

base_url = 'https://so.gushiwen.cn/shiwen/'

# 制定获取古诗id的规则
gs_id_pattern = re.compile(r'_(.*?)\.')
# 制定获取ajax_id的规则
ajax_id_pattern = re.compile(r'javascript:fanyiShow\(\d+,\'(.*?)\'\)')
# 定义ajax_url
ajax_url = 'https://so.gushiwen.cn/nocdn/ajaxfanyi.aspx?id={}'

response = requests.get(url=base_url,headers=headers)

# 将字符串转换成HTML元素对象
html = etree.HTML(response.text)
# 获取类型数据
type_list = html.xpath('//*[@id="html"]/body/div[2]/div[2]/div[2]/div[2]/a')
# 循环获取内容
for type in type_list:
    # 获取类型
    type_name = type.xpath('./text()')[0]
    # 获取类型的URL
    type_url = type.xpath('./@href')[0]
    # 发现URL不是完整的，所以，需要拼接成完整的
    type_url_full = 'https://so.gushiwen.cn' + type_url
    # 对古诗类型链接发起请求
    type_response = requests.get(url=type_url_full,headers=headers)
    # 将字符串转换成HTML元素对象
    type_html = etree.HTML(type_response.text)
    # 获取古诗链接
    gushi_list = type_html.xpath('//*[@id="html"]/body/div[2]/div[1]/div[2]/div[1]/span')
    # 循环获取每一个古诗链接
    for gushi in gushi_list:
        # 获取链接
        gushi_url = gushi.xpath('./a/@href')[0]
        # 发现有的链接时完整的，有的不是完整的
        # 需要判断是否是完整的
        if not gushi_url.startswith('http'):
            # 拼接完整的url
            gushi_url = 'https://so.gushiwen.cn' + gushi_url
        # 对古诗详情页的url发起请求
        gushi_response = requests.get(url=gushi_url,headers=headers)
        # 将字符串转换成HTML元素对象
        gushi_html = etree.HTML(gushi_response.text)
        # 获取数据
        # 获取古诗名
        gushi_name = gushi_html.xpath('//h1/text()')[0]
        # 获取作者和朝代
        author = gushi_html.xpath('//div[@id="sonsyuanwen"]/div[1]/p/a[2]/text()')[0]
        chaodai = gushi_html.xpath('//div[@id="sonsyuanwen"]/div[1]/p/a[1]/text()')[0]

        # 获取古诗的内容
        # content = gushi_html.xpath('//div[@id="contson45c396367f59"]/text()')
        # 发现问题：
        # 通过id来查找只能找到一首诗，其余都是空列表
        # 原因：xpath路径问题
        # 分析：
        # 行宫：//div[@id="contson45c396367f59"]/text()
        # 登雀楼：//div[@id="contsonc90ff9ea5a71"]/text()
        # 对比，发现id不同，只有contson是一样的，其余都不一样
        # 需要找到每一首古诗的id
        # 发现，每一首古诗的id存在于对应的古诗URL中，需要提取URL中的id
        # 获取古诗id
        gs_id = gs_id_pattern.findall(gushi_url)[0]
        content_list = gushi_html.xpath('//div[@id="contson{}"]//text()'.format(gs_id))
        # 又发现问题：通过此路径，能够获取大部分的内容
        # 有少部分内容获取不到，需要查看此路径
        content = ''.join(content_list)

        # 注释及译文
        # 发现：有的古诗的注释不完整，需要点击展开阅读全文才可以查看完整注释
        # 所以，需要做判断，是否是完整的注释（区别：有无展开阅读全文）
        # 如果有展开阅读全文，需要我们请求ajax_url
        # 否则，直接从页面中获取数据

        # 获取展开阅读全文
        zhushi_href = gushi_html.xpath('//div[@class="left"]/div[3]'
                                       '/div[@class="contyishang"]/div[last()]/a/@href')[0]
        # 发现有的href值是PlayFanyi，有的是fanyiShow
        # 是fanyiShow的代表有展开阅读全文
        # 需要判断是否存在fanyiShow
        if 'fanyiShow' in zhushi_href:
            # 请求ajax_url
            # 蒹葭：https://so.gushiwen.cn/nocdn/ajaxfanyi.aspx?id=700F301DB4FA7B10
            # 登鹳雀楼：https://so.gushiwen.cn/nocdn/ajaxfanyi.aspx?id=75B9117033181D93
            # 经过对比，发现id不同，需要我们找出id
            # id是a标签的fanyiShow函数的第二个参数
            # 获取ajax_id
            ajax_id = ajax_id_pattern.findall(zhushi_href)[0]
            # 发起请求，接收响应
            ajax_response = requests.get(url=ajax_url.format(ajax_id),headers=headers)
            # 将字符串转换成HTML元素对象
            ajax_html = etree.HTML(ajax_response.text)
            # 获取译文及注释
            explain_zhushi_list = ajax_html.xpath('//div[@class="contyishang"]/p//text()')
            explain_zhushi = '\n'.join(explain_zhushi_list)
        else:
            # 从页面中提取
            explain_zhushi_list = gushi_html.xpath('//div[@class="left"]/div[3]/div[@class="contyishang"]/p/text()')
            explain_zhushi = '\n'.join(explain_zhushi_list)
        # 保存数据
        with open('古诗合集.txt','a',encoding='utf-8') as fp:
            fp.write(type_name+'\n'+gushi_name+'\n'+author+chaodai+'\n'+content+'\n'+explain_zhushi+'\n')

2.豆瓣电影的数据获取


# 需求：
# 获取所有分类下，所有电影信息(排名、电影名、演员、评分等)

# 思路：
# 1. 获取每一个分类的URL ---> 获取电影的类型的id
# 2. 先拼接获取总共电影数量的URL ---> 请求，获取当前分类下的电影数量
# 3. 最后拼接获取电影信息的URL ---> 请求，获取数据

# 分析：
# 获取电影信息的ajax_url
# 第一页：https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=0&limit=20
# 第二页：https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=20&limit=20
# 第三页：https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=40&limit=20
# 第四页：https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=60&limit=20
# 找到规律：每次start值都加20

# 黑色电影：https://movie.douban.com/j/chart/top_list?type=31&interval_id=100%3A90&action=&start=0&limit=20

# 获取总共电影数量URL：
# https://movie.douban.com/j/chart/top_list_count?type=24&interval_id=100%3A90
# 总计的页码：total//20+1


import requests,re,math
from lxml import etree

# 定义请求函数
def get_requests(url):
    response = requests.get(url=url,headers=headers)
    if flag == 0:
        return etree.HTML(response.text)
    return response.json()

# 定义获取分类URL的函数
def get_type(url):
    # 发起请求，接收响应
    type_html = get_requests(url)
    # 将字符串转换成html元素对象
    # type_html = etree.HTML(response.text)
    # 提取数据
    type_list = type_html.xpath('//div[@class="types"]/span/a')
    for type in type_list:
        # 获取URL
        type_href = type.xpath('./@href')[0]
        # 获取分类名
        type_name = type.xpath('./text()')[0]
        type_id = type_id_pattern.findall(type_href)[0]
        get_total(type_id)
        # print(type_name,type_href,type_id)

# 定义获取电影总工数量的函数
def get_total(type_id):
    global flag
    flag = 1
    # 请求
    # response = requests.get(url=total_url.format(type_id),headers=headers)
    response = get_requests(total_url.format(type_id))
    # 获取total
    total = response['total']
    get_movie(type_id,total)
    ...

# 定义获取电影信息的函数
def get_movie(type_id,total):
    for i in range(0,math.ceil(total/20)):
        # 发起请求
        data_list = get_requests(movie_url.format(type_id,i*20))
        # data_list = response.json()
        for data in data_list:
            dic = {}
            # 获取排名
            rank = data['rank']
            # 获取电影名
            movie = data['title']
            # 获取主演
            actor = ';'.join(data['actors'])
            # 获取上映时间
            release_time = data['release_date']
            # 获取评分
            score = data['score']

            dic['排名'] = rank
            dic['电影名'] = movie
            dic['主演'] = actor
            dic['上映时间'] = release_time
            dic['评分'] = score
            # 保存数据
            write_to_txt(dic)
            print(rank,movie,actor,release_time,score)

# 定义保存数据函数
def write_to_txt(dic):
    # 保存数据
    with open('豆瓣电影.txt','a',encoding='utf-8') as fp:
        fp.write(str(dic)+'\n')

if __name__ == '__main__':
    # 定义全局变量
    flag = 0
    # 定义基础的url
    base_url = 'https://movie.douban.com/chart'
    # 定义请求头
    headers = {
        'User-Agent': '自己网页上的用户代理',
    }
    # 定义获取电影数据的URL
    movie_url = 'https://movie.douban.com/j/chart/top_list?type={}&interval_id=100%3A90&action=&start={}&limit=20'
    # 定义获取总共电影数的URL
    total_url = 'https://movie.douban.com/j/chart/top_list_count?type={}&interval_id=100%3A90'
    # 制定匹配type_id的正则表达式规则
    type_id_pattern = re.compile(r'type=(.*?)&')
    get_type(base_url)