记录一次对[https://www.acg81.cn/]的简单爬虫-CSDN博客

本文链接：https://blog.csdn.net/qq_43492356/article/details/129428021

记录一次对[https://www.acg81.cn/]的简单爬虫

使用语言：Python
开发工具：Pycharm
目标网站：https://www.acg81.cn/
爬取目标：该导航网站所有的站点

分析过程

观察网页结构，使用xpath确定需要采集的信息位置。

在这里插入图片描述

同时发现，这个网站的网址无法直接获得，先提取其他信息。

在这里插入图片描述

跟进去查看网页结构。

但是这里发现一个问题，使用xpath//*[@id="gourl"]/@href无法抓取到对应url，推测可能由JS加载。

在这里插入图片描述

简单搜索了一下，发现是直接由页面script标签自加载的JS，直接使用一个简单的正则提取出来。

在这里插入图片描述

OK，到这里还不是我们的真实url，再跟进去。

在这里插入图片描述

这里3秒后就会自动跳转，我们使用暂停大法（debugger）停住页面，而且也可以看到我们需要的目标—真实url了。

在这里插入图片描述

很快我发现这里它也是JS加载的。

在这里插入图片描述

很好，重点来了，分析这段代码发现主要分为三个部分，如下。

"""
详细代码流程是这样的
1. 调用 get_query_string
    - 获取substr  #url=aHR0cDovL3d3dy5jaWxpY2lsaS5jYy8=
    - 匹配到后半段  #aHR0cDovL3d3dy5jaWxpY2lsaS5jYy8=
    - 使用 unescape 解码该字符串
2. 判断结果不为空后, 调用 b64_decode
    - 使用 base64 解码
    - 使用 decodeURIComponent 解码该url
"""
    
# 分析到这里我发现其实就是分两步
# 1. 获取url参数中url的值
# 2. 解一次base64
# 然后就OK了

在这里插入图片描述

贴出解码的代码，相当简单。

    def parse_real_url(url__: str):
        """
        1. 调用 get_query_string
            - 获取substr  #url=aHR0cDovL3d3dy5jaWxpY2lsaS5jYy8=
            - 匹配到后半段  #aHR0cDovL3d3dy5jaWxpY2lsaS5jYy8=
            - 使用 unescape 解码该字符串
        2. 判断结果不为空后, 调用 b64_decode
            - 使用 base64 解码
            - 使用 decodeURIComponent 解码该url
        """
        query_string = url__.split('?url=')[-1]
        first_ = bytes(unquote(query_string), encoding='utf-8')
        second_ = base64.b64decode(first_)
        last_ = unquote(str(second_))
        return last_.split('\'')[1]

跑一下，一切正常，OK

在这里插入图片描述

完整代码

import base64
import json
import re
import time

import requests
from scrapy import Selector
from urllib.parse import unquote


def request(url_: str):
    """
    请求url
    :param url_: 需要请求的链接
    :return: 响应数据
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
    }
    try:
        response = requests.get(url=url_, headers=headers, timeout=60)
        response.encoding = response.apparent_encoding
        return response
    except Exception:
        print("网络波动, 重试...")
        time.sleep(1)
        return request(url_)


def parse_html(html_):
    """
    解析html网页
    :param html_: html网页
    :return: 页面选择器
    """
    return Selector(text=html_)


def parse_page_five(url_: str):
    """
    解析网页 https://www.acg81.cn/
    :param url_: https://www.acg81.cn/
    :return: 网址数据
    """
    html = request(url_).text
    web = parse_html(html)
    web_list = []

    def parse_real_url(url__: str):
        """
        解析出真实的url
        1. 调用 get_query_string
            - 获取substr  #url=aHR0cDovL3d3dy5jaWxpY2lsaS5jYy8=
            - 匹配到后半段  #aHR0cDovL3d3dy5jaWxpY2lsaS5jYy8=
            - 使用 unescape 解码该字符串
        2. 判断结果不为空后, 调用 b64_decode
            - 使用 base64 解码
            - 使用 decodeURIComponent 解码该url
        """
        query_string = url__.split('?url=')[-1]
        first_ = bytes(unquote(query_string), encoding='utf-8')
        second_ = base64.b64decode(first_)
        last_ = unquote(str(second_))
        return last_.split('\'')[1]

    for node in web.xpath("//a[@class='card-item']"):
        # 第一次请求, 进入详情页
        info_url = url_ + node.xpath('./@href').get().strip()
        info_html = request(info_url)
        info_web = parse_html(info_html.text)
        # 获取下一个页面的url
        temp_script = info_web.xpath('/html/body/script[12]/text()').get().strip()
        next_url = 'https:' + re.findall(r'attr\("href","(.*?)"\);', temp_script)[0]
        # 调用写好的解码函数获取真实的url
        real_url = parse_real_url(next_url)

        image = node.xpath("./div[1]/@data-src").get()
        if image.startswith("//dingmancn.com"):
            image = "https:" + image
        item = {
            'url': real_url,
            'image': image,
            'title': node.xpath(".//span[@class='card-title']/text()").get().strip(),
            'describe': node.xpath(".//span[@class='card-desc']/text()").get().strip(),
            'state': 1,
            'original_site': url_
        }
        web_list.append(item)
        print(item)

    return web_list

以上，2023年03月09日。