spider小案例~https://industry.cfi.cn/BCA0A4127A4128A4141.html

石头里蹦出的猴子

已于 2023-12-11 18:45:14 修改

阅读量425

点赞数 6

文章标签： python 爬虫 javascript

于 2023-12-11 18:18:14 首次发布

本文链接：https://blog.csdn.net/Amber_shi/article/details/134932310

版权

文章讲述了如何通过抓包分析，利用Python的requests库和正则表达式，解码JavaScript中的`unes`和`unescape`函数处理过的列表页和详情页信息，使用`%u`和`u`进行Unicode编码转换。

摘要由CSDN通过智能技术生成

一、获取列表页信息

通过抓包发现列表页信息非正常返回，列表信息如下图：

通过观察发现列表页信息是通过unes函数进行处理的，我们接下来去看下该函数

该函数是对列表页的信息先全局替换"~"为"%u"，然后再通过unescape函数对替换后的字符串进行解码，到此我们就可以获取到列表页的信息了，我们用Python来还原一下

import re
from urllib.parse import unquote

import requests


def get_list_page():
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    }
    url = 'https://industry.cfi.cn/BCA0A4127A4128A4141.html'
    response = requests.get(url, headers=headers)
    re_data = re.findall('var n.*?="(.*?)";', response.text)
    for data in re_data:
        result = data.replace("~", "\\u")
        list_info = unquote(result).encode('utf8').decode('unicode_escape')
        # 详情页url
        detail_url = "https://industry.cfi.cn/"+''.join(re.findall(r'onclick=\"window.open\(\'(.*?)\'\);\"',list_info,re.S))
        print(detail_url)
        # 标题
        title_info = re.sub(r'[<font color=FireBrick><b></b>/</font></u><br>]','',list_info.split(');"')[-1]).strip()
        print(title_info)

二、获取详情页信息

有了详情页的URL，我们接下来再来看详情页的获取

抓包可见详情信息如上图，处理详情内容的函数应为 -->ifrnews，接下来我们去找该函数的位置，卡看该函数做了什么处理，如下图

箭头所指为我们想要的结果，与列表页类似，我们用Python还原下详情页的获取

def get_detail_page():
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    }
    url = 'https://industry.cfi.cn/p20231209000312.html'
    response = requests.get(url, headers=headers)
    # 从响应中取出详情内容
    content = ''.join(re.findall(r"var nr\d+=\"(.*?)\";", response.text, re.S))
    # 对详情内容进行解码
    detail_page_html = unquote(content).replace('~', "\\u").encode('utf8').decode('unicode_escape')
    print(detail_page_html)

总结：

在 JavaScript 中，使用 “%u” 进行 Unicode 编码。而在 Python 中，可以使用 “\u” 进行 Unicode 编码。

以下是示例：

在 JavaScript 中，使用 “%u” 进行 Unicode 编码：

var str = "%u4F60%u597D";
var decodedStr = unescape(str);
console.log(decodedStr); // 输出：你好

在 Python 中，使用 “\u” 进行 Unicode 编码：

请注意，在 Python 中使用 Unicode 编码时需要对反斜杠进行转义，因此在字符串中需要使用双反斜杠 “\” 表示单个反斜杠。

str = "\\u4F60\\u597D"
decoded_str = bytes(str, "utf-8").decode("unicode_escape")
print(decoded_str) # 输出：你好

以上内容仅供学习使用~

石头里蹦出的猴子

关注

6
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫