爬虫笔记 for Splash

最新推荐文章于 2024-08-10 10:09:09 发布

Spark_zzz

最新推荐文章于 2024-08-10 10:09:09 发布

阅读量244

点赞数

文章标签：爬虫 python 开发语言

本文链接：https://blog.csdn.net/m0_60255954/article/details/128378796

版权

该博客介绍了如何利用Python和SplashLua脚本在京东商城进行商品搜索，并抓取每一页的搜索结果。脚本首先请求指定的搜索页面，等待页面加载后，查找并提取所有商品的标题，同时将每个页面的搜索结果截图保存为PNG文件。此过程通过循环生成多个URL来实现，覆盖了多页搜索结果的抓取。

摘要由CSDN通过智能技术生成

例子:利用splash Lua脚本在京东商城搜索商品,然后抓取搜索出的商品名称,以及将每一页搜索结果的截图保存为PNG格式得文件

import requests
from urllib.parse import quote
lua="""
function main(splash,args)
    --请求指定页面
    splash:go("https://search.jd.com/Search?keyword=python&page="..args.page)
    splash:wait(1)
    --查找所有符合条件的a节点
    li_list=splash:select_all('#J_goodsList > ul > li > div > div > a')
    --用于保存搜索出来的图书标题
    titles={}
    for _,li in ipairs(li_list) do
        --获取图书的标题,其中#titles表示titles数组当前的长度
        titles[#titles+1]=li.node.attributes.title;
    end
    return{
        titles=titles,
        png=splash:png()
        }
end
"""
# 循环产生6个URL
url_list=[('http://localhost:8050/execute?lua_source='+quote(lua)+'&page={}').format(str(i)) for i in range(1,13,2)]
i=1
for url in url_list:
    response=requests.get(url)
    import json
    import base64
    json_obj=json.loads(response.text)
    # 输出当前页面的所有图书的标题
    print(json_obj['titles'])
    png_base64=json_obj['png']
    png_bytes=base64.b64decode(png_base64)
    # 保存每一个搜索页面的截图
    with open(str(i)+'.png','wb') as f:
        f.write(png_bytes)
    i+=1