编写http workshop脚本从网页缓存里解析音乐

前一篇文章 编写http workshop脚本从网站下载音乐 示范了如何使用HttpClient访问API,以及Json数据的解析;

今天我们通过解析一个网页展示如何使用内置的LibXml2的功能解析HTML,提取我们关心的内容。

这里随便搜了2个资源类的网站,竟然使用的格式是一模一样的:

https://www.51miz.com/so-sound/86888.html
https://www.yespik.com/search-sound/86838.html

一、分析页面结构

用浏览器F12,元素选中工具查看一下页面结构;或者保存页面为html,用vscode打开后格式化,

发现页面十分简单,每个资源的页面节点类似如下:

一般情况我们会使用xpath来查找节点列表,但是html与xml不一样就在于格式很多时候不规范,

使用容错方式解析后,xpath不一定能工作,这个时候就需要使用dom树遍历方式去查找节点,

xpathSimple函数可以指定 “节点名,1个属性名(可选),属性值(可选)”来查找。

-- 查找所有带有data-id属性的elements
    nodes = doc:xpathSimple("div", "data-id", "");
    n = nodes:size()
    if n == 0 then
        return 0
    end

其实也可以直接找子一层

-- 查找所有带有data-id属性的elements
    nodes = doc:xpathSimple("div", "class", "SoundCotent");

这样找到的一级节点下面有2个与信息有关的节点,

分别是文本地址:

<div class="SoundTitle"> 
<a target="_blank"
 href="https://www.yespik.com/show-sound_293365.html">企业宣传片 希望未来</a> 
</div>

和资源链接:

<source
src="//img-bsy2.yespik.com/sound/00/29/33/65/293365_3d8657e566ff4b2aebffe9b09658f5bd.mp3"
type="audio/mpeg">
</audio>

那么对应查找方式就是:文本部分查找

titles = item:xpathSimple("div", "class", "SoundTitle")
if (titles:size() > 0) then
    title_a = titles:at(0):getChildByIndex("a", 0)
    if not(title_a:isNull() ) then 
        titleStr = title_a:getValue()
        titleStr = utf8ToAnsi(titleStr)
        print(titleStr)
    end    
else
    print("not found title")
    goto continue
end

资源节点查找:

audio = item:xpathSimple("source", "src", "")
if  audio:size() > 0 then 
    src = audio:at(0):getAttrByName("src")
    if not src:isNull() then 
        linkStr = "https:" .. src:getValue()
        print(linkStr)
    end
end 

需要注意的是,通过xpath或者xpathSimple找到的是节点集合,需要自行判断里面的元素个数,有可能是0个(未找到);

二、脚本代码

完整的代码如下:

author = "范例"
version = 1.0

setting = {
    name = "mizhi网音乐搜索",
    dir = "d:\\MP3",
    desc = "网站执行搜索后也会返回一个静态网页地址,猜测是使用nginx做加速, 比如https://www.51miz.com/so-sound/86888.html",
    input1 = "网页地址",
    input2 = "未使用",
    input3 = "未使用",

}
  
client = HttpClient()
header = HttpHeader()
header:setItem('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36')

function search(url)
    local urlSearch = url 
    
    local code = client:doGet(urlSearch, header)
    printMessage("HTTP 应答: " .. code)
    
    printMessage("开始解析html数据")
    --printMessage(utf8ToAnsi(client:getBodyAsString()))
    --printMessage(client:getBodyAsAnsiString())
    
    body = client:getBodyAsString()
    
    -- XML_PARSE_NOBLANKS + XML_PARSE_RECOVER
    local opt =  256 + 1
    doc = parseXmlString(body, "", opt)
    if doc:isNull() then
        print("解析html错误")
        return 0
    end
    
    -- 查找所有带有data-id属性的elements
    nodes = doc:xpathSimple("div", "data-id", "");
    n = nodes:size()
    if n == 0 then
        return 0
    end
 
    count = 0
    for i = 0,  n-1 do
        
        local titleStr = ""
        local linkStr = ""
        local lenStr = ""
        item = nodes:at(i)
        
        titles = item:xpathSimple("div", "class", "SoundTitle")
        if (titles:size() > 0) then
            title_a = titles:at(0):getChildByIndex("a", 0)
            if not(title_a:isNull() ) then 
                titleStr = title_a:getValue()
                titleStr = utf8ToAnsi(titleStr)
                print(titleStr)
            end    
        else
            print("not found title")
            goto continue
        end
        
        audio = item:xpathSimple("source", "src", "")
        if  audio:size() > 0 then 
            src = audio:at(0):getAttrByName("src")
            if not src:isNull() then 
                linkStr = "https:" .. src:getValue()
                print(linkStr)
            end
        end 
        
        tms = item:xpathSimple("div", "class", "SoundEndTime fl end-time")
        if tms:size() > 0 then 
            lenStr = tms:at(0):getValue()
            print(lenStr)
        end
        
        
        singer = "椰子音效"
        album_name = ""
        
        --printMessage(singer .. " | " .. titleStr .. " | " .. album_name  .. " | ".. linkStr)
        
        --downloadMp3(music_url, singer, song_name)
        --downloadMp3(music_url, keyWord, song_name)
        
        local tbl = {
            singer = singer,
            song = titleStr,
            album = album_name,
            tags =  titleStr,
            size = lenStr,
            url = linkStr,
        }
        -- 当解析到某个音乐条目的时候,可以使用此函数通知界面
        count = count + 1
        notifyData(1, tbl)
        ::continue::
    end --for
    return count
end


function lua_main(url, pageIndex, pageSize)
    printMessage("准备搜索")
    printMessage(keyWord)

    --printMessage("engine name is ".. engine_name())
    --printMessage("engine version is ".. engine_version())
    printMessage("当前目录:" .. setting.dir)
 
    -- https://www.51miz.com/so-sound/86888.html
    local n = 0
    n = n + search(url)
   

    return n
end



function downloadMp3(music_url, singer, song)

    subDir = combinePath(setting.dir, singer)
    
    mkDir(subDir)
    fileName = subDir .. "\\"..  song .. ".mp3"
    printMessage(fileName)
    code = client:doGetToFile(music_url,header, fileName)
    print(code)
end

脚本的使用方法就是在浏览器里面看到有资源需要导出,就在输入栏中填写地址,解析即可:

比如链接:

https://www.51miz.com/so-sound/1558589.html

我们解析后如下:

end here.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值