论文阅读HTS-AT- A HIERARCHICAL TOKEN-SEMANTIC AUDIO TRANSFORMERFOR SOUND CLASSIFICATION AND DETECTION

最新推荐文章于 2025-11-05 15:09:49 发布

原创

最新推荐文章于 2025-11-05 15:09:49 发布 · 1.1k 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#论文阅读

1. 论文介绍

关于论文的中文阅读参考这里：
https://blog.csdn.net/ggqyh/article/details/136098693；

代码：
https://github.com/RetroCirce/HTS-Audio-Transformer

2. 关于事件定位的相关提问

这里主要罗列出作者回答关于音频事件定位的相关问题：

2.1 audio set 上的事件定位功能；

https://github.com/RetroCirce/HTS-Audio-Transformer/issues/25；

CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test
// make sure that fl_local=True in config.py
python fl_evaluate.py
// organize and gather the localization results
fl_evaluate_f1.ipynb
// Follow the notebook to produce the results

是的，这个函数是一个临时函数，你可能知道 AudioSet 去年发布了一个带有强大本地化标签的小子集。于是我把公司服务器里的数据处理了一下，以备后用，但现在却无法访问了。

I think doing the localization on AudioSet is different from DESED, there are two differences I would suggest you need to write your own code for processing it:
我认为在 AudioSet 上进行本地化与 DESED 不同，有两个区别我建议您需要编写自己的代码来处理它：

if you want to train a new HST-AT model by localization data (my HTS-AT can support it but I did not write it), you need to extract different output of HST-AT (I believe it is the last second layer feature-map output), and have a loss function to converge it. Actually this might become a new work. One thing to keep in mind is that the interpolation and resolution of the output may be different from the input localization time resolution ----- in that you need to find a way to align them.
如果你想通过本地化数据训练一个新的HST-AT模型（我的HTS-AT可以支持，但我没有写），你需要提取HST-AT的不同输出（我相信它是最后第二层特征） -map 输出），并有一个损失函数来收敛它。其实这可能会成为一部新作品。要记住的一件事是，输出的插值和分辨率可能与输入的本地化时间分辨率不同 ----- 因为您需要找到一种方法来对齐它们。

If you want to evaluate the model on localization dataset, fl_evaluate.py can be served as a code-base but you need to