【毕业设计】基于程序化生成和音频检测的生态仿真与3D内容生成系统----音频检测算法设计

EndlessDaydream

已于 2023-04-20 16:41:22 修改

阅读量477

点赞数 2

分类专栏：学习日志 CUC 文章标签：音视频

于 2023-04-16 18:55:31 首次发布

本文链接：https://blog.csdn.net/Angelloveyatou/article/details/130185539

版权

学习日志同时被 2 个专栏收录

72 篇文章 6 订阅

订阅专栏

CUC

12 篇文章 8 订阅

订阅专栏

(2条消息) 【开发日志】2022.09.02 ZENO----Audio----Beat detection algorithm----Combine Wav&Mp3_minimp3 和 ffmpeg_EndlessDaydream的博客-CSDN博客https://blog.csdn.net/Angelloveyatou/article/details/126670613

4 音频检测算法设计

4.1 节拍检测算法

4.1.1 节拍检测算法

要实现节拍检测算法，我们首先需要计算所选频率范围内的声音能量。我们可以通过使用FFT分析并将所选范围内频率箱的平方幅度相加来做到这一点。然后，我们计算当前播放位置之前一段时间（例如，几秒钟）的平均能量。

获得平均能量后，将其与所选频率范围内声音的当前能量进行比较。如果两种能量之间的差异超过一定的阈值，我们可以得出结论，有一个节拍。可以调整阈值以控制节拍检测的灵敏度。

为了实时实现算法，我们需要维护先前能量值的缓冲区，并在每次计算能量时更新它。我们可以使用圆形缓冲区来存储能量值，并使用指针来跟踪缓冲区中的当前位置。

节拍检测算法并不完美，可能会错过某些节拍或检测误报。但是，它可以很好地近似歌曲的节奏，并可用于同步视觉效果或触发游戏或交互式应用程序中的事件。

节拍检测算法是用于分析音频信号以确定其节奏或节拍的算法。以下是一些常见的节拍检测算法：

1.自相关函数法：该方法通过计算信号的自相关函数来检测节拍。信号的自相关函数将显示信号与其自身在时间上的延迟之间的相关性。当信号具有重复的模式时，自相关函数将具有明显的峰值，这些峰值对应于信号的节拍。

2.峰值检测法：该方法通过寻找信号中的峰值来检测节拍。通常，这些峰值与信号的强度或能量相关。在一段时间内检测到的峰值数量将与该时间段内的节拍数量相匹配。

3.快速傅里叶变换法：该方法通过对信号进行快速傅里叶变换（FFT）来检测节拍。 FFT将信号转换为频率域，其中可以检测到频率和强度。可以根据频率域中的能量峰值来确定信号的节拍。

4.基于模型的方法：该方法使用基于时间的模型来检测节拍。模型将信号表示为一系列时序事件，并使用模型来识别节拍模式。

这些算法可以单独或组合使用，以获得更准确的节拍检测结果。

本系统使用基于声能的简单统计模型计算。

4.1.2 基于声能的简单统计模型检测节拍

本系统是基于声能的简单统计模型实现简单的节拍检测算法，基本思想是利用音频数据的能量变化来检测节奏。计算当前播放前几秒钟声音的平均能量，并将其与声音的当前能量进行比较，如果能量差超过某个阈值，可以说有一个节拍。

使用 1024 个样本的窗口大小和 44100 Hz 的采样率，我们需要一个 44100/1024 = 43 个元素的缓冲区来存储 1 秒的历史记录。此样本的值可以从FFT分析中获得。

我们将分析集中在频谱的第一小节中，这样做的原因是检查声音的较低频率以捕捉电池的踢鼓和军鼓的使用，电池是跟踪歌曲节奏的最常用乐器之一。在我们的实验中，我们将采用 60hz-130hz 的低音范围，我们将在其中找到底鼓，以及中低音 301hz-750hz，在那里可以找到军鼓声音。中低音包含大多数乐器的低次谐波，通常被视为低音存在范围。

因此，我们需要获取此范围内的声音信息，并获取FFT结果的相应元素。要获得FFT结果中每个元素的频率，我们只需要计算频率分割（44100/1024 = 43）并将其乘以数据数组的索引。所以第一个组件存储范围 0-43Hz 的结果，第二个组件存储 43-86Hz，第三个 86-129Hz 的结果......

算法

假设 k 和 k+n 是实际处理范围的极限，FFT[i] 是 i 位置的频率幅度。我们可以计算范围的当前能量为

我们需要将此值与接下来的 42 个样本一起存储，以获得 1 秒的历史记录（H）。

现在可以使用此历史记录计算波段的平均值

通常，超过平均值加其一半的值是检测节拍的良好阈值。但是我们可以使用历史值的方差来调整这个因子。在像硬摇滚或摇滚乐这样非常嘈杂的音乐中，节拍检测变得有点狡猾，因此我们需要降低更高方差值的阈值。

我们可以定义一条线（方差，阈值）方程来表示阈值和方差之间的关系。以（0， 1.55）（0.02， 1.25）作为这条线的两个点。

我们的 FFT 结果在 0..1 范围内，因此方差值也在 0..1 范围内。

最后检测到节拍，如果

输出1 ，反之输出0，从而生成01序列输出到下一个结点。

为了实现该算法，我定义了一些变量来存储历史数据、采样频率和窗口大小等信息，并编写一些辅助函数来计算平均值、方差和阈值等。另外，为了存储历史数据，我使用双端队列（deque）容器，以便在开头插入新元素并在末尾删除最旧的元素。

本系统检测节拍具体步骤如下：

1.将音频数据按窗口大小进行分割，并计算每个窗口内的平均能量。

2.从频谱中选取低音范围，如60hz-130hz的范围，来捕捉电池的踢鼓和军鼓的使用。

3.对于每个窗口内的数据，在低音范围内计算其FFT结果，并获取相应的频率幅度。

4.根据一定历史记录的范围，如1秒内的历史数据，计算当前时间点的能量值和历史数据的平均能量。

5.根据历史数据的方差调整平均能量值的阈值。

6.判断当前时间点的能量值是否超过阈值，并根据一定规则来检测节拍。

4.1.3 本系统中部分音频结点

算法实现

    struct AudioBeats : zeno::INode {
        std::deque<double> H;
        virtual void apply() override {
            auto wave = get_input<PrimitiveObject>("wave");
            float threshold = get_input<NumericObject>("threshold")->get<float>();
            auto start_time = get_input<NumericObject>("time")->get<float>();
            float sampleFrequency = wave->userData().get<zeno::NumericObject>("SampleRate")->get<float>();
            int start_index = int(sampleFrequency * start_time);
            int duration_count = 1024;
            auto fft = Aquila::FftFactory::getFft(duration_count);
            std::vector<double> samples;
            samples.resize(duration_count);
            for (auto i = 0; i < duration_count; i++) {
//                if (start_index + i >= wave->size()) {
//                    break;
//                }
                samples[i] = wave->attr<float>("value")[min((start_index + i), wave->size()-1)];
                
                //if (start_index + i >= wave->size()) {
                //    break;
                //}
                //samples[i] = wave->attr<float>("value")[start_index + i];
            }
            Aquila::SpectrumType spectrums = fft->fft(samples.data());

            {
                double E = 0;
                for (const auto& spectrum: spectrums) {
                    E += spectrum.real() * spectrum.real() + spectrum.imag() * spectrum.imag();
                }
                E /= duration_count;
                H.push_back(E);
            }

            while (H.size() > 43) {
                H.pop_front();
            }
            double avg_H = 0;
            for (const auto& E: H) {
                avg_H += E;
            }
            avg_H /= H.size();

            double var_H = 0;
            for (const auto& E: H) {
                var_H += (E - avg_H) * (E - avg_H);
            }
            var_H /= H.size();
            int beat = H.back() - threshold > (-15 * var_H + 1.55) * avg_H;
            set_output("beat", std::make_shared<NumericObject>(beat));
            set_output("var_H", std::make_shared<NumericObject>((float)var_H));


            auto output_H = std::make_shared<ListObject>();
            for (int i = 0; i < 43 - H.size(); i++) {
                output_H->arr.emplace_back(std::make_shared<NumericObject>((float)0));
            }
            for (const auto & h: H) {
                output_H->arr.emplace_back(std::make_shared<NumericObject>((float)h));
            }
            set_output("H", output_H);

            auto output_E = std::make_shared<ListObject>();
            for (const auto& spectrum: spectrums) {
                double e = spectrum.real() * spectrum.real() + spectrum.imag() * spectrum.imag();
                output_E->arr.emplace_back(std::make_shared<NumericObject>((float)e));
            }
            set_output("E", output_E);
        }
    };

    ZENDEFNODE(AudioBeats, {
        {
            "wave",
            {"float", "time", "0"},
            {"float", "threshold", "0.005"},
        },
        {
            "beat",
            "var_H",
            "H",
            "E",
        },
        {},
        {
            "audio"
        },
    });

    struct AudioEnergy : zeno::INode {
        double minE = std::numeric_limits<double>::max();
        double maxE = std::numeric_limits<double>::min();
        std::vector<double> init;
        virtual void apply() override {
            auto wave = get_input<PrimitiveObject>("wave");
            int duration_count = 1024;
            if (init.empty()) {
                auto fft = Aquila::FftFactory::getFft(duration_count);
                int clip_count = wave->size() / duration_count;
                init.reserve(clip_count);
                for (auto i = 0; i < clip_count; i++) {
                    std::vector<double> samples;
                    samples.resize(duration_count);
                    for (auto j = 0; j < duration_count; j++) {
                        samples[j] = wave->attr<float>("value")[min(duration_count * i + j, wave->size()-1)];
                    }
                    Aquila::SpectrumType spectrums = fft->fft(samples.data());
                    {
                        double E = 0;
                        for (const auto& spectrum: spectrums) {
                            E += spectrum.real() * spectrum.real() + spectrum.imag() * spectrum.imag();
                        }
                        E /= duration_count;
                        minE = min(minE, E);
                        maxE = max(maxE, E);
                        init.push_back(E);
                    }
                }
    //            for (auto i = 0; i < clip_count; i++) {
    //                init[i] = init[i] / maxE;
    //            }
            }

    //        auto vis = std::make_shared<PrimitiveObject>();
    //        vis->resize(init.size());
    //        auto &index = vis->add_attr<float>("index");
    //        auto &listE = vis->add_attr<float>("E");
    //        for (auto i = 0; i < init.size(); i++) {
    //            index[i] = i;
    //            listE[i] = init[i];
    //        }
    //        set_output("vis", vis);

            set_output("minE", std::make_shared<NumericObject>((float)minE));
            set_output("maxE", std::make_shared<NumericObject>((float)maxE));

            auto start_time = get_input2<float>("time");
            float sampleFrequency = wave->userData().get<zeno::NumericObject>("SampleRate")->get<float>();
            int start_index = int(sampleFrequency * start_time);
            auto fft = Aquila::FftFactory::getFft(duration_count);
            std::vector<double> samples;
            samples.resize(duration_count);
            for (auto i = 0; i < duration_count; i++) {
                samples[i] = wave->attr<float>("value")[min((start_index + i), wave->size()-1)];
            }
            Aquila::SpectrumType spectrums = fft->fft(samples.data());
            double E = 0;
            for (const auto& spectrum: spectrums) {
                E += spectrum.real() * spectrum.real() + spectrum.imag() * spectrum.imag();
            }
            E /= duration_count;
            set_output("E", std::make_shared<NumericObject>((float)E));
            double uniE = (E - minE) / (maxE - minE);
            set_output("uniE", std::make_shared<NumericObject>((float)uniE));
            start_index /= duration_count;
            start_index = min(start_index, init.size() - 1);
            std::vector<double> _queue;
            for (int i = max(start_index - 43, 0); i < start_index; i++) {
                _queue.push_back((init[i] - minE) / (maxE - minE));
            }
            if (_queue.size() > 0) {
                double avg_H = 0;
                for (const double & e: _queue) {
                    avg_H += e;
                }
                avg_H /= _queue.size();
                double var_H = 0;
                for (const double & e: _queue) {
                    var_H += (e - avg_H) * (e - avg_H);
                }
                var_H /= _queue.size();
                double std_H = sqrt(var_H);
    //            zeno::log_info("E: {}, avg_H: {}, std_H: {}, var_H: {}", uniE, avg_H, std_H, var_H);
                float threshold = get_input2<float>("threshold");
                int beat = uniE > avg_H + std_H * threshold;
                set_output("beat", std::make_shared<NumericObject>(beat));
            }
            else {
                set_output("beat", std::make_shared<NumericObject>(0));
            }
        }
    };
    ZENDEFNODE(AudioEnergy, {
        {
            "wave",
            {"float", "time", "0"},
            {"float", "threshold", "1"},
        },
        {
            "beat",
            "E",
            "uniE",
            "minE",
            "maxE",
//            "vis",
        },
        {},
        {
            "audio"
        },
    });

    struct AudioFFT : zeno::INode {
        virtual void apply() override {
            auto wave = get_input<PrimitiveObject>("wave");
            int duration_count = 1024;
            auto start_time = get_input2<float>("time");
            float sampleFrequency = wave->userData().get<zeno::NumericObject>("SampleRate")->get<float>();
            int start_index = int(sampleFrequency * start_time);
            std::vector<double> samples;
            samples.resize(duration_count+1);
            for (auto i = 0; i < duration_count+1; i++) {
                samples[i] = wave->attr<float>("value")[min((start_index + i), wave->size()-1)];
            }
            auto pre_emphasis = get_input2<int>("preEmphasis");
            if (pre_emphasis) {
                auto alpha = get_input2<float>("preEmphasisAlpha");
                for (auto i = 0; i < duration_count; i++) {
                    samples[i] = samples[i+1] - alpha * samples[i];
                }
            }
            samples.pop_back();
            auto hamming_window = get_input2<int>("hammingWindow");
            if (hamming_window) {
                for (auto i = 0; i < duration_count; i++) {
                    double i_value = 0.54 - 0.46 * std::cos(2.0 * M_PI * i / (duration_count - 1));
                    samples[i] = samples[i] * i_value;
                }
            }

            auto fft = Aquila::FftFactory::getFft(duration_count);
            Aquila::SpectrumType spectrums = fft->fft(samples.data());

            auto fft_prim = std::make_shared<PrimitiveObject>();
            fft_prim->resize(duration_count / 2 + 1);
            auto &freq = fft_prim->add_attr<float>("freq");
            auto &real = fft_prim->add_attr<float>("real");
            auto &image = fft_prim->add_attr<float>("image");
            auto &square = fft_prim->add_attr<float>("square");
            auto &power = fft_prim->add_attr<float>("power");
            for (std::size_t i = 0; i < fft_prim->verts.size(); ++i) {
                float r = spectrums[i].real();
                float im = spectrums[i].imag();
                freq[i] = float(i);
                real[i] = r;
                image[i] = im;
                float square_v = r * r + im * im;
                square[i] = square_v;
                power[i] = square_v / duration_count;
            }
            set_output("FFTPrim", fft_prim);
        }
    };
    ZENDEFNODE(AudioFFT, {
        {
            "wave",
            {"float", "time", "0"},
            {"bool", "preEmphasis", "0"},
            {"float", "preEmphasisAlpha", "0.97"},
            {"bool", "hammingWindow", "1"},
        },
        {
            "FFTPrim",
        },
        {},
        {
            "audio"
        },
    });
    struct MelFilter : zeno::INode {
        virtual void apply() override {
            auto fftPrim = get_input<PrimitiveObject>("FFTPrim");
            auto &power = fftPrim->attr<float>("power");
            auto sampleFreq = get_input2<float>("sampleFreq");
            auto rangePerFilter = get_input2<float>("rangePerFilter");
            float halfFreq = sampleFreq / 2;
            auto count = get_input2<int>("count");
            std::vector<float> hz_points;
            float mel_fh = 2595.0 * log10(1+halfFreq/700.0);
            for (int i = 0; i <= count + 1; i++) {
                float mel = mel_fh * i / (count + 1);
                float hz = 700.0 * (pow(10.0, mel / 2595.0) - 1);
                hz_points.push_back(hz);
            }
            std::vector<int> bin;
            for (const auto& hz: hz_points) {
                int index = (1024.0+1.0) * hz / sampleFreq;
                bin.push_back(index);
            }
            auto fbank = std::make_shared<PrimitiveObject>();
            fbank->resize(count);
            auto& fbank_v = fbank->add_attr<float>("fbank");
            for (auto i = 1; i <= count; i++) {
                int s = bin[i-1];
                int m = bin[i];
                int e = bin[i+1];
                s = (int) zaudio::lerp(m, s, rangePerFilter);
                e = (int) zaudio::lerp(m, e, rangePerFilter);
                float total = 0;
                for (auto i = s; i < m; i++) {
                    float cof = (float)(m - i) / (float)(m - s);
                    total += power[i] * cof;
                }
                for (auto i = m; i < e; i++) {
                    float cof = 1 - (float)(m - i) / (float)(e - m);
                    total += power[i] * cof;
                }
                if (total == 0) {
                    fbank_v[i-1] = std::numeric_limits<float>::min();
                }
                else {
                    fbank_v[i-1] = log(total);
                }
            }
            auto indexType = get_input2<std::string>("indexType");
            if (indexType == "index") {
                auto& index = fbank->add_attr<float>("i");
                for (auto i = 1; i <= count; i++) {
                    index[i-1] = (float)(i-1);
                };
            } else if (indexType == "indexdivcount") {
                auto& index = fbank->add_attr<float>("i");
                for (auto i = 1; i <= count; i++) {
                    index[i-1] = (float)(i-1) /count;
                };
            }
            set_output("FilterBank", fbank);
        }
    };
    ZENDEFNODE(MelFilter, {
        {
            "FFTPrim",
            {"int", "count", "15"},
            {"float", "sampleFreq", "44100"},
            {"float", "rangePerFilter", "1"},
            {"enum none index indexdivcount", "indexType", "index"},
        },
        {
            "FilterBank",
        },
        {},
        {
            "audio",
        },
    });
} // namespace zeno

参考文献

BEAT DETECTION ALGORITHMS.doc (parallelcube.com)https://www.parallelcube.com/web/wp-content/uploads/2018/03/BeatDetectionAlgorithms.pdf

TODO：

"A Review on Audio Event Detection," H. Su, et al., IEEE Access, vol. 8, pp. 77580-77593, 2020.
"Acoustic Event Detection with SEDNN: A Deep Learning Approach," P. Jaiswal and Y. Han, 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 316-320.
"Event Detection Using Multitask Learning of Auditory Features and Sound Event Classifiers," D. D. Lee, et al., IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1190-1201, 2017.
"Audio Event Detection Using Deep Learning with Mel-Frequency Cepstral Coefficients," S. Gupta, et al., 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, 2019, pp. 186-191.
"Deep Convolutional Neural Networks for Acoustic Event Detection in Domestic Environments," M. L. Seltzer, et al., IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 111-125, 2016.
"Environmental Sound Classification with Convolutional Neural Networks," J. Salamon, et al., IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 2015, pp. 732-736.

以上论文介绍了一些常用的音频检测算法，包括深度学习算法和基于特征的传统算法。这些算法可用于识别音频中的各种事件，如说话、喷嚏、汽车鸣笛等。可以根据实际需求选择合适的算法进行实现。