webassembly003 TTS BARK.CPP WordPiece

FakeOccupational

已于 2024-02-19 20:04:30 修改

阅读量759

点赞数 12

分类专栏：硬件和移动端文章标签：笔记

于 2024-02-02 11:30:00 首次发布

本文链接：https://blog.csdn.net/ResumeProject/article/details/135960405

版权

硬件和移动端专栏收录该内容

73 篇文章 1 订阅

订阅专栏

WordPiece是一种用于分词的子词划分算法，广泛应用于自然语言处理（NLP）任务中,代码中实现如下，使用贪心的方式进行编码，每次选择最长匹配的子串（i 到 j 之间的子串），并将对应的标记存储在 tokens 数组中

    // apply wordpiece
    for (const auto &word : words) {
        if (word.size() == 0)
            continue;

        std::string prefix = "";
        int i = 0;
        int n = word.size();

        loop:
            while (i < n) {
                if (t >= n_max_tokens - 1)
                    break;
                int j = n;
                while (j > i) {
                    auto it = token_map->find(prefix + word.substr(i, j - i));
                    if (it != token_map->end()) {
                        tokens[t++] = it->second;
                        i = j;
                        prefix = "##";
                        goto loop;
                    }
                    --j;
                }
                if (j == i) {
                    fprintf(stderr, "%s: unknown token '%s'\n", __func__, word.substr(i, 1).data());
                    prefix = "##";
                    ++i;
                }
            }
        }

转成可独立运行的代码如下：

#include <iostream>
#include <unordered_map>
#include <vector>

int main() {
    std::vector<std::string> words = {"用法解析"};
    const int n_max_tokens = 10;  // 最大标记数目，可以根据需求调整
    std::unordered_map<std::string, int> tokenmap = {
    {"用法",1},
    {"##解析",2},
    {"##析",3} };;  // 假设存在一个标记映射表
    
    std::unordered_map<std::string, int>  *token_map = &tokenmap;

    std::vector<int> tokens(n_max_tokens, 0);  // 存储标记的数组
    int t = 0;  // 当前标记的索引

    // 应用 WordPiece 分词算法
    for (const auto &word : words) {
        if (word.size() == 0)
            continue;

        std::string prefix = "";
        int i = 0;
        int n = word.size();

    loop:// loop是一个标签（label），在这段 C++ 代码中，它被用作跳转语句 goto 的目标
        while (i < n) {
            if (t >= n_max_tokens - 1)
                break;

            int j = n;
            while (j > i) {
                auto it = token_map->find(prefix + word.substr(i, j - i));
                if (it != token_map->end()) {
                    tokens[t++] = it->second;
                    i = j;
                    prefix = "##";
                    goto loop;
                }
                --j;
            }

            // 处理未知的单个字符，可以根据需求进行适当的处理
            if (j == i) {
                fprintf(stderr, "%s: unknown token '%s'\n", __func__, word.substr(i, 1).c_str());
                prefix = "##";
                ++i;
            }
        }
    }

    // 输出处理后的标记,输出应为：Tokens: 1 2 
    std::cout << "Tokens: ";
    for (int i = 0; i < t; ++i) {
        std::cout << tokens[i] << " ";
    }
    std::cout << std::endl;

    return 0;
}

CG

WordPiece在 Japanese and Korean Voice Search 中提出
WordPiece或者BPE这么好，我们是不是哪里都能这么用呢？其实在我们的中文中不是很适用。首先我们的中文不像英文或者其他欧洲的语言一样通过空格分开，我们是连续的。其次我们的中文一个字就是一个最小的单元，无法在拆分的更小了。在中文中一般的处理方式是两中，分词和分字。理论上分词要比分字好，因为分词更加细致，语义分的更加开。分字简单，效率高，词表也很小，常用字就3000左右。
C++ 96.3% CMake 3.7%
bert-chinese 的wordpiece的c++实现方法，使用方法见main.cpp https://github.com/a2409895438/wordpiece
git clone https://github.com/a2409895438/wordpiece.git
cmake -B ./build // -B选项用于指定构建目录的路径
cmake --build ./build
C++ 100.0%
HuggingFace Transformers WordPiece Tokenizer in C++ https://github.com/Sorrow321/huggingface_tokenizer_cpp
Rust 100.0%
Split tokens into word pieces https://github.com/danieldk/wordpieces
https://huggingface.co/learn/nlp-course/chapter6/6
分词是中文NLP的基础。系是不需要分词的。拉丁语言系不需要分词，因为他们的词语之间有空格分割，可以根据空格就可以把单词分开。
https://www.bilibili.com/video/BV1iz4y1X7q8?p=9

在这里插入图片描述

FakeOccupational

关注

12
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
webassembly003 TTS BARK.CPP WordPiece

分词是中文NLP的基础。拉丁语言系不需要分词，因为他们的词语之间有空格分割，可以根据空格就可以把单词分开。// -B选项用于指定构建目录的路径。WordPiece在。
复制链接

扫一扫