机器学习 伪标签_伪英语—机器学习打字练习

机器学习 伪标签

Articles in this series:1. Introduction2. Pseudo-English (You are here)3. Keyboard Input (Coming soon)4. Web Workers (Coming soon)

本系列文章:1。 简介 2.伪英语(您在这里) 3.键盘输入(即将推出) 4.网络工作者(即将推出)

The finished project is located here: https://www.bayanbennett.com/projects/rnn-typing-practice

完成的项目位于此处: https : //www.bayanbennett.com/projects/rnn-typing-practice

目的 (Objective)

Generate English-looking words using a recurrent neural network.

使用递归神经网络生成英语单词。

琐碎的方法 (Trivial Methods)

Before settling on using ML, first I had to convince myself that the trivial methods did not provide adequate results.

在开始使用ML之前,首先我必须说服自己,琐碎的方法无法提供足够的结果。

随机字母 (Random Letters)

const getRandom = (distribution) => {
const randomIndex = Math.floor(Math.random() * distribution.length);
return distribution[randomIndex];
}const alphabet = "abcdefghijklmnopqrstuvwxyz";const randomLetter = getRandom(alphabet);

Unsurprisingly, no resemblance to English words. The character sequences that were generated were painful to type. Here are a few examples of five letter words:

毫不奇怪,它与英语单词没有相似之处。 生成的字符序列很难键入。 以下是五个字母词的一些示例:

snyam   iqunm   nbspl   onrmx   wjavb   nmlgj
arkpt ppqjn zgwce nhnxl rwpud uqhuq
yjwpt vlxaw uxibk rfkqa hepxb uvxaw

加权随机字母 (Weighted Random Letters)

What if we generated sequences that had the same distribution of letters as English? I obtained the letter frequencies from Wikipedia and created a JSON file that mapped the alphabet to their corresponding relative frequency.

如果我们生成的序列具有与英语相同的字母分布怎么办? 我从Wikipedia获得字母频率,并创建了一个JSON文件,该文件将字母映射到其相应的相对频率。

// letter-frequencies.json
{
"a": 0.08497, "b": 0.01492, "c": 0.02202, "d": 0.04253,
"e": 0.11162, "f": 0.02228, "g": 0.02015, "h": 0.06094,
"i": 0.07546, "j": 0.00153, "k": 0.01292, "l": 0.04025,
"m": 0.02406, "n": 0.06749, "o": 0.07507, "p": 0.01929,
"q": 0.00095, "r": 0.07587, "s": 0.06327, "t": 0.09356,
"u": 0.02758, "v": 0.00978, "w": 0.02560, "x": 0.00150,
"y": 0.01994, "z": 0.00077
}

The idea here is to create a large sequence of letters whose distribution closely matches frequencies above. Math.random has a uniform distribution, so when we select random letters from the sequence, the probability for picking a letter matches its frequency.

这里的想法是创建一个大的字母序列,其分布与上面的频率紧密匹配。 Math.random具有均匀的分布,因此当我们从序列中选择随机字母时,选择字母的概率与其频率匹配。

const TARGET_DISTRIBUTION_LENGTH = 1e4; // 10,000const letterFrequencyMap = require("./letter-frequencies.json");const letterFrequencyEntries = Object.entries(letterFrequencyMap);const reduceLetterDistribution = (result, [letter, frequency]) => {
const num = Math.round(TARGET_DISTRIBUTION_LENGTH * frequency);
const letters = letter.repeat(num);
return result.concat(letters);
};const letterDistribution = letterFrequencyEntries
.reduce(reduceLetterDistribution, "");const randomLetter = getRandom(letterDistribution);

The increase in the number of vowels was noticeable, but the generated sequences still fail to resemble an English word. Here are a few examples of five-letter words:

元音数量的增加是明显的,但是生成的序列仍然不能类似于英语单词。 以下是一些由五个字母组成的单词的示例:

aoitv   aertc   cereb   dettt   rtrsl   ararm
oftoi rurtd ehwra rnfdr rdden kidda
nieri eeond cntoe rirtp srnye enshk

马尔可夫链 (Markov Chains)

This would be the next logical step, where we would create probabilities of letter sequence pairs. This was the point that I decided to go straight to RNNs. If anyone would like to implement this approach, I’d be interested in seeing the results.

这将是下一步的逻辑步骤,我们将在其中创建字母序列对的概率。 这就是我决定直接使用RNN的要点。 如果有人想实现这种方法,那么我会对看到结果感兴趣。

递归神经网络 (Recurrent Neural Networks)

Neural networks are usually memoryless, where the system has no information from previous steps. RNNs are a type of neural network where the previous state of the network is an input to the current step.

神经网络通常是无记忆的,其中系统没有来自先前步骤的信息。 RNN是一种神经网络,其中网络的先前状态是当前步骤的输入。

  • Input: A character

    输入 :一个字符

  • Output: A tensor with the probabilities for the next character.

    输出 :具有下一个字符的概率的张量。

NNs are inherently bad at processing inputs of varying length, there are ways around this (like with positional encoding in transformers). With RNNs, the inputs are consistent in size, a single character. Natural language processing has a natural affinity for RNNs as languages are unidirectional (LTR or RTL) and the order of the characters are important. In other words, although the words united and untied only have two characters swapped, but they have opposite meanings.

NN本质上不利于处理不同长度的输入,对此有很多解决方法(例如在变压器中进行位置编码) 。 使用RNN,输入的大小是一致的,一个字符。 自然语言处理对RNN具有天然的亲和力,因为语言是单向的(LTR或RTL),并且字符的顺序很重要。 换句话说,尽管“ 团结”和“ 解开 ”一词仅交换了两个字符,但它们具有相反的含义。

The model below is based on the Tensorflow Text generation with an RNN tutorial.

以下模型基于带有RNN教程的Tensorflow Text生成

嵌入输入层 (Input Layer with Embedding)

This was the first time I encountered the concept of an embedding layer. It was a fascinating concept and I was excited to start using it.

这是我第一次遇到嵌入层的概念。 这是一个令人着迷的概念,我很高兴开始使用它。

I wrote a short post summarizing embeddings here: https://bayanbennett.com/posts/embeddings-in-machine-learning

我在这里写了一篇简短的文章,总结了嵌入: https : //bayanbennett.com/posts/embeddings-in-machine-learning

const generateEmbeddingLayer = (batchSize, outputDim) =>
tf.layers.embedding({
inputDim: vocabSize,
outputDim,
maskZero: true,
batchInputShape: [batchSize, null],
});

门控循环单元(GRU) (Gated Recurrent Unit (GRU))

I don’t have enough knowledge to justify why a GRU was chosen, so I deferred to the implementation in the aforementioned Tensorflow tutorial.

我没有足够的知识来说明为什么选择GRU的理由,因此我推迟到上述Tensorflow教程中的实现。

const generateRnnLayer = (units) =>
tf.layers.gru({
units,
returnSequences: true,
recurrentInitializer: "glorotUniform",
activation: "softmax",
});

放在一起 (Putting it all together)

Since we are sequentially feeding the output of one layer into the input of another layer, tf.Sequential is the class of model that we should use.

由于我们将一层的输出顺序地馈送到另一层的输入中,因此tf.Sequential是我们应该使用的模型类别。

const generateModel = (embeddingDim, rnnUnits, batchSize) => {
const layers = [
generateEmbeddingLayer(batchSize, embeddingDim),
generateRnnLayer(rnnUnits),
];
return tf.sequential({ layers });
};

训练数据 (Training Data)

I used Princeton’s WordNet 3.1 data set as a source for words.

我使用普林斯顿大学的WordNet 3.1数据集作为单词来源。

“WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets)…” — Princeton University “About WordNet.” WordNet. Princeton University. 2010.

“WordNet®是一个大型的英语词汇数据库。 名词,动词,形容词和副词被分为多组认知同义词(同义词)……” –普林斯顿大学“关于WordNet”。 词网 。 普林斯顿大学。 2010。

Since I was only interested in the words, I parsed each file and extracted only the words. Words with spaces were split into separate words. Words that matched the following criteria were also removed:

由于我只对单词感兴趣,因此我解析每个文件并仅提取单词。 带空格的单词被分成单独的单词。 符合以下条件的单词也被删除:

  • Words with diacritics

    变音符号
  • Single character words

    单字词
  • Words with numbers

    带数字的单词
  • Roman numerals

    罗马数字
  • Duplicate words

    单词重复

数据集生成器 (Dataset Generator)

Both the tf.LayersModel and tf.Sequential both have the .fitDataset method, which is a convenient way of—fitting a dataset. We need to create a tf.data.Dataset, but first here are some helper functions:

tf.LayersModeltf.Sequential都具有.fitDataset 方法 ,这是一种适合数据集的便捷方法。 我们需要创建一个tf.data.Dataset ,但是首先这里是一些帮助函数:

// utils.jsconst characters = Array.from("\0 abcdefghijklmnopqrstuvwxyz");
const mapCharToInt = Object.fromEntries(
characters.map((char, index) => [char, index])
);const vocabSize = characters.length;const int2Char = (int) => characters[int];
const char2Int = (char) => mapCharToInt[char];// dataset.jsconst wordsJson = require("./wordnet-3.1/word-set.json");
const wordsArray = Array.from(wordsJson);// add 1 to max length to accommodate a single space that follows each word
const maxLength = wordsArray.reduce((max, s) => Math.max(max, s.length), 0) + 1;const data = wordsArray.map((word) => {
const paddedWordInt = word
.concat(" ")
.padEnd(maxLength, "\0")
.split("")
.map(char2Int);
return { input: paddedWordInt, expected: paddedWordInt.slice(1).concat(0) };
});function* dataGenerator() {
for (let { input, expected } of data) {
/* If I try to make the tensors inside `wordsArray.map`,
* I get an error on the second epoch of training */
yield { xs: tf.tensor1d(input), ys: tf.tensor1d(expected) };
}
}module.exports.dataset = tf.data.generator(dataGenerator);

Note that we need all the inputs to be the same length, so we pad all words with null characters, which will be converted to integer 0 with the char2Int function.

请注意,我们需要所有输入都具有相同的长度,因此我们用空字符填充所有单词,这些字符将通过char2Int函数转换为整数0。

生成并编译模型 (Generating and compiling the model)

Here it is, the moment we’ve been building towards:

在这里,我们一直在努力:

const BATCH_SIZE = 500;const batchedData = dataset.shuffle(10 * BATCH_SIZE).batch(BATCH_SIZE, false);
const model = generateModel(vocabSize, vocabSize, BATCH_SIZE);
const optimizer = tf.train.rmsprop(1e-2);model.compile({
optimizer,
loss: "sparseCategoricalCrossentropy",
metrics: tf.metrics.sparseCategoricalAccuracy,
});model.fitDataset(batchedData, { epochs: 100 });

A batch size of 500 was selected as that was around what I could fit without running out of memory.

选择了500的批量大小,因为这在不耗尽内存的情况下可以满足我的要求。

例子 (Examples)

ineco uno kam whya qunaben qunobin
xexaela sadinon zaninab mecoomasph
anonyus lyatra fema inimo unenones

It’s not perfect, but it produces words that vaguely appear to come from another Romance or Germanic language. The size of the model.json and weights.bin files are only 44 kB. This is important since simpler models generally run inference faster and are light enough for the end user to download without affecting perceived page performance.

它不是完美的,但是它产生的词隐约似乎来自另一种罗曼语或日耳曼语。 model.jsonweights.bin文件的大小仅为44 kB。 这一点很重要,因为较简单的模型通常可以更快地进行推理,并且足够轻巧以供最终用户下载而不影响感知的页面性能。

The next step is where the fun begins, building a typing practice web app!

下一步是乐趣的开始,构建打字练习网络应用程序!

Originally from:

最初是从:

翻译自: https://levelup.gitconnected.com/pseudo-english-typing-practice-with-machine-learning-5700eb4dc54

机器学习 伪标签

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值