What are the continuous bag of words(CBOW) and skip-gram?

最新推荐文章于 2023-06-08 11:38:41 发布

sohero

最新推荐文章于 2023-06-08 11:38:41 发布

阅读量1.6k

点赞数

文章标签： CBOW skip-gram

Both architectures describe how the neural network “learns” the underlying word representations for each word. Since learning word representations is essentially unsupervised, you need some way to “create” labels to train the model. Skip-gram and CBOW are two ways of creating the “task” for the neural network – you can think of this as the output layer of the neural network, where we create “labels” for the given input (which depends on the architecture).

For both descriptions below, we assume that the current word in a sentence is $w_i$ .

CBOW: The input to the model could be $w_{i−2},w_{i−1},w_{i+1},w_{i+2}$ , the preceding and following words of the current word we are at. The output of the neural network will be $w_i$ . Hence you can think of the task as “predicting the word given its context”.

Note that the number of words we use depends on your setting for the window size.

Skip-gram: The input to the model is $w_i$ , and the output could be $w_{i−1},w_{i−2},w_{i+1},w_{i+2}$ . So the task here is “predicting the context given a word”. Also, the context is not limited to its immediate context, training instances can be created by skipping a constant number of words in its context, so for example, $w_{i−3},w_{i−4},w_{i+3},w_{i+4}$ , hence the name skip-gram.

Note that the window size determines how far forward and backward to look for context words to predict.

According to Mikolov:

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
This can get even a bit more complicated if you consider that there are two different ways how to train the models: the normalized hierarchical softmax, and the un-normalized negative sampling. Both work quite differently.

which makes sense since with skip gram, you can create a lot more training instances from limited amount of data, and for CBOW, you will need more since you are conditioning on context, which can get exponentially huge.

sohero

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
What are the continuous bag of words(CBOW) and skip-gram?

Both architectures describe how the neural network “learns” the underlying word representations for each word. Since learning word representations is essentially unsupervised, you need some way to “cre
复制链接

扫一扫