吴恩达深度学习学习笔记——C5W2——自然语言处理与词嵌入——作业1——词向量的操作

这里主要梳理一下作业的主要内容和思路,完整作业文件可参考:

https://github.com/pandenghuang/Andrew-Ng-Deep-Learning-notes/tree/master/assignments/C5W2

作业完整截图,参考本文结尾:作业完整截图。

Operations on word vectors(词向量的操作)

Welcome to your first assignment of this week!

Because word embeddings are very computionally expensive to train, most ML practitioners will load a pre-trained set of embeddings.

After this assignment you will be able to:

  • Load pre-trained word vectors, and measure similarity using cosine similarity
  • Use word embeddings to solve word analogy problems such as Man is to Woman as King is to __.
  • Modify word embeddings to reduce their gender bias

Let's get started! Run the following cell to load the packages you will need.

...

1 - Cosine similarity(余弦相似度)

To measure how similar two words are, we need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors 𝑢 and 𝑣, cosine similarity is defined as follows:

where 𝑢.𝑣 is the dot product (or inner product) of two vectors, ||𝑢||2 is the norm (or length) of the vector 𝑢, and 𝜃 is the angle between 𝑢 and 𝑣. This similarity depends on the angle between 𝑢 and 𝑣. If 𝑢 and 𝑣 are very similar, their cosine similarity will be close to 1; if they are dissimilar, the cosine similarity will take a smaller value.

Figure 1: The cosine of the angle between two vectors is a measure of how similar they are

...

2 - Word analogy task(寻找同类词)

In the word analogy task, we complete the sentence "a is to b as c is to ____". An example is 'man is to woman as king is to queen' . In detail, we are trying to find a word d, such that the associated word vectors 𝑒𝑎,𝑒𝑏,𝑒𝑐,𝑒𝑑 are related in the following manner: 𝑒𝑏−𝑒𝑎≈𝑒𝑑−𝑒𝑐. We will measure the similarity between 𝑒𝑏−𝑒𝑎 and 𝑒𝑑−𝑒𝑐 using cosine similarity.

...

Congratulations!

You've come to the end of this assignment. Here are the main points you should remember:

  • Cosine similarity a good way to compare similarity between pairs of word vectors. (Though L2 distance works too.)
  • For NLP applications, using a pre-trained set of word vectors from the internet is often a good way to get started.

Even though you have finished the graded portions, we recommend you take a look too at the rest of this notebook.

Congratulations on finishing the graded portions of this notebook!

...

3 - Debiasing word vectors (OPTIONAL/UNGRADED)(词向量除偏)

In the following exercise, you will examine gender biases that can be reflected in a word embedding, and explore algorithms for reducing the bias. In addition to learning about the topic of debiasing, this exercise will also help hone your intuition about what word vectors are doing. This section involves a bit of linear algebra, though you can probably complete it even without being expert in linear algebra, and we encourage you to give it a shot. This portion of the notebook is optional and is not graded.

Lets first see how the GloVe word embeddings relate to gender. You will first compute a vector 𝑔=𝑒𝑤𝑜𝑚𝑎𝑛−𝑒𝑚𝑎𝑛, where 𝑒𝑤𝑜𝑚𝑎𝑛 represents the word vector corresponding to the word woman, and 𝑒𝑚𝑎𝑛 corresponds to the word vector corresponding to the word man. The resulting vector 𝑔 roughly encodes the concept of "gender". (You might get a more accurate representation if you compute 𝑔1=𝑒𝑚𝑜𝑡ℎ𝑒𝑟−𝑒𝑓𝑎𝑡ℎ𝑒𝑟, 𝑔2=𝑒𝑔𝑖𝑟𝑙−𝑒𝑏𝑜𝑦, etc. and average over them. But just using 𝑒𝑤𝑜𝑚𝑎𝑛−𝑒𝑚𝑎𝑛 will give good enough results for now.)

...

3.1 - Neutralize bias for non-gender specific words(非性别特定词的除偏)

The figure below should help you visualize what neutralizing does. If you're using a 50-dimensional word embedding, the 50 dimensional space can be split into two parts: The bias-direction 𝑔, and the remaining 49 dimensions, which we'll call 𝑔⊥. In linear algebra, we say that the 49 dimensional 𝑔⊥ is perpendicular (or "othogonal") to 𝑔, meaning it is at 90 degrees to 𝑔. The neutralization step takes a vector such as 𝑒𝑟𝑒𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝑖𝑠𝑡 and zeros out the component in the direction of 𝑔, giving us 𝑒𝑑𝑒𝑏𝑖𝑎𝑠𝑒𝑑𝑟𝑒𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝑖𝑠𝑡.

Even though 𝑔⊥ is 49 dimensional, given the limitations of what we can draw on a screen, we illustrate it using a 1 dimensional axis below.

...

3.2 - Equalization algorithm for gender-specific words(性别特定词的均衡算法)

Next, lets see how debiasing can also be applied to word pairs such as "actress" and "actor." Equalization is applied to pairs of words that you might want to have differ only through the gender property. As a concrete example, suppose that "actress" is closer to "babysit" than "actor." By applying neutralizing to "babysit" we can reduce the gender-stereotype associated with babysitting. But this still does not guarantee that "actor" and "actress" are equidistant from "babysit." The equalization algorithm takes care of this.

The key idea behind equalization is to make sure that a particular pair of words are equi-distant from the 49-dimensional 𝑔⊥. The equalization step also ensures that the two equalized steps are now the same distance from 𝑒𝑑𝑒𝑏𝑖𝑎𝑠𝑒𝑑𝑟𝑒𝑐𝑒𝑝𝑡𝑖𝑜𝑛𝑖𝑠𝑡, or from any other work that has been neutralized. In pictures, this is how equalization works:

...

Congratulations

You have come to the end of this notebook, and have seen a lot of the ways that word vectors can be used as well as modified.

Congratulations on finishing this notebook!

References:

 

作业完整截图:

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值