Reinforcement Learning for Relation Classification from Noisy Data阅读笔记
《Reinforcement Learning for Relation Classification from Noisy Data》 原文链接
目录
Problem
instance selection
- given a set of <sentence,relationlabel> <script type="math/tex" id="MathJax-Element-1"> </script> pairs as X=(x1,r1),(x2,r2),...,(xn,rn) , where xi is a sentence associated with two entities (hi,ti) and ri is a noisy relation label produced by distant supervision
- determine which sentence truly describes the relation and should be selected as a training instance
relation classification
- given a sentence xi and the mentioned entity pair (hi,ti)
- predict the semantic relation ri in xi
Overview
Instance selector
The agent follows a policy to decide which action (choosing the current sentence or not) at each state (consisting of the current sentence, the chosen sentence set, and the entity pair), and then receive a reward from the relation classifier at the terminal state when all the selections are made
we split the training sentence instances X = {x 1 , … , x n } into N bags B = {B 1 , B 2 , … , B N } and compute a reward when we finish data selection in a bag. Each bag corresponds to a distinct entity pair, and each bag Bk is a sequence of sentences x1k,x2k,...,xk|Bk| with the same relation label rk
policy: select or not
state: a continuous real-valued vector F(si)
- vector representation of the current sentence
- nonlinear layer of the CNN
- The representation of the chosen sentence set
- average of the vector representations
- two entities in sentences
- pre-trained knowledge graph embedding table(TransE)
action: policy function(logistic regression)
reward:
- For the special case B̂ =∅ , we set the reward as the average likelihood of all sentences in the training data
- where B̂ is the set of selected sentences, which is a subset of B, and r is the relation label of bag B
Optimization:
objective function:
value function: determined by reward
update policy:
According to the policy gradient theorem and the REINFORCE algorithm
Monto-Carlo based policy gradient method
Relation classifier
CNN + softmax(proposed by Kim Y)
-
-
Model training
- pre-training strategy is quite crucial, widely recommended by many other reinforcement learning studies
- In order to have a stable update, We update Θ’and Φ’ by linear interpolation: Θ’ ← (1 − τ )Θ’ + τ Θ and
Φ’ ← (1 − τ )Φ’ + τ Φ, where τ << 1 is a hyper-parameter - procedure
Note
distant supervision: In distant supervision, we make use of an already existing database, such as Freebase or a domain-specific database, to collect examples for the relation we want to extract. We then use these examples to automatically generate our training data. For example, Freebase contains the fact that Barack Obama and Michelle Obama are married. We take this fact, and then label each pair of “Barack Obama” and “Michelle Obama” that appear in the same sentence as a positive example for our marriage relation. This way we can easily generate a large amount of (possibly noisy) training data.
Monto-Carlo policy gradient theorem
high variance,slow convergence rate
思考
- 训练policy function, 讲reward分成多个bag进行更新,同一个bag中r相同,对于其他任务,bag应该如何划分?
- instance selector, 参数更新,句子的序列顺序是否有关?
按照目前的v(s)(或reward)计算策略, 如果s,a的个数相同而次序不同,应该不影响。 - relation classifier 的Loss function(cross entropy)是否缺了一部分?
- 训练使用Mc based gradient, others?