java识别训练,增量训练实体识别分类器

最新推荐文章于 2023-04-28 02:05:49 发布

weixin_39968852

最新推荐文章于 2023-04-28 02:05:49 发布

阅读量103

点赞数

文章标签： java识别训练

我正在做一些语义web / nlp研究，我有一组稀疏记录，包含数字和非数字数据的混合，表示标有从简单英语句子中提取的各种特征的实体 .

例如

uid|features

87w39423|speaker=432, session=43242, sentence=34, obj_called=bob,favorite_color_is=blue

4535k3l535|speaker=512, session=2384, sentence=7, obj_called=tree,isa=plant,located_on=wilson_street

23432424|speaker=997, session=8945305, sentence=32, obj_called=salty,isa=cat,eats=mice

09834502|speaker=876, session=43242, sentence=56, obj_called=the monkey,ate=the banana

928374923|speaker=876, session=43242, sentence=57, obj_called=it,was=delicious

294234234|speaker=876, session=43243, sentence=58, obj_called=the monkey,ate=the banana

sd09f8098|speaker=876, session=43243, sentence=59, obj_called=it,was=hungry

...

单个实体可能出现不止一次(但每次都有不同的UID)，并且可能与其他实例具有重叠的特征 . 第二个数据集表示上述哪个UID绝对相同 .

例如

uid|sameas

87w39423|234k2j,234l24jlsd,dsdf9887s

4535k3l535|09d8fgdg0d9,l2jk34kl,sd9f08sf

23432424|io43po5,2l3jk42,sdf90s8df

09834502|294234234,sd09f8098

...

我将使用什么算法训练可以采用一组特征的分类器，并立即推荐N个最相似的UID以及这些UID是否实际代表 same 实体的概率？或者，我还希望获得缺少功能的建议，以便填充然后重新分类以获得更确定的匹配 .

我研究了传统的近似最近邻算法 . 例如FLANN和ANN，我不能训练(在监督学习意义上)，也不是通常设计用于稀疏的非数字输入 .

作为一个非常天真的第一次尝试，我正在考虑使用朴素贝叶斯分类器，将每个SameAs关系转换为一组训练样本 . 因此，对于具有B sameas关系的每个实体A，我将迭代每个并训练分类器，如：

classifier = Classifier()

for entity,sameas_entities in sameas_dataset:

entity_features = get_features(entity)

for other_entity in sameas_entities:

other_entity_features = get_features(other_entity)

classifier.train(cls=entity, ['left_'+f for f in entity_features] + ['right_'+f for f in other_entity_features])

classifier.train(cls=other_entity, ['left_'+f for f in other_entity_features] + ['right_'+f for f in entity_features])

然后使用它像：

>>> print classifier.findSameAs(dict(speaker=997, session=8945305, sentence=32, obj_called='salty',isa='cat',eats='mice'), n=7)

[(1.0, '23432424'),(0.999, 'io43po5', (1.0, '2l3jk42'), (1.0, 'sdf90s8df'), (0.76, 'jerwljk'), (0.34, 'rlekwj32424'), (0.08, '09843jlk')]

>>> print classifier.findSameAs(dict(isa='cat',eats='mice'), n=7)

[(0.09, '23432424'), (0.06, 'jerwljk'), (0.03, 'rlekwj32424'), (0.001, '09843jlk')]

>>> print classifier.findMissingFeatures(dict(isa='cat',eats='mice'), n=4)

['obj_called','has_fur','has_claws','lives_at_zoo']

这种方法有多可行？最初的批次培训将非常缓慢，至少为O(N ^ 2)，但增量培训支持将允许更新更快地发生 .

有什么更好的方法？

关注