python中fbncc_python – 在scikit中实现K邻居分类器和线性SVM

本文链接：https://blog.csdn.net/weixin_30540871/article/details/113500587

我正在尝试使用线性SVM和K邻居分类器来进行词义消歧(WSD).以下是我用来训练数据的一段数据：

Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to

activate it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with .

For neurophysiologists and neuropsychologists , the way forward in understanding perception has been to correlate these dimensions of experience with , firstly , the material properties of the experienced object or event ( usually regarded as the stimulus ) and , secondly , the patterns of discharges in the sensory system . Qualitative Aspects of Experience The quality or modality of the experience depends less upon the quality of energy reaching the nervous system than upon which parts of the sensory system are

activated : stimulation of the retinal receptors causes an experience of light ; stimulation of the receptors in the inner ear gives rise to the experience of sound ; and so on . Muller 's nineteenth - century doctrine of specific energies formalized the ordinary observation that different sense organs are sensitive to different physical properties of the world and that when they are stimulated , sensations specific to those organs are experienced . It was proposed that there are endings ( or receptors ) within the nervous system which are attuned to specific types of energy , For example , retinal receptors in the eye respond to light energy , cochlear endings in the ear to vibrations in the air , and so on .

.....

训练和测试数据之间的区别在于测试数据没有“答案”标签.我已经构建了一个字典来存储每个实例的“head”字的邻居,其窗口大小为10.当一个实例有多个时,我只考虑第一个.我还构建了一个记录训练文件中所有词汇表的集合,以便为每个实例计算一个向量.例如,如果总词汇表是[a,b,c,d,e],并且一个实例具有单词[a,a,d,d,e],则该实例的结果向量将是[2,0, 0,2,1].这是我为每个单词构建的字典的一部分：

{

"activate.v": {

"activate.v.bnc.00024693": {

"instanceId": "activate.v.bnc.00024693",

"senseId": "38201",

"vocab": {

"although": 1,

"back": 1,

"bend": 1,

"bicycl": 1,

"correct": 1,

"dig": 1,

"general": 1,

"handlebar": 1,

"hefti": 1,

"lever": 1,

"nt": 2,

"quit": 1,

"rear": 1,

"spade": 1,

"sprung": 1,

"step": 1,

"type": 1,

"use": 1,

"wo": 1

}

"activate.v.bnc.00044852": {

"instanceId": "activate.v.bnc.00044852",

"senseId": "38201",

"vocab": {

"caus": 1,

"ear": 1,

"energi": 1,

"experi": 1,

"inner": 1,

"light": 1,

"nervous": 1,

"part": 1,

"qualiti": 1,

"reach": 1,

"receptor": 2,

"retin": 1,

"sensori": 1,

"stimul": 2,

"system": 2,

"upon": 2

}

......

现在,我只需要从scikit提供K邻居分类器和线性SVM的输入 – 学习训练分类器.但我不确定应该如何为每个构建特征向量和标签.我的理解是标签应该是“答案”中的实例标签和senseid标签的元组.但我不确定特征向量.我应该在“回答”中对来自同一个具有相同实例标签和senseid标签的所有向量进行分组吗？但每个单词大约有100个单词和数百个实例,我该怎么处理呢？

另外,vector是一个功能,我需要稍后添加更多功能,例如synset,hypernyms,hyponyms等.我该怎么做？

提前致谢！