沈阳航空工业学院硕士学位论文摘
沈阳航空工业学院硕士学位论文
摘要
词语之间相互关系的量化方法是自然语言处理的重要研究内容,在信息检索、词义 消歧、机器翻译等自然语言处理领域都有广泛的应用。本文以知网为基础,研究和探讨 了词语的语义相似度和关系相似度的度量方法,提出了语义与统计相融合的语义相似度 算法和基于潜在语义索引的关系相似度算法,改进了相似度的计算结果,具体内容体现 如下:
现有的语义和关系相似度算法主要分为基于语义资源和基于统计两类方法,前者利 用人工构建的语义词典或语义网络计算相似度,而后者完全是数据驱动的方式,即从大 规模的语料中统计与词语共现的上下文信息以计算其相似度。本文研究知网的语义相似 度计算方法,针对其在计算异类义原词语间相似度效果不佳的不足,提出一种语义与统 计相融合的语义相似度算法,以改善最终的语义相似度计算结果。本文引入国家公务员 考试的替换题型作为中文词语相似度算法的测试集,在一定程度上解决该类问题缺少公 共中文测试集的问题,在该测试集对不同语义相似度算法进行对比,本算法取得了较好 的实验结果。
针对传统的无监督或半监督的关系相似度计算中难以解决的数据稀疏问题,本文使 用知网进行同义词扩展,运用奇异值分解降维去除噪声,从而提出一种基于潜在语义索 引的关系SN他,t度算法,最终在专利语料中进行关系分类实验,较传统的SVM分类准确 率提高6%,达到44%。
为进一步验证本文提出的两种相似度算法的有效性,本文实现了FAQ的相似问句 检索系统和实体关系分类系统,并对上述两种词语SNn_,t度算法进行相应实验。
关键词:词语相似度;关系相似度;潜在语义索引;知网
沈阳航空1:业学院硕十学位论文Abstract
沈阳航空1:业学院硕十学位论文
Abstract
The complex relationship between the natural language words needs to be dealt with quantitative analysis practically.This paper introduces two kinds of word similarity algorithm, one is semantic similarity between words,and another is relation similarity between pairs of words.Either of them is widely used in the field of natural language processing,such as
information retrieval,information extraction,text classification,word sense disambiguation
and machine translation based on examples.
The existing semantic similarity and relation similarity are mainly divided into two types: semantic resource and statistic,the former algorithm calculates the similarity based on manual semantic dictionary,and the latter is in a data·driven way completely,which means
finding out the word occurrence information in the context from a large corpus.This paper studies the word similarity algorithm based on Hownet and many other statistical word similarity algorithms,and in order to solve the problem of the words whose kinds of sememe are different,a new similarity algorithm based on the combination of semantics with statistics
is proposed.It is the first time to use the word alternation in national official tests to prove the
efficiency of the alg