python 进行文本相似性对比

最新推荐文章于 2024-05-10 05:15:00 发布

鱼香土豆丝

最新推荐文章于 2024-05-10 05:15:00 发布

阅读量5.2k

点赞数 2

分类专栏： python

本文链接：https://blog.csdn.net/he_min/article/details/51476272

版权

python 专栏收录该内容

28 篇文章 1 订阅

订阅专栏

 
 纠正：在机器学习系统设计一书中，关于求欧几里得范数是使用scipy下的linagl.norm来求的，在实际中用的应该是numpy中的linalg.norm来求的，当然也可能是我下载的scipy包和书中的不一样 

  一种文本相似性度量的方式叫做 --Levenshtein距离，也叫做 
 编辑距离 

  也就是是表示从 
 一个单词转换到另一个单词所有的最小距离 

  比较编辑距离的一种方法叫做词袋方法 ， 他是基于 
 词频统计的 

  ------------------------------------------------------------------------------------------------------- 

  这是关于词频统计的用的包的一些练习代码 
 
 

  CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', 

          dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', 

          lowercase=True, max_df=1.0, max_features=None, min_df=1, 

          ngram_range=(1, 1), preprocessor=None, stop_words=None, 

          strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b', 

          tokenizer=None, vocabulary=None) 

  统计词频可以使用这个包 

  from sklearn.feature_extraction.text import CountVectorizer 

  vectorizer = CountVectorizer(min_df=1)    设置参数，出现小于一次的就删了 

  ------------------------------------------ 

  实验部分： 

  vectorizer = CountVectorizer(min_df=1) 

  #print vectorizer 

  contex = [r'how to format my hard disk' , r'hard disk format problems'] 

  X = vectorizer.fit_transform(contex) 

  print vectorizer.get_feature_names() 

  print X.toarray().transpose() 

  ++++++++++++++++++++++++++++ 

  result： 

  [u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to'] 

  [[1 1] 

   [1 1] 

   [1 1] 

   [1 0] 

   [1 0] 

   [0 1] 

   [1 0]] 

  ------------------------------------------------------ 

  读取文件见文件方法，使用os包 

  DIRs = r'D:\workspace\bulid_ML_system\src\first\toy' 

  post = [ 
 open(os.path.join(DIRs ,f)). 
 read()  for f in os.listdir(DIRs)] 

  --------------------------------------------------------- 

  计算欧几里得范数 

  def dist_raw(v1, v2): 

      delta = v1 - v2 

      return np.linalg.norm(delta.toarray()) 

  ------------------------------------------------------------ 

  通过相似度来测量相似的文本： 

  best_doc = None 

  best_dist = sys.maxint 

  best_i = None 

  #print range(0 , num_sample) 

  for i in range( 0 , num_sample): 

      contexs = contex[i] 

      if contexs == new_contex: 

          continue 

      cocnt_vec = x_train.getrow(i) 

      d = dist_raw(cocnt_vec, new_contex_vec) 

      print '=====%i=======%.2f====:%s'%(i,d,contexs) 

      if d<best_dist: 

          best_dist = d 

          best_i = i 

  print 'best=%i====is===%.2f'%(best_i,best_dist) 

  ------------------------------------------------------------------------- 

   删去一些无关紧要的词： 
 
   可以使用 
  vectorize = CountVectorizer(min_df=1 , stop_words=['interesting'])设置stop_word来设置，他可以是一个list，也可以直接输入english，假如输入的english，那么他会过滤掉318个常见的单词（一般是出现频率高，而没有什么实际用处的词） 
 
  这是使用了stop_word之前和之后的对比： 
 
    [u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy'] 
  
    [u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'learning', u'machine', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'toy'] 
  
   这是使用个list来排除interesting的结果： 
 
    [u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']

鱼香土豆丝

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 进行文本相似性对比

纠正：在机器学习系统设计一书中，关于求欧几里得范数是使用scipy下的linagl.norm来求的，在实际中用的应该是numpy中的linalg.norm来求的，当然也可能是我下载的scipy包和书中的不一样一种文本相似性度量的方式叫做 --Levenshtein距离，也叫做编辑距离也就是是表示从一个单词转换到另一个单词所有的最小距离比较编辑距离的一种方法叫做词
复制链接

扫一扫