python中有matlab中的lwt吗_在Python中简单实现N-gram,tf-idf和余弦相似度

检查NLTK包:

http://www.nltk.org它有一切你需要的

对于余弦相似性:

def cosine_distance(u, v):

"""

Returns the cosine of the angle between vectors v and u. This is equal to

u.v / |u||v|.

"""

return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

对于ngram:

def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):

"""

A utility that produces a sequence of ngrams from a sequence of items.

For example:

>>> ngrams([1,2,3,4,5], 3)

[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use ingram for an iterator version of this function. Set pad_left

or pad_right to true in order to get additional ngrams:

>>> ngrams([1,2,3,4,5], 2, pad_right=True)

[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]

@param sequence: the source data to be converted into ngrams

@type sequence: C{sequence} or C{iterator}

@param n: the degree of the ngrams

@type n: C{int}

@param pad_left: whether the ngrams should be left-padded

@type pad_left: C{boolean}

@param pad_right: whether the ngrams should be right-padded

@type pad_right: C{boolean}

@param pad_symbol: the symbol to use for padding (default is None)

@type pad_symbol: C{any}

@return: The ngrams

@rtype: C{list} of C{tuple}s

"""

if pad_left:

sequence = chain((pad_symbol,) * (n-1), sequence)

if pad_right:

sequence = chain(sequence, (pad_symbol,) * (n-1))

sequence = list(sequence)

count = max(0, len(sequence) - n + 1)

return [tuple(sequence[i:i+n]) for i in range(count)]

对于tf-idf,你将不得不首先计算分布,我使用Lucene做到这一点,但你可能很好地做类似于NLTK的东西,使用FreqDist:

如果你喜欢pylucene,这将告诉你如何喜欢tf.idf

# reader = lucene.IndexReader(FSDirectory.open(index_loc))

docs = reader.numDocs()

for i in xrange(docs):

tfv = reader.getTermFreqVector(i, fieldname)

if tfv:

rec = {}

terms = tfv.getTerms()

frequencies = tfv.getTermFrequencies()

for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):

df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term

tmap.setdefault(t, len(tmap))

rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF

# and normalize the values using cosine normalization

if cosine_normalization:

denom = sum([x**2 for x in rec.values()])**0.5

for k,v in rec.items():

rec[k] = v / denom

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值