【NLP-运用TF-IDF来判断语句和文档的匹配程度】给出一句话和一个文档,查看此句话与文档中的那些话更加匹配?

效果

给出一句话

I get a coffee cup

给一个文档

docs = [
“it is a good day, I like to stay here”,
“I am happy to be here”,
“I am bob”,
“it is sunny today”,
“I have a party today”,
“it is a dog and that is a cat”,
“there are dog and cat on the tree”,
“I study hard this morning”,
“today is a good day”,
“tomorrow will be a good day”,
“I like coffee, I like book and I like apple”,
“I do not like it”,
“I am kitty, I like bob”,
“I do not care who like bob, but I like kitty”,
“It is coffee time, bring your cup”, ]

得出最相似的三句话

[‘It is coffee time, bring your cup’,
‘I like coffee, I like book and I like apple’,
‘I have a party today’]

在这里插入图片描述

TF-IDF

TF为词频的意思:词语的频率
IDF为逆文本频率指数:词语越常常出现区分力越低(比如:‘我’,这常常出现在所有文档中但并不意味着‘我’是很重要的)词语特定出现区分力更好

TF*IDF 能够表达词语的权重,它综合了词频和区分力

文档预处理

docs_words = [d.replace(",", "").split(" ") for d in docs]
#产生无需不重复的单词集合
vocab = set(itertools.chain(*docs_words))
v2i = {
   v: i for i, v in enumerate(vocab)}
i2v = {
   i: v for v, i in v2i.items()}

文档TF函数-计算词频

#三种方式
tf_methods = {
   
        "log": lambda x: np.log(1+x),
        "augmented": lambda x: 0.5 + 0.5 * x / np.max(x, axis=1, keepdims=True),
        "boolean": lambda x: np.minimum(x, 1),
        # "log_avg": lambda x: (1 + safe_log(x)) / (1 + safe_log(np.mean(x, axis=1, keepdims=True))),
    }
def get_tf(method="log"):
    # term frequency: how frequent a word appears in a doc
    _tf = np.zeros((len(vocab), len(docs)), dtype=np.float64)    # [n_vocab, n_doc]
    for i, d in enumerate(docs_words):
         # Counter([7, 7, 8, 9])Counter({7: 2, 8: 1, 9: 1})
        counter = Counter(d)
        for v in counter.keys():
            _tf[v2i[v], i] = counter[v] / counter.most_common(1)[0][1]

    weighted_tf = tf_methods.get(method, None)
    if weighted_tf is None:
        raise ValueError
    return weighted_tf(_tf)     # [n_vocab, n_doc]

文档IDF-计算区分力


idf_methods = {
   
        "log": lambda x: 1 + np.log(len(docs) / (x+1)),
        "prob": lambda x: np.maximum(0, np.log((len(docs) - x) / (x+1))),
        "len_norm": lambda x: x / (np.sum(np.square(x))+1),
    
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值