NLP教程笔记:TF_IDF

NLP教程笔记

TF-IDF

词向量

句向量

Seq2Seq 语言生成模型

CNN的语言模型

语言模型的注意力

Transformer 将注意力发挥到极致

ELMo 一词多义

GPT 单向语言模型

BERT 双向语言模型

NLP模型的多种应用


目录

NLP教程

TF-IDF算法步骤

一、直接实现 

1. itertools.chain()

2. Counter.most_common([n])

二、借助sklearn实现 

1. numpy.ravel()

2. numpy.argsort()



TF-IDF算法步骤

第一步,计算词频:

考虑到文章有长短之分,为了便于不同文章的比较,进行"词频"标准化。

第二步,计算逆文档频率:

这时,需要一个语料库(corpus),用来模拟语言的使用环境。

如果一个词越常见,那么分母就越大,逆文档频率就越小越接近0。分母之所以要加1,是为了避免分母为0(即所有文档都不包含该词)。log表示对得到的值取对数。

第三步,计算TF-IDF:

可以看到,TF-IDF与一个词在文档中的出现次数成正比,与该词在整个语言中的出现次数成反比。所以,自动提取关键词的算法就很清楚了,就是计算出文档的每个词的TF-IDF值,然后按降序排列,取排在最前面的几个词。 


一、直接实现 

import numpy as np
from collections import Counter
import itertools
from visual import show_tfidf   # this refers to visual.py in my [repo](https://github.com/MorvanZhou/NLP-Tutorials/)

docs = [
    "it is a good day, I like to stay here",
    "I am happy to be here",
    "I am bob",
    "it is sunny today",
    "I have a party today",
    "it is a dog and that is a cat",
    "there are dog and cat on the tree",
    "I study hard this morning",
    "today is a good day",
    "tomorrow will be a good day",
    "I like coffee, I like book and I like apple",
    "I do not like it",
    "I am kitty, I like bob",
    "I do not care who like bob, but I like kitty",
    "It is coffee time, bring your cup",
]

docs_words = [d.replace(",", "").split(" ") for d in docs]
vocab = set(itertools.chain(*docs_words))
v2i = {v: i for i, v in enumerate(vocab)}
i2v = {i: v for v, i in v2i.items()}


def safe_log(x):
    mask = x != 0
    x[mask] = np.log(x[mask])
    return x


tf_methods = {
        "log": lambda x: np.log(1+x),
        "augmented": lambda x: 0.5 + 0.5 * x / np.max(x, axis=1, keepdims=True),
        "boolean": lambda x: np.minimum(x, 1),
        "log_avg": lambda x: (1 + safe_log(x)) / (1 + safe_log(np.mean(x, axis=1, keepdims=True))),
    }
idf_methods = {
        "log": lambda x: 1 + np.log(len(docs) / (x+1)),
        "prob": lambda x: np.maximum(0, np.log((len(docs) - x) / (x+1))),
        "len_norm": lambda x: x / (np.sum(np.square(x))+1),
    }


def get_tf(method="log"):
    # term frequency: how frequent a word appears in a doc
    _tf = np.zeros((len(vocab), len(docs)), dtype=np.float64)    # [n_vocab, n_doc]
    for i, d in enumerate(docs_words):
        counter = Counter(d)
        for v in counter.keys():
            _tf[v2i[v], i] = counter[v] / counter.most_common(1)[0][1]

    weighted_tf = tf_methods.get(method, None)
    if weighted_tf is None:
        raise ValueError
    return weighted_tf(_tf)


def get_idf(method="log"):
    # inverse document frequency: low idf for a word appears in more docs, mean less important
    df = np.zeros((len(i2v), 1))
    for i in range(len(i2v)):
        d_count = 0
        for d in docs_words:
            d_count += 1 if i2v[i] in d else 0
        df[i, 0] = d_count

    idf_fn = idf_methods.get(method, None)
    if idf_fn is None:
        raise ValueError
    return idf_fn(df)


def cosine_similarity(q, _tf_idf):
    unit_q = q / np.sqrt(np.sum(np.square(q), axis=0, keepdims=True))
    unit_ds = _tf_idf / np.sqrt(np.sum(np.square(_tf_idf), axis=0, keepdims=True))
    similarity = unit_ds.T.dot(unit_q).ravel()
    return similarity


def docs_score(q, len_norm=False):
    q_words = q.replace(",", "").split(" ")

    # add unknown words
    unknown_v = 0
    for v in set(q_words):
        if v not in v2i:
            v2i[v] = len(v2i)
            i2v[len(v2i)-1] = v
            unknown_v += 1
    if unknown_v > 0:
        _idf = np.concatenate((idf, np.zeros((unknown_v, 1), dtype=np.float)), axis=0)
        _tf_idf = np.concatenate((tf_idf, np.zeros((unknown_v, tf_idf.shape[1]), dtype=np.float)), axis=0)
    else:
        _idf, _tf_idf = idf, tf_idf
    counter = Counter(q_words)
    q_tf = np.zeros((len(_idf), 1), dtype=np.float)     # [n_vocab, 1]
    for v in counter.keys():
        q_tf[v2i[v], 0] = counter[v]

    q_vec = q_tf * _idf            # [n_vocab, 1]

    q_scores = cosine_similarity(q_vec, _tf_idf)
    if len_norm:
        len_docs = [len(d) for d in docs_words]
        q_scores = q_scores / np.array(len_docs)
    return q_scores


def get_keywords(n=2):
    for c in range(3):
        col = tf_idf[:, c]
        idx = np.argsort(col)[-n:]
        print("doc{}, top{} keywords {}".format(c, n, [i2v[i] for i in idx]))


tf = get_tf()           # [n_vocab, n_doc]
idf = get_idf()         # [n_vocab, 1]
tf_idf = tf * idf       # [n_vocab, n_doc]
print("tf shape(vecb in each docs): ", tf.shape)
print("\ntf samples:\n", tf[:2])
print("\nidf shape(vecb in all docs): ", idf.shape)
print("\nidf samples:\n", idf[:2])
print("\ntf_idf shape: ", tf_idf.shape)
print("\ntf_idf sample:\n", tf_idf[:2])


# test
get_keywords()
q = "I get a coffee cup"
scores = docs_score(q)
d_ids = scores.argsort()[-3:][::-1]
print("\ntop 3 docs for '{}':\n{}".format(q, [docs[i] for i in d_ids]))

show_tfidf(tf_idf.T, [i2v[i] for i in range(tf_idf.shape[0])], "tfidf_matrix")

 结果:

tf shape(vecb in each docs):  (47, 15)

tf samples:
 [[0.         0.         0.         0.         0.         0.40546511
  0.69314718 0.         0.         0.         0.         0.
  0.         0.         0.        ]
 [0.         0.69314718 0.69314718 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.40546511 0.         0.        ]]

idf shape(vecb in all docs):  (47, 1)

idf samples:
 [[2.60943791]
 [2.32175584]]

tf_idf shape:  (47, 15)

tf_idf sample:
 [[0.         0.         0.         0.         0.         1.05803603
  1.80872453 0.         0.         0.         0.         0.
  0.         0.         0.        ]
 [0.         1.60931851 1.60931851 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.94139098 0.         0.        ]]
doc0, top2 keywords ['to', 'stay']
doc1, top2 keywords ['be', 'happy']
doc2, top2 keywords ['bob', 'am']

top 3 docs for 'I get a coffee cup':
['It is coffee time, bring your cup', 'I like coffee, I like book and I like apple', 'I have a party today']

Process finished with exit code 0

 

1. itertools.chain()

chain()可以把一组迭代对象串联起来,形成一个更大的迭代器:

import itertools

for c in itertools.chain('ABC', 'XYZ'):
    print(c)
# 迭代效果:'A' 'B' 'C' 'X' 'Y' 'Z'

2. Counter.most_common([n])

返回一个列表,其中包含 n 个最常见的元素及出现次数,按常见程度由高到低排序。 如果 n 被省略或为 Nonemost_common() 将返回计数器中的 所有 元素。 计数值相等的元素按首次出现的顺序排序:

Counter('abracadabra').most_common(3)
[('a', 5), ('b', 2), ('r', 2)]

二、借助sklearn实现 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


docs = [
    "it is a good day, I like to stay here",
    "I am happy to be here",
    "I am bob",
    "it is sunny today",
    "I have a party today",
    "it is a dog and that is a cat",
    "there are dog and cat on the tree",
    "I study hard this morning",
    "today is a good day",
    "tomorrow will be a good day",
    "I like coffee, I like book and I like apple",
    "I do not like it",
    "I am kitty, I like bob",
    "I do not care who like bob, but I like kitty",
    "It is coffee time, bring your cup",
]

vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(docs)
print("idf: ", [(n, idf) for idf, n in zip(vectorizer.idf_, vectorizer.get_feature_names())])
print("v2i: ", vectorizer.vocabulary_)


q = "I get a coffee cup"
qtf_idf = vectorizer.transform([q])
res = cosine_similarity(tf_idf, qtf_idf)
res = res.ravel().argsort()[-3:]
print("\ntop 3 docs for '{}':\n{}".format(q, [docs[i] for i in res[::-1]]))


i2v = {i: v for v, i in vectorizer.vocabulary_.items()}
dense_tfidf = tf_idf.todense()

结果:

idf:  [('am', 2.386294361119891), ('and', 2.386294361119891), ('apple', 3.0794415416798357), ('are', 3.0794415416798357), ('be', 2.6739764335716716), ('bob', 2.386294361119891), ('book', 3.0794415416798357), ('bring', 3.0794415416798357), ('but', 3.0794415416798357), ('care', 3.0794415416798357), ('cat', 2.6739764335716716), ('coffee', 2.6739764335716716), ('cup', 3.0794415416798357), ('day', 2.386294361119891), ('do', 2.6739764335716716), ('dog', 2.6739764335716716), ('good', 2.386294361119891), ('happy', 3.0794415416798357), ('hard', 3.0794415416798357), ('have', 3.0794415416798357), ('here', 2.6739764335716716), ('is', 1.9808292530117262), ('it', 1.9808292530117262), ('kitty', 2.6739764335716716), ('like', 1.9808292530117262), ('morning', 3.0794415416798357), ('not', 2.6739764335716716), ('on', 3.0794415416798357), ('party', 3.0794415416798357), ('stay', 3.0794415416798357), ('study', 3.0794415416798357), ('sunny', 3.0794415416798357), ('that', 3.0794415416798357), ('the', 3.0794415416798357), ('there', 3.0794415416798357), ('this', 3.0794415416798357), ('time', 3.0794415416798357), ('to', 2.6739764335716716), ('today', 2.386294361119891), ('tomorrow', 3.0794415416798357), ('tree', 3.0794415416798357), ('who', 3.0794415416798357), ('will', 3.0794415416798357), ('your', 3.0794415416798357)]
v2i:  {'it': 22, 'is': 21, 'good': 16, 'day': 13, 'like': 24, 'to': 37, 'stay': 29, 'here': 20, 'am': 0, 'happy': 17, 'be': 4, 'bob': 5, 'sunny': 31, 'today': 38, 'have': 19, 'party': 28, 'dog': 15, 'and': 1, 'that': 32, 'cat': 10, 'there': 34, 'are': 3, 'on': 27, 'the': 33, 'tree': 40, 'study': 30, 'hard': 18, 'this': 35, 'morning': 25, 'tomorrow': 39, 'will': 42, 'coffee': 11, 'book': 6, 'apple': 2, 'do': 14, 'not': 26, 'kitty': 23, 'care': 9, 'who': 41, 'but': 8, 'time': 36, 'bring': 7, 'your': 43, 'cup': 12}

top 3 docs for 'I get a coffee cup':
['It is coffee time, bring your cup', 'I like coffee, I like book and I like apple', 'I do not care who like bob, but I like kitty']

Process finished with exit code 0

 

1. numpy.ravel()

numpy的ravel() flatten()函数

两者所要实现的功能是一致的(将多维数组降位一维)。这点从两个单词的意也可以看出来,ravel(散开,解开),flatten(变平)。两者的区别在于返回拷贝(copy)还是返回视图(view),numpy.flatten()返回一份拷贝,对拷贝所做的修改不会影响(reflects)原始矩阵,而numpy.ravel()返回的是视图(view,也颇有几分C/C++引用reference的意味),会影响(reflects)原始矩阵。

In [14]: x=np.array([[1,2],[3,4]])

# flattenh函数和ravel函数在降维时默认是行序优先
In [15]: x.flatten()
Out[15]: array([1, 2, 3, 4])

In [17]: x.ravel()
Out[17]: array([1, 2, 3, 4])

# 传入'F'参数表示列序优先
In [18]: x.flatten('F')
Out[18]: array([1, 3, 2, 4])

In [19]: x.ravel('F')
Out[19]: array([1, 3, 2, 4])

#reshape函数当参数只有一个-1时表示将数组降为一维
In [21]: x.reshape(-1)
Out[21]: array([1, 2, 3, 4])
#x.T表示x的转置
In [22]: x.T.reshape(-1)
Out[22]: array([1, 3, 2, 4])

2. numpy.argsort()

argsort()函数是将x中的元素从小到大排列提取其对应的index(索引),然后输出到y。

import numpy as np

x = np.array([1,4,3,-1,6,9])
##
y = array([3,0,2,1,4,5])

 

  • 2
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
tf-idf是自然语言处理中一种常用的文本特征表示方法。tf代表词频(term frequency),idf代表逆文档频率(inverse document frequency)。 tf表示在一篇文档中一个词出现的频率。一般来说,一个词在一个文档中出现得越频繁,它对文档的特征表示的重要性越大。但是tf并没有考虑到词在整个语料库中出现的频率。 idf则通过一种统计方式,计算一个词在所有文档中出现的概率。公式为idf = log(总文档数/包含该词的文档数)。由于取对数,所以当一个词在所有文档中都出现时,它的idf值会很小,对于文档特征表示的重要性也会很小。而当一个词在少数文档中出现时,它的idf值会很大,对于特征表示的重要性也会很大。 tf-idf的计算方法为tf * idf,通过将tfidf相乘可以得到每个词在文档中的tf-idf值。这个值越大则说明这个词对文档的特征表示的重要性越高。 tf-idf可以用于文本分类、信息检索、文本聚类等任务。在文本分类中,通过计算每个词的tf-idf值可以得到文档的特征表示,然后可以使用机器学习算法对文档进行分类。在信息检索中,可以通过计算查询词的tf-idf值来评估文档和查询之间的相关性。在文本聚类中,可以根据词的tf-idf值来度量文档之间的相似度,从而将相似的文档聚类在一起。 总之,tf-idf是一种常用的文本特征表示方法,可以用于自然语言处理中的各种任务。通过考虑词频和词在整个语料库中出现的频率,tf-idf可以帮助我们更好地理解和分析文本数据。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值