1 什么是TF-IDF?
词频逆词频模型(TF-IDF)的出现主要是为了解决BOW仅考虑了词频而忽略了词的重要性的问题。TF-IDF是基于统计来评估文本中词对于语料库中的一份文本的重要程度的方法。
TF-IDF使得文本内的高频率词语及其在整个文件集合中的低频率文件可以得到高权重的TF-IDF。在TF-IDF中,词语的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降,这从侧面反映TF-IDF倾向于保留重要的词语,过滤掉常见的词语。
举个栗子:
想象你是新手房产中介,摆在面前的有100个楼盘的文档资料,当客户来咨询楼盘的时候,你必须快速定位到某一篇文档,你会怎么做?我们当然会想要使用某种方法将其归类,什么方法合适呢? 你会不会去找每篇文档中的关键词?那些在某篇文档中出现频率很高的词,比如每篇文档中基本都会有谈论租房或新房或二手房这样的字眼,这些高频的字眼其实就代表着这篇文章的属性, 我们大概也能通过这些字眼判断这是不是客户关心的问题。
但是还有一个问题,很多语气词,没有代表意义的词在一篇文档中同样频率很高,比如我,中介,楼盘,和这种词,几乎每篇文档中都会存在,而且提及很多次。 它们很明显,虽然词频高,但是不具有区分力,用上面的方法,这些词也会被误认为很重要。所以学者很聪明,他们知道光看局部信息(某篇文档中的词频TF)会带来统计偏差, 它们就引入了一个全局参数(IDF),来判断这个词在所有文档中,是不是垃圾信息。很明显,我,中介,和这种词在全量文档中就是这样的垃圾信息, 而租房或新房或二手房是在全局下有区分力的词。所以如果我们把局部(TF)和全局(IDF)的信息都整合起来一起看的时候,我们就能快速定位到具体的文档啦。
2 TF-IDF的数学表达形式
其实它就是一个庞大的矩阵,用词语的数字向量来代表一篇文档,当比较文档时,就是在比较这些向量的相似性。
如果把向量展示在图上,就会有类似下面图案的样子,加上一个计算向量相似度的方式,比如cosine距离,我们就能判段哪些文档在这个三维空间上比较相似了。 图中两个蓝色向量离得越近,就代表他们越像。
3 实现
最简单的一个搜索引擎,就是计算好所有的文档向量,然后每次来一个搜索问题,就将这个问题转换成同样的向量,找到那个和问题向量最相近的文档。 这就是很多搜索引擎的本质了。
首先我用一句话代替一篇文档,总共15句话代表了15篇文档,当然文档很长,为了方便举例,用一句话代表一篇文档。如下代码所示:
下面展示一些 内联代码片
。
import numpy as np
from collections import Counter
import itertools
docs = [
"it is a good day, I like to stay here",
"I am happy to be here",
"I am bob",
"it is sunny today",
"I have a party today",
"it is a dog and that is a cat",
"there are dog and cat on the tree",
"I study hard this morning",
"today is a good day",
"tomorrow will be a good day",
"I like coffee, I like book and I like apple",
"I do not like it",
"I am kitty, I like bob",
"I do not care who like bob, but I like kitty",
"It is coffee time, bring your cup",
]
将文档的单词转换成ID形式,这样便于后续通过ID进行统计。
docs_words = [d.replace(",", "").split(" ") for d in docs]
vocab = set(itertools.chain(*docs_words))
v2i = {v: i for i, v in enumerate(vocab)}
i2v = {i: v for v, i in v2i.items()}
词w 的IDF本质计算 IDF=log(所有文档数/所有文档中 词w 数), 当然还有很多种变异的计算方式。代码如下:
idf_methods = {
"log": lambda x: 1 + np.log(len(docs) / (x+1)),
"prob": lambda x: np.maximum(0, np.log((len(docs) - x) / (x+1))),
"len_norm": lambda x: x / (np.sum(np.square(x))+1),
}
def get_idf(method="log"):
# inverse document frequency: low idf for a word appears in more docs, mean less important
df = np.zeros((len(i2v), 1))
for i in range(len(i2v)):
d_count = 0
for d in docs_words:
d_count += 1 if i2v[i] in d else 0
df[i, 0] = d_count
idf_fn = idf_methods.get(method, None)
if idf_fn is None:
raise ValueError
return idf_fn(df) # [n_vocab, 1]
词w 在 文档d 的TF本质计算 TF=文档d 中 词w 总数。
tf_methods = {
"log": lambda x: np.log(1+x),
"augmented": lambda x: 0.5 + 0.5 * x / np.max(x, axis=1, keepdims=True),
"boolean": lambda x: np.minimum(x, 1),
"log_avg": lambda x: (1 + safe_log(x)) / (1 + safe_log(np.mean(x, axis=1, keepdims=True))),
}
def get_tf(method="log"):
# term frequency: how frequent a word appears in a doc
_tf = np.zeros((len(vocab), len(docs)), dtype=np.float64) # [n_vocab, n_doc]
for i, d in enumerate(docs_words):
counter = Counter(d)
for v in counter.keys():
_tf[v2i[v], i] = counter[v] / counter.most_common(1)[0][1]
weighted_tf = tf_methods.get(method, None)
if weighted_tf is None:
raise ValueError
return weighted_tf(_tf) # [n_vocab, n_doc]
TF 是用所有词组成的每篇文档向量表示,shape=[n_vocab, n_doc], 通过TF-IDF的乘积,就能得到每篇文档的TF-IDF向量表示了。
tf = get_tf() # [n_vocab, n_doc]
idf = get_idf() # [n_vocab, 1]
tf_idf = tf * idf # [n_vocab, n_doc]
最终计算出来的TF-IDF实际是一个词语和文章的矩阵,代表着用词语向量表示的文章。
接着,有了这些向量表示,当我们进行搜索时,只需要将搜索的问题向量化,把搜索向量在文档向量上进行距离计算,就能算出来哪些文档更贴近搜索向量了。 我们尝试搜索 I get a coffee cup, 并返回15篇文档当中最像这句搜索的前3篇文档。
def cosine_similarity(q, _tf_idf):
unit_q = q / np.sqrt(np.sum(np.square(q), axis=0, keepdims=True))
unit_ds = _tf_idf / np.sqrt(np.sum(np.square(_tf_idf), axis=0, keepdims=True))
similarity = unit_ds.T.dot(unit_q).ravel()
return similarity
def docs_score(q, len_norm=False):
q_words = q.replace(",", "").split(" ")
# add unknown words
unknown_v = 0
for v in set(q_words):
if v not in v2i:
v2i[v] = len(v2i)
i2v[len(v2i)-1] = v
unknown_v += 1
if unknown_v > 0:
_idf = np.concatenate((idf, np.zeros((unknown_v, 1), dtype=np.float)), axis=0)
_tf_idf = np.concatenate((tf_idf, np.zeros((unknown_v, tf_idf.shape[1]), dtype=np.float)), axis=0)
else:
_idf, _tf_idf = idf, tf_idf
counter = Counter(q_words)
q_tf = np.zeros((len(_idf), 1), dtype=np.float) # [n_vocab, 1]
for v in counter.keys():
q_tf[v2i[v], 0] = counter[v]
q_vec = q_tf * _idf # [n_vocab, 1]
q_scores = cosine_similarity(q_vec, _tf_idf)
if len_norm:
len_docs = [len(d) for d in docs_words]
q_scores = q_scores / np.array(len_docs)
return q_scores
def get_keywords(n=2):
for c in range(3):
col = tf_idf[:, c]
idx = np.argsort(col)[-n:]
print("doc{}, top{} keywords {}".format(c, n, [i2v[i] for i in idx]))
tf = get_tf() # [n_vocab, n_doc]
idf = get_idf() # [n_vocab, 1]
tf_idf = tf * idf # [n_vocab, n_doc]
print("tf shape(vecb in each docs): ", tf.shape)
print("\ntf samples:\n", tf[:2])
print("\nidf shape(vecb in all docs): ", idf.shape)
print("\nidf samples:\n", idf[:2])
print("\ntf_idf shape: ", tf_idf.shape)
print("\ntf_idf sample:\n", tf_idf[:2])
完整的代码如下所示:
import numpy as np
from collections import Counter
import itertools
from visual import show_tfidf
docs = [
"it is a good day, I like to stay here",
"I am happy to be here",
"I am bob",
"it is sunny today",
"I have a party today",
"it is a dog and that is a cat",
"there are dog and cat on the tree",
"I study hard this morning",
"today is a good day",
"tomorrow will be a good day",
"I like coffee, I like book and I like apple",
"I do not like it",
"I am kitty, I like bob",
"I do not care who like bob, but I like kitty",
"It is coffee time, bring your cup",
]
docs_words = [d.replace(",", "").split(" ") for d in docs]
vocab = set(itertools.chain(*docs_words))
v2i = {v: i for i, v in enumerate(vocab)}
i2v = {i: v for v, i in v2i.items()}
def safe_log(x):
mask = x != 0
x[mask] = np.log(x[mask])
return x
tf_methods = {
"log": lambda x: np.log(1+x),
"augmented": lambda x: 0.5 + 0.5 * x / np.max(x, axis=1, keepdims=True),
"boolean": lambda x: np.minimum(x, 1),
"log_avg": lambda x: (1 + safe_log(x)) / (1 + safe_log(np.mean(x, axis=1, keepdims=True))),
}
idf_methods = {
"log": lambda x: 1 + np.log(len(docs) / (x+1)),
"prob": lambda x: np.maximum(0, np.log((len(docs) - x) / (x+1))),
"len_norm": lambda x: x / (np.sum(np.square(x))+1),
}
def get_tf(method="log"):
# term frequency: how frequent a word appears in a doc
_tf = np.zeros((len(vocab), len(docs)), dtype=np.float64) # [n_vocab, n_doc]
for i, d in enumerate(docs_words):
counter = Counter(d)
for v in counter.keys():
_tf[v2i[v], i] = counter[v] / counter.most_common(1)[0][1]
weighted_tf = tf_methods.get(method, None)
if weighted_tf is None:
raise ValueError
return weighted_tf(_tf)
def get_idf(method="log"):
# inverse document frequency: low idf for a word appears in more docs, mean less important
df = np.zeros((len(i2v), 1))
for i in range(len(i2v)):
d_count = 0
for d in docs_words:
d_count += 1 if i2v[i] in d else 0
df[i, 0] = d_count
idf_fn = idf_methods.get(method, None)
if idf_fn is None:
raise ValueError
return idf_fn(df)
def cosine_similarity(q, _tf_idf):
unit_q = q / np.sqrt(np.sum(np.square(q), axis=0, keepdims=True))
unit_ds = _tf_idf / np.sqrt(np.sum(np.square(_tf_idf), axis=0, keepdims=True))
similarity = unit_ds.T.dot(unit_q).ravel()
return similarity
def docs_score(q, len_norm=False):
q_words = q.replace(",", "").split(" ")
# add unknown words
unknown_v = 0
for v in set(q_words):
if v not in v2i:
v2i[v] = len(v2i)
i2v[len(v2i)-1] = v
unknown_v += 1
if unknown_v > 0:
_idf = np.concatenate((idf, np.zeros((unknown_v, 1), dtype=np.float)), axis=0)
_tf_idf = np.concatenate((tf_idf, np.zeros((unknown_v, tf_idf.shape[1]), dtype=np.float)), axis=0)
else:
_idf, _tf_idf = idf, tf_idf
counter = Counter(q_words)
q_tf = np.zeros((len(_idf), 1), dtype=np.float) # [n_vocab, 1]
for v in counter.keys():
q_tf[v2i[v], 0] = counter[v]
q_vec = q_tf * _idf # [n_vocab, 1]
q_scores = cosine_similarity(q_vec, _tf_idf)
if len_norm:
len_docs = [len(d) for d in docs_words]
q_scores = q_scores / np.array(len_docs)
return q_scores
def get_keywords(n=2):
for c in range(3):
col = tf_idf[:, c]
idx = np.argsort(col)[-n:]
print("doc{}, top{} keywords {}".format(c, n, [i2v[i] for i in idx]))
tf = get_tf() # [n_vocab, n_doc]
idf = get_idf() # [n_vocab, 1]
tf_idf = tf * idf # [n_vocab, n_doc]
print("tf shape(vecb in each docs): ", tf.shape)
print("\ntf samples:\n", tf[:2])
print("\nidf shape(vecb in all docs): ", idf.shape)
print("\nidf samples:\n", idf[:2])
print("\ntf_idf shape: ", tf_idf.shape)
print("\ntf_idf sample:\n", tf_idf[:2])
# test
get_keywords()
q = "I get a coffee cup"
scores = docs_score(q)
d_ids = scores.argsort()[-3:][::-1]
print("\ntop 3 docs for '{}':\n{}".format(q, [docs[i] for i in d_ids]))
show_tfidf(tf_idf.T, [i2v[i] for i in range(tf_idf.shape[0])], "tfidf_matrix")