TFIDF理解和应用

最新推荐文章于 2022-09-28 21:16:44 发布

御用厨师

最新推荐文章于 2022-09-28 21:16:44 发布

阅读量1k

点赞数

分类专栏：自然语言处理文章标签：自然语言处理 nlp 人工智能

本文链接：https://blog.csdn.net/qq_45520647/article/details/123803653

版权

自然语言处理专栏收录该内容

13 篇文章 4 订阅

订阅专栏

最近老师布置了一个任务：

统计德语和英语语料里所有单词的TFIDF值
利用这些值，实现：输入一个句子，给出语种分类（英语或德语）

所以来浅学一下TFIDF。

1 理解

这篇文章讲的很好：TF-IDF算法详解，我针对我的问题简单写一下看法：

1.1 TF

TF即词频。我们如果想知道哪些单词能代表我的英语（或德语）语料的内容，首先得保证这个单词出现的频率足够高。
但是这个单词在某个语种的语料里出现的频率足够高，就说明它能代表这个语种了吗？如果它在其他语种的语料里出现的次数也很高，很显然就不能代表之前的语种了。

1.2 IDF

针对上面的问题，我们引入了IDF，如果某个单词在所有语种的语料里出现的频率都很高，那就给这个单词较低的IDF值。
这样，我们就可以通过下面的公式，通过TFIDF值的大小来判断某个单词能否代表某个语种了。
$T F I D F = T F * I D F$

1.3 总结

IDF与类别无关，要看全部。
TF与类别有关，只看自己的。

2 代码实现

如果你的TFIDF计算很慢，那就一定是代码的问题！

2.1 计算`TFIDF`的代码：

import math
import sys
from collections import defaultdict
from tqdm import tqdm


def handle(srcf, srcf2):
    # 加进度条
    count = 0
    with open(srcf, "rb") as frd:
        for line in frd:
            count += 1
    pbar = tqdm(total=count)
    pbar.set_description("Processing 1")

    # prepare for TF、IDF
    # TF = 一个词出现的次数 / 总次数 （为了简化，以行为单位，一行里出现多个重复词记为一次）
    vocab_doc = defaultdict(int)  # 包含某个词的句子数
    with open(srcf, "rb") as frd:
        for line in frd:
            tmp = line.strip()  # "\r\n"
            if tmp:
                tmp = tmp.decode("utf-8")
                for word in set(tmp.split()):
                    vocab_doc[word] += 1
            pbar.update(1)
    pbar.close()

    # 加进度条
    count2 = 0
    with open(srcf2, "rb") as frd:
        for line in frd:
            count2 += 1
    pbar = tqdm(total=count2)
    pbar.set_description("Processing 2")

    # IDF = log(两个文档的总行数 / (包含该词的行数 + 1))
    vocab_doc2 = vocab_doc.copy()
    with open(srcf2, "rb") as frd:
        for line in frd:
            tmp = line.strip()  # "\r\n"
            if tmp:
                tmp = tmp.decode("utf-8")
                for word in set(tmp.split()):
                    vocab_doc2[word] += 1
            pbar.update(1)
    pbar.close()

    # 计算TF*IDF
    word_tfidf = {}
    vocab_sum = sum(vocab_doc.values())
    count_sum = count + count2
    for i in vocab_doc:
        tmp_tf = vocab_doc[i] / vocab_sum
        tmp_idf = math.log(count_sum / (vocab_doc2[i] + 1))
        word_tfidf[i] = tmp_tf * tmp_idf

    return word_tfidf


def save(fname, obj):
    with open(fname, "wb") as fwrt:
        fwrt.write(repr(obj).encode("utf-8"))


if __name__ == "__main__":
    print(sys.argv)
    save(sys.argv[3], handle(sys.argv[1], sys.argv[2]))     # save(存储路径, handle(TF路径, IDF补充路径))

2.2 预测语种代码

def handle(s):
    en_sum = 0
    de_sum = 0
    for word in s.split():
        if word not in dic_en:
            dic_en[word] = 0
        if word not in dic_de:
            dic_de[word] = 0
        en_sum += dic_en[word]
        de_sum += dic_de[word]
    return en_sum, de_sum


def load(fname):
    with open(fname, "rb") as frd:
        tmp = frd.read().strip()
        rs = eval(tmp.decode("utf-8"))      # repr(): python的序列化函数、eval(): python的解序列化函数
        return rs


if __name__ == "__main__":
    sentence = input()
    dic_en = load("TFIDF/result_en.txt")
    # print(1)
    dic_de = load("TFIDF/result_de.txt")
    # print(1)
    tmp = handle(sentence)
    print("en: %f" % tmp[0])
    print("de: %f" % tmp[1])