02关键信息提取

capus = ['He is a boy',
         'She is a girl, good girl']

word_set = set()
for text in capus:
    for word in text.strip().split(" "):
        word_set.add(word.strip(','))

print("语料库 word", word_set)

# 带index的语料库的
index_word = dict()
for index, word in enumerate(word_set):
    index_word[word] = index

print("语料库index_word", index_word)


# 得到count vector
text_count = []
for text in capus:
    count_list = [0 for _ in range(len(word_set))]
    for word in text.strip().split(" "):
        count_list[index_word[word.strip(',')]] += 1
    
    text_count.append(count_list)

print("count vector", text_count)

输出：可以看一下对不对哦

语料库 word {'boy', 'She', 'a', 'good', 'He', 'girl', 'is'}

语料库index_word {'boy': 0, 'She': 1, 'a': 2, 'good': 3, 'He': 4, 'girl': 5, 'is': 6}

count vector [[1, 0, 1, 0, 1, 0, 1], [0, 1, 1, 1, 0, 2, 1]]

Note：你运行的和我运行的语料库中词的顺序可能不一样哦，这都是对的。好好看一下代码的实现逻辑就知道了。

1.3.2 使用sklearn现成的API接口

# --encoding=utf-8--
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "He is a boy.",
    "She is a girl, good girl."]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
print("vocabulary:", vectorizer.vocabulary_)
# [[0 1 1 1 0 0 1 0 1]
#  [0 2 0 1 0 1 1 0 1]
#  [1 0 0 1 1 0 1 1 1]
#  [0 1 1 1 0 0 1 0 1]]

1.3 count vector的优缺点

1.3.1优点

简单，易用，好理解。

1.3.2缺点

（1）仅仅考虑了词的频数，且没有对词的频数做归一化，不便于“比较”。

（2）没有语义，位置信息。

比如：

例子1:

句子1: 今天真开心 ——>向量1: [1, 1, 1, 0]

句子2: 今天不开心 ——>向量2: [1, 0, 1, 1]

语料库1:{今天，真，开心，不}

例子2:

句子1: 红房子——> 向量1:[1, 1]

句子2: 房子红——> 向量2:[1, 1]

语料库2:{红，房子}

（3）若语料库比较大时，count vector肯定是稀疏的；且语料库比较大时，count vector的维度比较大，计算量也会比较大。

1.3.3 改进

可选用频数最高的n个词构建语料库，可有效缩减规模。

2. TF-IDF

2.1 TF-IDF的原理介绍

TF-IDF（term frequency- inverse document frequency），一个word在一篇文档A中出现的频率很高，但在语料库中出现的频率很低，则该word对该文档A非常重要。

2.1.1 TF

其实，count vector，也是计算文档中word的TF（term frequence）。

但是为了便与比较，我们往往把某个变量做“归一化”处理。比如这里总共有100篇文章，第1篇文章有10000个词，第二篇文章有10个词。此时第一篇文章的count vector和第二篇文章的counter vector根本没有可比性，因为基数不一样。

故，TF的标准计算公式如下：

（1）某个词在文章中出现的次数 / 文章的总次数

（2）某个词在文章中出现的次数 / 该文章中出现频率最高的词出现的次数

从上图中，是不是可以很清晰的看出，词“house”，“ beautiful” 在TF的标准计算中的优势呢？

2.1.2 IDF

逆文件频率（inverse document frequency，idf）是一个词语普遍重要性的度量。

idf = log（语料库中文件总数/（包含该词的文档数+1））

+1是保证分母不为0.

log控制idf值不会无限变大（想一下log函数的形状就知道啦）。比如一个词非常重要，只在一个文档中出现过。

（1）为什么要用IDF

在词频的基础上，赋予每个词一个权重，进一步体现该词的重要性。

比如：{的，是，…}这些词，在每一个文档中出现的次数都很多，让人误以为这些都是很“重要”的词；如果加上idf，则可取消这些误解。因为{的，是，…}这些词基本在每个文档中都会包含，所以idf会很小，即在词频的基础上，{的，是，…}会被赋予很小的权重。

2.1.3 TF-IDF

TF-IDF = TF * IDF

2.2 TF-IDF的优缺点

2.2.1 优点

（1）简单，好理解，易用

（2）对于满足“高频词，很有用”的情况，很好用

2.2.2缺点

（1）虽然解决了count vector没有对词的频数做归一化的问题，但是没有解决本质问题——“只看词频”

（2）没有语义和位置信息。

位置信息：比如这个词是在文章的开头，结尾，还是中间。当然，开头和结尾是比较重要的啦。

（3）若语料库比较大时，tfidf向量肯定也是稀疏的；且语料库比较大时，tfidf的维度比较大，计算量也会比较大。

（4）更特殊的情况，重要的词出现的次数也不多。此时，tfidf就失效了。

2.3 实践TF-IDF

2.3.1 手撸

import math
from collections import OrderedDict

capus = ['He is a boy', 'She is a girl, good girl']

# 文档数量
n = len(capus)

textA = capus[0].strip().split(' ')
textA = [e.strip(',') for e in textA]

textB = capus[1].strip().split(' ')
textB = [e.strip(',') for e in textB]

# 语料库
capus_set = set(textA).union(set(textB))
print("语料库", capus_set)

# 语料库的长度
capus_n = len(capus_set)

# 带index的语料库，word为key，index为value
capus_index = dict()
for index, word in enumerate(capus_set):
    capus_index[word] = index

print("index 语料库", capus_index)

# inverse_capus_index,带index的语料库,index为key, word为value
inverse_capus_index = dict()
for index, word in inverse_capus_index.items():
    inverse_capus_index[index] = word


# 计算term frequency
def tf(text_list):
    """
    text: list
    """
    tf_list = [0 for _ in range(len(capus_set))]
    # 文档总词数
    len_text = len(text_list)
    for word in text_list:
        tf_list[capus_index[word.strip(',')]] += 1

    tf_list = [e / len_text for e in tf_list]
    return tf_list


# 打印textA的tf，textB的tf
tf_textA = tf(textA)
print("tf_textA", tf_textA)
tf_textB = tf(textB)
print("tf_textB", tf_textB)


# 计算inverse document frequency
def idf(capus):
    text_list = []
    for text in capus:
        word_list = text.strip().split(' ')
        word_set = set(e.strip(',') for e in word_list)
        text_list.append(word_set)

    print("text_list", text_list)

    idf_dict = dict.fromkeys(capus_set, 0)
    idf_list = [0 for _ in range(len(idf_dict))]
    for text in text_list:
        for word in text:
            idf_dict[word] += 1
    print("idf_dict", idf_dict)

    for word in idf_dict.keys():
        idf_list[capus_index[word]] = math.log10(n / (idf_dict[word] + 1))
        # idf_list[capus_index[word]] = math.log10(n / (idf_dict[word]))

    return idf_list


idf_list = idf(capus)
print("idf_dict ", idf_list)


# 计算tf-idf
def tf_idf(texts):
    """
    texts：list，其中的元素也是list
    """
    tf_idf_list = []
    for text in texts:
        tf_text = tf(text)
        tf_idf_text = [0 for _ in range(capus_n)]
        for i in range(capus_n):
            tf_idf_text[i] = tf_text[i] * idf_list[i]
        tf_idf_list.append(tf_idf_text)

    return tf_idf_list


tf_idf_list = tf_idf([textA, textB])
print(tf_idf_list)

2.3.2 用轮子

from sklearn.feature_extraction.text import TfidfVectorizer


corpus = ['He is a boy', 'She is a girl, good girl']

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X)
"""
  (0, 0)	0.6316672017376245 # 表示第0号句子，vocabulary中index为0的word的tfidf值
  (0, 4)	0.4494364165239821
  (0, 3)	0.6316672017376245
  (1, 2)	0.3920440146223274 # 表示第1号句子，vocabulary中index为2的word的tfidf值
  (1, 1)	0.7840880292446548
  (1, 5)	0.3920440146223274
  (1, 4)	0.2789425453258252
"""
print("tfidf\n", X.toarray())
print("feture name:", vectorizer.get_feature_names())
print("vocabulary:", vectorizer.vocabulary_)

这里可以举一反三，发现2.3.2 和 1.3.2 代码的异同了吗？

它们的api接口都是一样的，vectorizer.fit_transform()返回的结果的结构也是一样的。理解了一个，其它的也就都理解啦。

另外，还有其它相关的api接口可以尝试哦：

比如：

TfidfTransformer

HashingVectorizer

2.4 Jieba Tool

Jieba Tool 是一个中文分词的工具，除了中文分词，还能干什么呢？

克隆代码到本地：git clone https://github.com/fxsjy/jieba.git

Jieba——> test——>test.py

Note: idf.txt是直接给的

3. 相似度

比如日常中很常见的计算文本的相似度。为了简单起见，先从句子着手。这次我们举一个中文的例子

句子A：这只皮靴号码大了。那只号码合适

句子B：这只皮靴号码不小，那只更合适

怎样计算上面两句话的相似程度？

基本思路：如果这两句话的用词越相似，它们的内容就应该越相似。因此，可以从我们上述讲的Count Vector，TF-IDF向量等，来计算它们的相似程度。

具体用哪种相似度，看下面的讲解，来挑选一个合适的方式来计算相似度啦。

3.1 欧式距离/相似度

3.1.1 理论基础

定义：欧几里得距离或欧几里得度量是欧几里得空间中两点间“普通”（即直线）距离。

3.1.2 牛刀小试

句子A词频向量：[1 1 2 1 1 1 0 0 0]
句子B词频向量：[1 1 1 0 1 1 1 1 1]

欧式距离：

用程序来实现一下吧：

人肉撸

import numpy as np


# 欧式距离
def euclidean(u, v):
    assert len(u) == len(v)
    u_v = zip(u, v)
    new_array = [0 for _ in range(len(u))]
    for i, sub_u_v in enumerate(u_v):
        u, v = sub_u_v
        new_array[i] = np.power(u - v, 2)

    new_array = np.array(new_array)
    return np.power(np.sum(new_array), 0.5)



count_vector_1 = [1, 1, 2, 1, 1, 1, 0, 0, 0]
count_vector_2 = [1, 1, 1, 0, 1, 1, 1, 1, 1]

euclidean_dist = euclidean(count_vector_1, count_vector_2)
print("euclidean dist: ", euclidean_dist)

现有api

from scipy.spatial.distance import euclidean

count_vector_1 = [1, 1, 2, 1, 1, 1, 0, 0, 0]
count_vector_2 = [1, 1, 1, 0, 1, 1, 1, 1, 1]

euclidean_dist = euclidean(count_vector_1, count_vector_2)
print("euclidean dist: ", euclidean_dist)

3.2 余弦距离/相似度

3.2.1理论基础

余弦相似度用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小。余弦值越接近1，就表明夹角越接近0度，也就是两个向量越相似，这就叫"余弦相似性"。

余弦相似度

三角形中：

二维空间中：

扩展到n维：

余弦相似度体现的是每个向量的方向关系（角度），而非幅度。如果你想要幅度，则应计算欧式距离。直观来说，欧式距离衡量空间点的直线距离，余弦距离衡量点在空间的方向差异。

余弦距离：1-cos⁡(θ)

3.2.2 牛刀小试

句子A词频向量：[1 1 2 1 1 1 0 0 0]
句子B词频向量：[1 1 1 0 1 1 1 1 1]

余弦距离：1-cos⁡(θ)，其中：

用程序来实现一下吧：

人肉撸

import numpy as np


# 余弦距离
def cosine(u, v):
    u_array = np.array(u)
    v_array = np.array(v)

    dot_product = np.sum(u_array * v_array)
    module_u = np.power(np.sum(u_array * u_array), 0.5)
    module_v = np.power(np.sum(v_array * v_array), 0.5)

    return 1 - dot_product / (module_u * module_v)



count_vector_1 = [1, 1, 2, 1, 1, 1, 0, 0, 0]
count_vector_2 = [1, 1, 1, 0, 1, 1, 1, 1, 1]

cosine_dist = cosine(count_vector_1, count_vector_2)
print("cosine dist: ", cosine_dist)

现有api

from scipy.spatial.distance import euclidean, cosine, jaccard

count_vector_1 = [1, 1, 2, 1, 1, 1, 0, 0, 0]
count_vector_2 = [1, 1, 1, 0, 1, 1, 1, 1, 1]


cosine_dist =cosine(count_vector_1, count_vector_2)
print("cosine dist: ", cosine_dist)

3.3 Jaccard距离/相似度

3.3.1 理论介绍

杰卡德距离用于比较有限样本集之间的相似性与差异性。

杰卡德相似系数计算公式：

杰卡德距离计算公式

作用：

比较文本相似度，用于文本查重与去重；

计算对象间距离，用于数据聚类或衡量两个集合的区分度等。

3.3.2 牛刀小试

句子A词频向量：[1 1 2 1 1 1 0 0 0]
句子B词频向量：[1 1 1 0 1 1 1 1 1]

杰卡德相似度：

杰卡德距离:

Note：向量中的第3个位置，“2，1”不在的计数范围。

用程序来实现一下吧：

人肉撸

import numpy as np


# jaccard 距离
def jaccard(u, v):
    assert len(u) == len(v)
    c_tt = 0
    c_tf_ft = 0  # c_tf + c_ft

    u_v = zip(u, v)
    for sub_u, sub_v in u_v:
        if sub_u == 0 and sub_v == 0:
            continue
        elif sub_u == sub_v:
            c_tt += 1
        else:
            c_tf_ft += 1

    return c_tf_ft / (c_tf_ft + c_tt)


count_vector_1 = [1, 1, 2, 1, 1, 1, 0, 0, 0]
count_vector_2 = [1, 1, 1, 0, 1, 1, 1, 1, 1]


jaccard_dist = jaccard(count_vector_1, count_vector_2)
print("jaccard dist: ", jaccard_dist)

现有api

from scipy.spatial.distance import jaccard

count_vector_1 = [1, 1, 2, 1, 1, 1, 0, 0, 0]
count_vector_2 = [1, 1, 1, 0, 1, 1, 1, 1, 1]


jaccard_dist =jaccard(count_vector_1, count_vector_2)
print("jaccard dist: ", jaccard_dist)

4. 项目实战

4.1 文章关键信息提取

利用mapreduce思想，计算文章信息“词”的idf计算流程图。

其中：

Note：shell 中的管道模式“stdin｜stdout”

（1）查看input_tfidf_dir中总共有多少个文件

ls input_tfidf_dir | wc -l

（2）对输入文件的每行按照一定的规则排序

sort -k1 -nr

-k1: 使用 -k 参数设置对第1列的值进行重排

-n 依照数值的大小排序。(从小到大)

-r 以相反的顺序来排序。(从大到小)

(3) convert.py 将多个文件存储到一个文件中

import os
import sys

file_path_dir = sys.argv[1]

def read_file_handler(f):
    fd = open(f, 'r')
    return fd

file_name = 0
for fd in os.listdir(file_path_dir):
    file_path = file_path_dir + "/" + fd

    content_list = []

    file_fd = read_file_handler(file_path)
    for line in file_fd:
        content_list.append(line.strip())

    print('\t'.join([str(file_name), ' '.join(content_list)]))

    file_name += 1



#for line in sys.stdin:
    #ss = line.strip().split('\t')

(4) map.py 将文件中的每个词列出来，并在每个词后边标1，每个词都是一行。

import sys


for line in sys.stdin:
    line.strip()
    ss = line.strip().split('\t')
    if len(ss) != 2:
        continue

    file_name, file_content = ss
    word_list = file_content.strip().split(' ')

    word_set = set(word_list)

    for word in word_set:
        print('\t'.join([word, '1']))

(5) reduce.py 计算每个词的idf值（注意，这里需要将map.py输出的词，根据词排序，以便将相同的词排在一起）

import sys
import math

current_word = None
count = 0

docs_cnt = 508

for line in sys.stdin:
    ss = line.strip().split('\t')
    if len(ss) != 2:
        continue

    word, val = ss
    if current_word == None:
        current_word = word

    if current_word != word:
        idf = math.log(float(docs_cnt) / (float(count) + 1.0))
        print('\t'.join([current_word, str(idf)]))
        current_word = word
        count = 0


    count += int(val)

idf = math.log(float(docs_cnt) / (float(count) + 1.0))
print('\t'.join([current_word, str(idf)]))