【Spark NLP】第 6 章：信息检索

Sonhhxg_柒

已于 2022-10-31 16:48:06 修改

阅读量1.2k

点赞数 7

分类专栏：使用Spark进行自然语言处理文章标签： spark 自然语言处理大数据

于 2022-10-29 10:08:27 首次发布

本文链接：https://blog.csdn.net/sikh_0529/article/details/127567691

版权

使用Spark进行自然语言处理专栏收录该内容

18 篇文章 16 订阅

订阅专栏

🔎大家好，我是Sonhhxg_柒，希望你看完之后，能对你有所帮助，不足请指正！共同学习交流🔎

📝个人主页－Sonhhxg_柒的博客_CSDN博客 📃

🎁欢迎各位→点赞👍 + 收藏⭐️ + 留言📝

📣系列专栏 - 机器学习【ML】自然语言处理【NLP】深度学习【DL】

🖍foreword

✔说明⇢本人讲解主要包括Python、机器学习（ML）、深度学习（DL）、自然语言处理（NLP）等内容。

如果你对这个系列感兴趣的话，可以关注订阅哟👋

文章目录

倒排指数

建立倒排索引

步骤1

第2步

第 3 步

第4步

向量空间模型

停用词删除

逆文档频率

在 Spark

练习

在上一章中，我们遇到了难以描述语料库的常用词。这是不同种类的 NLP 任务的问题。幸运的是，信息检索领域已经开发了许多可用于改进各种 NLP 应用的技术。

早些时候，我们谈到了文本数据是如何存在的，并且每天都在生成更多。我们需要一些方法来管理和搜索这些数据。如果有 ID 或标题，我们当然可以对这些数据进行索引，但是我们如何按内容搜索呢？使用结构化数据，我们可以创建逻辑表达式并检索满足表达式的所有行。这也可以用文本来完成，虽然不太准确。

信息检索的基础早于计算机。信息检索侧重于如何在更大的信息集中找到特定的信息，尤其是文本数据中的信息。信息检索中最常见的任务类型是搜索，即文档搜索。

以下是文档搜索的组件：

查询 q ：描述您正在查找的文档或文档类型的逻辑语句

查询词 q_t ：查询中的一个术语，通常是一个标记

文件语料库 D ：文档集合

文档 d ：包含描述文档的D 术语的文档t_d

排名功能 r(q, D) ：D 根据与查询的相关性对文档进行排名的函数 q

结果 R ：文档排序列表

在我们讨论如何实现这些组件之前，我们需要考虑一个技术问题。我们如何根据文档中的信息快速访问文档？如果我们必须扫描每个文档，那么我们就无法搜索大量文档。为了解决这个问题，我们使用倒排索引。

倒排指数

最初，索引是一种组织和标记信息的方法，使检索更容易。例如，图书馆使用索引来组织和查找书籍。杜威十进制分类系统是一种根据主题对书籍进行索引的方法。我们还可以有基于标题、作者、出版日期等的索引。另一种索引通常可以在书的背面找到。这是书中的概念列表以及可以在其中找到它们的页面。

倒排索引中的索引与传统索引略有不同；相反，它从索引的数学概念中汲取灵感——即将索引分配给集合中的一个元素。回想一下我们的文档集。我们可以为每个文档分配一个数字，创建从整数到文档的映射 i -> d。D

让我们为我们的DataFrame. 通常，我们会将倒排索引存储在允许快速查找的数据存储中。SparkDataFrame不适合快速查找。我们将介绍用于搜索的工具。

建立倒排索引

让我们看看如何在 Spark 中构建倒排索引。以下是我们将遵循的步骤：

加载数据。
创建索引：i -> d*
- 由于我们使用的是 Spark，我们将在行上生成此索引。
处理文本。
创建从术语到文档的倒排索引：t_d -> i*

步骤1

我们将为 mini_newsgroups 数据集创建一个倒排索引。

import os

from pyspark.sql.types import *
from pyspark.sql.functions import collect_set
from pyspark.sql import Row
from pyspark.ml import Pipeline

import sparknlp
from sparknlp import DocumentAssembler, Finisher
from sparknlp.annotator import *

spark = sparknlp.start()

path = os.path.join('data', 'mini_newsgroups', '*')
texts = spark.sparkContext.wholeTextFiles(path)

schema = StructType([
    StructField('path', StringType()),
    StructField('text', StringType()),
])

texts = spark.createDataFrame(texts, schema=schema).persist()

第2步

现在我们需要创建索引。Spark 假设数据是分布式的，因此要分配索引，我们需要使用较低级别的RDDAPI。zipWithIndex将对工人的数据进行排序并分配索引。

rows_w_indexed = texts.rdd.zipWithIndex()
(path, text), i = rows_w_indexed.first()

print(i)
print(path)
print(text[:200])

0
file:/home/alext/projects/spark-nlp-book/data/mini_...
Xref: cantaloupe.srv.cs.cmu.edu sci.astro:35223 sci.space:61404
Newsgroups: sci.astro,sci.space
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!...

现在我们已经创建了索引，我们需要DataFrame像之前一样创建一个，除了现在我们需要将我们的索引添加到我们Row的 s 中。表 6-1显示了结果。

indexed = rows_w_indexed.map(
    lambda row_index: Row(
        index=row_index[1], 
        **row_index[0].asDict())
)
(i, path, text) = indexed.first()

indexed_schema = schema.add(StructField('index', IntegerType()))

indexed = spark.createDataFrame(indexed, schema=indexed_schema)\
    .persist()


indexed.limit(10).toPandas()

表 6-1。索引文件
	path	text	index
0	file：/.../spark-nlp-book/data/m...	Newsgroups：rec.motorcycles\nFrom：lisa@alex.c...	0
1	file：/.../spark-nlp-book/data/m...	Path：cantaloupe.srv.cs.cmu.edu!das-news.harva...	1
2	file：/.../spark-nlp-book/data/m...	Newsgroups：rec.motorcycles\nPath：cantaloupe....	2
3	file：/.../spark-nlp-book/data/m...	Xref：cantaloupe.srv.cs.cmu.edu rec.motorcycle...	3
4	file：/.../spark-nlp-book/data/m...	Path：cantaloupe.srv.cs.cmu.edu!das-news.harva...	4
5	file：/.../spark-nlp-book/data/m...	Path：cantaloupe.srv.cs.cmu.edu!magnesium.club...	5
6	file：/.../spark-nlp-book/data/m...	Newsgroups：rec.motorcycles\nPath：cantaloupe....	6
7	file：/.../spark-nlp-book/data/m...	Newsgroups：rec.motorcycles\nPath：cantaloupe....	7
8	file：/.../spark-nlp-book/data/m...	Path：cantaloupe.srv.cs.cmu.edu!rochester!udel...	8
9	file：/.../spark-nlp-book/data/m...	Path：cantaloupe.srv.cs.cmu.edu!crabapple.srv....	9

每个文档 d 都是术语的集合， t_d. 所以我们的索引是从整数到术语集合的映射。

另一方面，倒排索引是从项t_d到整数的映射inv-index: t_d -> i, j, k, ...。这使我们能够快速查找包含给定术语的文档。

第 3 步

现在让我们处理文本（见表 6-2）。

from sparknlp.pretrained import PretrainedPipeline

assembler = DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')
tokenizer = Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')
lemmatizer = LemmatizerModel.pretrained()\
    .setInputCols(['token'])\
    .setOutputCol('lemma')
normalizer = Normalizer()\
    .setInputCols(['lemma'])\
    .setOutputCol('normalized')\
    .setLowercase(True)
finisher = Finisher()\
    .setInputCols(['normalized'])\
    .setOutputCols(['normalized'])\
    .setOutputAsArray(True)

pipeline = Pipeline().setStages([
    assembler, tokenizer, 
    lemmatizer, normalizer, finisher
]).fit(indexed)

indexed_w_tokens = pipeline.transform(indexed)

indexed_w_tokens.limit(10).toPandas()

表 6-2。具有标准化标记的文档


	path	text	index	normalized
0	file:/.../spark-nlp-book/data/m...	...	0	[newsgroups, recmotorcycles, from, lisaalexcom...
1	file:/.../spark-nlp-book/data/m...	...	1	[path, cantaloupesrvcscmuedudasnewsharvardedun...
2	file:/.../spark-nlp-book/data/m...	...	2	[newsgroups, recmotorcycles, path, cantaloupes...
3	file:/.../spark-nlp-book/data/m...	...	3	[xref, cantaloupesrvcscmuedu, recmotorcyclesha...
4	file:/.../spark-nlp-book/data/m...	...	4	[path, cantaloupesrvcscmuedudasnewsharvardeduo...
5	file:/.../spark-nlp-book/data/m...	...	5	[path, cantaloupesrvcscmuedumagnesiumclubcccmu...
6	file:/.../spark-nlp-book/data/m...	...	6	[newsgroups, recmotorcycles, path, cantaloupes...
7	file:/.../spark-nlp-book/data/m...	...	7	[newsgroups, recmotorcycles, path, cantaloupes...
8	file:/.../spark-nlp-book/data/m...	...	8	[path, cantaloupesrvcscmuedurochesterudelbogus...
9	file:/.../spark-nlp-book/data/m...	...	9	[path, cantaloupesrvcscmueducrabapplesrvcscmue...

由于我们使用的是小型数据集，因此出于本示例的目的，我们将移出 Spark。我们会将我们的数据收集到 pandas 中，并使用我们的索引字段作为我们的DataFrame索引。

doc_index = indexed_w_tokens.select('index', 'path', 'text').toPandas()
doc_index = doc_index.set_index('index')

第4步

现在，让我们创建倒排索引。我们将使用 Spark SQL 来执行此操作。结果如表6-3所示。

SELECT term, collect_set(index) AS documents FROM ( SELECT index, explode(normalized) AS term FROM indexed_w_tokens ) GROUP BY term ORDER BY term

inverted_index = indexed_w_tokens\
    .selectExpr('index', 'explode(normalized) AS term')\
    .distinct()\
    .groupBy('term').agg(collect_set('index').alias('documents'))\
    .persist()

inverted_index.show(10)

表 6-3。倒排索引（从术语到文档索引的映射）
	term	documents
0	aaangel.qdeck.com	[198]
1	accumulation	[857]
2	adventists	[314]
3	aecfb.student.cwru.edu	[526]
4	again...hmmm	[1657]
5	alt.binaries.pictures	[815]
6	amplifier	[630, 624, 654]
7	antennae	[484, 482]
8	apr..gordian.com	[1086]
9	apr..lokkur.dexter.mi.us	[292]

这是我们的倒排索引。我们可以看到术语“放大器”出现在文档 630、624 和 654 中。有了这些信息，我们可以快速找到包含特定术语的所有文档。

另一个好处是这个倒排索引是基于我们词汇量的大小，而不是我们语料库中的文本量，所以它不是大数据。倒排索引仅随着新术语和文档索引而增长。对于非常大的语料库，这仍然可能是单台机器的大量数据。然而，在 mini_newsgroups 数据集的情况下，它很容易管理。

让我们看看我们的倒排索引有多大。

inverted_index.count()

对我们来说，由于我们的文档数量如此之少，倒排索引的条目比索引多。词频遵循 Zipf 定律——也就是说，一个词在排序时的频率与其排名成反比。结果，最常用的英语单词已经在我们的倒排索引中。这可以通过不跟踪至少不出现一定次数的单词来进一步限制。

inverted_index = {
    term: set(docs) 
    for term, docs in inverted_index.collect()
}

现在我们可以开始我们最基本的排名功能——简单的布尔搜索。在这种情况下，让我们查找所有包含“语言”或“信息”字样的文档。

lang_docs = inverted_index['language']
print('docs', ('{}, ' * 10).format(*list(lang_docs)[:10]), '...')
print('number of docs', len(lang_docs))

docs 1926, 1937, 1171, 1173, 1182, 1188, 1830, 808, 1193, 433,  ...
number of docs 44

info_docs = inverted_index['information']
print('docs', ('{}, ' * 10).format(*list(info_docs)[:10]), '...')
print('number of docs', len(info_docs))

docs 516, 519, 520, 1547, 1035, 1550, 1551, 17, 1556, 22,  ...
number of docs 215

filter_set = list(lang_docs | info_docs)
print('number of docs in filter set', len(filter_set))

number of docs in filter set 246

intersection = list(lang_docs & info_docs)
print('number of docs in intersection set', len(intersection))

number of docs in intersection set 13

让我们打印过滤器集中的行。这里，过滤器集就是结果集，但一般过滤器集的排名是r(q, D)，从而得到结果集。

让我们看一下我们看到出现的行，以了解我们的结果集。

k = 1
for i in filter_set:
    path, text = doc_index.loc[i]
    lines = text.split('\n')
    print(path.split('/')[-1], 'length:', len(text))
    for line_number, line in enumerate(lines):
        if 'information' in line or 'language' in line:
            print(line_number, line)
    print()
    k += 1
    if k > 5:
        break

178813 length: 1783
14 >>    Where did you get this information?  The FBI stated ...

104863 length: 2795
14 of information that I received, but as my own bad mouthing) ...

104390 length: 2223
51 ... appropriate disclaimer to outgoing public information,

178569 length: 11735
60  confidential information obtained illegally from law ...
64  ... to allegations of obtaining confidential information from
86  employee and have said they simply traded information with ...
91  than truthful" in providing information during an earlier ...
125  and Department of Motor Vehicles information such as ...
130  Bullock also provided information to the South African ...
142  information.
151  exchanged information with the FBI and worked with ...
160  information in Los Angeles, San Francisco, New York, ...
168  some confidential information in the Anti-Defamation ...
182  police information on citizens and groups.
190  ... spying operations, which collected information on more than
209  information to police, journalists, academics, government ...
211  information illegally, he said.
215  identity of any source of information," Foxman said.

104616 length: 1846
45 ... an appropriate disclaimer to outgoing public information,

现在我们有了结果集，我们应该如何对结果进行排序？我们可以只计算搜索词的出现次数，但这会偏向于长文档。另外，如果我们的查询包含一个非常常见的单词，比如“the”，会发生什么？如果我们只使用计数，像“the”这样的常用词将主导我们的结果。在我们的结果集中，查询词出现次数最多的文本最长。我们可以说文档中找到的术语越多，文档的相关性就越高，但这也有问题。我们如何处理单项查询？在我们的示例中，只有一个文档具有两者。同样，如果我们的查询有一个常用词——例如，“the cat in the hat”——“the”和“in”是否应该与“cat”和“hat”具有相同的重要性？为了解决这个问题，我们需要一个更灵活的模型来处理我们的文档和查询。

向量空间模型

在上一章中，我们介绍了矢量化文档的概念。我们讨论了创建二进制向量，其中 1 表示该单词存在于文档中。我们也可以使用计数。

当我们将语料库转换为向量集合时，我们隐含地将我们的语言建模为向量空间。在这个向量空间中，每个维度代表一个术语。这有很多好处和缺点。这是一种以允许机器学习算法使用它的方式来表示我们的文本的简单方法。它还允许我们稀疏地表示向量。另一方面，我们丢失了包含在词序中的信息。此过程还会创建高维数据集，这可能对某些算法造成问题。

让我们计算数据集的向量。在上一章中，我们使用了CountVectorizer这个。我们将在 Python 中构建向量，但构建它们的方式将帮助我们了解库如何实现向量化。

class SparseVector(object):
    
    def __init__(self, indices, values, length):
        # if the indices are not in ascending order, we need 
        # to sort them
        is_ascending = True
        for i in range(len(indices) - 1):
            is_ascending = is_ascending and indices[i] < indices[i+1]
        if not is_ascending:
            pairs = zip(indices, values)
            sorted_pairs = sorted(pairs, key=lambda x: x[0])
            indices, values = zip(*sorted_pairs)
        self.indices = indices
        self.values = values
        self.length = length
        
    def __getitem__(self, index):
        try:
            return self.values[self.indices.index(index)]
        except ValueError:
            return 0.0
        
    def dot(self, other):
        assert isinstance(other, SparseVector)
        assert self.length == other.length
        res = 0
        i = j = 0
        while i < len(self.indices) and j < len(other.indices):
            if self.indices[i] == other.indices[j]:
                res += self.values[i] * other.values[j]
                i += 1
                j += 1
            elif self.indices[i] < other.indices[j]:
                i += 1
            elif self.indices[i] > other.indices[j]:
                j += 1
        return res
    
    def hadamard(self, other):
        assert isinstance(other, SparseVector)
        assert self.length == other.length
        res_indices = []
        res_values = []
        i = j = 0
        while i < len(self.indices) and j < len(other.indices):
            if self.indices[i] == other.indices[j]:
                res_indices.append(self.indices[i])
                res_values.append(self.values[i] * other.values[j])
                i += 1
                j += 1
            elif self.indices[i] < other.indices[j]:
                i += 1
            elif self.indices[i] > other.indices[j]:
                j += 1
        return SparseVector(res_indices, res_values, self.length)
    
    def sum(self):
        return sum(self.values)
    
    def __repr__(self):
        return 'SparseVector({}, {})'.format(
            dict(zip(self.indices, self.values)), self.length)

我们需要对所有文件进行两次传递。在第一遍中，我们将获得我们的词汇量和计数。在第二遍中，我们将构建向量。

from collections import Counter

vocabulary = set()
vectors = {}

for row in indexed_w_tokens.toLocalIterator():
    counts = Counter(row['normalized'])
    vocabulary.update(counts.keys())
    vectors[row['index']] = counts
    
vocabulary = list(sorted(vocabulary))
inv_vocabulary = {term: ix for ix, term in enumerate(vocabulary)}
vocab_len = len(vocabulary)

现在我们有了这些信息，我们需要回顾我们的字数并构建实际的向量。

for index in vectors:
    terms, values = zip(*vectors[index].items())
    indices = [inv_vocabulary[term] for term in terms]
    vectors[index] = SparseVector(indices, values, vocab_len)

vectors[42]

SparseVector({56: 1, 630: 1, 678: 1, 937: 1, 952: 1, 1031: 1, 1044: 1,
1203: 1, 1348: 1, 1396: 5, 1793: 1, 2828: 1, 3264: 3, 3598: 3, 3753: 1,
4742: 1, 5907: 1, 7990: 1, 7999: 1, 8451: 1, 8532: 1, 9570: 1, 11031: 1,
11731: 1, 12509: 1, 13555: 1, 13772: 1, 14918: 1, 15205: 1, 15350: 1,
15475: 1, 16266: 1, 16356: 1, 16865: 1, 17236: 2, 17627: 1, 17798: 1,
17931: 2, 18178: 1, 18329: 2, 18505: 1, 18730: 3, 18776: 1, 19346: 1,
19620: 1, 20381: 1, 20475: 1, 20594: 1, 20782: 1, 21831: 1, 21856: 1,
21907: 1, 22560: 1, 22565: 2, 22717: 1, 23714: 1, 23813: 1, 24145: 1,
24965: 3, 25937: 1, 26437: 1, 26438: 1, 26592: 1, 26674: 1, 26679: 1,
27091: 1, 27109: 1, 27491: 2, 27500: 1, 27670: 1, 28583: 1, 28864: 1,
29636: 1, 31652: 1, 31725: 1, 31862: 1, 33382: 1, 33923: 1, 34311: 1,
34451: 1, 34478: 1, 34778: 1, 34904: 1, 35034: 1, 35635: 1, 35724: 1,
36136: 1, 36596: 1, 36672: 1, 37048: 1, 37854: 1, 37867: 3, 37872: 1,
37876: 3, 37891: 1, 37907: 1, 37949: 1, 38002: 1, 38224: 1, 38225: 2,
38226: 3, 38317: 3, 38856: 1, 39818: 1, 40870: 1, 41238: 1, 41239: 1,
41240: 1, 41276: 1, 41292: 1, 41507: 1, 41731: 1, 42384: 2}, 42624)

让我们看一些出现次数最多的单词。

vocabulary[3598]

'be'

vocabulary[37876]

'the'

正如我们之前所讨论的，仅使用计数进行搜索有很多缺点。令人担忧的是，英语中普遍常见的词会比不常见的词产生更大的影响。有几种策略可以解决这个问题。首先，让我们看一下最简单的解决方案——删除常用词。

停用词删除

我们希望删除的这些常用词称为停用词。这个术语是 1950 年代由信息检索领域的先驱汉斯·彼得·卢恩 (Hans Peter Luhn) 创造的。默认停用词列表可用，但通常需要针对不同任务修改通用停用词列表。

from pyspark.ml.feature import StopWordsRemover

sw_remover = StopWordsRemover() \
    .setInputCol("normalized") \
    .setOutputCol("filtered") \
    .setStopWords(StopWordsRemover.loadDefaultStopWords("english"))

filtered = sw_remover.transform(indexed_w_tokens)

from collections import Counter

vocabulary_filtered = set()
vectors_filtered = {}

for row in filtered.toLocalIterator():
    counts = Counter(row['filtered'])
    vocabulary_filtered.update(counts.keys())
    vectors_filtered[row['index']] = counts
    
vocabulary_filtered = list(sorted(vocabulary_filtered))
inv_vocabulary_filtered = {
    term: ix 
    for ix, term in enumerate(vocabulary_filtered)
}
vocab_len_filtered = len(vocabulary)

for index in vectors:
    terms, values = zip(*vectors_filtered[index].items())
    indices = [inv_vocabular_filteredy[term] for term in terms]
    vectors_filtered[index] = \
        SparseVector(indices, values, vocab_len_filtered)

vectors[42]

SparseVector({630: 1, 678: 1, 952: 1, 1031: 1, 1044: 1, 1203: 1, 1348: 1,
1793: 1, 2828: 1, 3264: 3, 4742: 1, 5907: 1, 7990: 1, 7999: 1, 8451: 1, 
8532: 1, 9570: 1, 11031: 1, 11731: 1, 12509: 1, 13555: 1, 13772: 1, 
14918: 1, 15205: 1, 15350: 1, 16266: 1, 16356: 1, 16865: 1, 17236: 2, 
17627: 1, 17798: 1, 17931: 2, 18178: 1, 18505: 1, 18776: 1, 20475: 1, 
20594: 1, 20782: 1, 21831: 1, 21856: 1, 21907: 1, 22560: 1, 22565: 2, 
22717: 1, 23714: 1, 23813: 1, 24145: 1, 25937: 1, 26437: 1, 26438: 1, 
26592: 1, 26674: 1, 26679: 1, 27109: 1, 27491: 2, 28583: 1, 28864: 1, 
29636: 1, 31652: 1, 31725: 1, 31862: 1, 33382: 1, 33923: 1, 34311: 1, 
34451: 1, 34478: 1, 34778: 1, 34904: 1, 35034: 1, 35724: 1, 36136: 1, 
36596: 1, 36672: 1, 37048: 1, 37872: 1, 37891: 1, 37949: 1, 38002: 1, 
38224: 1, 38225: 2, 38226: 3, 38856: 1, 39818: 1, 40870: 1, 41238: 1, 
41239: 1, 41240: 1, 41276: 1, 41731: 1}, 42624)

vocabulary[3264]

'bake'

vocabulary[38226]

'timmons'

“烘焙”和“蒂蒙斯”这两个词似乎信息量更大。在确定停用词列表中应包含哪些词时，您应该探索您的数据。

列出所有我们不想要的单词似乎是一项艰巨的任务。然而，回顾我们讨论过的关于形态学的内容，我们可以缩小我们想要删除的范围。我们要删除未绑定的函数词素。

一个流利的语言使用者，知道这些形态学基础知识，能够创建一个相当好的列表。然而，这仍然留下两个问题。如果我们需要保留一些常用词怎么办？如果我们想去掉一些常见的词素怎么办？您可以修改列表，但这仍然是最后一个问题。我们如何处理诸如“虚构猫”之类的查询？“虚构”一词不如“猫”常见，因此前者在确定返回哪些文件时更重要是有道理的。让我们看看如何使用我们的数据来实现这一点。

逆文档频率

我们可以尝试对单词进行加权，而不是手动编辑我们的词汇表。我们需要找到某种方法来使用它们的“共性”来衡量单词的权重。定义“共性”的一种方法是识别我们的语料库中包含该词的文档数量。这通常称为文档频率。我们希望具有高文档频率的单词被降低权重，因此我们对使用逆文档频率（IDF）感兴趣。

我们将这些值乘以术语频率，即给定文档中单词的频率。逆文档频率乘以词频的结果就是 TF.IDF。

最常见的类型是平滑对数。

让我们用我们的向量来计算它。我们实际上已经有了词频，所以我们需要做的就是计算idf，用转换值，然后log乘以。tfidf

idf = Counter()

for vector in vectors.values():
    idf.update(vector.indices)

for ix, count in idf.most_common(20):
    print('{:5d} {:20s} {:d}'.format(ix, vocabulary[ix], count))

11031 date                 2000
15475 from                 2000
23813 messageid            2000
26438 newsgroups           2000
28583 path                 2000
36672 subject              2000
21907 lines                1993
27897 organization         1925
37876 the                  1874
 1793 apr                  1861
 3598 be                   1837
38317 to                   1767
27500 of                   1756
   56 a                    1730
16266 gmt                  1717
18329 i                    1708
18730 in                   1695
 1396 and                  1674
15166 for                  1474
17238 have                 1459

我们现在可以制作 idf 一个 SparseVector. 我们知道它包含所有单词，所以它实际上不会稀疏，但这将帮助我们实现接下来的步骤。

indices, values = zip(*idf.items())
idf = SparseVector(indices, values, vocab_len)

from math import log

for index, vector in vectors.items():
    vector.values = list(map(lambda v: log(1+v), vector.values))
    
idf.values = list(map(lambda v: log(vocab_len / (1+v)), idf.values))

tfidf = {index: tf.hadamard(idf) for index, tf in vectors.items()}

tfidf[42]

SparseVector({56: 2.2206482367540246, 630: 5.866068667810157, 
678: 5.793038323439593, 937: 2.7785503981772224, 952: 5.157913986067814, 
..., 
41731: 2.4998956290056062, 42384: 3.8444034764394415}, 42624)

让我们看一下“be”和“the”的 TF.IDF 值。让我们也看看TF.IDF比这些常用词更高的词之一。

tfidf[42][3598] # be

4.358148273729854

tfidf[42][37876] # the

4.3305185461380855

vocabulary[17236], tfidf[42][17236]

('hausmann', 10.188396765921954)

让我们看一下文档，以了解为什么这个词如此重要。

print(doc_index.loc[42]['text'])

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard...
From: timmbake@mcl.ucsb.edu (Bake Timmons)
Newsgroups: alt.atheism
Subject: Re: Amusing atheists and agnostics
Message-ID: <timmbake.735285604@mcl>
Date: 20 Apr 93 06:00:04 GMT
Sender: news@ucsbcsl.ucsb.edu
Lines: 32

Maddi Hausmann chirps:

>timmbake@mcl.ucsb.edu (Bake Timmons) writes: >

...

>"Killfile" Keith Allen Schneider = Frank "Closet Theist" O'Dwyer = ...

= Maddi "The Mad Sound-O-Geek" Hausmann

...whirrr...click...whirrr

--
Bake Timmons, III
...

我们可以看到该文件正在谈论一个名叫“Maddi Hausman”的人。

在 Spark

Spark 在 MLlib 中具有计算 TF.IDF 的阶段。如果您有一个包含字符串数组的列，您可以使用CountVectorizer我们已经熟悉的或者HashingTF来获取tf值。HashingTF使用散列技巧，您可以在其中预先确定一个向量空间，然后将单词散列到该向量空间中。如果发生碰撞，那么这些单词将被视为相同。这使您可以在内存效率和准确性之间进行权衡。随着您使预定的向量空间变大，输出向量会变大，但发生冲突的机会会降低。

现在我们知道如何将文档转换为向量，在下一章中，我们可以探索如何在经典机器学习任务中使用该向量。

练习

现在我们已经计算了 TF.IDF 值，让我们构建一个搜索函数。首先，我们需要一个函数来处理查询。

def process_query(query, pipeline):
    data = spark.createDataFrame([(query,)], ['text'])
    return pipeline.transform(data).first()['normalized']

然后我们需要一个函数来获取过滤器集。

def get_filter_set(processed_query):
    filter_set = set()
    # 查找所有包含任何条款的文档
    return filter_set

接下来，我们需要一个函数来计算文档的分数。

def get_score(index, terms): 
    return # 返回单个分数

我们还需要一个显示结果的函数。

def display(index, score, terms): 
    hits = [term for term in terms if term in words and tfidf[index][inv_vocabulary[term]] > 0.] 
    print('terms', terms, 'hits', hits ) 
    print('score', score) 
    print('path', path) 
    print('length', len(doc_index.loc[index]['text']))

最后，我们准备好我们的搜索功能。

def search(query, pipeline, k=5):
    processed_query = process_query(query, pipeline)
    filter_set = get_filter_set(processed_query)
    scored = {index: get_score(index, processed_query) for index in filter_set}
    display_list = list(sorted(filter_set, key=scored.get, reverse=True))[:k]
    for index in display_list:
        display(index, scored[index], processed_query)

search('search engine', pipeline)

您应该能够实现get_filter_set并get_score轻松使用本章中的示例。尝试几个查询。您可能会注意到这里有两个很大的限制。没有 N-gram 支持，并且排名器偏向于较长的文档。你可以修改什么来解决这些问题？

Sonhhxg_柒

关注

7
点赞
踩
13

收藏

觉得还不错? 一键收藏
打赏
3
评论
【Spark NLP】第 6 章：信息检索

在上一章中，我们遇到了难以描述语料库的常用词。这是不同种类的 NLP 任务的问题。幸运的是，信息检索领域已经开发了许多可用于改进各种 NLP 应用的技术。早些时候，我们谈到了文本数据是如何存在的，并且每天都在生成更多。我们需要一些方法来管理和搜索这些数据。如果有 ID 或标题，我们当然可以对这些数据进行索引，但是我们如何按内容搜索呢？使用结构化数据，我们可以创建逻辑表达式并检索满足表达式的所有行。这也可以用文本来完成，虽然不太准确。信息检索的基础早于计算机。
复制链接

扫一扫