NLP中如何使用预训练的embeddings

最新推荐文章于 2024-08-08 08:06:03 发布

BruceCheen

最新推荐文章于 2024-08-08 08:06:03 发布

阅读量1.5k

点赞数 3

分类专栏： NLP

NLP 专栏收录该内容

21 篇文章 5 订阅

订阅专栏

本文探讨了如何在NLP任务中有效利用预训练的词嵌入，如GoogleNews，来提高模型性能。文章指出在使用预训练嵌入时不应进行词干提取或删除停用词，以保留有价值的信息。通过改进预处理步骤，如处理标点符号、数字哈希以及常见的拼写错误，可以显著提高嵌入覆盖率，从而提升模型对文本的理解能力。

摘要由CSDN通过智能技术生成

接上一篇

问题描述：

今天任何一个主要网站的存在问题是如何处理有毒（toxic）和分裂（divisive）的内容。 Quora希望正面（head-on）解决（tackle）这个问题，让他们的平台成为用户可以安全地与世界分享知识的地方。

Quora是一个让人们相互学习的平台。在Quora上，人们可以提出问题，并与提供独特见解和质量回答（unique insights and quality answers）的其他人联系。一个关键的挑战是淘汰（weed out）虚假的问题 - 那些建立在虚假前提（false premises）下的问题，或者打算发表声明而不是寻求有用答案的问题。

在本次比赛中，Kagglers将开发识别和标记虚假问题（flag insincere questions）的模型。到目前为止（To date），Quora已经使用机器学习和人工审查（manual review）来解决这个问题（address this problem）。在您的帮助下，他们可以开发更具可扩展性的方法（develop more scalable methods）来检测有毒和误导性内容（detect toxic and misleading content）。

这是你大规模对抗在线巨魔（combat online trolls at scale）的机会。帮助Quora坚持（uphold）“善良，尊重”（Be Nice, Be Respectful）的政策，继续成为分享和发展世界知识的地方。

Important Note：（注意）
请注意，这是作为a Kernels Only Competition运行，要求所有submissions都通过Kernel output进行。请仔细阅读内核常见问题解答和数据页面，以充分了解其设计方法。

Data Description（数据描述）

在本次比赛中，您将预测Quora上提出的问题是否真诚（sincere）。

一个虚伪的（insincere）问题被定义为一个旨在发表声明而不是寻找有用答案的问题。一些可以表明问题虚伪（insincere）的特征：

具有非中性语气（Has a non-neutral tone）
- 夸张的语气（exaggerated tone）强调了一群人的观点
- 是修辞（rhetorical）的，意味着暗示（meant to imply）关于一群人的陈述
是贬低（disparaging）或煽动性的（inflammatory）
- 建议针对受保护阶层的人提出歧视性（discriminatory）观点，或寻求确认陈规定型观念（confirmation of a stereotype）
- 对特定的人或一群人进行贬低（disparaging）的攻击/侮辱（attacks/insults）
- 基于关于一群人的古怪前提（outlandish premise）
- 贬低（Disparages）不可修复（fixable）且无法衡量（measurable）的特征
不是基于现实（Isn’t grounded in reality）
- 基于虚假信息（false information），或包含荒谬的假设（absurd assumptions）
使用性内容（乱伦incest，兽交bestiality，恋童癖pedophilia）来获得震撼价值，而不是寻求真正的（genuine）答案

训练数据包括被询问的问题（question that was asked），以及是否被识别为真诚的（insincere）（target=1）。真实（ground-truth）标签包含一些噪音：它们不能保证是完美的。

请注意，数据集中问题的分布不应被视为代表Quora上提出的问题的分布。部分原因是由于采样程序和已应用于最终数据集的消毒（sanitization）措施的组合。

Data fields（数据域描述）

qid - 唯一的问题标识符
question_text - Quora问题文本
target - 标记为“insincere”的问题的值为1，否则为0

这是仅限内核的比赛（Kernels-only competition）。此数据部分中的文件可供下载，以便在阶段1中参考。阶段2文件仅在内核中可用且无法下载。

比赛的第二阶段会有什么？

在比赛的第二阶段，我们将重新运行您选择的内核。以下文件将与新数据交换：

test.csv - 这将与完整的公共和私有测试数据集交换。该文件在阶段1中具有_{56k行，在阶段2中具有}376k行。两个版本的公共页首数据保持相同。文件名将相同（均为test.csv）以确保您的代码将运行。
sample_submission.csv - 类似于test.csv，这将从第1阶段的_{56k变为第2阶段的}376k行。文件名将保持不变。

Embeddings

本次比赛不允许使用外部数据源。但是，我们提供了许多字嵌入以及可以在模型中使用的数据集。这些如下：

GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
glove.840B.300d - https://nlp.stanford.edu/projects/glove/
paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

####################################################

在这个kernel中，我想说明在构建深度学习NLP模型时我是如何进行有意义的预处理的（meaningful preprocessing）。

我从两条黄金法则（golden rules）开始：

当您有预先训练好的嵌入（pre-trained embeddings）时，不要使用标准的预处理步骤，如词干（stemming）或删除词（stopword removal）

你们中的一些人在进行基于单词计数的特征提取（例如TFIDF）时可能会使用标准的预处理步骤，例如删除停用词，词干等（removing stopwords, stemming etc）。原因很简单：你丢失了有价值的信息，这有助于你的NN解决问题。

使您的词汇尽可能接近嵌入（embeddings）

我将专注于这款笔记本，如何实现这一目标（how to achieve that）。举一个例子，我采用GoogleNews预训练嵌入，没有更深层次的理由进行这种选择（there is no deeper reason for this choice）。

我们从一个巧妙的小技巧（neat little trick）开始，使我们能够在将函数应用于pandas Dataframe时看到进度条（progressbar ）。

import pandas as pd
from tqdm import tqdm
tqdm.pandas()

首先加载数据

train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")
print("Train shape : ",train.shape)
print("Test shape : ",test.shape)

在这里插入图片描述
我将使用以下函数来跟踪我们的训练词汇（track our training vocabulary），它贯穿（goes through）我们的所有文本并计算所包含单词的出现。（counts the occurance of the contained words）

def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

因此，让我们填充词汇表（populate the vocabulary）并显示前5个元素（the first 5 elements）及其计数（their count）。请注意，现在我们可以使用progess_apply查看进度条（progress bar）。

sentences = train["question_text"].progress_apply(lambda x: x.split()).values
vocab = build_vocab(sentences)
print({k: vocab[k] for k in list(vocab)[:5]})

在这里插入图片描述
接下来，我们稍后将导入我们想要在我们的模型中使用的嵌入。为了说明（For illustration），我在这里使用GoogleNews。

from gensim.models import KeyedVectors

news_path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embeddings_index = KeyedVectors.load_word2vec_format(news_path, binary=True)

接下来我定义一个函数来检查我们的词汇表和嵌入之间的交集（the intersection between our vocabulary and the embeddings）。它将输出一个词汇表（oov）单词列表，我们可以用它来改进我们的预处理。

import operator 

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

oov = check_coverage(vocab,embeddings_index)

在这里插入图片描述

哎哟（Ouch ），只有24％的词汇会有嵌入，使21％的数据或多或少无用。所以让我们来看看并开始改进。为此，我们可以轻松查看顶级单词。（have a look at the top oov words）。

oov[:10]

在这里插入图片描述
首先是“to”。为什么？只是因为GoogleNews嵌入式培训时删除了“to”。（when the GoogleNews Embeddings were trained）我们稍后会修复此问题，因为现在我们会关注标点符号（punctuation）的拆分，因为这似乎也是一个问题。但是我们如何处理标点符号（punctuation） - 我们是要删除还是将其视为标记（token）？我会说：这取决于。如果标记（token）具有嵌入（embedding），请保留它，如果不存在，我们就不再需要它了。所以让我们检查：

'?' in embeddings_index

在这里插入图片描述

'&' in embeddings_index

在这里插入图片描述
有趣。虽然“＆”在Google新闻嵌入中，“？” 不是。所以我们基本上定义了一个分解“＆”并删除其他标点符号的函数。

def clean_text(x):

    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x

train["question_text"] = train["question_text"].progress_apply(lambda x: clean_text(x))
sentences = train["question_text"].apply(lambda x: x.split())
vocab = build_vocab(sentences)

在这里插入图片描述

oov = check_coverage(vocab,embeddings_index)

在这里插入图片描述

太好了！通过处理点样（ handling punctiation），我们能够将嵌入率（embeddings ratio）从24％提高到57％。好的，让我们检查一下其他的oov的词。

oov[:10]

在这里插入图片描述
嗯好像数字也是一个问题。让我们检查前10个嵌入以获得线索。（Lets check the top 10 embeddings to get a clue.）

for i in range(10):
    print(embeddings_index.index2entity[i])

在这里插入图片描述
嗯，为什么“##”在那里？仅仅因为作为重新处理（reprocessing），所有大于9的数字都被哈希替换。即15变为##而123变为###或15.80€变为##。##€。因此，让我们模仿（mimic）这个预处理步骤，以进一步改善我们的嵌入覆盖率（improve our embeddings coverage）。

import re

def clean_numbers(x):

    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

train["question_text"] = train["question_text"].progress_apply(lambda x: clean_numbers(x))
sentences = train["question_text"].progress_apply(lambda x: x.split())
vocab = build_vocab(sentences)

在这里插入图片描述

oov = check_coverage(vocab,embeddings_index)

在这里插入图片描述

太好了！再增加3％。现在和处理标点符号（puntuation）一样多，但每一点都有帮助。我们再来看看其他的oov词

oov[:20]

在这里插入图片描述
好了，现在我们处理（take care of）常见的（common）拼写错误（misspellings），当使用美国/英国词汇并用“社交媒体”替换一些“现代”单词来执行此任务时，我使用了一段时间以前在stack overflow时发现的多正则表达式（multi regex）脚本。此外，我们将简单地删除单词“a”，“to”，“and”和“of”，因为在训练GoogleNews嵌入时，这些单词显然已被下采样（downsampled）。

def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re


mispell_dict = {'colour':'color',
                'centre':'center',
                'didnt':'did not',
                'doesnt':'does not',
                'isnt':'is not',
                'shouldnt':'should not',
                'favourite':'favorite',
                'travelling':'traveling',
                'counselling':'counseling',
                'theatre':'theater',
                'cancelled':'canceled',
                'labour':'labor',
                'organisation':'organization',
                'wwii':'world war 2',
                'citicise':'criticize',
                'instagram': 'social medium',
                'whatsapp': 'social medium',
                'snapchat': 'social medium'

                }
mispellings, mispellings_re = _get_mispell(mispell_dict)

def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]

    return mispellings_re.sub(replace, text)

train["question_text"] = train["question_text"].progress_apply(lambda x: replace_typical_misspell(x))
sentences = train["question_text"].progress_apply(lambda x: x.split())
to_remove = ['a','to','of','and']
sentences = [[word for word in sentence if not word in to_remove] for sentence in tqdm(sentences)]
vocab = build_vocab(sentences)

在这里插入图片描述

oov = check_coverage(vocab,embeddings_index)

在这里插入图片描述
我们看到虽然我们将所有文本的嵌入量从89％提高到99％。让我们再次检查oov单词。

oov[:20]

在这里插入图片描述
看起来不错。没有明显的oov词我们可以快速解决。谢谢你的阅读和快乐的kaggling。

BruceCheen

关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录