利用Gensim在英文Wikipedia训练词向量

最新推荐文章于 2023-07-14 17:37:18 发布

凡人哥_NLP

最新推荐文章于 2023-07-14 17:37:18 发布

阅读量6.9k

点赞数 5

分类专栏：深度学习基础文章标签：维基百科词向量 word2vec gensim 深度学习

本文链接：https://blog.csdn.net/Mr1060907970/article/details/54600629

版权

深度学习基础专栏收录该内容

1 篇文章

订阅专栏

最近在SemEval 2010 Task 8上做关系分类的实验，主要是实现了一下这篇论文的模型：A neural network framework for relation extraction: Learning entity semantic and relation pattern，奈何比原文中的性能差了两个点，好忧桑，想着是不是词向量的原因。这篇文章中用的是英文Wikipedia语料训练的词向量，没办法，动手试一下吧。。

一、数据准备

首先下载最新的英文wiki数据wiki dumps，我的下载时间是2017-01-17，12.7G的xml格式压缩包。

1.1 转为纯文本

之前看过一个实验是使用process_wiki.py+Gensim处理的，但是gensim默认是没有断句的，使用上述方法处理完之后，你会发现一句话变得特别地长，这可能会影响词向量的质量。
所以这里对源代码做了一些改动，gensim的WikiCorpus模块在文件如下文件中：
python3.5/dist-packages/gensim/corpora/wikicorpus.py
我们将wikicorpus.py文件拷贝出来，主要是修改process_article函数和WikiCorpus类，这里用到了nltk的sent_tokenize模块，并且将ARTICLE_MIN_WORDS的值改成了10（这里实际上判断的是句子的最短长度），修改后的两个文件process_wiki.py和wikicorpus.py分别见附件1和附件2。
上述工作完成之后，执行python3 process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text，最后会得到一个约13G的纯文本文件wiki.en.text，每个句子占一行（ps: 在24核的机器上跑了大概100分钟），从下图可以看出比分句之前的结果好看了很多，但是标点符号都不见了，不知道怎么保留标点符号，希望知道的小伙伴提出来。

Anarchism is a political philosophy that advocates self governed societies based on voluntary institutions
These are often described as stateless societies although several authors have defined them more specifically as institutions based on non hierarchical free associations
Anarchism holds the state to be undesirable unnecessary and harmful
While anti statism is central anarchism entails opposing authority or hierarchical organisation in the conduct of all human relations including but not limited to the state system
Anarchism does not offer a fixed body of doctrine from a single particular world view instead > fluxing and flowing as a philosophy
Many types and traditions of anarchism exist not all of which are mutually exclusive

二、训练词向量

将wiki语料处理成纯文本之后就可以开始训练词向量了，利用Gensim的Word2Vec模块，用的是下面这个脚本train_word2vec_model.py，执行python3 train_word2vec_model.py wiki.en.text model.bin，训练时间较长，写这篇博客时才完成2.05%。

# -*- coding: utf-8 -*-
import logging
import os.path
import sys
import multiprocessing
from gensim.models.word2vec import LineSentence
from time import time
from gensim.models import Word2Vec

if __name__ == '__main__':
    t0 = time()

    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) < 3:
        print(globals()['__doc__'] % locals())
        sys.exit(1)
    inp, outp = sys.argv[1:3]  # corpus path and path to save model

    model = Word2Vec(sg=0, sentences=LineSentence(inp), size=300, window=5, min_count=5,
            workers=16, iter=35)

    # trim unneeded model memory = use(much) less RAM
    #model.init_sims(replace=True)
    model.save_word2vec_format(outp, binary=True)

    print('done in %ds!' % (time()-t0))

三、将词向量保存成pkl格式

txt格式的词向量加载起来那叫一个慢，Gensim生成的bin文件虽然快了很多，但是还是要利用Gensim模块来加载，很不方便。我在python中使用词向量的时候，一般是将词向量存在一个字典里，其中键是词，值是词对应的词向量（numpy类型的数组）。如果将字典直接以二进制的形式存储在文件中，那加载会方便很多，速度也会快不少。这里用到是pickle模块，程序如下：

# -*- encoding:utf-8 -*-
import pickle
import numpy as np
from time import time
from gensim.models import Word2Vec

if __name__ == '__main__':
    t0 = time()

    word_wieghts = {}
    model = Word2Vec.load_word2vec_format('model.bin', binary=True)
    for word in model.vocab:
        word_weights[word] = model[word]
    with open('model.pkl','wb') as file:
        pickle.dump(file, word_weights)

    print('Done in %ds!' % (time()-t0))

四、总结

这里主要改进了gensim模块，在处理wiki文本的时候利用nltk工具将文章分割成句子。

附件1：process_wiki.py

# -*- encoding:utf-8 -*-
import logging
import os.path
import sys

from wikicorpus import WikiCorpus  # 注意这里

# add by ljx
def decode_text(text):
    words = []
    for w in text:
        words.append(w.decode('utf-8'))
    return words

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) < 3:
        print(globals()['__doc__'] % locals())
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0

    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(decode_text(text)) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " sentences")

    output.close()
    logger.info("Finished Saved " + str(i) + " sentences")

附件2：wikicorpus.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2010 Radim Rehurek <radimrehurek@seznam.cz>
# Copyright (C) 2012 Lars Buitinck <larsmans@gmail.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html


"""
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

If you have the `pattern` package installed, this module will use a fancy
lemmatization to get a lemma of each token (instead of plain alphabetic
tokenizer). The package is available at https://github.com/clips/pattern .

See scripts/process_wiki.py for a canned (example) script based on this
module.
"""


import bz2
import logging
import re
from xml.etree.cElementTree import iterparse  # LXML isn't faster, so let's go with the built-in solution
import multiprocessing

from gensim import utils

# cannot import whole gensim.corpora, because that imports wikicorpus...
from gensim.corpora.dictionary import Dictionary
from gensim.corpora.textcorpus import TextCorpus

from nltk.tokenize import sent_tokenize, word_tokenize

logger = logging.getLogger('gensim.corpora.wikicorpus')

# ignore articles shorter than ARTICLE_MIN_WORDS characters (after full preprocessing)
ARTICLE_MIN_WORDS = 10


RE_P0 = re.compile('<!--.*?-->', re.DOTALL | re.UNICODE)  # comments
RE_P1 = re.compile('<ref([> ].*?)(</ref>|/>)', re.DOTALL | re.UNICODE)  # footnotes
RE_P2 = re.compile("(\n\[\[[a-z][a-z][\w-]*:[^:\]]+\]\])+$", re.UNICODE)  # links to languages
RE_P3 = re.compile("{{([^}{]*)}}", re.DOTALL | re.UNICODE)  # template
RE_P4 = re.compile("{{([^}]*)}}", re.DOTALL | re.UNICODE)  # template
RE_P5 = re.compile('\[(\w+):\/\/(.*?)(( (.*?))|())\]', re.UNICODE)  # remove URL, keep description
RE_P6 = re.compile("\[([^][]*)\|([^][]*)\]", re.DOTALL | re.UNICODE)  # simplify links, keep description
RE_P7 = re.compile('\n\[\[[iI]mage(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)  # keep description of images
RE_P8 = re.compile('\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)  # keep description of files
RE_P9 = re.compile('<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL | re.UNICODE)  # outside links
RE_P10 = re.compile('<math([> ].*?)(</math>|/>)', re.DOTALL | re.UNICODE)  # math content
RE_P11 = re.compile('<(.*?)>', re.DOTALL | re.UNICODE)  # all other tags
RE_P12 = re.compile('\n(({\|)|(\|-)|(\|}))(.*?)(?=\n)', re.UNICODE)  # table formatting
RE_P13 = re.compile('\n(\||\!)(.*?\|)*([^|]*?)', re.UNICODE)  # table cell formatting
RE_P14 = re.compile('\[\[Category:[^][]*\]\]', re.UNICODE)  # categories
# Remove File and Image template
RE_P15 = re.compile('\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE)

# MediaWiki namespaces (https://www.mediawiki.org/wiki/Manual:Namespace) that
# ought to be ignored
IGNORED_NAMESPACES = ['Wikipedia', 'Category', 'File', 'Portal', 'Template',
                      'MediaWiki', 'User', 'Help', 'Book', 'Draft',
                      'WikiProject', 'Special', 'Talk']


def filter_wiki(raw):
    """
    Filter out wiki mark-up from `raw`, leaving only text. `raw` is either unicode
    or utf-8 encoded string.
    """
    # parsing of the wiki markup is not perfect, but sufficient for our purposes
    # contributions to improving this code are welcome :)
    text = utils.to_unicode(raw, 'utf8', errors='ignore')
    text = utils.decode_htmlentities(text)  # '&amp;nbsp;' --> '\xa0'
    return remove_markup(text)


def remove_markup(text):
    text = re.sub(RE_P2, "", text)  # remove the last list (=languages)
    # the wiki markup is recursive (markup inside markup etc)
    # instead of writing a recursive grammar, here we deal with that by removing
    # markup in a loop, starting with inner-most expressions and working outwards,
    # for as long as something changes.
    text = remove_template(text)
    text = remove_file(text)
    iters = 0
    while True:
        old, iters = text, iters + 1
        text = re.sub(RE_P0, "", text)  # remove comments
        text = re.sub(RE_P1, '', text)  # remove footnotes
        text = re.sub(RE_P9, "", text)  # remove outside links
        text = re.sub(RE_P10, "", text)  # remove math content
        text = re.sub(RE_P11, "", text)  # remove all remaining tags
        text = re.sub(RE_P14, '', text)  # remove categories
        text = re.sub(RE_P5, '\\3', text)  # remove urls, keep description
        text = re.sub(RE_P6, '\\2', text)  # simplify links, keep description only
        # remove table markup
        text = text.replace('||', '\n|')  # each table cell on a separate line
        text = re.sub(RE_P12, '\n', text)  # remove formatting lines
        text = re.sub(RE_P13, '\n\\3', text)  # leave only cell content
        # remove empty mark-up
        text = text.replace('[]', '')
        if old == text or iters > 2:  # stop if nothing changed between two iterations or after a fixed number of iterations
            break

    # the following is needed to make the tokenizer see '[[socialist]]s' as a single word 'socialists'
    # TODO is this really desirable?
    text = text.replace('[', '').replace(']', '')  # promote all remaining markup to plain text
    return text


def remove_template(s):
    """Remove template wikimedia markup.

    Return a copy of `s` with all the wikimedia markup template removed. See
    http://meta.wikimedia.org/wiki/Help:Template for wikimedia templates
    details.

    Note: Since template can be nested, it is difficult remove them using
    regular expresssions.
    """

    # Find the start and end position of each template by finding the opening
    # '{{' and closing '}}'
    n_open, n_close = 0, 0
    starts, ends = [], []
    in_template = False
    prev_c = None
    for i, c in enumerate(iter(s)):
        if not in_template:
            if c == '{' and c == prev_c:
                starts.append(i - 1)
                in_template = True
                n_open = 1
        if in_template:
            if c == '{':
                n_open += 1
            elif c == '}':
                n_close += 1
            if n_open == n_close:
                ends.append(i)
                in_template = False
                n_open, n_close = 0, 0
        prev_c = c

    # Remove all the templates
    s = ''.join([s[end + 1:start] for start, end in
                 zip(starts + [None], [-1] + ends)])

    return s


def remove_file(s):
    """Remove the 'File:' and 'Image:' markup, keeping the file caption.

    Return a copy of `s` with all the 'File:' and 'Image:' markup replaced by
    their corresponding captions. See http://www.mediawiki.org/wiki/Help:Images
    for the markup details.
    """
    # The regex RE_P15 match a File: or Image: markup
    for match in re.finditer(RE_P15, s):
        m = match.group(0)
        caption = m[:-2].split('|')[-1]
        s = s.replace(m, caption, 1)
    return s


def tokenize(content):  # 修改后
    """
    Tokenize a piece of text from wikipedia. The input string `content` is assumed
    to be mark-up free (see `filter_wiki()`).

    Return list of tokens as utf8 bytestrings. Ignore words shorted than 2 or longer
    that 15 characters (not bytes!).
    """
    # TODO maybe ignore tokens with non-latin characters? (no chinese, arabic, russian etc.)
    return [token.encode('utf8') for token in utils.tokenize(content, lower=False, errors='ignore')
            if len(token) <= 15 and not token.startswith('_')]


def get_namespace(tag):
    """Returns the namespace of tag."""
    m = re.match("^{(.*?)}", tag)
    namespace = m.group(1) if m else ""
    if not namespace.startswith("http://www.mediawiki.org/xml/export-"):
        raise ValueError("%s not recognized as MediaWiki dump namespace"
                         % namespace)
    return namespace
_get_namespace = get_namespace


def extract_pages(f, filter_namespaces=False):
    """
    Extract pages from a MediaWiki database dump = open file-like object `f`.

    Return an iterable over (str, str, str) which generates (title, content, pageid) triplets.

    """
    elems = (elem for _, elem in iterparse(f, events=("end",)))

    # We can't rely on the namespace for database dumps, since it's changed
    # it every time a small modification to the format is made. So, determine
    # those from the first element we find, which will be part of the metadata,
    # and construct element paths.
    elem = next(elems)
    namespace = get_namespace(elem.tag)
    ns_mapping = {"ns": namespace}
    page_tag = "{%(ns)s}page" % ns_mapping
    text_path = "./{%(ns)s}revision/{%(ns)s}text" % ns_mapping
    title_path = "./{%(ns)s}title" % ns_mapping
    ns_path = "./{%(ns)s}ns" % ns_mapping
    pageid_path = "./{%(ns)s}id" % ns_mapping

    for elem in elems:
        if elem.tag == page_tag:
            title = elem.find(title_path).text
            text = elem.find(text_path).text

            if filter_namespaces:
                ns = elem.find(ns_path).text
                if ns not in filter_namespaces:
                    text = None

            pageid = elem.find(pageid_path).text
            yield title, text or "", pageid     # empty page will yield None

            # Prune the element tree, as per
            # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
            # except that we don't need to prune backlinks from the parent
            # because we don't use LXML.
            # We do this only for <page>s, since we need to inspect the
            # ./revision/text element. The pages comprise the bulk of the
            # file, so in practice we prune away enough.
            elem.clear()
_extract_pages = extract_pages  # for backward compatibility


def process_article(args):  # 修改后
    """
    Parse a wikipedia article, returning its content as a list of tokens
    (utf8-encoded strings).
    """
    text, lemmatize, title, pageid = args
    text = filter_wiki(text)
    sentences = []
    sentences_str = sent_tokenize(text)
    for sentence_str in sentences_str:
        sentences.append(tokenize(sentence_str))
    return sentences, title, pageid


class WikiCorpus(TextCorpus):  # 修改后
    """
    Treat a wikipedia articles dump (\*articles.xml.bz2) as a (read-only) corpus.

    The documents are extracted on-the-fly, so that the whole (massive) dump
    can stay compressed on disk.

    >>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h
    >>> MmCorpus.serialize('wiki_en_vocab200k.mm', wiki) # another 8h, creates a file in MatrixMarket format plus file with id->word

    """
    def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None, filter_namespaces=('0',)):
        """
        Initialize the corpus. Unless a dictionary is provided, this scans the
        corpus once, to determine its vocabulary.

        If `pattern` package is installed, use fancier shallow parsing to get
        token lemmas. Otherwise, use simple regexp tokenization. You can override
        this automatic logic by forcing the `lemmatize` parameter explicitly.

        """
        self.fname = fname
        self.filter_namespaces = filter_namespaces
        self.metadata = False
        if processes is None:
            processes = max(1, multiprocessing.cpu_count() - 1)
        self.processes = processes
        self.lemmatize = lemmatize
        if dictionary is None:
            self.dictionary = Dictionary(self.get_texts())
        else:
            self.dictionary = dictionary

    def get_texts(self):
        """
        Iterate over the dump, returning text version of each article as a list
        of tokens.

        Only articles of sufficient length are returned (short articles & redirects
        etc are ignored).

        Note that this iterates over the **texts**; if you want vectors, just use
        the standard corpus interface instead of this function::

        >>> for vec in wiki_corpus:
        >>>     print(vec)
        """
        articles, articles_all = 0, 0
        positions, positions_all = 0, 0
        texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
        pool = multiprocessing.Pool(self.processes)
        # process the corpus in smaller chunks of docs, because multiprocessing.Pool
        # is dumb and would load the entire input into RAM at once...
        for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
            for sentences, title, pageid in pool.imap(process_article, group):  # chunksize=10):
                articles_all += 1
                positions_all += len(sentences)
                # article redirects and short stubs are pruned here
                if any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
                    continue
                for sentence in sentences:
                    if len(sentence) < ARTICLE_MIN_WORDS:
                        continue
                    articles += 1
                    positions += len(sentence)
                    yield sentence
        pool.terminate()

        logger.info(
            "finished iterating over Wikipedia corpus of %i documents with %i positions"
            " (total %i articles, %i positions before pruning articles shorter than %i words)",
            articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS)
        self.length = articles  # cache corpus length
# endclass WikiCorpus