（八）PositionRank代码解读（三）

最新推荐文章于 2022-09-13 16:45:02 发布

Simonsdu

最新推荐文章于 2022-09-13 16:45:02 发布

阅读量1.3k

点赞数

分类专栏：面向特定问题的开源算法推荐

本文链接：https://blog.csdn.net/Simonsdu/article/details/121317526

版权

数据处理 NLTK 分词过滤文档解析

关键词由CSDN通过智能技术生成

面向特定问题的开源算法推荐专栏收录该内容

15 篇文章 0 订阅

订阅专栏

2021SC@SDUSC

简介

本文将分析process_data数据处理模块。

read_input_file方法

该方法用于文件的读取，除了路径判断是否存在以外还需注意decode方法的第二个参数“ignore”，标识忽略无法解析的二进制编码，如果不忽略，遇到错误二进制编码时会报错。

def read_input_file(this_file):
    if os.path.exists(this_file):
        with codecs.open(this_file, "rb") as f:
            b = f.read()
            text = b.decode('utf-8','ignore')
    else:
        text = None

    return text

read_gold_file方法

该方法用于读取关键词标注文件，将关键词读取到列表gold_list中。

def read_gold_file(this_gold):
    if os.path.exists(this_gold):
        with codecs.open(this_gold, "rb") as f:
            b_list = f.readlines()
            gold_list = []
            for b in b_list:
                s = b.decode('utf-8','ignore')
                gold_list.append(s)
        f.close()
    else:
        gold_list = None

    return gold_list

word_tokenize方法

该方法用于分词，基于NLTK分词器，可以选择语言。

NLTK简介：

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

NLTK是一款用于构建Python自然语言处理的平台，提供分类、分词、词干还原、标记、解析、语义推理等功能。

def word_tokenize(text, language="english", preserve_line=False):
    """
    text可以是句子也可以是一整段文本
    :param text: 源文本
    :type text: str
    :param language: Punk语料库中的模型
    :type language: str
    :param preserve_line: 是否先对文本进行分句操作
    :type preserve_line: bool
    """
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [
        token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
    ]

filter_candidates方法

该方法用于基于多重标准对候选词进行过滤。

def filter_candidates(tokens, stopwords_file=None, min_word_length=2,valid_punctuation='-')

参数介绍

tokens: 等待被过滤的词集合
stopwords_file: 停用词所在文件
min_word_length: 过滤掉长度小于该参数的候选词
valid_punctuation:过滤掉包含非有效符号的单词，有效符号：连词符“-”
encoding=‘utf-8’

具体分析

def filter_candidates(tokens, stopwords_file=None, min_word_length=2, valid_punctuation='-'):
    # 如果停用词未提供，那么从nltk语言包中加载停用词
    stopwords_list = []
    if stopwords_file is None:
        from nltk.corpus import stopwords
        stopwords_list = set(stopwords.words('english'))
    else:
        with codecs.open(stopwords_file, 'rb', encoding='utf-8') as f:
            f.readlines()
        f.close()
        # add the stopword from file in the stopwords_list container
        for line in f:
            stopwords_list.append(line)

    # 确定要被删除的token的索引
    indices = []
    # 同时获取内容和索引
    for i, c in enumerate(tokens):

        # 获取属于停用词的索引
        if c in stopwords_list:
            indices.append(i)

        # 获取长度不满住条件的索引
        elif len(c) < min_word_length:
            indices.append(i)

        elif c in ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']:
            indices.append(i)

        else:

            # 获取包含数字、非法特殊符号的索引
            letters_set = set([u for u in c])

            if letters_set.issubset(punctuation):
                indices.append(i)

            elif re.match(r'^[a-zA-Z0-9%s]*$' % valid_punctuation, c):
                pass
            else:
                indices.append(i)

    dels = 0
    
    # dels的作用是删除内容后重新确定索引
    for index in indices:
        offset = index - dels
        del tokens[offset]
        dels += 1

    return tokens

MyCorpus类

简介

用于解析提供的路径所对应的文档集合，将一篇文档的token列表作为返回值。

参数介绍

path_to_data: 文档集合路径
dictionary: 词和id的映射关系
length: 文档数量

class MyCorpus(object):

    def __init__(self, path_to_data, dictionary, length=None, encoding='utf-8'):
        """初始化参数"""
        self.path_to_data = path_to_data
        self.dictionary = dictionary
        self.length = length
        self.encoding = encoding
        self.index_filename = {}

    def __iter__(self):

        index = 0

        for filename, text, tokens in itertools.islice(iter_data(self.path_to_data, self.encoding), self.length):
            self.index_filename[index] = filename
            index += 1
            yield self.dictionary.doc2bow(tokens)

    def __len__(self):
        if self.length is None:
            self.length = sum(1 for doc in self)
        return self.length