2021SC@SDUSC
简介
本文将分析process_data数据处理模块。
read_input_file方法
该方法用于文件的读取,除了路径判断是否存在以外还需注意decode方法的第二个参数“ignore”,标识忽略无法解析的二进制编码,如果不忽略,遇到错误二进制编码时会报错。
def read_input_file(this_file):
if os.path.exists(this_file):
with codecs.open(this_file, "rb") as f:
b = f.read()
text = b.decode('utf-8','ignore')
else:
text = None
return text
read_gold_file方法
该方法用于读取关键词标注文件,将关键词读取到列表gold_list中。
def read_gold_file(this_gold):
if os.path.exists(this_gold):
with codecs.open(this_gold, "rb") as f:
b_list = f.readlines()
gold_list = []
for b in b_list:
s = b.decode('utf-8','ignore')
gold_list.append(s)
f.close()
else:
gold_list = None
return gold_list
word_tokenize方法
该方法用于分词,基于NLTK分词器,可以选择语言。
NLTK简介:
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
NLTK是一款用于构建Python自然语言处理的平台,提供分类、分词、词干还原、标记、解析、语义推理等功能。
def word_tokenize(text, language="english", preserve_line=False):
"""
text可以是句子也可以是一整段文本
:param text: 源文本
:type text: str
:param language: Punk语料库中的模型
:type language: str
:param preserve_line: 是否先对文本进行分句操作
:type preserve_line: bool
"""
sentences = [text] if preserve_line else sent_tokenize(text, language)
return [
token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
]
filter_candidates方法
该方法用于基于多重标准对候选词进行过滤。
def filter_candidates(tokens, stopwords_file=None, min_word_length=2,valid_punctuation='-')
参数介绍
tokens: 等待被过滤的词集合
stopwords_file: 停用词所在文件
min_word_length: 过滤掉长度小于该参数的候选词
valid_punctuation:过滤掉包含非有效符号的单词,有效符号:连词符“-”
encoding=‘utf-8’
具体分析
def filter_candidates(tokens, stopwords_file=None, min_word_length=2, valid_punctuation='-'):
# 如果停用词未提供,那么从nltk语言包中加载停用词
stopwords_list = []
if stopwords_file is None:
from nltk.corpus import stopwords
stopwords_list = set(stopwords.words('english'))
else:
with codecs.open(stopwords_file, 'rb', encoding='utf-8') as f:
f.readlines()
f.close()
# add the stopword from file in the stopwords_list container
for line in f:
stopwords_list.append(line)
# 确定要被删除的token的索引
indices = []
# 同时获取内容和索引
for i, c in enumerate(tokens):
# 获取属于停用词的索引
if c in stopwords_list:
indices.append(i)
# 获取长度不满住条件的索引
elif len(c) < min_word_length:
indices.append(i)
elif c in ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']:
indices.append(i)
else:
# 获取包含数字、非法特殊符号的索引
letters_set = set([u for u in c])
if letters_set.issubset(punctuation):
indices.append(i)
elif re.match(r'^[a-zA-Z0-9%s]*$' % valid_punctuation, c):
pass
else:
indices.append(i)
dels = 0
# dels的作用是删除内容后重新确定索引
for index in indices:
offset = index - dels
del tokens[offset]
dels += 1
return tokens
MyCorpus类
简介
用于解析提供的路径所对应的文档集合,将一篇文档的token列表作为返回值。
参数介绍
path_to_data: 文档集合路径
dictionary: 词和id的映射关系
length: 文档数量
class MyCorpus(object):
def __init__(self, path_to_data, dictionary, length=None, encoding='utf-8'):
"""初始化参数"""
self.path_to_data = path_to_data
self.dictionary = dictionary
self.length = length
self.encoding = encoding
self.index_filename = {}
def __iter__(self):
index = 0
for filename, text, tokens in itertools.islice(iter_data(self.path_to_data, self.encoding), self.length):
self.index_filename[index] = filename
index += 1
yield self.dictionary.doc2bow(tokens)
def __len__(self):
if self.length is None:
self.length = sum(1 for doc in self)
return self.length