blog8 将文本处理为元组(word，tag)列表

最新推荐文章于 2024-07-25 23:06:58 发布

gh冲

最新推荐文章于 2024-07-25 23:06:58 发布

阅读量210

点赞数

文章标签：深度学习机器学习自然语言处理

本文链接：https://blog.csdn.net/qq_46765753/article/details/121540295

版权

2021SC@ SDUSC

在自然语言处理的很多任务上，词性信息已经是必不可少的特征信息，这时我们就需要利用其他词性标注的工具包stanford-postagger。
例如：

 >>from sentkp.preprocessing import postagger as pt

        >>pt = postagger.PosTagger()

        >>pt.pos_tag_raw_text('Write your python code in a .py file. Thank you.')

输出：

[
            [('Write', 'VB'), ('your', 'PRP$'), ('python', 'NN'),
            ('code', 'NN'), ('in', 'IN'), ('a', 'DT'), ('.', '.'), ('py', 'NN'), ('file', 'NN'), ('.', '.')
            ],
            [('Thank', 'VB'), ('you', 'PRP'), ('.', '.')]
        ]

接下来我们研究EmbedRank方法中如何给文本标记词性：
1、函数def pos_tag_raw_text(self, text, as_tuple_list=True)，处理单行文本。
参数说明：
text为字符串到POS标签；
paramas_tuple_list：返回结果作为列表的列表(word，Pos_tag)
例如：

 >>pt.pos_tag_raw_text('Write your python code in a .py file. Thank you.', as_tuple_list=False)

输出：

        'Write/VB your/PRP$ python/NN code/NN in/IN a/DT ./.[ENDSENT]py/NN file/NN ./.[ENDSENT]Thank/VB you/PRP

2、函数def pos_tag_file(self, input_path, output_path=None):
POS标记文件，我们要么有一个列表的列表（对于每个句子都有一个元组列表（单词、标签）），要么是一个带有POS标记文本的文件。
注意：跳线仅用于可读性，当读取标记文件时，我们将再次使用sent_tokenize来查找句子的边界。
参数说明：
input_path：源文件的路径，
output_path：如果设置写POS标记文本与分割(self.pos_tag_raw_textas_tuple_listFalse)，如果没有设置，返回元组列表列表(self.post_tag_raw_textas_tuple_list=真)；
返回：结果POS标记文本作为一个元组列表或如果设置输出路径。

def pos_tag_file(self, input_path, output_path=None):
	original_text = read_file(input_path)

        	if output_path is not None:
            	tagged_text = self.pos_tag_raw_text(original_text, as_tuple_list=False)
            	# Write to the output the POS-Tagged text.
            	write_string(tagged_text, output_path)
        	else:
            	return self.pos_tag_raw_text(original_text, as_tuple_list=True)

3、函数def pos_tag_and_write_corpora(self, list_of_path, suffix):
POS标记一个文件列表，它将结果文件写入与+后缀相同的目录中；
例如：

pos_tag_and_write_corpora(['/Users/user1/text1', '/Users/user1/direct/text2'] , suffix = _POS)

它将会创建：

 /Users/user1/text1_POS
 /Users/user1/direct/text2_POS

参数说明：
paramlist_of_path：列表包含每个文件的路径（作为字符串），标签：param后缀：后缀附加在生成的pos_tagged文件的原始文件名的末尾。

def pos_tag_and_write_corpora(self, list_of_path, suffix):
        for path in list_of_path:
            output_file_path = path + suffix
            if os.path.isfile(path):
                self.pos_tag_file(path, output_file_path)
            else:
                warnings.warn('file ' + output_file_path + 'does not exists')

用特定语言进行标记化的实现。标记和tag_sent方法使用特定的语言执行标记化：

英语：

class EnglishStanfordPOSTagger(StanfordPOSTagger):

    @property
    def _cmd(self):
        return ['edu.stanford.nlp.tagger.maxent.MaxentTagger',
                '-model', self._stanford_model, '-textFile', self._input_file_path,
                '-outputFormatOptions', 'keepEmptySentences']

法语：

class FrenchStanfordPOSTagger(StanfordPOSTagger):
    """
    Taken from github mhkuu/french-learner-corpus
    Extends the StanfordPosTagger with a custom command that calls the FrenchTokenizerFactory.
    """

    @property
    def _cmd(self):
        return ['edu.stanford.nlp.tagger.maxent.MaxentTagger',
                '-model', self._stanford_model, '-textFile',
                self._input_file_path, '-tokenizerFactory',
                'edu.stanford.nlp.international.french.process.FrenchTokenizer$FrenchTokenizerFactory',
                '-outputFormatOptions', 'keepEmptySentences']

德语：

class GermanStanfordPOSTagger(StanfordPOSTagger):
    """ Use english tokenizer for german """

    @property
    def _cmd(self):
        return ['edu.stanford.nlp.tagger.maxent.MaxentTagger',
                '-model', self._stanford_model, '-textFile', self._input_file_path,
                '-outputFormatOptions', 'keepEmptySentences']