英文段落分句

最新推荐文章于 2024-06-04 17:52:45 发布

Maann

最新推荐文章于 2024-06-04 17:52:45 发布

阅读量502

点赞数

分类专栏： NLP 文章标签：自然语言处理人工智能 nlp

本文链接：https://blog.csdn.net/weixin_43815222/article/details/123074052

版权

NLP 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

做nlp的时候，我们数据往往是一篇文章或者一大段文字，在进行其他处理之前，你需要先对文章进行切割或者处理(去除多余字符、特殊符号，分句和分词)，或者是分句以句子级别为最小单位进行后续处理。那么如何进行分句呢？

比如有下面一段文本：
First, it takes time to accomplish a task —— the earlier you begin，the more likely you will reach your goal earlier. Otherwise you call never be sure of your success. Second, when diligence becomes a habit, nothing will be difficult to a determined and persistent person. For example, you will never feel bored and tired at doing social investigation if you really enjoy it. Third, looking at the matter from another perspective， we will find that social resources are always limited and opportunities are always for those prepared minds.

如何进行分句呢？下面介绍两种方法：

一、规则匹配

一般情况下我们可以通过python的split等函数快速完成切分任务，主要的分割特征如下：

大概这些句子分割符(. ? !)；
可以使用split函数进行分割，可以得到新的列表，例如下面的函数;

def sentence_split(str_centence):
    list_ret = list()
    for s_str in str_centence.split('.'):
        if '?' in s_str:
            list_ret.extend(s_str.split('?'))
        elif '!' in s_str:
            list_ret.extend(s_str.split('!'))
        else:
            list_ret.append(s_str)
    return list_ret

得到的结果如下：
在这里插入图片描述

二、nltk中sent_tokenize库进行分句

第二种方法就是这种基本需求一般已经有现有的库，我们可以用nltk中的sent_tokenize库来实现分句，使用如下代码所示：

from nltk.tokenize import sent_tokenize
 def sentence_token_nltk(str):
    sent_tokenize_list = sent_tokenize(str)
    return sent_tokenize_list

结果如下：
在这里插入图片描述

两个方法完整代码如下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018/8/11 下午3:18
# @Author  : yizhen
# @Site    : 
# @File    : split_sentence.py
# @Software: PyCharm
import codecs
from nltk.tokenize import sent_tokenize


def sentence_token_nltk(str):
    sent_tokenize_list = sent_tokenize(str)
    return sent_tokenize_list

def sentence_split(str_centence):
    list_ret = list()
    for s_str in str_centence.split('.'):
        if '?' in s_str:
            list_ret.extend(s_str.split('?'))
        elif '!' in s_str:
            list_ret.extend(s_str.split('!'))
        else:
            list_ret.append(s_str)
    return list_ret

def main():
    with codecs.open('test.txt', 'r', encoding='utf-8') as fp:
        str = fp.read().strip()

    sentence_str = sentence_token_nltk(str)
    sentence_str_2 = sentence_split(str)
    print(sentence_str)


if __name__ == '__main__':
    main()