做nlp的时候,我们数据往往是一篇文章或者一大段文字,在进行其他处理之前,你需要先对文章进行切割或者处理(去除多余字符、特殊符号,分句和分词),或者是分句以句子级别为最小单位进行后续处理。那么如何进行分句呢?
比如有下面一段文本:First, it takes time to accomplish a task —— the earlier you begin,the more likely you will reach your goal earlier. Otherwise you call never be sure of your success. Second, when diligence becomes a habit, nothing will be difficult to a determined and persistent person. For example, you will never feel bored and tired at doing social investigation if you really enjoy it. Third, looking at the matter from another perspective, we will find that social resources are always limited and opportunities are always for those prepared minds.
如何进行分句呢?下面介绍两种方法:
一、规则匹配
一般情况下我们可以通过python的split等函数快速完成切分任务,主要的分割特征如下:大概这些句子分割符(. ? !);
可以使用split函数进行分割,可以得到新的列表,例如下面的函数;
def sentence_split(str_