使用python+stanfordcorenlp+dependency_parse（依赖句法分析）实现英文长句切分

最新推荐文章于 2023-02-14 01:26:04 发布

六七～

最新推荐文章于 2023-02-14 01:26:04 发布

阅读量2.2k

点赞数 1

分类专栏：算法文章标签：算法 python 机器学习自然语言处理

本文链接：https://blog.csdn.net/qq_41626059/article/details/114658310

版权

算法专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一、问题引入–英文长句切分

其实英文里面也是有一些短语，比如：I traveled to New York last year
其中 New York 可以作为一个短语来看待，在情感分析以及别的任务里面可能会有更好的性能提升。
英文长句的切分也是一个研究的方向：比如论文：Neural Text Segmentation and Its Application to Sentiment Analysis

二、如何解决问题？

我们使用依赖树（好像类似的还有什么依存句法分析等等）算法提供的各个单词之间的依赖关系，进行句子分组，从而实现长句切分

三、代码以及注释

from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('./stanford-corenlp-4.2.0')

sentence = "Steers turns in a snappy screenplay that curls at the edges ; it 's so clever you want to hate it ."

result = nlp.dependency_parse(sentence)
data = pd.read_csv('data/train_1.csv')


def seg_text(result):
    
    d = {}
    for i in range(1, len(result)):
        d[result[i][2]] = result[i][1]
        
           
    #先将result内部的根节点统计出来
    roots = []
    for i in range(1, len(result)):
        roots.append(result[i][1])
        
    #消除重复的根节点
    roots = list(set(roots))
    roots.sort()
    #存储分割结果
    segs = []
    
    for root in roots:
        print(root)
        #当前根节点的分割方案
        children_root = []
        
        #判断一下这个根节点是否因为是别的节点的孩子而已经被分割到别的组里了
        flag = False
        for seg in segs:
            if root in seg:
                flag = True
                break
        if flag:
            continue
        children_root.append(root)
        
        #下面开始遍历根节点root的所有叶子节点，然后判断是否符合所定义的三种条件
        for re in result:
            
            # re的父节点不是root，因此直接略过不要
            if re[1] == 0:
                continue
            
            #将已经分割好的排除出去
            f1 = False
            for seg in segs:
                if re[2] in seg:
                    f1 = True
            if f1:
                continue
            
            #到了这里，就说明所代表的节点是当前根节点的孩子节点
            if root > re[2]:
                #下面开始在当前re节点和当前根节点里面寻找那两种模式
                #开始判断第一种模式：就是节点re和root之间的所有节点都是root节点的函数
                pattern = False
                for i in range(re[2], root):
                    try:
                        if d[i] != root:
                            pattern = True
                            break
                    except KeyError:
                        continue
                    
                if pattern: #说明在re[2]和root之间有不是root孩子节点的节点
                    #下面开始判断是否是第二种模式
                    flag = True
                    for i in range(re[2], root):
                        try:
                            if d[i] != i+1:
                                flag = False
                                break
                        except KeyError:
                            continue
                                
                    if flag:#此时说明之间的所有 id 都可以加入到当前root的分割里面
                    
                        for ch in range(re[2], root):
                            children_root.append(ch)
     
                else:#说明在re[2]和root之间都是root的孩子节点，将之间的东西全部加入到当前的分割里面
                    for ch in range(re[2], root):
                        children_root.append(ch)
            else:
                #下面开始root比当前节点小的情况的处理，开始识别两种模式
                #现在开始判断第一种模式
                pattern = False
                for i in range(root + 1, re[2] + 1):
                    try:
                        if d[i] != root:
                            pattern = True
                            break
                    except KeyError:
                        continue
                    
                if pattern:#此时说明两者之间有不是root孩子节点的节点
                    #下面开始第二种模式的判断
                    flag = True
                    for i in range(re[2], root, -1):
                        try:
                            if d[i] != i - 1:
                                flag = False
                                break
                        except KeyError:
                            continue
                        
                    if flag: #此时说明符合那个第二种模式的
                        for ch in range(root, re[2] + 1):
                            children_root.append(ch)
                        
                else: #此时两者之间都是root的孩子节点，将他们加入到当前的分割里面
                    for ch in range(root, re[2] + 1):
                        children_root.append(ch)
                        
        children_root = list(set(children_root))                
        segs.append(children_root)
        
        
    #下面开始将分割方案里面长度为1或者2的 合并一些
    def merge(segs):
        
        done = False
        #还要先判断一下segs内部是不是有长度不大于2的EDU才行
        for i in range(len(segs) - 1):
            if len(segs[i]) <= 2:
                done = True
                break
            
        while done:
            segs[i+1].extend(segs[i])
            #下面将已经被合并的EDU弹出去
            segs.pop(i)
            
            for i in range(len(segs) - 1):
                if len(segs[i]) <= 2:
                    done = True
                    break
                else:
                    done = False
        #最后还要再判断一次segs的最后一个是否是长度不大于2的
        if len(segs[-1]) <= 2:
            segs[-2].extend(segs[-1])
            segs.pop(-1)
            

        return segs
            
    segs = merge(segs)
    
    
    l1 = [re[2] for re in result]
    l2 = []
    for seg in segs:
        l2.extend(seg)
    
    other = list(set(l1) ^ set(l2))
    print(other)
    for oth in other:
        for i in range(len(segs)):
            print(segs[i])
            if (oth-1 in segs[i]) or (oth + 1) in segs[i] :
                segs[i].append(oth)
                break
                
                
    sort_segs = []
    for i in range(len(segs)):
        sort_segs.append(segs[i].sort())
            
    return segs
if __name__ == "__main__":
	seg = seg_text(result)

六七～

关注

1
点赞
踩
17

收藏

觉得还不错? 一键收藏
0
评论
使用python+stanfordcorenlp+dependency_parse（依赖句法分析）实现英文长句切分

一、问题引入–英文长句切分其实英文里面也是有一些短语，比如：I traveled to New York last year其中 New York 可以作为一个短语来看待，在情感分析以及别的任务里面可能会有更好的性能提升。英文长句的切分也是一个研究的方向：比如论文：Neural Text Segmentation and Its Application to Sentiment Analysis二、如何解决问题？我们使用依赖树（好像类似的还有什么依存句法分析等等）算法提供的各个单词之间的依赖关系，
复制链接

扫一扫

专栏目录