ccmt2019——sgm文件解析

最新推荐文章于 2024-05-29 11:04:19 发布

Icoding_F2014

最新推荐文章于 2024-05-29 11:04:19 发布

阅读量2.2k

点赞数 1

分类专栏：自然语言处理文章标签： ccmt2019 sgm文件解析

本文链接：https://blog.csdn.net/jmh1996/article/details/90155137

版权

自然语言处理专栏收录该内容

14 篇文章 1 订阅

订阅专栏

ccmt2019的开发集和测试集使用sgm文件给出的，它其实就是一个xml文件，使用python的lxml解析即可。
坑点：

官方给的xml有毒，有些特殊符号需要自己转换，例如：&。需要先手动把所有的&替换为&
注意要核对docid和seg id,存在源语中有而目标语没有的情况。这一点比较坑。

__author__ = 'jmh081701'

from xml.dom.minidom import  parse
import xml.dom.minidom

def get_sentence(file):
    #读取文件
    rst={}
    domTree = parse(file)
    docs = domTree.getElementsByTagName("doc")

    for doc in docs:
        # print("docid",doc.getAttribute('docid'))
        docid=doc.getAttribute('docid')
        rst.setdefault(doc.getAttribute('docid'),{})
        segs = doc.getElementsByTagName('seg')
        for seg in segs:
            #print("id:%d,sentence:%s"%(int(seg.getAttribute('id')),seg.childNodes[0].data))
            rst[docid].setdefault(seg.getAttribute('id'), seg.childNodes[0].data)
    return  rst
en_src=get_sentence(file=r"E:\TempWorkStation\ccmt2019\dataset\NJU-newsdev2017-enzh\NJU-newsdev2017-enzh\newsdev2017-enzh-src.en.sgm")
cn_ref=get_sentence(file=r"E:\TempWorkStation\ccmt2019\dataset\NJU-newsdev2017-enzh\NJU-newsdev2017-enzh\newsdev2017-enzh-ref.zh.sgm")
with open(".//corpus//en_dev.txt","w") as en_fp:
    with open(".//corpus//cn_dev.txt","w") as cn_fp:
        for docid in en_src:
            for id in en_src[docid]:
                if (docid in cn_ref) and (id in cn_ref[docid]):
                    en_fp.writelines(en_src[docid][id]+"\n")
                    cn_fp.writelines(cn_ref[docid][id]+"\n")