StanfordCoreNLP 提取 Constituency Parsing结果

最新推荐文章于 2023-04-26 20:22:45 发布

qq_45104795

最新推荐文章于 2023-04-26 20:22:45 发布

阅读量1.1k

点赞数 21

分类专栏： pytorch 文章标签：大数据

本文链接：https://blog.csdn.net/qq_45104795/article/details/125770383

版权

pytorch 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

处理目标：

使用StanfordCoreNLP对Quora数据集中的句子进行Constituency Parsing。本次使用命令行进行生成包含Constituency Parsing结果的文件，但需要提取出其中的Sentence属性和Constituency Parse属性。

train.txt.out文件格式：

Document: ID=train.txt (127587 sentences, 1428493 tokens)

Sentence #1 (6 tokens):
how do i memorize faster ?

Tokens:
[Text=how CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=WRB]
[Text=do CharacterOffsetBegin=4 CharacterOffsetEnd=6 PartOfSpeech=VBP]
[Text=i CharacterOffsetBegin=7 CharacterOffsetEnd=8 PartOfSpeech=PRP]
[Text=memorize CharacterOffsetBegin=9 CharacterOffsetEnd=17 PartOfSpeech=VBP]
[Text=faster CharacterOffsetBegin=18 CharacterOffsetEnd=24 PartOfSpeech=RBR]
[Text=? CharacterOffsetBegin=25 CharacterOffsetEnd=26 PartOfSpeech=.]

Constituency parse:
(ROOT
  (SBARQ
    (WHADVP (WRB how))
    (SQ (VBP do)
      (NP (PRP i))
      (VP (VBP memorize)
        (ADVP (RBR faster))))
    (. ?)))


Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, memorize-4)
advmod(memorize-4, how-1)
aux(memorize-4, do-2)
nsubj(memorize-4, i-3)
advmod(memorize-4, faster-5)
punct(memorize-4, ?-6)

Sentence #2 (15 tokens):
what is the process to rent a land to host mobile tower in india ?

Tokens:
[Text=what CharacterOffsetBegin=27 CharacterOffsetEnd=31 PartOfSpeech=WP]
[Text=is CharacterOffsetBegin=32 CharacterOffsetEnd=34 PartOfSpeech=VBZ]
[Text=the CharacterOffsetBegin=35 CharacterOffsetEnd=38 PartOfSpeech=DT]
[Text=process CharacterOffsetBegin=39 CharacterOffsetEnd=46 PartOfSpeech=NN]
[Text=to CharacterOffsetBegin=47 CharacterOffsetEnd=49 PartOfSpeech=TO]
[Text=rent CharacterOffsetBegin=50 CharacterOffsetEnd=54 PartOfSpeech=VB]
[Text=a CharacterOffsetBegin=55 CharacterOffsetEnd=56 PartOfSpeech=DT]
[Text=land CharacterOffsetBegin=57 CharacterOffsetEnd=61 PartOfSpeech=NN]
[Text=to CharacterOffsetBegin=62 CharacterOffsetEnd=64 PartOfSpeech=TO]
[Text=host CharacterOffsetBegin=65 CharacterOffsetEnd=69 PartOfSpeech=VB]
[Text=mobile CharacterOffsetBegin=70 CharacterOffsetEnd=76 PartOfSpeech=JJ]
[Text=tower CharacterOffsetBegin=77 CharacterOffsetEnd=82 PartOfSpeech=NN]
[Text=in CharacterOffsetBegin=83 CharacterOffsetEnd=85 PartOfSpeech=IN]
[Text=india CharacterOffsetBegin=86 CharacterOffsetEnd=91 PartOfSpeech=NNP]
[Text=? CharacterOffsetBegin=92 CharacterOffsetEnd=93 PartOfSpeech=.]

Constituency parse:
(ROOT
  (SBARQ
    (WHNP (WP what))
    (SQ (VBZ is)
      (NP
        (NP (DT the) (NN process))
        (SBAR
          (S
            (VP (TO to)
              (VP (VB rent)
                (NP
                  (NP (DT a) (NN land))
                  (S
                    (VP (TO to)
                      (VP (VB host)
                        (NP (JJ mobile) (NN tower))
                        (PP (IN in)
                          (NP (NNP india)))))))))))))
    (. ?)))


Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, what-1)
cop(what-1, is-2)
det(process-4, the-3)
nsubj(what-1, process-4)
mark(rent-6, to-5)
acl:to(process-4, rent-6)
det(land-8, a-7)
obj(rent-6, land-8)
mark(host-10, to-9)
acl:to(land-8, host-10)
amod(tower-12, mobile-11)
obj(host-10, tower-12)
case(india-14, in-13)
obl:in(host-10, india-14)
punct(what-1, ?-15)

Sentence #3 (5 tokens):
why do rapists rape ?

Tokens:
[Text=why CharacterOffsetBegin=94 CharacterOffsetEnd=97 PartOfSpeech=WRB]
[Text=do CharacterOffsetBegin=98 CharacterOffsetEnd=100 PartOfSpeech=VBP]
[Text=rapists CharacterOffsetBegin=101 CharacterOffsetEnd=108 PartOfSpeech=NNS]
[Text=rape CharacterOffsetBegin=109 CharacterOffsetEnd=113 PartOfSpeech=NN]
[Text=? CharacterOffsetBegin=114 CharacterOffsetEnd=115 PartOfSpeech=.]

Constituency parse:
(ROOT
  (SBARQ
    (WHADVP (WRB why))
    (SQ (VBP do)
      (NP (NNS rapists))
      (NP (NN rape)))
    (. ?)))


Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, do-2)
advmod(do-2, why-1)
nsubj(do-2, rapists-3)
dep(do-2, rape-4)
punct(do-2, ?-5)

通过extract_parses函数对生成的文件进行处理，得到想要的结果。

代码如下：

def extract_parses(fname,output_file):
    f = open(fname, "r", encoding="utf-8").readlines()
    new_sent = False

    new_parse = False
    sent = ""
    parse = ""
    count = 0
    sents = []
    synts = []
    for line in f:
        if line.startswith('Sentence #'):
            new_sent = True
            new_parse = False


            # label_sentence(data)
            # print ' '.join(data['tokens'])
            # data['label'] = dataset[count]['label']

            count += 1

        # read original sentence
        elif new_sent:
            # data['sent'] = line.strip()
            if line.startswith('Tokens:'):
                new_sent = False
                sent = " ".join(sent.split())
                sents.append(sent)
                #print(sent)
                sent = ""
            else:
                sent = sent + line



        elif line.startswith('Constituency parse:'):
            new_parse = True
            continue

        elif new_parse:
            if line.startswith('Dependency Parse'):
                new_parse = False
                parse = " ".join(parse.split())
                #print(parse)
                synts.append(parse)
                parse = ""
            else:

                parse = parse + line

    h5f = h5py.File(output_file, 'w')
    dtype = h5py.special_dtype(vlen=str)
    sents = [tmp.encode('utf8') for tmp in sents]
    synts = [tmp.encode('utf8') for tmp in synts]

    h5f.create_dataset("sents", dtype=dtype, data=sents)
    h5f.create_dataset("synts", dtype=dtype, data=synts)

    h5f.close()


    return sents,synts