处理目标:
使用StanfordCoreNLP对Quora数据集中的句子进行Constituency Parsing。本次使用命令行进行生成包含Constituency Parsing结果的文件,但需要提取出其中的Sentence属性和Constituency Parse属性。
train.txt.out文件格式:
Document: ID=train.txt (127587 sentences, 1428493 tokens) Sentence #1 (6 tokens): how do i memorize faster ? Tokens: [Text=how CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=WRB] [Text=do CharacterOffsetBegin=4 CharacterOffsetEnd=6 PartOfSpeech=VBP] [Text=i CharacterOffsetBegin=7 CharacterOffsetEnd=8 PartOfSpeech=PRP] [Text=memorize CharacterOffsetBegin=9 CharacterOffsetEnd=17 PartOfSpeech=VBP] [Text=faster CharacterOffsetBegin=18 CharacterOffsetEnd=24 PartOfSpeech=RBR] [Text=? CharacterOffsetBegin=25 CharacterOffsetEnd=26 PartOfSpeech=.] Constituency parse: (ROOT (SBARQ (WHADVP (WRB how)) (SQ (VBP do) (NP (PRP i)) (VP (VBP memorize) (ADVP (RBR faster)))) (. ?))) Dependency Parse (enhanced plus plus dependencies): root(ROOT-0, memorize-4) advmod(memorize-4, how-1) aux(memorize-4, do-2) nsubj(memorize-4, i-3) advmod(memorize-4, faster-5) punct(memorize-4, ?-6) Sentence #2 (15 tokens): what is the process to rent a land to host mobile tower in india ? Tokens: [Text=what CharacterOffsetBegin=27 CharacterOffsetEnd=31 PartOfSpeech=WP] [Text=is CharacterOffsetBegin=32 CharacterOffsetEnd=34 PartOfSpeech=VBZ] [Text=the CharacterOffsetBegin=35 CharacterOffsetEnd=38 PartOfSpeech=DT] [Text=process CharacterOffsetBegin=39 CharacterOffsetEnd=46 PartOfSpeech=NN] [Text=to CharacterOffsetBegin=47 CharacterOffsetEnd=49 PartOfSpeech=TO] [Text=rent CharacterOffsetBegin=50 CharacterOffsetEnd=54 PartOfSpeech=VB] [Text=a CharacterOffsetBegin=55 CharacterOffsetEnd=56 PartOfSpeech=DT] [Text=land CharacterOffsetBegin=57 CharacterOffsetEnd=61 PartOfSpeech=NN] [Text=to CharacterOffsetBegin=62 CharacterOffsetEnd=64 PartOfSpeech=TO] [Text=host CharacterOffsetBegin=65 CharacterOffsetEnd=69 PartOfSpeech=VB] [Text=mobile CharacterOffsetBegin=70 CharacterOffsetEnd=76 PartOfSpeech=JJ] [Text=tower CharacterOffsetBegin=77 CharacterOffsetEnd=82 PartOfSpeech=NN] [Text=in CharacterOffsetBegin=83 CharacterOffsetEnd=85 PartOfSpeech=IN] [Text=india CharacterOffsetBegin=86 CharacterOffsetEnd=91 PartOfSpeech=NNP] [Text=? CharacterOffsetBegin=92 CharacterOffsetEnd=93 PartOfSpeech=.] Constituency parse: (ROOT (SBARQ (WHNP (WP what)) (SQ (VBZ is) (NP (NP (DT the) (NN process)) (SBAR (S (VP (TO to) (VP (VB rent) (NP (NP (DT a) (NN land)) (S (VP (TO to) (VP (VB host) (NP (JJ mobile) (NN tower)) (PP (IN in) (NP (NNP india))))))))))))) (. ?))) Dependency Parse (enhanced plus plus dependencies): root(ROOT-0, what-1) cop(what-1, is-2) det(process-4, the-3) nsubj(what-1, process-4) mark(rent-6, to-5) acl:to(process-4, rent-6) det(land-8, a-7) obj(rent-6, land-8) mark(host-10, to-9) acl:to(land-8, host-10) amod(tower-12, mobile-11) obj(host-10, tower-12) case(india-14, in-13) obl:in(host-10, india-14) punct(what-1, ?-15) Sentence #3 (5 tokens): why do rapists rape ? Tokens: [Text=why CharacterOffsetBegin=94 CharacterOffsetEnd=97 PartOfSpeech=WRB] [Text=do CharacterOffsetBegin=98 CharacterOffsetEnd=100 PartOfSpeech=VBP] [Text=rapists CharacterOffsetBegin=101 CharacterOffsetEnd=108 PartOfSpeech=NNS] [Text=rape CharacterOffsetBegin=109 CharacterOffsetEnd=113 PartOfSpeech=NN] [Text=? CharacterOffsetBegin=114 CharacterOffsetEnd=115 PartOfSpeech=.] Constituency parse: (ROOT (SBARQ (WHADVP (WRB why)) (SQ (VBP do) (NP (NNS rapists)) (NP (NN rape))) (. ?))) Dependency Parse (enhanced plus plus dependencies): root(ROOT-0, do-2) advmod(do-2, why-1) nsubj(do-2, rapists-3) dep(do-2, rape-4) punct(do-2, ?-5)
通过extract_parses函数对生成的文件进行处理,得到想要的结果。
代码如下:
def extract_parses(fname,output_file):
f = open(fname, "r", encoding="utf-8").readlines()
new_sent = False
new_parse = False
sent = ""
parse = ""
count = 0
sents = []
synts = []
for line in f:
if line.startswith('Sentence #'):
new_sent = True
new_parse = False
# label_sentence(data)
# print ' '.join(data['tokens'])
# data['label'] = dataset[count]['label']
count += 1
# read original sentence
elif new_sent:
# data['sent'] = line.strip()
if line.startswith('Tokens:'):
new_sent = False
sent = " ".join(sent.split())
sents.append(sent)
#print(sent)
sent = ""
else:
sent = sent + line
elif line.startswith('Constituency parse:'):
new_parse = True
continue
elif new_parse:
if line.startswith('Dependency Parse'):
new_parse = False
parse = " ".join(parse.split())
#print(parse)
synts.append(parse)
parse = ""
else:
parse = parse + line
h5f = h5py.File(output_file, 'w')
dtype = h5py.special_dtype(vlen=str)
sents = [tmp.encode('utf8') for tmp in sents]
synts = [tmp.encode('utf8') for tmp in synts]
h5f.create_dataset("sents", dtype=dtype, data=sents)
h5f.create_dataset("synts", dtype=dtype, data=synts)
h5f.close()
return sents,synts