StanfordNLP实战使用(Python)

数据集与目的

任务目的

此任务目的为使用StanfordNLP工具对文本进行解析,得到文本分析结果,做后续研究

数据集

获取到新闻网站上8个国家相关新闻的html,抓取正文内容以及发表日期,并存入txt文件。

新闻内容样例:“本报伦敦3月20日电(记者邢雪)英国最大工会组织“工会联盟”近期发布报告称,英国社会男女收入不平等,女性员工工资平均比男性员工少15.4%。英国还需要30多年才能弥合这一差距。英国财政研究所此前发布的报告也显示,过去25年英国男女收入差距“几乎没有任何变化”。
据报道,在英国不同行业,男女收入差距的严重程度不一。在金融和保险领域,男女收入差距达到32.2%,相当于女性一年内少获得将近4个月的工资。即使是在教育、医疗护理等以女性为主的行业领域,女性收入仍旧普遍低于男性。
英国财政研究所研究部门副主任科斯塔·迪亚斯认为,英国在就业、工资、工作时间等方面,依然存在较大性别差距。英国政府的政策缺乏一套“连贯性”激励机制,以确保男女社会责任均等、实现职场性别平等。
”

数据分类

因为此任务得到的结果之后会进行一些后续研究,所以对新闻需要进行分类。

  1. 首先按照发布日期将新闻数据按周长度分类
  2. 利用层次聚类算法对每一周的新闻进行分类

层次聚类核心代码如下:

def cluster(article_list):
    num = len(article_list)
    Scores=[]
    labels=[]
    metric = np.zeros((num, num))
    for i in range(num):
        for j in range(i + 1, num):
            v1, v2 = get_word_vector(article_list[i], article_list[j], stop_word)
            val = cos_dist(v1, v2)
            metric[i, j] = 1 - val
            metric[j, i] = 1 - val
    n_clusters = min(11,num)

    for k in range(2, n_clusters):
        model = AgglomerativeClustering(n_clusters=k, affinity='precomputed', linkage='average')
        model.fit(metric)
        Scores.append(silhouette_score(metric, model.labels_, metric='euclidean'))#轮廓系数
        label = model.fit_predict(metric)
        labels.append(label)
    maxid=0
    for i in range(1,len(Scores)):
        if Scores[i]>Scores[maxid]:
            maxid=i
    return labels[maxid],maxid+2

stanfordnlp使用

具体使用安装使用教程大家可在别处找到,csdn上也有很多

在模型文件夹打开命令行中执行下述代码

java -mx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -fileList ../filelist.txt -outputFormat json -outputDirectory ../output/England/

其中filelist里面的文件示例:
stanfordnlp会按行处理,一行是一条新闻文本。
在这里插入图片描述
结果json文件样例:(因为篇幅过长,放在最后)

共指消解

使用stanfordnlp分析结果,对原文本进行共指消解,代码如下:

def get_coref(file):
    with open(file,'r',encoding='UTF-8') as f:
        f=f.read()
        result=json.loads(f)
        sentence=result['sentences']
        corefs=result['corefs']

    return [sentence,corefs]

def deal_origin(file,res):
    text=""
    for i in res[1].values():
        sentenceList=i
        origin_sentenceid=sentenceList[0]['sentNum']-1
        origin_startIndex=sentenceList[0]['startIndex']-1
        origin_endIndex=sentenceList[0]['endIndex']-1
        origin_word=""
        for j in range(origin_startIndex,origin_endIndex):
            origin_word+=res[0][origin_sentenceid]['tokens'][j]["originalText"]
        for j in sentenceList[1::]:
            tgt_sentenceid=j['sentNum']-1
            tgt_startIndex = j['startIndex'] - 1
            tgt_endIndex = j['endIndex'] - 1
            for k in range(tgt_startIndex, tgt_endIndex):
                if k==tgt_startIndex:
                    res[0][tgt_sentenceid]['tokens'][k]["originalText"]=origin_word
                else:
                    res[0][tgt_sentenceid]['tokens'][k]["originalText"] = ""
    for i in res[0]:
        for j in i['tokens']:
            text+=j["originalText"]

    with open(file, 'w', encoding='UTF-8') as f:
        f.write(text)

执行完此代码会得到新的原文本,再使用stanfordnlp处理一次,这次是新的filelist,不要忘记重新生成filelist了。

java -mx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -fileList ../filelist.txt -outputFormat json -outputDirectory ../output/England/

这样就得到了stanfordnlp处理文本后的结果,可以根据自己的需要修改命令行的代码,代码format指导在stanfordnlp官网就能看到

遇到问题

  • 在学习过程中,发现使用stanfordnlp的python库会触发错误,试过了一些解决办法,以及版本更换,仍存在问题。同时从stanfordnlp官网了解到,他们已经开发出了名叫Stanza的python版nlp库。综上原因,尝试使用Stanza进行处理。
  • Stanza没有开发共指解析模块,但是通过stanza可以访问stanfordnlp的接口,通过学习,调用了共指解析的接口,并实现文本转换。
  • 命令行调用stanfordnlp也很方便,可以考虑使用这种方法,本文也是如此。

output.json

{
  "docId": "2020-04-15_all_3.txt",
  "sentences": [
    {
      "index": 0,
      "parse": "(ROOT\r\n  (IP\r\n    (VP\r\n      (VCD (VV 延伸) (VV 阅读))\r\n      (PU :)\r\n      (IP\r\n        (NP\r\n          (NP\r\n            (NP (NN 全球) (NN 疫情) (NN 要览))\r\n            (PRN (PU ()\r\n              (NP (NT 4月) (NT 16日))\r\n              (PU ))))\r\n          (NP\r\n            (NP (NR 亚欧))\r\n            (NP (NN 地区)))\r\n          (NP (NN 疫情)))\r\n        (VP (VV 持续)\r\n          (VP\r\n            (VP (VV 蔓延)\r\n              (NP\r\n                (NP (NR 德国))\r\n                (NP (NN 社交) (NN 限制) (NN 措施))))\r\n            (VP (VV 延长))))))))",
      "basicDependencies": [
        {
          "dep": "ROOT",
          "governor": 0,
          "governorGloss": "ROOT",
          "dependent": 1,
          "dependentGloss": "延伸"
        },
        {
          "dep": "compound:vc",
          "governor": 1,
          "governorGloss": "延伸",
          "dependent": 2,
          "dependentGloss": "阅读"
        },
        {
          "dep": "punct",
          "governor": 1,
          "governorGloss": "延伸",
          "dependent": 3,
          "dependentGloss": ":"
        },
        {
          "dep": "compound:nn",
          "governor": 6,
          "governorGloss": "要览",
          "dependent": 4,
          "dependentGloss": "全球"
        },
        {
          "dep": "compound:nn",
          "governor": 6,
          "governorGloss": "要览",
          "dependent": 5,
          "dependentGloss": "疫情"
        },
        {
          "dep": "compound:nn",
          "governor": 13,
          "governorGloss": "疫情",
          "dependent": 6,
          "dependentGloss": "要览"
        },
        {
          "dep": "punct",
          "governor": 9,
          "governorGloss": "16日",
          "dependent": 7,
          "dependentGloss": "("
        },
        {
          "dep": "compound:nn",
          "governor": 9,
          "governorGloss": "16日",
          "dependent": 8,
          "dependentGloss": "4月"
        },
        {
          "dep": "parataxis:prnmod",
          "governor": 6,
          "governorGloss": "要览",
          "dependent": 9,
          "dependentGloss": "16日"
        },
        {
          "dep": "punct",
          "governor": 9,
          "governorGloss": "16日",
          "dependent": 10,
          "dependentGloss": ")"
        },
        {
          "dep": "nmod:assmod",
          "governor": 12,
          "governorGloss": "地区",
          "dependent": 11,
          "dependentGloss": "亚欧"
        },
        {
          "dep": "compound:nn",
          "governor": 13,
          "governorGloss": "疫情",
          "dependent": 12,
          "dependentGloss": "地区"
        },
        {
          "dep": "nsubj",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 13,
          "dependentGloss": "疫情"
        },
        {
          "dep": "xcomp",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 14,
          "dependentGloss": "持续"
        },
        {
          "dep": "ccomp",
          "governor": 1,
          "governorGloss": "延伸",
          "dependent": 15,
          "dependentGloss": "蔓延"
        },
        {
          "dep": "nmod:assmod",
          "governor": 19,
          "governorGloss": "措施",
          "dependent": 16,
          "dependentGloss": "德国"
        },
        {
          "dep": "compound:nn",
          "governor": 19,
          "governorGloss": "措施",
          "dependent": 17,
          "dependentGloss": "社交"
        },
        {
          "dep": "compound:nn",
          "governor": 19,
          "governorGloss": "措施",
          "dependent": 18,
          "dependentGloss": "限制"
        },
        {
          "dep": "dobj",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 19,
          "dependentGloss": "措施"
        },
        {
          "dep": "conj",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 20,
          "dependentGloss": "延长"
        }
      ],
      "enhancedDependencies": [
        {
          "dep": "ROOT",
          "governor": 0,
          "governorGloss": "ROOT",
          "dependent": 1,
          "dependentGloss": "延伸"
        },
        {
          "dep": "compound:vc",
          "governor": 1,
          "governorGloss": "延伸",
          "dependent": 2,
          "dependentGloss": "阅读"
        },
        {
          "dep": "punct",
          "governor": 1,
          "governorGloss": "延伸",
          "dependent": 3,
          "dependentGloss": ":"
        },
        {
          "dep": "compound:nn",
          "governor": 6,
          "governorGloss": "要览",
          "dependent": 4,
          "dependentGloss": "全球"
        },
        {
          "dep": "compound:nn",
          "governor": 6,
          "governorGloss": "要览",
          "dependent": 5,
          "dependentGloss": "疫情"
        },
        {
          "dep": "compound:nn",
          "governor": 13,
          "governorGloss": "疫情",
          "dependent": 6,
          "dependentGloss": "要览"
        },
        {
          "dep": "punct",
          "governor": 9,
          "governorGloss": "16日",
          "dependent": 7,
          "dependentGloss": "("
        },
        {
          "dep": "compound:nn",
          "governor": 9,
          "governorGloss": "16日",
          "dependent": 8,
          "dependentGloss": "4月"
        },
        {
          "dep": "parataxis:prnmod",
          "governor": 6,
          "governorGloss": "要览",
          "dependent": 9,
          "dependentGloss": "16日"
        },
        {
          "dep": "punct",
          "governor": 9,
          "governorGloss": "16日",
          "dependent": 10,
          "dependentGloss": ")"
        },
        {
          "dep": "nmod:assmod",
          "governor": 12,
          "governorGloss": "地区",
          "dependent": 11,
          "dependentGloss": "亚欧"
        },
        {
          "dep": "compound:nn",
          "governor": 13,
          "governorGloss": "疫情",
          "dependent": 12,
          "dependentGloss": "地区"
        },
        {
          "dep": "nsubj",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 13,
          "dependentGloss": "疫情"
        },
        {
          "dep": "xcomp",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 14,
          "dependentGloss": "持续"
        },
        {
          "dep": "ccomp",
          "governor": 1,
          "governorGloss": "延伸",
          "dependent": 15,
          "dependentGloss": "蔓延"
        },
        {
          "dep": "nmod:assmod",
          "governor": 19,
          "governorGloss": "措施",
          "dependent": 16,
          "dependentGloss": "德国"
        },
        {
          "dep": "compound:nn",
          "governor": 19,
          "governorGloss": "措施",
          "dependent": 17,
          "dependentGloss": "社交"
        },
        {
          "dep": "compound:nn",
          "governor": 19,
          "governorGloss": "措施",
          "dependent": 18,
          "dependentGloss": "限制"
        },
        {
          "dep": "dobj",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 19,
          "dependentGloss": "措施"
        },
        {
          "dep": "conj",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 20,
          "dependentGloss": "延长"
        }
      ],
      "enhancedPlusPlusDependencies": [
        {
          "dep": "ROOT",
          "governor": 0,
          "governorGloss": "ROOT",
          "dependent": 1,
          "dependentGloss": "延伸"
        },
        {
          "dep": "compound:vc",
          "governor": 1,
          "governorGloss": "延伸",
          "dependent": 2,
          "dependentGloss": "阅读"
        },
        {
          "dep": "punct",
          "governor": 1,
          "governorGloss": "延伸",
          "dependent": 3,
          "dependentGloss": ":"
        },
        {
          "dep": "compound:nn",
          "governor": 6,
          "governorGloss": "要览",
          "dependent": 4,
          "dependentGloss": "全球"
        },
        {
          "dep": "compound:nn",
          "governor": 6,
          "governorGloss": "要览",
          "dependent": 5,
          "dependentGloss": "疫情"
        },
        {
          "dep": "compound:nn",
          "governor": 13,
          "governorGloss": "疫情",
          "dependent": 6,
          "dependentGloss": "要览"
        },
        {
          "dep": "punct",
          "governor": 9,
          "governorGloss": "16日",
          "dependent": 7,
          "dependentGloss": "("
        },
        {
          "dep": "compound:nn",
          "governor": 9,
          "governorGloss": "16日",
          "dependent": 8,
          "dependentGloss": "4月"
        },
        {
          "dep": "parataxis:prnmod",
          "governor": 6,
          "governorGloss": "要览",
          "dependent": 9,
          "dependentGloss": "16日"
        },
        {
          "dep": "punct",
          "governor": 9,
          "governorGloss": "16日",
          "dependent": 10,
          "dependentGloss": ")"
        },
        {
          "dep": "nmod:assmod",
          "governor": 12,
          "governorGloss": "地区",
          "dependent": 11,
          "dependentGloss": "亚欧"
        },
        {
          "dep": "compound:nn",
          "governor": 13,
          "governorGloss": "疫情",
          "dependent": 12,
          "dependentGloss": "地区"
        },
        {
          "dep": "nsubj",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 13,
          "dependentGloss": "疫情"
        },
        {
          "dep": "xcomp",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 14,
          "dependentGloss": "持续"
        },
        {
          "dep": "ccomp",
          "governor": 1,
          "governorGloss": "延伸",
          "dependent": 15,
          "dependentGloss": "蔓延"
        },
        {
          "dep": "nmod:assmod",
          "governor": 19,
          "governorGloss": "措施",
          "dependent": 16,
          "dependentGloss": "德国"
        },
        {
          "dep": "compound:nn",
          "governor": 19,
          "governorGloss": "措施",
          "dependent": 17,
          "dependentGloss": "社交"
        },
        {
          "dep": "compound:nn",
          "governor": 19,
          "governorGloss": "措施",
          "dependent": 18,
          "dependentGloss": "限制"
        },
        {
          "dep": "dobj",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 19,
          "dependentGloss": "措施"
        },
        {
          "dep": "conj",
          "governor": 15,
          "governorGloss": "蔓延",
          "dependent": 20,
          "dependentGloss": "延长"
        }
      ],
      "entitymentions": [
        {
          "docTokenBegin": 7,
          "docTokenEnd": 9,
          "tokenBegin": 7,
          "tokenEnd": 9,
          "text": "4月16日",
          "characterOffsetBegin": 14,
          "characterOffsetEnd": 19,
          "ner": "DATE",
          "normalizedNER": "XXXX-04-16",
          "nerConfidences": {
            "DATE": -1
          }
        },
        {
          "docTokenBegin": 10,
          "docTokenEnd": 11,
          "tokenBegin": 10,
          "tokenEnd": 11,
          "text": "亚欧",
          "characterOffsetBegin": 20,
          "characterOffsetEnd": 22,
          "ner": "LOCATION",
          "nerConfidences": {
            "LOCATION": 0.48412511863581
          }
        },
        {
          "docTokenBegin": 15,
          "docTokenEnd": 16,
          "tokenBegin": 15,
          "tokenEnd": 16,
          "text": "德国",
          "characterOffsetBegin": 30,
          "characterOffsetEnd": 32,
          "ner": "COUNTRY",
          "nerConfidences": {
            "GPE": 0.9540884277315
          }
        }
      ],
      "tokens": [
        {
          "index": 1,
          "word": "延伸",
          "originalText": "延伸",
          "lemma": "延伸",
          "characterOffsetBegin": 0,
          "characterOffsetEnd": 2,
          "pos": "VV",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 2,
          "word": "阅读",
          "originalText": "阅读",
          "lemma": "阅读",
          "characterOffsetBegin": 2,
          "characterOffsetEnd": 4,
          "pos": "VV",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 3,
          "word": ":",
          "originalText": ":",
          "lemma": ":",
          "characterOffsetBegin": 4,
          "characterOffsetEnd": 5,
          "pos": "PU",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 4,
          "word": "全球",
          "originalText": "全球",
          "lemma": "全球",
          "characterOffsetBegin": 7,
          "characterOffsetEnd": 9,
          "pos": "NN",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 5,
          "word": "疫情",
          "originalText": "疫情",
          "lemma": "疫情",
          "characterOffsetBegin": 9,
          "characterOffsetEnd": 11,
          "pos": "NN",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 6,
          "word": "要览",
          "originalText": "要览",
          "lemma": "要览",
          "characterOffsetBegin": 11,
          "characterOffsetEnd": 13,
          "pos": "NN",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 7,
          "word": "(",
          "originalText": "(",
          "lemma": "(",
          "characterOffsetBegin": 13,
          "characterOffsetEnd": 14,
          "pos": "PU",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 8,
          "word": "4月",
          "originalText": "4月",
          "lemma": "4月",
          "characterOffsetBegin": 14,
          "characterOffsetEnd": 16,
          "pos": "NT",
          "ner": "DATE",
          "normalizedNER": "XXXX-04-16",
          "speaker": "PER0"
        },
        {
          "index": 9,
          "word": "16日",
          "originalText": "16日",
          "lemma": "16日",
          "characterOffsetBegin": 16,
          "characterOffsetEnd": 19,
          "pos": "NT",
          "ner": "DATE",
          "normalizedNER": "XXXX-04-16",
          "speaker": "PER0"
        },
        {
          "index": 10,
          "word": ")",
          "originalText": ")",
          "lemma": ")",
          "characterOffsetBegin": 19,
          "characterOffsetEnd": 20,
          "pos": "PU",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 11,
          "word": "亚欧",
          "originalText": "亚欧",
          "lemma": "亚欧",
          "characterOffsetBegin": 20,
          "characterOffsetEnd": 22,
          "pos": "NR",
          "ner": "LOCATION",
          "speaker": "PER0"
        },
        {
          "index": 12,
          "word": "地区",
          "originalText": "地区",
          "lemma": "地区",
          "characterOffsetBegin": 22,
          "characterOffsetEnd": 24,
          "pos": "NN",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 13,
          "word": "疫情",
          "originalText": "疫情",
          "lemma": "疫情",
          "characterOffsetBegin": 24,
          "characterOffsetEnd": 26,
          "pos": "NN",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 14,
          "word": "持续",
          "originalText": "持续",
          "lemma": "持续",
          "characterOffsetBegin": 26,
          "characterOffsetEnd": 28,
          "pos": "VV",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 15,
          "word": "蔓延",
          "originalText": "蔓延",
          "lemma": "蔓延",
          "characterOffsetBegin": 28,
          "characterOffsetEnd": 30,
          "pos": "VV",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 16,
          "word": "德国",
          "originalText": "德国",
          "lemma": "德国",
          "characterOffsetBegin": 30,
          "characterOffsetEnd": 32,
          "pos": "NR",
          "ner": "COUNTRY",
          "speaker": "PER0"
        },
        {
          "index": 17,
          "word": "社交",
          "originalText": "社交",
          "lemma": "社交",
          "characterOffsetBegin": 32,
          "characterOffsetEnd": 34,
          "pos": "NN",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 18,
          "word": "限制",
          "originalText": "限制",
          "lemma": "限制",
          "characterOffsetBegin": 34,
          "characterOffsetEnd": 36,
          "pos": "NN",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 19,
          "word": "措施",
          "originalText": "措施",
          "lemma": "措施",
          "characterOffsetBegin": 36,
          "characterOffsetEnd": 38,
          "pos": "NN",
          "ner": "O",
          "speaker": "PER0"
        },
        {
          "index": 20,
          "word": "延长",
          "originalText": "延长",
          "lemma": "延长",
          "characterOffsetBegin": 38,
          "characterOffsetEnd": 40,
          "pos": "VV",
          "ner": "O",
          "speaker": "PER0"
        }
      ]
    }
  ],
  "corefs": {
  }
}

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值