python中使用Stanford CoreNLP

1. 确保安装了java环境,下载安装JDK 1.8及以上版本

2. 下载Stanford CoreNLP文件,并解压

3. 由于Stanford CoreNLP默认处理英文,如果需要处理其他的语言,可以下载对应的jar包

4. 下载其他语言的jar包之后,一定要移动到第二步解压的目录下,并修改jar包的名称,格式为stanford-语言-corenlp-yyyy-mm-dd-models.jar,(yyyy-mm-dd可以自定义)

如:stanford-chinese-corenlp-2024-04-22-models.jar,

mv /path/to/stanford-corenlp-4.5.6-models-french.jar /path/to/stanford-corenlp-4.5.6

tips:只有包含model的,才是对应的语言jar包 

5. 配置Stanford CoreNLP的路径(必须配置),有两种选择:

1)导入所有的jar包,也可以只导入使用的jar包

export CLASSPATH=$CLASSPATH:/path/to/stanford-corenlp-4.5.6/*:

 2)只导入使用的jar包

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

6. 检测Stanford CoreNLP能否正常使用

方法一:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -file input.txt
 方法二:
echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

# 输出
Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

方法三:python环境下调用

1)安装stanfordcorenlp 

pip install stanfordcorenlp

2)在Python环境下调用stanfordcorenlp,对文本进行分词tokenize

#英文中的应用
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(r'D:\python3\stanford-corenlp-full-2018-02-27\stanford-corenlp-full-2018-02-27')
 
sentence = 'Guangdong University of Foreign Studies is located in Guangzhou.'
print ('Tokenize:', nlp.word_tokenize(sentence))
print ('Part of Speech:', nlp.pos_tag(sentence))
print ('Named Entities:', nlp.ner(sentence))
print ('Constituency Parsing:', nlp.parse(sentence))#语法树
print ('Dependency Parsing:', nlp.dependency_parse(sentence))#依存句法
nlp.close() # Do not forget to close! The backend server will consume a lot memery



#中文中的应用,一定记得下载中文jar包,并标志lang=‘zh’
nlp = StanfordCoreNLP(r'D:\python3\stanford-corenlp-full-2018-02-27\stanford-corenlp-full-2018-02-27', lang='zh')
sentence = '清华大学位于北京。'
print(nlp.word_tokenize(sentence))
print(nlp.pos_tag(sentence))
print(nlp.ner(sentence))
print(nlp.parse(sentence))
print(nlp.dependency_parse(sentence))

3)CNN/Dailymail数据集处理中的一个Tokenizer方法 ,GitHub - abisee/cnn-dailymail: Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization

def tokenize_stories(stories_dir, tokenized_stories_dir):
  """Maps a whole directory of .story files to a tokenized version using Stanford CoreNLP Tokenizer"""
  print("Preparing to tokenize %s to %s..." % (stories_dir, tokenized_stories_dir))
  stories = os.listdir(stories_dir)
  # make IO list file
  print("Making list of files to tokenize...")
  with open("mapping.txt", "w") as f:
    for s in stories:
      f.write("%s \t %s\n" % (os.path.join(stories_dir, s), os.path.join(tokenized_stories_dir, s)))
  command = ['java', 'edu.stanford.nlp.process.PTBTokenizer', '-ioFileList', '-preserveLines', 'mapping.txt']
  print("Tokenizing %i files in %s and saving in %s..." % (len(stories), stories_dir, tokenized_stories_dir))
  subprocess.run(command)
  print("Stanford CoreNLP Tokenizer has finished.")
  os.remove("mapping.txt")

  # Check that the tokenized stories directory contains the same number of files as the original directory
  num_orig = len(os.listdir(stories_dir))
  num_tokenized = len(os.listdir(tokenized_stories_dir))
  if num_orig != num_tokenized:
    raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
  print("Successfully finished tokenizing %s to %s.\n" % (stories_dir, tokenized_stories_dir))




# java 是Java运行时的命令。
# edu.stanford.nlp.process.PTBTokenizer 是Stanford CoreNLP包中的PTBTokenizer类的完全限定名,这个类实现了分词器。
# -ioFileList 是一个命令行参数,它告诉PTBTokenizer从指定的文件中读取输入和输出文件的映射。
# -preserveLines 是一个可选参数,它指示分词器保留原始文本中的行结构。
# mapping.txt 是之前代码中创建的文件,它包含了原始文件和期望的分词后文件的路径映射。

参考:python中stanfordCorenlp使用教程-CSDN博客

           Overview - CoreNLP (stanfordnlp.github.io)

  • 5
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值