python中使用Stanford CoreNLP

wwwwzm

于 2024-04-26 12:21:56 发布

阅读量657

点赞数 5

文章标签： python

本文链接：https://blog.csdn.net/wwwwzm/article/details/138214272

版权

1. 确保安装了java环境，下载安装JDK 1.8及以上版本

2. 下载Stanford CoreNLP文件，并解压

3. 由于Stanford CoreNLP默认处理英文，如果需要处理其他的语言，可以下载对应的jar包

4. 下载其他语言的jar包之后，一定要移动到第二步解压的目录下，并修改jar包的名称，格式为stanford-语言-corenlp-yyyy-mm-dd-models.jar，（yyyy-mm-dd可以自定义）

如：stanford-chinese-corenlp-2024-04-22-models.jar，

mv /path/to/stanford-corenlp-4.5.6-models-french.jar /path/to/stanford-corenlp-4.5.6

tips：只有包含model的，才是对应的语言jar包

5. 配置Stanford CoreNLP的路径（必须配置），有两种选择：

1）导入所有的jar包，也可以只导入使用的jar包

export CLASSPATH=$CLASSPATH:/path/to/stanford-corenlp-4.5.6/*:

2）只导入使用的jar包

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar

6. 检测Stanford CoreNLP能否正常使用

方法一：

java edu.stanford.nlp.pipeline.StanfordCoreNLP -file input.txt

方法二：

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

# 输出
Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

方法三：python环境下调用

1）安装stanfordcorenlp

pip install stanfordcorenlp

2）在Python环境下调用stanfordcorenlp，对文本进行分词tokenize

#英文中的应用
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(r'D:\python3\stanford-corenlp-full-2018-02-27\stanford-corenlp-full-2018-02-27')
 
sentence = 'Guangdong University of Foreign Studies is located in Guangzhou.'
print ('Tokenize:', nlp.word_tokenize(sentence))
print ('Part of Speech:', nlp.pos_tag(sentence))
print ('Named Entities:', nlp.ner(sentence))
print ('Constituency Parsing:', nlp.parse(sentence))#语法树
print ('Dependency Parsing:', nlp.dependency_parse(sentence))#依存句法
nlp.close() # Do not forget to close! The backend server will consume a lot memery



#中文中的应用，一定记得下载中文jar包，并标志lang=‘zh’
nlp = StanfordCoreNLP(r'D:\python3\stanford-corenlp-full-2018-02-27\stanford-corenlp-full-2018-02-27', lang='zh')
sentence = '清华大学位于北京。'
print(nlp.word_tokenize(sentence))
print(nlp.pos_tag(sentence))
print(nlp.ner(sentence))
print(nlp.parse(sentence))
print(nlp.dependency_parse(sentence))

3）CNN/Dailymail数据集处理中的一个Tokenizer方法，GitHub - abisee/cnn-dailymail: Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization

def tokenize_stories(stories_dir, tokenized_stories_dir):
  """Maps a whole directory of .story files to a tokenized version using Stanford CoreNLP Tokenizer"""
  print("Preparing to tokenize %s to %s..." % (stories_dir, tokenized_stories_dir))
  stories = os.listdir(stories_dir)
  # make IO list file
  print("Making list of files to tokenize...")
  with open("mapping.txt", "w") as f:
    for s in stories:
      f.write("%s \t %s\n" % (os.path.join(stories_dir, s), os.path.join(tokenized_stories_dir, s)))
  command = ['java', 'edu.stanford.nlp.process.PTBTokenizer', '-ioFileList', '-preserveLines', 'mapping.txt']
  print("Tokenizing %i files in %s and saving in %s..." % (len(stories), stories_dir, tokenized_stories_dir))
  subprocess.run(command)
  print("Stanford CoreNLP Tokenizer has finished.")
  os.remove("mapping.txt")

  # Check that the tokenized stories directory contains the same number of files as the original directory
  num_orig = len(os.listdir(stories_dir))
  num_tokenized = len(os.listdir(tokenized_stories_dir))
  if num_orig != num_tokenized:
    raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
  print("Successfully finished tokenizing %s to %s.\n" % (stories_dir, tokenized_stories_dir))




# java 是Java运行时的命令。
# edu.stanford.nlp.process.PTBTokenizer 是Stanford CoreNLP包中的PTBTokenizer类的完全限定名，这个类实现了分词器。
# -ioFileList 是一个命令行参数，它告诉PTBTokenizer从指定的文件中读取输入和输出文件的映射。
# -preserveLines 是一个可选参数，它指示分词器保留原始文本中的行结构。
# mapping.txt 是之前代码中创建的文件，它包含了原始文件和期望的分词后文件的路径映射。

参考：python中stanfordCorenlp使用教程-CSDN博客

Overview - CoreNLP (stanfordnlp.github.io)