1. 确保安装了java环境,下载安装JDK 1.8及以上版本
2. 下载Stanford CoreNLP文件,并解压
3. 由于Stanford CoreNLP默认处理英文,如果需要处理其他的语言,可以下载对应的jar包
4. 下载其他语言的jar包之后,一定要移动到第二步解压的目录下,并修改jar包的名称,格式为stanford-语言-corenlp-yyyy-mm-dd-models.jar,(yyyy-mm-dd可以自定义)
如:stanford-chinese-corenlp-2024-04-22-models.jar,
mv /path/to/stanford-corenlp-4.5.6-models-french.jar /path/to/stanford-corenlp-4.5.6
tips:只有包含model的,才是对应的语言jar包
5. 配置Stanford CoreNLP的路径(必须配置),有两种选择:
1)导入所有的jar包,也可以只导入使用的jar包
export CLASSPATH=$CLASSPATH:/path/to/stanford-corenlp-4.5.6/*:
2)只导入使用的jar包
export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
6. 检测Stanford CoreNLP能否正常使用
方法一:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -file input.txt
方法二:
echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer
# 输出
Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.
方法三:python环境下调用
1)安装stanfordcorenlp
pip install stanfordcorenlp
2)在Python环境下调用stanfordcorenlp,对文本进行分词tokenize
#英文中的应用
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(r'D:\python3\stanford-corenlp-full-2018-02-27\stanford-corenlp-full-2018-02-27')
sentence = 'Guangdong University of Foreign Studies is located in Guangzhou.'
print ('Tokenize:', nlp.word_tokenize(sentence))
print ('Part of Speech:', nlp.pos_tag(sentence))
print ('Named Entities:', nlp.ner(sentence))
print ('Constituency Parsing:', nlp.parse(sentence))#语法树
print ('Dependency Parsing:', nlp.dependency_parse(sentence))#依存句法
nlp.close() # Do not forget to close! The backend server will consume a lot memery
#中文中的应用,一定记得下载中文jar包,并标志lang=‘zh’
nlp = StanfordCoreNLP(r'D:\python3\stanford-corenlp-full-2018-02-27\stanford-corenlp-full-2018-02-27', lang='zh')
sentence = '清华大学位于北京。'
print(nlp.word_tokenize(sentence))
print(nlp.pos_tag(sentence))
print(nlp.ner(sentence))
print(nlp.parse(sentence))
print(nlp.dependency_parse(sentence))
3)CNN/Dailymail数据集处理中的一个Tokenizer方法 ,GitHub - abisee/cnn-dailymail: Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
def tokenize_stories(stories_dir, tokenized_stories_dir):
"""Maps a whole directory of .story files to a tokenized version using Stanford CoreNLP Tokenizer"""
print("Preparing to tokenize %s to %s..." % (stories_dir, tokenized_stories_dir))
stories = os.listdir(stories_dir)
# make IO list file
print("Making list of files to tokenize...")
with open("mapping.txt", "w") as f:
for s in stories:
f.write("%s \t %s\n" % (os.path.join(stories_dir, s), os.path.join(tokenized_stories_dir, s)))
command = ['java', 'edu.stanford.nlp.process.PTBTokenizer', '-ioFileList', '-preserveLines', 'mapping.txt']
print("Tokenizing %i files in %s and saving in %s..." % (len(stories), stories_dir, tokenized_stories_dir))
subprocess.run(command)
print("Stanford CoreNLP Tokenizer has finished.")
os.remove("mapping.txt")
# Check that the tokenized stories directory contains the same number of files as the original directory
num_orig = len(os.listdir(stories_dir))
num_tokenized = len(os.listdir(tokenized_stories_dir))
if num_orig != num_tokenized:
raise Exception("The tokenized stories directory %s contains %i files, but it should contain the same number as %s (which has %i files). Was there an error during tokenization?" % (tokenized_stories_dir, num_tokenized, stories_dir, num_orig))
print("Successfully finished tokenizing %s to %s.\n" % (stories_dir, tokenized_stories_dir))
# java 是Java运行时的命令。
# edu.stanford.nlp.process.PTBTokenizer 是Stanford CoreNLP包中的PTBTokenizer类的完全限定名,这个类实现了分词器。
# -ioFileList 是一个命令行参数,它告诉PTBTokenizer从指定的文件中读取输入和输出文件的映射。
# -preserveLines 是一个可选参数,它指示分词器保留原始文本中的行结构。
# mapping.txt 是之前代码中创建的文件,它包含了原始文件和期望的分词后文件的路径映射。