stanford nlp chinese jar 工具包处理中文

最新推荐文章于 2024-08-13 08:16:01 发布

暖花_

最新推荐文章于 2024-08-13 08:16:01 发布

阅读量2.5k

点赞数 1

分类专栏：自然语言处理文章标签： Stanford nlp 中文

本文链接：https://blog.csdn.net/qq_40562912/article/details/90021812

版权

自然语言处理专栏收录该内容

6 篇文章 0 订阅

订阅专栏

先下载一下核心包和中文工具包，https://stanfordnlp.github.io/CoreNLP/

下载Stanford CoreNLP ，和中文支持包

另外,在这个https://nlp.stanford.edu/software/ ，可以下载到对应的功能工具包，如果你只需要部分功能，如分词功能，那个只需要下载对应的包即可，在刚刚这个网址可以下：

然后，继续记录我的Stanford中文处理探索，

解压stanford-corenlp-full-2018-10-05.zip

将文件夹中的所有.jar加入你建立的工程中，将stanford-chinese-corenlp-2018-10-05-models.jar也加入你建立的工程中，具体过程我就不演示了

我这边遇到了一个问题，java.lang.OutOfMemoryError: GC overhead limit exceeded，

在代码上右键，依次点击“Run As ”-> “Run Configurations ”，在Arguments 参数中的“VM arguments: ”中填入如下值即可。

-Xms1024m -Xmx4096m -Xss1024K -XX:PermSize=512m -XX:MaxPermSize=2048m，设置一下堆栈的分配

经过测试，这个跑起来至少需要3个G，还没加载中文包

后面使用idea跑起来的，也是一样的参数设置，

就可以运行起来了。

下面贴上我的代码

	//Stanford API
	/**
	 * @Description: Stanford pTBTokenizer 文本断句 
	 * @author wangk
	 * @param text 
	 * @date: 2019年5月9日 下午2:45:50  
	*/
	public void pTBTokenizer(String text) {
		
		PTBTokenizer ptb = new PTBTokenizer(new StringReader(text),new CoreLabelTokenFactory(),null);
		WordToSentenceProcessor wtsp = new WordToSentenceProcessor();//他的process方法，可以根据PTBTokenizer 示例产生的词生成List
		List<List<CoreLabel>> sents = wtsp.process(ptb.tokenize());
		
		for(List<CoreLabel> sent : sents) {
			System.out.println(sent);
		}
		String props="StanfordCoreNLP-chinese.properties";
		StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

		Annotation document = pipeline.process("第一个括号子表达式捕获 Web 地址的协议部分。 该子表达式匹配在冒号和两个正斜杠前面的任何单词。");
		List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
		StringBuilder result = new StringBuilder();
		for (CoreMap sentence : sentences) {
			for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
				String word = token.get(CoreAnnotations.TextAnnotation.class);
				result.append(word).append(" ");
			}
		}
		System.out.println(result.toString());

	
	//中文结果：第一 个 括号 子 表达式 捕获 Web 地址 的 协议 部分 。 该 子 表达式 匹配 在 冒号 和 两 个 正 斜杠 前面 的 任何 单词 。
	}