Stanford-parser分解分词后的数据

最新推荐文章于 2022-10-14 19:56:49 发布

candice廷

最新推荐文章于 2022-10-14 19:56:49 发布

阅读量2.1k

点赞数

分类专栏：推荐系统&机器学习

本文链接：https://blog.csdn.net/xietingcandice/article/details/23271891

版权

推荐系统&机器学习专栏收录该内容

33 篇文章 3 订阅

订阅专栏

分解英文数据：

（1）首先下载stanford的资源包，可以在官网上下载

http://nlp.stanford.edu/software/lex-parser.shtml#Download

也可以在相应的下载网站上下载，推荐官网下载

(2).开始前要检查一下java的版

The current version of the parser requires Java 6 (JDK1.6) or later

下载一个最新安装即可

安装完毕后记得配置环境变量,以方便的命令行下快捷使用java.exe

我是下载的1.8版本，测试后没有问题

安装JDK的时候注意配置环境变量即可

（3）英文数据测试：

参数说明：

①-mx1g：给java虚拟机分配的最大内存为1g(大小可自行设置)。

②-cp ：为了加载java包stanford-parser-2011-04-20.jar。 LexicalizedParser：parser类.

③-maxLength：指定句子单词长度最大为100。

④–outputFormat：指定输出句子的格式。 outputFormat具体选项值如下： Oneline：成分句法分析输出文件的格式为每行一句的广义表形式的树结构。 Penn：成分句法分析输出文件的格式为层次化树的形式。默认选项为penn。 latexTree：格式类似于penn Words：只给出分词格式。如：继续播报详细的新闻内容。 wordsAndTags：给出分词文本和标记。如：继续/VV 播报/VV 详细/VA 的/DEC 新闻/NN 内容/NN 。/PU rootSymbolOnly：只给出ROOT结点 typedDependencies：给出依存句法分析结果。 mmod(播报-2, 继续-1) rcmod(内容-6, 详细-3) cpm(详细-3, 的-4) nn(内容-6, 新闻-5) dobj(播报-2, 内容-6) conllStyleDependencies、conll2008：conll格式(每行一词，每词十项)如下： 1 继续 _ VV _ _ 2 _ _ _ 2 播报 _ VV _ _ 0 _ _ _ 3 详细 _ VA _ _ 4 _ _ _ 4 的 _ DEC _ _ 6 _ _ _ 5 新闻 _ NN _ _ 6 _ _ _ 6 内容 _ NN _ _ 2 _ _ _ 7 。 _ PU _ _ 2 _ _ _

⑤-escaper：字符的标准化(例如将英文的”(”改成”-LRB-”,默认情况即这样转换)。英文的escaper为edu.stanford.nlp.process.PTBEscapingProcessor。中文为：edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper。举例： java -mx500m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -escaper edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper -sentences newline chineseFactored.ser.gz chinese-onesent > chinese-onesent.stp

⑥-sentences：指定句子之间的边界，一般为newline ：输入文件的句子通过换行符分割。Parser得到的文本是每行一句，一句一句的进行分析。

⑦-encoding：指定输入输出文件的字符集。(中文默认为GB18030)

⑧-outputFormatOptions：进一步控制各种–outputFormat选项的输出行为(可以说是–outputFormat的附加选项)。当–outputFormat 为typedDependencies时，-outputFormatOptions可有如下选项(默认选项为collapsed dependencyies)： basicDependencies：基本格式 treeDependencies：以树结构保存的压缩依存关系(去除依存图中一些边构成树)。 collapsedDependencies：压缩依存(不一定为树结构) cc(makes-11, and-12) conj(makes-11, distributes-13) 转化为： Conj_and(makes-11, distributes-13) CCPropagatedDependencies:带有连词依存传播的压缩依存。

⑨-writeOutputFiles：产生对应于输入文件的输出文件，输出文件名同输入文件，只是增加了”.stp”的后缀。-outputFilesExtension：指定输出文件扩展名，默认为”.stp”

⑩-outputFilesDirectory :指定输出文件目录，默认为当前目录。在这一小节中，我们用到的parser类为parser.lexparser.LexicalizedParser，这个类既能生成基于短语结构的成分句法树(指定输出格式为penn或oneline)，又可以生成基于依存结构的依存句法树(指定输出格式为typedDependencies)。接下来，我们用到的类名为：trees.EnglishGrammaticalStructure。我们使用这个类将已经是成分句法树结构(penn Treebank-style trees)转化为依存句法树结构。这里的成分句法树来源，既可以是stanford parser生成的，又可以是其他种类的parser(如：berkeley parser、charniak parser)生成的。

输入的文档是：

Scores of properties are under extreme fire threat as a huge blaze
continues to advance through Sydney's north-western suburbs. Fires
have also shut down the major road and rail links between Sydney and
Gosford.
The promotional stop in Sydney was everything to be expected for a
Hollywood blockbuster - phalanxes of photographers, a stretch limo to
a hotel across the Quay - but with one difference. A line-up of
masseurs was waiting to take the media in hand. Never has the term
"massaging the media" seemed so accurate.

输出文档时：