斯坦福ner python_斯坦福大学Corenlp和Java入门（Python程序员）

最新推荐文章于 2024-08-20 20:00:00 发布

weixin_26750481

最新推荐文章于 2024-08-20 20:00:00 发布

阅读量660

点赞数

文章标签： python java 人工智能

原文链接：https://towardsdatascience.com/intro-to-stanfords-corenlp-and-java-for-python-programmers-c2586215aab6

版权

这篇博客面向Python开发者，介绍了如何使用斯坦福大学的CoreNLP工具进行命名实体识别（NER）。文章通过Python视角解释了如何与Java交互，以利用CoreNLP的强大功能。

摘要由CSDN通过智能技术生成

斯坦福ner python

Hello there! I’m back and I want this to be the first of a series of post on Stanford’s CoreNLP library. In this article I will focus on the installation of the library and an introduction to its basic features for Java newbies like myself. I will firstly go through the installation steps and a couple of tests from the command line. I will later walk you through a two very simple Java scripts that you will be able to easily incorporate into your Python NLP pipeline. You can find the complete code on github!

你好！我回来了，我希望这是Stanford的CoreNLP库上系列文章的第一篇。在本文中，我将重点介绍该库的安装及其对像我这样的Java新手的基本功能的介绍。我将首先通过命令行完成安装步骤和一些测试。稍后，我将向您介绍两个非常简单的Java脚本，您可以将它们轻松地合并到Python NLP管道中。您可以在github上找到完整的代码！

CoreNLP is a toolkit with which you can generate a quite complete NLP pipeline with only a few lines of code. The library includes pre-built methods for all the main NLP procedures, such as Part of Speech (POS) tagging, Named Entity Recognition (NER), Dependency Parsing or Sentiment Analysis. It also supports other languages apart from English, more specifically Arabic, Chinese, German, French, and Spanish.

CoreNLP是一个工具包，您可以使用它仅用几行代码即可生成相当完整的NLP管道。该库包括用于所有主要NLP程序的预构建方法，例如词性(POS)标记，命名实体识别(NER)，依赖性分析或情感分析。它还支持英语以外的其他语言，尤其是阿拉伯语，中文，德语，法语和西班牙语。

I am a big fan of the library, mainly because of HOW COOL its Sentiment Analysis model is ❤ (I will talk more about it in the next post). However, I can see why most people would rather use other libraries like NLTK or SpaCy, as CoreNLP can be a bit of an overkill. The reality is that coreNLP can be much more computationally expensive than other libraries, and for shallow NLP processes the results are not even significantly better. Plus it’s written in Java, and getting started with it is a bit of a pain for Python users (however it is doable, as you will see below, and it also has a Python API if you can’t be bothered).

我是该图书馆的忠实拥护者，主要是因为它的情感分析模型是❤(我将在下一篇文章中详细介绍)。但是，我明白了为什么大多数人宁愿使用NLTK或SpaCy之类的其他库，因为CoreNLP可能有点过头了。现实情况是，coreNLP的计算量可能比其他库昂贵得多，对于较浅的NLP进程，结果甚至没有明显改善。加上它是用Java编写的，并且开始使用它对Python用户来说有点痛苦(但是它是可行的，如您将在下面看到的，并且它还有一个Python API，如果您不打扰的话)。

CoreNLP Pipeline and Basic Annotators
CoreNLP管道和基本注释器

The basic building block of coreNLP is the coreNLP pipeline. The pipeline takes an input text, processes it and outputs the results of this processing in the form of a coreDocument object. A coreNLP pipeline can be customised and adapted to the needs of your NLP project. The properties objects allow to do this customization by adding, removing or editing annotators.

coreNLP的基本构建模块是coreNLP 管道。管道接收输入文本，对其进行处理，并以coreDocument对象的形式输出该处理的结果。可以定制coreNLP管道并使其适应NLP项目的需求。 属性对象允许通过添加，删除或编辑注释器来进行此自定义。

That was a lot of jargon, so let’s break it down with an example. All the information and figures were extracted from the official coreNLP page.

那是很多行话，所以让我们用一个例子来分解它。所有信息和数据均摘自coreNLP官方页面。

Image for post — coreNLP site coreNLP网站提取的图

In the figure above we have a basic coreNLP Pipeline, the one that is ran by default when you first run the coreNLP Pipeline class without changing anything. At the very left we have the input text entering the pipeline, this will usually be a plain .txt file. The pipeline itself is composed by 6 annotators. Each of these annotators will process the input text sequentially, the intermediate outputs of the processing sometimes being used as inputs by some other annotator. If we wanted to change this pipeline by adding or removing annotators, we would use the properties object. The final output is a set of annotations in the form of a coreDocument object.

在上图中，我们有一个基本的coreNLP Pipeline，它是在您首次运行coreNLP Pipeline类而不更改任何内容时默认运行的。在最左侧，有输入文本进入管道，这通常是一个纯文本 .txt文件。管道本身由6个注释器组成。这些注释器中的每个注释器将顺序处理输入文本，有时其他一些注释器会将处理的中间输出用作输入。如果我们想通过添加或删除注释器来更改此管道，则可以使用properties对象 。最终输出是一组注释，格式为 coreDocument对象 。

We will be working with this basic pipeline throughout the article. The nature of the objects will be more clear later on when we look at an example. For the moment let’s note down what each of the annotator does:

在整篇文章中，我们将使用这个基本管道。在后面的示例中，对象的性质将更加清楚。现在，让我们记下每个注释器的作用：

Annotator 1: Tokenization → turns raw text into tokens.
注释器1：标记化 →将原始文本转换为标记。
Annotator 2: Sentence Splitting → divides raw text into sentences.
注释器2：句子拆分 →将原始文本分成句子。
Annotator 3: Part of Speech (POS) Tagging → assigns part of speech labels to tokens, such as whether they are verbs or nouns. Each token in the text will be given a tag.
注释器3：词性(POS)标记 →将词性标签分配给标记，例如它们是动词还是名词。文本中的每个标记都将被赋予一个标签。

Annotator 4: Lemmatization → converts every word into its lemma, its dictionary form. For example the word “was” is mapped to “be”.
注释器4：引词化 →将每个单词转换成引理，即字典形式。例如，单词“ was”被映射为“ be”。
Annotator 5: Named Entity Recognition (NER) → Recognises when an entity (a person, country, organization etc…) is named in a text. It also recognises numerical entities such as dates.
注释器5：命名实体识别(NER) →识别何时在文本中命名一个实体(一个人，一个国家，一个组织等)。它还可以识别数字实体，例如日期。

Annotator 6: Dependency Parsing → Will parse the text and highlight dependencies between words.
注释器6：依赖关系解析 →将解析文本并突出显示单词之间的依赖关系。

Lastly, all the outputs from the 6 annotators are organised into a CoreDocument. These are basically data objects that contain annotation information in a structured way. CoreDocuments make our lives easier since, as you will see later on, they store all the information so that we can access it with a simple API.

最后，来自6个注释器的所有输出都组织成一个CoreDocument 。这些基本上是数据对象，它们以结构化方式包含注释信息。 CoreDocuments使我们的生活更加轻松，因为，正如您稍后将看到的那样，它们存储了所有信息，以便我们可以使用简单的API对其进行访问。

Installation
安装

You will need to have Java installed. You can download the latest version here. For downloading CoreNLP I followed the official guide:

您将需要安装Java。您可以在此处下载最新版本。为了下载CoreNLP，我遵循了官方指南：

Downloading the CoreNLP zip file using curl or wget
使用curl或wget下载CoreNLP zip文件

curl -O -L http://nlp.stanford.edu/software/stanford-corenlp-latest.zip

2. Unzip the file

2.解压缩文件

unzip stanford-corenlp-latest.zip

3. Move into the newly created directory

3.移至新创建的目录

cd stanford-corenlp-4.1.0

Let’s now go through a couple of examples to make sure everything works.

现在让我们来看几个示例，以确保一切正常。

Example using the command line and an input.txt file
使用命令行和input.txt文件的示例

For this example, firstly we will open the terminal and create a test file that we will use as input. The code was adapted from coreNLP’s official site. You can use the following command:

对于此示例，首先，我们将打开终端并创建一个测试文件，将其用作输入。该代码改编自coreNLP的官方网站。您可以使用以下命令：

echo "the quick brown fox jumped over the lazy dog" > test.txt

echoprints the sentence "the quick brown fox jumped over the lazy dog" on the test.txt file.

echo在test.txt文件上打印句子"the quick brown fox jumped over the lazy dog" 。

Let’s now run a default coreNLP pipeline on the test sentence.

现在让我们在测试语句上运行默认的coreNLP管道。

java -cp “*” -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat xml -file test.txt

This is a java command that loads and runs the coreNLP pipeline from the class edu.stanford.nlp.pipeline.StanfordCoreNLP. Since we have not changed anything from that class, the settings will be set to default. The pipeline will use as input the test.txt file and will output an XML file.

这是一个Java命令，可从edu.stanford.nlp.pipeline.StanfordCoreNLP类加载并运行coreNLP管道。由于我们没有对该类进行任何更改，因此设置将设置为默认值。管道将使用test.txt文件作为输入，并将输出XML文件。

Once you run the command the pipeline will start annotating the text. You will notice it takes a while… (around 20 seconds for a 9-word-sentence 🙄). The output will be a file named test.txt.xml. This process will also automatically generate as a side product an XSLT stylesheet (CoreNLP-to-HTML.xsl), which will convert the XML into HTML if you open it in a browser.

一旦运行命令，管道将开始注释文本。您会发现需要一段时间...(9字句🙄大约需要20秒)。输出将是一个名为test.txt.xml的文件。此过程还将自动生成XSLT样式表( CoreNLP-to-HTML.xsl )作为副产品 ，如果在浏览器中打开XML，则该样式表会将XML转换为HTML。

Seems that everything is working fine!! We see the standard pipeline is actually quite complex. It included all the annotators we saw in the section above: tokenization, sentence splitting, lemattization, POS, NER tagging and dependency parsing.

似乎一切正常！我们看到标准管道实际上非常复杂。它包含了我们在上一节中看到的所有注释器：标记化，句子拆分，去词义化，POS，NER标签和依赖项解析。

Note: I displayed it using Firefox, however I took me ages to figure out how to do this because apparently in 2019 Firefox stopped allowing this. One can get around this by going to the about:config page and changing the privacy.file_unique_origin setting to False. If it doesn’t work for you you can choose json as the outputFormat or open the XML file with a text editor.

注意： 我使用Firefox显示了它，但是我花了很长时间才弄清楚该怎么做，因为显然在2019年，Firefox停止允许这样做。 可以通过转到 about：config 页面并将 privacy.file_unique_origin 设置更改为 False来解决此问题。 如果对您不起作用，则可以选择json作为outputFormat或使用文本编辑器打开XML文件。

Example using the interactive shell mode
使用交互式外壳模式的示例

For our second example you will also use exclusively the terminal. CoreNLP has an cool interactive shell mode that you can enter by running the following command.

对于第二个示例，您还将专门使用终端。 CoreNLP具有很酷的交互式外壳模式，您可以通过运行以下命令进入该模式。

java -cp “*” -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP

Once you enter this interactive mode, you just have to type a sentence or group of sentences and they will be processed by the basic annotators on the fly! Below you can see an example of how the sentence “Hello my name is Laura” is analysed.

进入此交互模式后，只需键入一个句子或一组句子，基本注释器就会即时处理它们！在下面，您可以看到一个示例示例，该语句如何分析“你好，我叫劳拉”。

We can see the same annotations we saw in the XML file printed in the Terminal in a different format! You can also try it out with longer texts.

我们可以看到在终端以不同格式打印的XML文件中看到的相同注释！您也可以尝试使用更长的文字。

Example using very simple Java code
使用非常简单的Java代码的示例

Now let’s go through a couple of Java code examples! We will basically create and tune the pipeline using Java, and then we will output the results onto a .txt file that then can be incorporated into our Python or R NLP pipeline. The code was adapted from coreNLP’s official site.

现在让我们来看几个Java代码示例！我们基本上将使用Java创建和调整管道，然后将结果输出到.txt文件，然后将其合并到我们的Python或R NLP管道中。该代码改编自coreNLP的官方网站。

Example 1

例子1

Find the complete code in my github. I will firstly run you through the coreNLP_pipeline1_LBP.java file. We start the file importing all the needed dependencies. Then we make up an example of text that we will use for our analysis. You can change this to any other example:

在我的github中找到完整的代码。我将首先通过coreNLP_pipeline1_LBP.java文件运行您。我们开始导入所有需要的依赖文件。然后，我们将构成一个文本示例，用于分析。您可以将其更改为任何其他示例：

public static String text = "Marie was born in Paris.";

Now we set up the pipeline, we create a document and annotate it using the following lines:

现在，我们设置了管道，我们创建了一个文档并使用以下几行对其进行注释：

// set up pipeline properties
Properties props = new Properties();// set the list of annotators to run
props.setProperty("annotators","tokenize,ssplit,pos,lemma,ner,depparse");// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);// create a document object and annotate it
CoreDocument document = pipeline.processToCoreDocument(text);
pipeline.annotate(document);

The rest of the lines of the file will print out on the terminal several tests to make sure the pipeline worked fine. For instance, we firstly get the list of sentences of the input document.

文件的其余各行将在终端上打印一些测试，以确保管道正常运行。例如，我们首先获得输入文档的句子列表。

// get sentences of the document
        List <CoreSentence> sentences = document.sentences();
        System.out.println("Sentences of the document");
        System.out.println(sentences);
        System.out.println();

Notice that we get the list of sentences using the method .sentences() on the document object. Similarly, we get the list of tokens of a sentence using the method .tokens() on the object sentence and the individual word and lemma using the methods .word() and .lemma() on the object tok.

注意，我们使用文档对象上的.sentences()方法.sentences()句子列表。类似地，我们在对象句子上使用.tokens()方法.tokens()句子的标记列表，而在对象tok上使用.word()和.lemma()方法获得单个单词和引理。

List<CoreLabel> tokens = sentence.tokens();System.out.println("Tokens of the sentence:");
          for (CoreLabel tok : tokens) {
            System.out.println("Token: " + tok.word());
            System.out.println("Lemma: " + tok.lemma());
          }

For running the file you only need to save it on your stanford-corenlp-4.1.0 directory and use the command

要运行该文件，您只需要将其保存在stanford-corenlp-4.1.0目录中并使用以下命令

java -cp "*" coreNLP_pipeline1_LBP.java

The results should look like:

结果应如下所示：

Example 2

例子2

The second example coreNLP_pipeline2_LBP.java is slightly different, since it reads a file coreNLP_input.txt as input document and outputs the results onto a coreNLP_output.txt file.

第二个示例coreNLP_pipeline2_LBP.java稍有不同，因为它读取文件coreNLP_input.txt作为输入文档，并将结果输出到coreNLP_output.txt文件中。

We used as the input text the short story of The Fox and the Grapes. It is a document with 2 paragraphs and 6 sentences. The processing will be similar to the one in the example above, except this time we will also keep track of the paragraph and sentence number.

我们将“狐狸与葡萄”的短篇小说作为输入文字。它是一个包含2个段落和6个句子的文档。处理将与上面的示例类似，除了这次我们还将跟踪段落和句子编号。

The biggest changes will be regarding reading the input and writing the final output. This bit of code below will create the output file (if it doesn’t exist yet) and print the column names using PrintWriter…

最大的变化将是关于读取输入和写入最终输出。下面的这段代码将创建输出文件(如果尚不存在)并使用PrintWriter打印列名…

File file = new File("coreNLP_output.txt");
    //create the file if it doesn't exist
    if(!file.exists()){
      file.createNewFile();}PrintWriter out = new PrintWriter(file);//print column names on the output document      out.println("par_id;sent_id;words;lemmas;posTags;nerTags;depParse");

…and this other bit will read the input document using Scanner. The input document will be saved as a String text that we will be able to use as the one in Example 1.

…另一位将使用扫描仪读取输入文档。输入文档将另存为String文本，我们可以将其用作示例1中的文本。

Scanner myReader = new Scanner(myObj);while (myReader.hasNextLine()) {
          String text = myReader.nextLine();

Once the file coreNLP_pipeline2_LBP.java is ran and the output generated, one can open it as a dataframe using the following python code:

运行文件coreNLP_pipeline2_LBP.java并生成输出后，可以使用以下python代码将其作为数据框打开：

df = pd.read_csv('coreNLP_output.txt', delimiter=';',header=0)

The resulting dataframe will look like this, and can be used for further analysis!

生成的数据框将如下所示，可以用于进一步分析！

Conclusions
结论

As you have seen coreNLP can be very easy to use and easily incorporated into a Python NLP pipeline! You could also print it directly onto a .csv file and use other delimitors, but I was having some annoying parsing problems…. Hope you enjoyed the post anyways and remember the complete code is available on github.

如您所见，coreNLP可以非常容易使用，并且可以轻松地集成到Python NLP管道中！您也可以将其直接打印到.csv文件上，并使用其他定界符，但是我在解析时遇到了一些烦人的问题……。希望您无论如何都喜欢这篇文章，并记得完整的代码可以在github上找到。

In the following post we will start talking about the Recursive Sentiment Analysis model and how to use it with coreNLP and Java. Keep posted to learn more about coreNLP ✌🏻

在下面的文章中，我们将开始讨论递归情感分析模型以及如何将其与coreNLP和Java一起使用。保持发布以了解有关coreNLP的更多信息✌🏻