java怎么调用dubbed_Java Stanford NLP: Part of Speech labels?

问题

The Stanford NLP, demo\'d here, gives an output like this:

Colorless/JJ green/JJ ideas/NNS sleep/VBP furiously/RB ./.

What do the Part of Speech tags mean? I am unable to find an official list. Is it Stanford\'s own system, or are they using universal tags? (What is JJ, for instance?)

Also, when I am iterating through the sentences, looking for nouns, for instance, I end up doing something like checking to see if the tag .contains(\'N\'). This feels pretty weak. Is there a better way to programmatically search for a certain part of speech?

回答1:

The Penn Treebank Project. Look at the Part-of-speech tagging ps.

JJ is adjective. NNS is noun, plural. VBP is verb present tense. RB is adverb.

That's for english. For chinese, it's the Penn Chinese Treebank. And for german it's the NEGRA corpus.

CC Coordinating conjunction

CD Cardinal number

DT Determiner

EX Existential there

FW Foreign word

IN Preposition or subordinating conjunction

JJ Adjective

JJR Adjective, comparative

JJS Adjective, superlative

LS List item marker

MD Modal

NN Noun, singular or mass

NNS Noun, plural

NNP Proper noun, singular

NNPS Proper noun, plural

PDT Predeterminer

POS Possessive ending

PRP Personal pronoun

PRP$ Possessive pronoun

RB Adverb

RBR Adverb, comparative

RBS Adverb, superlative

RP Particle

SYM Symbol

TO to

UH Interjection

VB Verb, base form

VBD Verb, past tense

VBG Verb, gerund or present participle

VBN Verb, past participle

VBP Verb, non­3rd person singular present

VBZ Verb, 3rd person singular present

WDT Wh­determiner

WP Wh­pronoun

WP$ Possessive wh­pronoun

WRB Wh­adverb

回答2:

Explanation of each tag from the documentation :

CC: conjunction, coordinating

& 'n and both but either et for less minus neither nor or plus so

therefore times v. versus vs. whether yet

CD: numeral, cardinal

mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-

seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025

fifteen 271,124 dozen quintillion DM2,000 ...

DT: determiner

all an another any both del each either every half la many much nary

neither no some such that the them these this those

EX: existential there

there

FW: foreign word

gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous

lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte

terram fiche oui corporis ...

IN: preposition or conjunction, subordinating

astride among uppon whether out inside pro despite on by throughout

below within for towards near behind atop around if like until below

next into if beside ...

JJ: adjective or numeral, ordinal

third ill-mannered pre-war regrettable oiled calamitous first separable

ectoplasmic battery-powered participatory fourth still-to-be-named

multilingual multi-disciplinary ...

JJR: adjective, comparative

bleaker braver breezier briefer brighter brisker broader bumper busier

calmer cheaper choosier cleaner clearer closer colder commoner costlier

cozier creamier crunchier cuter ...

JJS: adjective, superlative

calmest cheapest choicest classiest cleanest clearest closest commonest

corniest costliest crassest creepiest crudest cutest darkest deadliest

dearest deepest densest dinkiest ...

LS: list item marker

A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005

SP-44007 Second Third Three Two * a b c d first five four one six three

two

MD: modal auxiliary

can cannot could couldn't dare may might must need ought shall should

shouldn't will would

NN: noun, common, singular or mass

common-carrier cabbage knuckle-duster Casino afghan shed thermostat

investment slide humour falloff slick wind hyena override subhumanity

machinist ...

NNS: noun, common, plural

undergraduates scotches bric-a-brac products bodyguards facets coasts

divestitures storehouses designs clubs fragrances averages

subjectivists apprehensions muses factory-jobs ...

NNP: noun, proper, singular

Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos

Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA

Shannon A.K.C. Meltex Liverpool ...

NNPS: noun, proper, plural

Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists

Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques

Apache Apaches Apocrypha ...

PDT: pre-determiner

all both half many quite such sure this

POS: genitive marker

' 's

PRP: pronoun, personal

hers herself him himself hisself it itself me myself one oneself ours

ourselves ownself self she thee theirs them themselves they thou thy us

PRP$: pronoun, possessive

her his mine my our ours their thy your

RB: adverb

occasionally unabatingly maddeningly adventurously professedly

stirringly prominently technologically magisterially predominately

swiftly fiscally pitilessly ...

RBR: adverb, comparative

further gloomier grander graver greater grimmer harder harsher

healthier heavier higher however larger later leaner lengthier less-

perfectly lesser lonelier longer louder lower more ...

RBS: adverb, superlative

best biggest bluntest earliest farthest first furthest hardest

heartiest highest largest least less most nearest second tightest worst

RP: particle

aboard about across along apart around aside at away back before behind

by crop down ever fast for forth from go high i.e. in into just later

low more off on open out over per pie raising start teeth that through

under unto up up-pp upon whole with you

SYM: symbol

% & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***

TO: "to" as preposition or infinitive marker

to

UH: interjection

Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen

huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly

man baby diddle hush sonuvabitch ...

VB: verb, base form

ask assemble assess assign assume atone attention avoid bake balkanize

bank begin behold believe bend benefit bevel beware bless boil bomb

boost brace break bring broil brush build ...

VBD: verb, past tense

dipped pleaded swiped regummed soaked tidied convened halted registered

cushioned exacted snubbed strode aimed adopted belied figgered

speculated wore appreciated contemplated ...

VBG: verb, present participle or gerund

telegraphing stirring focusing angering judging stalling lactating

hankerin' alleging veering capping approaching traveling besieging

encrypting interrupting erasing wincing ...

VBN: verb, past participle

multihulled dilapidated aerosolized chaired languished panelized used

experimented flourished imitated reunifed factored condensed sheared

unsettled primed dubbed desired ...

VBP: verb, present tense, not 3rd person singular

predominate wrap resort sue twist spill cure lengthen brush terminate

appear tend stray glisten obtain comprise detest tease attract

emphasize mold postpone sever return wag ...

VBZ: verb, present tense, 3rd person singular

bases reconstructs marks mixes displeases seals carps weaves snatches

slumps stretches authorizes smolders pictures emerges stockpiles

seduces fizzes uses bolsters slaps speaks pleads ...

WDT: WH-determiner

that what whatever which whichever

WP: WH-pronoun

that what whatever whatsoever which who whom whosoever

WP$: WH-pronoun, possessive

whose

WRB: Wh-adverb

how however whence whenever where whereby whereever wherein whereof why

回答3:

The accepted answer above is missing the following information:

There are also 9 punctuation tags defined (which are not listed in some references, see here). These are:

#

$

'' (used for all forms of closing quote)

( (used for all forms of opening parenthesis)

) (used for all forms of closing parenthesis)

,

. (used for all sentence-ending punctuation)

: (used for colons, semicolons and ellipses)

`` (used for all forms of opening quote)

回答4:

Here is a more complete list of tags for the Penn Treebank (posted here for the sake of completness):

http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html

It also includes tags for clause and phrase levels.

Clause Level

- S

- SBAR

- SBARQ

- SINV

- SQ

Phrase Level

- ADJP

- ADVP

- CONJP

- FRAG

- INTJ

- LST

- NAC

- NP

- NX

- PP

- PRN

- PRT

- QP

- RRC

- UCP

- VP

- WHADJP

- WHAVP

- WHNP

- WHPP

- X

(descriptions in the link)

回答5:

Just in case you were wanting to code it...

/**

* Represents the English parts-of-speech, encoded using the

* de facto Penn Treebank

* Project standard.

*

* @see Penn Treebank Specification

*/

public enum PartOfSpeech {

ADJECTIVE( "JJ" ),

ADJECTIVE_COMPARATIVE( ADJECTIVE + "R" ),

ADJECTIVE_SUPERLATIVE( ADJECTIVE + "S" ),

/* This category includes most words that end in -ly as well as degree

* words like quite, too and very, posthead modi ers like enough and

* indeed (as in good enough, very well indeed), and negative markers like

* not, n't and never.

*/

ADVERB( "RB" ),

/* Adverbs with the comparative ending -er but without a strictly comparative

* meaning, like later in We can always come by later, should

* simply be tagged as RB.

*/

ADVERB_COMPARATIVE( ADVERB + "R" ),

ADVERB_SUPERLATIVE( ADVERB + "S" ),

/* This category includes how, where, why, etc.

*/

ADVERB_WH( "W" + ADVERB ),

/* This category includes and, but, nor, or, yet (as in Y et it's cheap,

* cheap yet good), as well as the mathematical operators plus, minus, less,

* times (in the sense of "multiplied by") and over (in the sense of "divided

* by"), when they are spelled out. For in the sense of "because" is

* a coordinating conjunction (CC) rather than a subordinating conjunction.

*/

CONJUNCTION_COORDINATING( "CC" ),

CONJUNCTION_SUBORDINATING( "IN" ),

CARDINAL_NUMBER( "CD" ),

DETERMINER( "DT" ),

/* This category includes which, as well as that when it is used as a

* relative pronoun.

*/

DETERMINER_WH( "W" + DETERMINER ),

EXISTENTIAL_THERE( "EX" ),

FOREIGN_WORD( "FW" ),

LIST_ITEM_MARKER( "LS" ),

NOUN( "NN" ),

NOUN_PLURAL( NOUN + "S" ),

NOUN_PROPER_SINGULAR( NOUN + "P" ),

NOUN_PROPER_PLURAL( NOUN + "PS" ),

PREDETERMINER( "PDT" ),

POSSESSIVE_ENDING( "POS" ),

PRONOUN_PERSONAL( "PRP" ),

PRONOUN_POSSESSIVE( "PRP$" ),

/* This category includes the wh-word whose.

*/

PRONOUN_POSSESSIVE_WH( "WP$" ),

/* This category includes what, who and whom.

*/

PRONOUN_WH( "WP" ),

PARTICLE( "RP" ),

/* This tag should be used for mathematical, scientific and technical symbols

* or expressions that aren't English words. It should not used for any and

* all technical expressions. For instance, the names of chemicals, units of

* measurements (including abbreviations thereof) and the like should be

* tagged as nouns.

*/

SYMBOL( "SYM" ),

TO( "TO" ),

/* This category includes my (as in M y, what a gorgeous day), oh, please,

* see (as in See, it's like this), uh, well and yes, among others.

*/

INTERJECTION( "UH" ),

VERB( "VB" ),

VERB_PAST_TENSE( VERB + "D" ),

VERB_PARTICIPLE_PRESENT( VERB + "G" ),

VERB_PARTICIPLE_PAST( VERB + "N" ),

VERB_SINGULAR_PRESENT_NONTHIRD_PERSON( VERB + "P" ),

VERB_SINGULAR_PRESENT_THIRD_PERSON( VERB + "Z" ),

/* This category includes all verbs that don't take an -s ending in the

* third person singular present: can, could, (dare), may, might, must,

* ought, shall, should, will, would.

*/

VERB_MODAL( "MD" ),

/* Stanford.

*/

SENTENCE_TERMINATOR( "." );

private final String tag;

private PartOfSpeech( String tag ) {

this.tag = tag;

}

/**

* Returns the encoding for this part-of-speech.

*

* @return A string representing a Penn Treebank encoding for an English

* part-of-speech.

*/

public String toString() {

return getTag();

}

protected String getTag() {

return this.tag;

}

public static PartOfSpeech get( String value ) {

for( PartOfSpeech v : values() ) {

if( value.equals( v.getTag() ) ) {

return v;

}

}

throw new IllegalArgumentException( "Unknown part of speech: '" + value + "'." );

}

}

回答6:

I am providing the whole list here and also giving reference link

1. CC Coordinating conjunction

2. CD Cardinal number

3. DT Determiner

4. EX Existential there

5. FW Foreign word

6. IN Preposition or subordinating conjunction

7. JJ Adjective

8. JJR Adjective, comparative

9. JJS Adjective, superlative

10. LS List item marker

11. MD Modal

12. NN Noun, singular or mass

13. NNS Noun, plural

14. NNP Proper noun, singular

15. NNPS Proper noun, plural

16. PDT Predeterminer

17. POS Possessive ending

18. PRP Personal pronoun

19. PRP$ Possessive pronoun

20. RB Adverb

21. RBR Adverb, comparative

22. RBS Adverb, superlative

23. RP Particle

24. SYM Symbol

25. TO to

26. UH Interjection

27. VB Verb, base form

28. VBD Verb, past tense

29. VBG Verb, gerund or present participle

30. VBN Verb, past participle

31. VBP Verb, non-3rd person singular present

32. VBZ Verb, 3rd person singular present

33. WDT Wh-determiner

34. WP Wh-pronoun

35. WP$ Possessive wh-pronoun

36. WRB Wh-adverb

You can find out the whole list of Parts of Speech tags here.

回答7:

Regarding your second question of finding particular POS (e.g., Noun) tagged word/chunk, here is the sample code you can follow.

public static void main(String[] args) {

Properties properties = new Properties();

properties.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse");

StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);

String input = "Colorless green ideas sleep furiously.";

Annotation annotation = pipeline.process(input);

List sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);

List output = new ArrayList<>();

String regex = "([{pos:/NN|NNS|NNP/}])"; //Noun

for (CoreMap sentence : sentences) {

List tokens = sentence.get(CoreAnnotations.TokensAnnotation.class);

TokenSequencePattern pattern = TokenSequencePattern.compile(regex);

TokenSequenceMatcher matcher = pattern.getMatcher(tokens);

while (matcher.find()) {

output.add(matcher.group());

}

}

System.out.println("Input: "+input);

System.out.println("Output: "+output);

}

The output is:

Input: Colorless green ideas sleep furiously.

Output: [ideas]

回答8:

They seem to be Brown Corpus tags.

回答9:

Stanford CoreNLP Tags for Other Languages : French, Spanish, German ...

I see you use the parser for English language, which is the default model.

You may use the parser for other languages (French, Spanish, German ...) and, be aware, both tokenizers and part of speech taggers are different for each language. If you want to do that, you must download the specific model for the language (using a builder like Maven for example) and then set the model you want to use.

Here you have more information about that.

Here you are lists of tags for different languages :

Stanford CoreNLP POS Tags for Spanish

Stanford CoreNLP POS Tagger for German uses the Stuttgart-Tübingen Tag Set (STTS)

Stanford CoreNLP POS tagger for French uses the following tags:

TAGS FOR FRENCH:

Part of Speech Tags for French

A (adjective)

Adv (adverb)

CC (coordinating conjunction)

Cl (weak clitic pronoun)

CS (subordinating conjunction)

D (determiner)

ET (foreign word)

I (interjection)

NC (common noun)

NP (proper noun)

P (preposition)

PREF (prefix)

PRO (strong pronoun)

V (verb)

PONCT (punctuation mark)

Phrasal Categories Tags for French:

AP (adjectival phrases)

AdP (adverbial phrases)

COORD (coordinated phrases)

NP (noun phrases)

PP (prepositional phrases)

VN (verbal nucleus)

VPinf (infinitive clauses)

VPpart (nonfinite clauses)

SENT (sentences)

Sint, Srel, Ssub (finite clauses)

Syntactic Functions for French:

SUJ (subject)

OBJ (direct object)

ATS (predicative complement of a subject)

ATO (predicative complement of a direct object)

MOD (modifier or adjunct)

A-OBJ (indirect complement introduced by à)

DE-OBJ (indirect complement introduced by de)

P-OBJ (indirect complement introduced by another preposition)

来源:https://stackoverflow.com/questions/1833252/java-stanford-nlp-part-of-speech-labels

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值