Porting your code to NLTK 3.0

最新推荐文章于 2020-05-20 18:05:43 发布

liuha511

最新推荐文章于 2020-05-20 18:05:43 发布

阅读量806

点赞数

分类专栏： NLP

NLP 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Original link: https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0

NLTK 3.0 contains a number of interface changes. These are being incorporated into a new version of the NLTK book, updated for Python 3 and NLTK 3.

The way NLTK works with unicode is changed: NLTK3 requires all text input to be unicode and always return text as unicode. Previously, some functions and classes worked on unicode and others required encoded bytestrings. Please make sure you're passing unicode to NLTK and expecting unicode output from NLTK - existing code that assumes bytestrings may start to fail.

Here are some changes you may need to make:

grammar: ContextFreeGrammar → CFG, WeightedGrammar → PCFG,StatisticalDependencyGrammar → ProbabilisticDependencyGrammar,WeightedProduction → ProbabilisticProduction
draw.tree: TreeSegmentWidget.node() → TreeSegmentWidget.label(),TreeSegmentWidget.set_node() → TreeSegmentWidget.set_label()
parsers: nbest_parse() → parse()
ccg.parse.chart: EdgeI.next() → EdgeI.nextsym()
Chunk parser: top_node → root_label; chunk_node → chunk_label
WordNet properties are now access methods, e.g. Synset.definition →Synset.definition()
sem.relextract: mk_pairs() → _tree2semi_rel(), mk_reldicts() →semi_rel2reldict(), show_clause() → clause(), show_raw_rtuple() → rtuple()
corpusname.tagged_words(simplify_tags=True) →corpusname.tagged_words(tagset='universal')
util.clean_html() → BeautifulSoup.get_text(). clean_html() is now dropped, install & use BeautifulSoup or some other html parser instead.
util.ibigrams() → util.bigrams()
util.ingrams() → util.ngrams()
util.itrigrams() → util.trigrams()
metrics.windowdiff → metrics.segmentation.windowdiff(),metrics.windowdiff.demo() was removed.
parse.generate2 was re-written and merged into parse.generate

Creating objects from strings:

Many objects now support a fromstring() method
tree.Tree.parse() → tree.Tree.fromstring()
tree.Tree() → tree.Tree.fromstring()
chunk.RegexpChunkRule.parse() → chunkRegexpChunkRule.fromstring()
grammar.parse_cfg() → CFG.fromstring() (same for other types of grammar)
sem.LogicParser.parse() → sem.Expression.fromstring()
sem.DrtParser.parse() → sem.DrtExpression.fromstring()
sem.parse_valuation() → sem.Valuation.fromstring()
sem.parse_type() → sem.Type.fromstring()

Operations on lists of sentences or other items:

tokenize.batch_tokenize() → tokenize.tokenize_sents()
tag.batch_tag() → tag.tag_sents()
parse.batch_parse() → parse.parse_sents()
classify.batch_classify() → classify.classify_many()
sem.batch_interpret() → sem.interpret_sents()
sem.batch_evaluate() → sem.evaluate_sents()
chunk.batch_ne_chunk() → chunk.ne_chunk_sents()

Changes in probability.FreqDist:

fdist.keys() → sorted(fdist)
fdist.inc(x) → fdist[x] += 1
fdist.samples() → sorted(fdist.keys())
fdist.Nr(r) → fdist.Nr()[r]
fdist.Nr_nonzero() → fdist.Nr().items()
cfdist.conditions() → sorted(cfdist.conditions())

Porter stemmer changes:

adjust_case(), cons(), cvc(), doublec(), m(), step1ab(), step1c(), step2(),step3(), step4(), step5(), vowelinstem() made private
ends(), r(), setto() removed

Removed modules, classes and functions:

classify.svm was removed. For classification based on support vector machines (SVMs) use classify.scikitlearn or scikit-learn directly. Seehttps://github.com/nltk/nltk/issues/450.
probability.GoodTuringProbDist class was removed. Seehttps://github.com/nltk/nltk/issues/381.
HiddenMarkovModelTaggerTransformI and its subclasses are removed. Seehttps://github.com/nltk/nltk/issues/374.
classify.maxent no longer support algorithms backed by scipy.maxentropy. Seehttps://github.com/nltk/nltk/issues/321.
misc.babelfish was removed. See https://github.com/nltk/nltk/issues/265.
sourcedstring was removed. See https://github.com/nltk/nltk/issues/322.
yamltags was removed. JSON is now preferred instead. Seehttps://github.com/nltk/nltk/issues/540
mallet was removed, including the tag.crf module. Seehttps://github.com/nltk/nltk/issues/104
tag.simplify was removed. See https://github.com/nltk/nltk/issues/483
model was removed. See https://github.com/nltk/nltk/issues?labels=model
corpus.reader.wordnet._lcs_by_depth was removed. Seehttps://github.com/nltk/nltk/issues/422.

Miscellaneous changes:

probability.ConditionalProbDist.default_factory now inherits from dict instead of defaultdict
probability.ConditionalProbDistI.default_factory now inherits from dict instead of defaultdict
probability.DictionaryConditionalProbDist.default_factory now inherits from dictinstead of defaultdict

Environment variables for third-party software: