nltk库内包说明

最新推荐文章于 2021-02-09 04:38:34 发布

python & TwinCAT

最新推荐文章于 2021-02-09 04:38:34 发布

阅读量1k

点赞数

本文链接：https://blog.csdn.net/gkbxs/article/details/111317822

版权

Nltk2-ccg

“”"

Combinatory Categorial Grammar.

For more information see nltk/doc/contrib/ccg/ccg.pdf

“”"

Nltk2-inference

“”"

Classes and interfaces for theorem proving and model building.

“”"

Nltk2-metric

“”"

NLTK Metrics

Classes and methods for scoring processing modules.

“”"

Nltk2-parse

“”"

NLTK Parsers

Classes and interfaces for producing tree structures that represent the internal organization of a text. This task is known as “parsing” the text, and the resulting tree structures are called the text’s “parses”. Typically, the text is a single sentence, and the tree structure represents the syntactic structure of the sentence. However, parsers can also be used in other domains. For example, parsers can be used to derive the morphological structure of the morphemes that make up a word, or to derive the discourse structure for a set of utterances.

Sometimes, a single piece of text can be represented by more than one tree structure. Texts represented by more than one tree structure are called “ambiguous” texts. Note that there are actually two ways in which a text can be ambiguous:

- The text has multiple correct parses.

- There is not enough information to decide which of several

  candidate parses is correct.

However, the parser module does not distinguish these two types of ambiguity.

The parser module defines ParserI, a standard interface for parsing texts; and two simple implementations of that interface, ShiftReduceParser and RecursiveDescentParser. It also contains three sub-modules for specialized kinds of parsing:

nltk.parser.chart defines chart parsing, which uses dynamic programming to efficiently parse texts.
nltk.parser.probabilistic defines probabilistic parsing, which associates a probability with each parse.

“”"

Nltk2-tag

“”"

NLTK Taggers

This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.

A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

>>> tagged_tok = ('fly', 'NN')

An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:

>>> from nltk import pos_tag, word_tokenize

>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))

[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),

("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

A Russian tagger is also available if you specify. It uses

the Russian National Corpus tagset:

>>> pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."),)    # doctest: +SKIP

[('Илья', 'S'), ('оторопел', 'V'), ('и', 'CONJ'), ('дважды', 'ADV'), ('перечитал', 'V'),

('бумажку', 'S'), ('.', 'NONLEX')]

This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:

>>> from nltk.corpus import brown

>>> from nltk.tag import UnigramTagger

>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])

>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']

>>> for word, tag in tagger.tag(sent):

...     print(word, '->', tag)

Mitchell -> NP

decried -> None

the -> AT

high -> JJ

rate -> NN

of -> IN

unemployment -> None

Note that words that the tagger has not seen during training receive a tag of None.

We evaluate a tagger on data that was not seen during training:

>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])

0.73...

For more information, please consult chapter 5 of the NLTK Book.

“”"

Nltk2-tbl

“”"

Transformation Based Learning

A general purpose package for Transformation Based Learning, currently used by nltk.tag.BrillTagger.

“”"

Nltk2-chunk

“”"

Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This task is called “chunk parsing” or “chunking”, and the identified groups are called “chunks”. The chunked text is represented using a shallow tree called a “chunk structure.” A chunk structure is a tree containing tokens and chunks, where each chunk is a subtree containing only tokens. For example, the chunk structure for base noun phrase chunks in the sentence “I saw the big dog on the hill” is::

(SENTENCE:

(NP: <I>)

<saw>

(NP: <the> <big> <dog>)

<on>

(NP: <the> <hill>))

To convert a chunk structure back to a list of tokens, simply use the chunk structure’s leaves() method.

This module defines ChunkParserI, a standard interface for chunking texts; and RegexpChunkParser, a regular-expression based implementation of that interface. It also defines ChunkScore, a utility class for scoring chunk parsers.

RegexpChunkParser

=================

RegexpChunkParser is an implementation of the chunk parser interface that uses regular-expressions over tags to chunk a text. Its parse() method first constructs a ChunkString, which encodes a particular chunking of the input text. Initially, nothing is chunked. parse.RegexpChunkParser then applies a sequence of RegexpChunkRule rules to the ChunkString, each of which modifies the chunking that it encodes. Finally, the ChunkString is transformed back into a chunk structure, which is returned.

RegexpChunkParser can only be used to chunk a single kind of phrase. For example, you can use an RegexpChunkParser to chunk the noun phrases in a text, or the verb phrases in a text; but you can not use it to simultaneously chunk both noun phrases and verb phrases in the same text. (This is a limitation of RegexpChunkParser, not of chunk parsers in general.)

RegexpChunkRules

A RegexpChunkRule is a transformational rule that updates the chunking of a text by modifying its ChunkString. Each RegexpChunkRule defines the apply() method, which modifies the chunking encoded by a ChunkString. The

RegexpChunkRule class itself can be used to implement any transformational rule based on regular expressions. There are also a number of subclasses, which can be used to implement simpler types of rules:

- ``ChunkRule`` chunks anything that matches a given regular expression.

- ``ChinkRule`` chinks anything that matches a given regular expression.

- ``UnChunkRule`` will un-chunk any chunk that matches a given regular expression.

- ``MergeRule`` can be used to merge two contiguous chunks.

- ``SplitRule`` can be used to split a single chunk into two smaller chunks.

- ``ExpandLeftRule`` will expand a chunk to incorporate new unchunked material on the left.

- ``ExpandRightRule`` will expand a chunk to incorporate new unchunked material on the right.

Tag Patterns


A ``RegexpChunkRule`` uses a modified version of regular expression patterns, called "tag patterns".  Tag patterns are

used to match sequences of tags.  Examples of tag patterns are::

     r'(<DT>|<JJ>|<NN>)+'

     r'<NN>+'

     r'<NN.*>'

The differences between regular expression patterns and tag patterns are:

    - In tag patterns, ``'<'`` and ``'>'`` act as parentheses; so ``'<NN>+'`` matches one or more repetitions of ``'<NN>'``, not

      ``'<NN'`` followed by one or more repetitions of ``'>'``.

    - Whitespace in tag patterns is ignored.  So ``'<DT> | <NN>'`` is equivalant to ``'<DT>|<NN>'``

    - In tag patterns, ``'.'`` is equivalant to ``'[^{}<>]'``; so ``'<NN.*>'`` matches any single tag starting with ``'NN'``.

The function ``tag_pattern2re_pattern`` can be used to transform a tag pattern to an equivalent regular expression pattern.

Efficiency

----------

Preliminary tests indicate that ``RegexpChunkParser`` can chunk at a rate of about 300 tokens/second, with a moderately complex rule set.

There may be problems if ``RegexpChunkParser`` is used with more than 5,000 tokens at a time.  In particular, evaluation of some regular expressions may cause the Python regular expression engine to exceed its maximum recursion depth.  We have attempted to minimize these problems, but it is impossible to avoid them completely.  We therefore recommend that you apply the chunk parser to a single sentence at a time.

Emacs Tip

---------

If you evaluate the following elisp expression in emacs, it will colorize a ``ChunkString`` when you use an interactive python shell with emacs or xemacs ("C-c !")::

    (let ()

      (defconst comint-mode-font-lock-keywords

        '(("<[^>]+>" 0 'font-lock-reference-face)

          ("[{}]" 0 'font-lock-function-name-face)))

      (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))

You can evaluate this code by copying it to a temporary buffer, placing the cursor after the last close parenthesis, and typing "``C-x C-e``".  You should evaluate it before running the interactive session.  The change will last until you close emacs.

Unresolved Issues

-----------------

If we use the ``re`` module for regular expressions, Python's regular expression engine generates "maximum recursion depth exceeded" errors when processing very large texts, even for regular expressions that should not require any recursion.  We therefore use the ``pre`` module instead.  But note that ``pre`` does not include Unicode support, so this module will not work with unicode strings.  Note also that ``pre`` regular expressions are not quite as advanced as ``re`` ones (e.g., no leftward zero-length assertions).

:type CHUNK_TAG_PATTERN: regexp

:var CHUNK_TAG_PATTERN: A regular expression to test whether a tag

     pattern is valid.

"""

Nltk2-classify

"""

Classes and interfaces for labeling tokens with category labels (or "class labels").  Typically, labels are represented with strings (such as ``'health'`` or ``'sports'``).  Classifiers can be used to perform a wide range of classification tasks.  For example, classifiers can be used...

- to classify documents by topic

- to classify ambiguous words by which word sense is intended

- to classify acoustic signals by which phoneme they represent

- to classify sentences by their author

Features

========

In order to decide which category label is appropriate for a given token, classifiers examine one or more 'features' of the token.  These "features" are typically chosen by hand, and indicate which aspects of the token are relevant to the classification decision.  For example, a document classifier might use a separate feature for each word, recording how often that word occurred in the document.

Featuresets

===========

The features describing a token are encoded using a "featureset", which is a dictionary that maps from "feature names" to "feature values".  Feature names are unique strings that indicate what aspect of the token is encoded by the feature.  Examples include ``'prevword'``, for a feature whose value is the previous word; and ``'contains-word(library)'`` for a feature that is true when a document contains the word ``'library'``.  Feature values are typically booleans, numbers, or strings, depending on which feature they describe.

Featuresets are typically constructed using a "feature detector" (also known as a "feature extractor").  A feature detector is a function that takes a token (and sometimes information about its context) as its input, and returns a featureset describing that token.

For example, the following feature detector converts a document (stored as a list of words) to a featureset describing the set of words included in the document:

    >>> # Define a feature detector function.

    >>> def document_features(document):

    ...     return dict([('contains-word(%s)' % w, True) for w in document])

Feature detectors are typically applied to each token before it is fed to the classifier:

    >>> # Classify each Gutenberg document.

    >>> from nltk.corpus import gutenberg

    >>> for fileid in gutenberg.fileids(): # doctest: +SKIP

    ...     doc = gutenberg.words(fileid) # doctest: +SKIP

    ...     print(fileid, classifier.classify(document_features(doc))) # doctest: +SKIP

The parameters that a feature detector expects will vary, depending on the task and the needs of the feature detector.  For example, a feature detector for word sense disambiguation (WSD) might take as its input a sentence, and the index of a word that should be classified, and return a featureset for that word.  The following feature detector for WSD includes features describing the left and right contexts of the target word:

    >>> def wsd_features(sentence, index):

    ...     featureset = {}

    ...     for i in range(max(0, index-3), index):

    ...         featureset['left-context(%s)' % sentence[i]] = True

    ...     for i in range(index, max(index+3, len(sentence))):

    ...         featureset['right-context(%s)' % sentence[i]] = True

    ...     return featureset

Training Classifiers

====================

Most classifiers are built by training them on a list of hand-labeled examples, known as the "training set".  Training sets are represented as lists of ``(featuredict, label)`` tuples.

"""

Nltk2-cluster

"""

This module contains a number of basic clustering algorithms. Clustering describes the task of discovering groups of similar items with a large collection. It is also describe as unsupervised machine learning, as the data from which it learns is unannotated with class information, as is the case for supervised learning.  Annotated data is difficult and expensive to obtain in the quantities required for the majority of supervised learning algorithms.

This problem, the knowledge acquisition bottleneck, is common to most natural language processing tasks, thus fueling the need for quality unsupervised approaches.

This module contains a k-means clusterer, E-M clusterer and a group average agglomerative clusterer (GAAC). All these clusterers involve finding good cluster groupings for a set of vectors in multi-dimensional space.

The K-means clusterer starts with k arbitrary chosen means then allocates each vector to the cluster with the closest mean. It then recalculates the means of each cluster as the centroid of the vectors in the cluster. This process repeats until the cluster memberships stabilise. This is a hill-climbing algorithm which may converge to a local maximum. Hence the clustering is often repeated with random initial means and the most commonly occurring output means are chosen.

The GAAC clusterer starts with each of the *N* vectors as singleton clusters. It then iteratively merges pairs of clusters which have the closest centroids.

This continues until there is only one cluster. The order of merges gives rise to a dendrogram - a tree with the earlier merges lower than later merges. The membership of a given number of clusters *c*, *1 <= c <= N*, can be found by cutting the dendrogram at depth *c*.

The Gaussian EM clusterer models the vectors as being produced by a mixture of k Gaussian sources. The parameters of these sources (prior probability, mean and covariance matrix) are then found to maximise the likelihood of the given data. This is done with the expectation maximisation algorithm. It starts with k arbitrarily chosen means, priors and covariance matrices. It then calculates the membership probabilities for each vector in each of the clusters - this is the 'E' step. The cluster parameters are then updated in the 'M' step using the maximum likelihood estimate from the cluster membership

probabilities. This process continues until the likelihood of the data does not significantly increase.

They all extend the ClusterI interface which defines common operations available with each clusterer. These operations include.

   - cluster: clusters a sequence of vectors

   - classify: assign a vector to a cluster

   - classification_probdist: give the probability distribution over cluster memberships

The current existing classifiers also extend cluster.VectorSpace, an abstract class which allows for singular value decomposition (SVD) and vector normalisation. SVD is used to reduce the dimensionality of the vector space in

such a manner as to preserve as much of the variation as possible, by reparameterising the axes in order of variability and discarding all bar the first d dimensions. Normalisation ensures that vectors fall in the unit hypersphere.

Usage example (see also demo())::

    from nltk import cluster

    from nltk.cluster import euclidean_distance

    from numpy import array

    vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]

    # initialise the clusterer (will also assign the vectors to clusters)

    clusterer = cluster.KMeansClusterer(2, euclidean_distance)

    clusterer.cluster(vectors, True)

    # classify a new vector

    print(clusterer.classify(array([3, 3])))

Note that the vectors must use numpy array-like objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for efficiency when required.

"""

Nltk2-misc

Nltk2-sem

"""

NLTK Semantic Interpretation Package

This package contains classes for representing semantic structure in formulas of first-order logic and for evaluating such formulas in set-theoretic models.

    >>> from nltk.sem import logic

    >>> logic._counter._value = 0

The package has two main components:

 - ``logic`` provides support for analyzing expressions of First Order Logic (FOL).

 - ``evaluate`` allows users to recursively determine truth in a model for formulas of FOL.

A model consists of a domain of discourse and a valuation function, which assigns values to non-logical constants. We assume that entities in the domain are represented as strings such as ``'b1'``, ``'g1'``, etc. A ``Valuation`` is initialized with a list of (symbol, value) pairs, where values are entities, sets of entities or sets of tuples of entities.

The domain of discourse can be inferred from the valuation, and model is then created with domain and valuation as parameters.

    >>> from nltk.sem import Valuation, Model

    >>> v = [('adam', 'b1'), ('betty', 'g1'), ('fido', 'd1'),

    ... ('girl', set(['g1', 'g2'])), ('boy', set(['b1', 'b2'])),

    ... ('dog', set(['d1'])),

    ... ('love', set([('b1', 'g1'), ('b2', 'g2'), ('g1', 'b1'), ('g2', 'b1')]))]

    >>> val = Valuation(v)

    >>> dom = val.domain

    >>> m = Model(dom, val)

"""

Nltk2-stem

"""

NLTK Stemmers

Interfaces used to remove morphological affixes from words, leaving only the word stem.  Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word.  This is a difficult problem due to irregular words (eg. common verbs in English), complicated morphological rules, and part-of-speech and sense ambiguities (eg. ``ceil-`` is not the stem of ``ceiling``).

StemmerI defines a standard interface for stemmers.

"""

Nltk2-tokenize

"""

NLTK Tokenizer Package

Tokenizers divide strings into lists of substrings.  For example, tokenizers can be used to find the words and punctuation in a string:

    >>> from nltk.tokenize import word_tokenize

    >>> s = '''Good muffins cost $3.88\nin New York.  Please buy me

    ... two of them.\n\nThanks.'''

    >>> word_tokenize(s)

    ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',

    'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:

    >>> from nltk.tokenize import wordpunct_tokenize

    >>> wordpunct_tokenize(s)

    ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

We can also operate at the level of sentences, using the sentence tokenizer directly as follows:

    >>> from nltk.tokenize import sent_tokenize, word_tokenize

    >>> sent_tokenize(s)

    ['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']

    >>> [word_tokenize(t) for t in sent_tokenize(s)]

    [['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],

    ['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with ``s.decode("utf8")``.

NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers.  (These methods are implemented as generators.)

    >>> from nltk.tokenize import WhitespaceTokenizer

    >>> list(WhitespaceTokenizer().span_tokenize(s))

    [(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),

    (45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]

There are numerous ways to tokenize text.  If you need more control over tokenization, see the other methods provided in this package.

For further information, please see Chapter 3 of the NLTK book.

"""

Nltk2-translate

"""

Experimental features for machine translation.

These interfaces are prone to change.

"""

Nltk-colocations

"""

Tools to identify collocations --- words that often appear consecutively --- within corpora. They may also be used to find other associations between word occurrences.

See Manning and Schutze ch. 5 at http://nlp.stanford.edu/fsnlp/promo/colloc.pdf and the Text::NSP Perl package at http://ngram.sourceforge.net

Finding collocations requires first calculating the frequencies of words and their appearance in the context of other words. Often the collection of words will then requiring filtering to only retain useful content terms. Each ngram of words may then be scored according to some association measure, in order to determine the relative likelihood of each ngram being a collocation.

The ``BigramCollocationFinder`` and ``TrigramCollocationFinder`` classes provide these functionalities, dependent on being provided a function which scores a ngram given appropriate frequency counts. A number of standard association measures are provided in bigram_measures and trigram_measures.

"""

Nltk-data

"""

Functions to find and load NLTK resource files, such as corpora, grammars, and saved processing objects.  Resource files are identified using URLs, such as ``nltk:corpora/abc/rural.txt`` or ``http://nltk.org/sample/toy.cfg``.  The following URL protocols are supported:

  - ``file:path``: Specifies the file whose path is *path*. Both relative and absolute paths may be used.

  - ``http://host/path``: Specifies the file stored on the web server *host* at path *path*.

  - ``nltk:path``: Specifies the file stored in the NLTK data package at *path*.  NLTK will search for these files in the directories specified by ``nltk.data.path``.

If no protocol is specified, then the default protocol ``nltk:`` will be used.

This module provides to functions that can be used to access a resource file, given its URL: ``load()`` loads a given resource, and adds it to a resource cache; and ``retrieve()`` copies a given resource to a local file.

"""

Nltk-decorators

"""

Decorator module by Michele Simionato <michelesimionato@libero.it>

Copyright Michele Simionato, distributed under the terms of the BSD License (see below).

http://www.phyast.pitt.edu/~micheles/python/documentation.html

Included in NLTK for its support of a nice memoization decorator.

"""

Nltk-downloader

"""

The NLTK corpus and module downloader.  This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.

Downloading Packages

====================

If called with no arguments, ``download()`` will display an interactive interface which can be used to download and install new packages.

If Tkinter is available, then a graphical interface will be shown, otherwise a simple text interface will be provided.

Individual packages can be downloaded by calling the ``download()`` function with a single argument, giving the package identifier for the package that should be downloaded:

    >>> download('treebank') # doctest: +SKIP

    [nltk_data] Downloading package 'treebank'...

    [nltk_data]   Unzipping corpora/treebank.zip.

NLTK also provides a number of \"package collections\", consisting of a group of related packages.  To download all packages in a colleciton, simply call ``download()`` with the collection's identifier:

    >>> download('all-corpora') # doctest: +SKIP

    [nltk_data] Downloading package 'abc'...

    [nltk_data]   Unzipping corpora/abc.zip.

    [nltk_data] Downloading package 'alpino'...

    [nltk_data]   Unzipping corpora/alpino.zip.

      ...

    [nltk_data] Downloading package 'words'...

    [nltk_data]   Unzipping corpora/words.zip.

Download Directory

==================

By default, packages are installed in either a system-wide directory (if Python has sufficient access to write to it); or in the current user's home directory.  However, the ``download_dir`` argument may be used to specify a different installation target, if desired.

See ``Downloader.default_download_dir()`` for more a detailed description of how the default download directory is chosen.

NLTK Download Server

====================

Before downloading any packages, the corpus and module downloader contacts the NLTK download server, to retrieve an index file describing the available packages.  

By default, this index file is loaded from ``https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml``.

If necessary, it is possible to create a new ``Downloader`` object, specifying a different URL for the package index file.

Usage::

    python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

or::

    python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

"""

# ----------------------------------------------------------------------

"""

  0     1  2    3

[label][----][label][----]

[column  ][column     ]

Notes

=====

Handling data files..  Some questions:

* Should the data files be kept zipped or unzipped?  I say zipped.

* Should the data files be kept in svn at all?  Advantages: history; automatic version numbers; 'svn up' could be used rather than the downloader to update the corpora.  Disadvantages: they're big, which makes working from svn a bit of a pain.  And we're planning to potentially make them much bigger.  I don't think we want people to have to download 400MB corpora just to use nltk from svn.

* Compromise: keep the data files in trunk/data rather than in trunk/nltk.  That way you can check them out in svn if you want to; but you don't need to, and you can use the downloader instead.

* Also: keep models in mind.  When we change the code, we'd potentially like the models to get updated.  This could require a little thought.

* So.. let's assume we have a trunk/data directory, containing a bunch of packages.  The packages should be kept as zip files, because we really shouldn't be editing them much (well -- we may edit models more, but they tend to be binary-ish files anyway, where diffs aren't that helpful).  So we'll have trunk/data, with a bunch of files like abc.zip and treebank.zip and propbank.zip.  For each package we could also have eg treebank.xml and propbank.xml, describing the contents of the package (name, copyright, license, etc).  Collections would also have .xml files.  Finally, we would pull all these together to form a single index.xml file.  Some directory structure wouldn't hurt.  So how about::

    /trunk/data/ ....................... root of data svn

      index.xml ........................ main index file

      src/ ............................. python scripts

      packages/ ........................ dir for packages

        corpora/ ....................... zip & xml files for corpora

        grammars/ ...................... zip & xml files for grammars

        taggers/ ....................... zip & xml files for taggers

        tokenizers/ .................... zip & xml files for tokenizers

        etc.

      collections/ ..................... xml files for collections

  Where the root (/trunk/data) would contain a makefile; and src/ would contain a script to update the info.xml file.  It could also contain scripts to rebuild some of the various model files.  The script that builds index.xml should probably check that each zip file expands entirely into a single subdir, whose name matches the package's uid.

Changes I need to make:

  - in index: change "size" to "filesize" or "compressed-size"

  - in index: add "unzipped-size"

  - when checking status: check both compressed & uncompressed size.

    uncompressed size is important to make sure we detect a problem if something got partially unzipped.  define new status values to differentiate stale vs corrupt vs corruptly-uncompressed??

    (we shouldn't need to re-download the file if the zip file is ok but it didn't get uncompressed fully.)

  - add other fields to the index: author, license, copyright, contact, etc.

the current grammars/ package would become a single new package (eg toy-grammars or book-grammars).

xml file should have:

  - authorship info

  - license info

  - copyright info

  - contact info

  - info about what type of data/annotation it contains?

  - recommended corpus reader?

collections can contain other collections.  they can also contain multiple package types (corpora & models).  Have a single 'basics' package that includes everything we talk about in the book?

n.b.: there will have to be a fallback to the punkt tokenizer, in case they didn't download that model.

default: unzip or not?

Nltk-featstruct

"""

Basic data classes for representing feature structures, and for performing basic operations on those feature structures.  A feature structure is a mapping from feature identifiers to feature values, where each feature value is either a basic value (such as a string or an integer), or a nested feature structure.  There are two types of feature structure, implemented by two subclasses of ``FeatStruct``:

    - feature dictionaries, implemented by ``FeatDict``, act like Python dictionaries.  Feature identifiers may be strings or

      instances of the ``Feature`` class.

    - feature lists, implemented by ``FeatList``, act like Python lists.  Feature identifiers are integers.

Feature structures are typically used to represent partial information about objects.  A feature identifier that is not mapped to a value stands for a feature whose value is unknown (*not* a feature without a value).  Two feature structures that represent (potentially overlapping) information about the same object can be combined by unification.  When two inconsistent feature structures are unified, the unification fails and returns None.

Features can be specified using "feature paths", or tuples of feature identifiers that specify path through the nested feature structures to a value.  Feature structures may contain reentrant feature values.  A "reentrant feature value" is a single feature value that can be accessed via multiple feature paths.  Unification preserves the reentrance relations imposed by both of the unified feature structures.  In the feature structure resulting from unification, any modifications to a reentrant feature value will be visible using any of its feature paths.

Feature structure variables are encoded using the ``nltk.sem.Variable`` class.  The variables' values are tracked using a bindings dictionary, which maps variables to their values.  When two feature structures are unified, a fresh bindings dictionary is created to track their values; and before unification completes, all bound variables are replaced by their values.  Thus, the bindings dictionaries are usually strictly internal to the unification process. However, it is possible to track the bindings of variables if you choose to, by supplying your own initial bindings dictionary to the ``unify()`` function.

When unbound variables are unified with one another, they become aliased.  This is encoded by binding one variable to the other.

Lightweight Feature Structures

==============================

Many of the functions defined by ``nltk.featstruct`` can be applied directly to simple Python dictionaries and lists, rather than to full-fledged ``FeatDict`` and ``FeatList`` objects.  In other words, Python ``dicts`` and ``lists`` can be used as "light-weight" feature structures.

    >>> from nltk.featstruct import unify

    >>> unify(dict(x=1, y=dict()), dict(a='a', y=dict(b='b')))  # doctest: +SKIP

    {'y': {'b': 'b'}, 'x': 1, 'a': 'a'}

However, you should keep in mind the following caveats:

  - Python dictionaries & lists ignore reentrance when checking for equality between values.  But two FeatStructs with different reentrances are considered nonequal, even if all their base values are equal.

  - FeatStructs can be easily frozen, allowing them to be used as keys in hash tables.  Python dictionaries and lists can not.

  - FeatStructs display reentrance in their string representations; Python dictionaries and lists do not.

  - FeatStructs may *not* be mixed with Python dictionaries and lists (e.g., when performing unification).

  - FeatStructs provide a number of useful methods, such as ``walk()`` and ``cyclic()``, which are not available for Python dicts and lists.

In general, if your feature structures will contain any reentrances, or if you plan to use them as dictionary keys, it is strongly recommended that you use full-fledged ``FeatStruct`` objects.

"""

Nltk-grammar

"""

Basic data classes for representing context free grammars.  A "grammar" specifies which trees can represent the structure of a given text.  Each of these trees is called a "parse tree" for the text (or simply a "parse").  In a "context free" grammar, the set of parse trees for any piece of a text can depend only on that piece, and not on the rest of the text (i.e., the piece's context).  Context free grammars are often used to find possible syntactic structures for sentences.  In this context, the leaves of a parse tree are word tokens; and the node values are phrasal categories, such as ``NP`` and ``VP``.

The ``CFG`` class is used to encode context free grammars.  Each ``CFG`` consists of a start symbol and a set of productions.

The "start symbol" specifies the root node value for parse trees.  For example, the start symbol for syntactic parsing is usually ``S``.  Start symbols are encoded using the ``Nonterminal`` class, which is discussed below.

A Grammar's "productions" specify what parent-child relationships a parse tree can contain.  Each production specifies that a particular node can be the parent of a particular set of children.  For example, the production ``<S> -> <NP> <VP>`` specifies that an ``S`` node can be the parent of an ``NP`` node and a ``VP`` node.

Grammar productions are implemented by the ``Production`` class.

Each ``Production`` consists of a left hand side and a right hand side.  The "left hand side" is a ``Nonterminal`` that specifies the node type for a potential parent; and the "right hand side" is a list that specifies allowable children for that parent.  This lists consists of ``Nonterminals`` and text types: each ``Nonterminal`` indicates that the corresponding child may be a ``TreeToken`` with the specified node type; and each text type indicates that the corresponding child may be a ``Token`` with the with that type.

The ``Nonterminal`` class is used to distinguish node values from leaf values.  This prevents the grammar from accidentally using a leaf value (such as the English word "A") as the node of a subtree.  Within a ``CFG``, all node values are wrapped in the ``Nonterminal`` class. Note, however, that the trees that are specified by the grammar do *not* include these ``Nonterminal`` wrappers.

Grammars can also be given a more procedural interpretation.  According to this interpretation, a Grammar specifies any tree structure *tree* that can be produced by the following procedure:

| Set tree to the start symbol

| Repeat until tree contains no more nonterminal leaves:

|   Choose a production prod with whose left hand side

|     lhs is a nonterminal leaf of tree.

|   Replace the nonterminal leaf with a subtree, whose node

|     value is the value wrapped by the nonterminal lhs, and

|     whose children are the right hand side of prod.

The operation of replacing the left hand side (*lhs*) of a production with the right hand side (*rhs*) in a tree (*tree*) is known as "expanding" *lhs* to *rhs* in *tree*.

"""

Nltk-help

"""

Provide structured access to documentation.

"""

Nltk-internals

Nltk-jsontags

"""

Register JSON tags, so the nltk data loader knows what module and class to look for.

NLTK uses simple '!' tags to mark the types of objects, but the fully-qualified "tag:nltk.org,2011:" prefix is also accepted in case anyone ends up using it.

"""

Nltk-lazyimport

""" Helper to enable simple lazy module import.

    'Lazy' means the actual import is deferred until an attribute is requested from the module's namespace. This has the advantage of allowing all imports to be done at the top of a script (in a prominent and visible place) without having a great impact on startup time.

    Copyright (c) 1999-2005, Marc-Andre Lemburg; mailto:mal@lemburg.com

    See the documentation for further information on copyrights, or contact the author. All Rights Reserved.

"""

Nltk-probability

"""

Classes for representing and processing probabilistic information.

The ``FreqDist`` class is used to encode "frequency distributions", which count the number of times that each outcome of an experiment occurs.

The ``ProbDistI`` class defines a standard interface for "probability distributions", which encode the probability of each outcome for an experiment.  There are two types of probability distribution:

  - "derived probability distributions" are created from frequency distributions.  They attempt to model the probability distribution that generated the frequency distribution.

  - "analytic probability distributions" are created directly from parameters (such as variance).

The ``ConditionalFreqDist`` class and ``ConditionalProbDistI`` interface are used to encode conditional distributions.  Conditional probability distributions can be derived or analytic; but currently the only implementation of the ``ConditionalProbDistI`` interface is ``ConditionalProbDist``, a derived distribution.

"""

Nltk-text

"""

This module brings together a variety of NLTK functionality for text analysis, and provides simple, interactive interfaces.

Functionality includes: concordancing, collocation discovery, regular expression search over tokenized strings, and distributional similarity.

"""

Nltk-tree

"""

Class for representing hierarchical language structures, such as syntax trees and morphological trees.

"""

Nltk-treetransforms

"""

A collection of methods for tree (grammar) transformations used in parsing natural language.

Although many of these methods are technically grammar transformations (ie. Chomsky Norm Form), when working with treebanks it is much more natural to visualize these modifications in a tree structure.  Hence, we will do all transformation directly to the tree itself.

Transforming the tree directly also allows us to do parent annotation.

A grammar can then be simply induced from the modified tree.

The following is a short tutorial on the available transformations.

 1. Chomsky Normal Form (binarization)

    It is well known that any grammar has a Chomsky Normal Form (CNF) equivalent grammar where CNF is defined by every production having either two non-terminals or one terminal on its right hand side.

    When we have hierarchically structured data (ie. a treebank), it is natural to view this in terms of productions where the root of every subtree is the head (left hand side) of the production and all of its children are the right hand side constituents.  In order to convert a tree into CNF, we simply need to ensure that every subtree has either two subtrees as children (binarization), or one leaf node (non-terminal).  In order to binarize a subtree with more than two children, we must introduce artificial nodes.

    There are two popular methods to convert a tree into CNF: left factoring and right factoring.  The following example demonstrates the difference between them.  Example::

     Original       Right-Factored     Left-Factored

          A              A                      A

        / | \          /   \                  /   \

       B  C  D   ==>  B    A|<C-D>   OR   A|<B-C>  D

                            /  \          /  \

                           C    D        B    C

 2. Parent Annotation

    In addition to binarizing the tree, there are two standard modifications to node labels we can do in the same traversal: parent annotation and Markov order-N smoothing (or sibling smoothing).

    The purpose of parent annotation is to refine the probabilities of productions by adding a small amount of context.  With this simple addition, a CYK (inside-outside, dynamic programming chart parse) can improve from 74% to 79% accuracy.  A natural generalization from parent annotation is to grandparent annotation and beyond.  The tradeoff becomes accuracy gain vs. computational complexity.  We must also keep in mind data sparcity issues.  Example::

     Original       Parent Annotation

          A                A^<?>

        / | \             /   \

       B  C  D   ==>  B^<A>    A|<C-D>^<?>     where ? is the

                                 /  \          parent of A

                             C^<A>   D^<A>

 3. Markov order-N smoothing

    Markov smoothing combats data sparcity issues as well as decreasing computational requirements by limiting the number of children included in artificial nodes.  In practice, most people use an order 2 grammar.  Example::

      Original       No Smoothing       Markov order 1   Markov order 2   etc.

       __A__            A                      A                A

      / /|\ \         /   \                  /   \            /   \

     B C D E F  ==>  B    A|<C-D-E-F>  ==>  B   A|<C>  ==>   B  A|<C-D>

                            /   \               /   \            /   \

                           C    ...            C    ...         C    ...

    Annotation decisions can be thought about in the vertical direction (parent, grandparent, etc) and the horizontal direction (number of siblings to keep).  Parameters to the following functions specify these values.  For more information see:

Dan Klein and Chris Manning (2003) "Accurate Unlexicalized Parsing",

ACL-03.  http://www.aclweb.org/anthology/P03-1054

 4. Unary Collapsing

    Collapse unary productions (ie. subtrees with a single child) into a new non-terminal (Tree node).  This is useful when working with algorithms that do not allow unary productions, yet you do not wish to lose the parent information.  Example::

       A

       |

       B   ==>   A+B

      / \        / \

     C   D      C   D

"""

Nltk-util

Nltk-wsd