2.3.NLTK工具包安装、分词、Text对象、停用词、过滤掉停用词、词性标注、分块、命名实体识别、数据清洗实例、参考文章

2.3.NLTK工具包安装
2.3.1.分词
2.3.2.Text对象
2.3.3.停用词
2.3.4.过滤掉停用词
2.3.5.词性标注
2.3.6.分块
2.3.7.命名实体识别
2.3.8.数据清洗实例
2.3.9.参考文章

2.3.NLTK工具包安装

非常实用的文本处理工具,主要用于英文数据,历史悠久~

(base) C:\Users\toto>pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: nltk in d:\installed\anaconda3\lib\site-packages (3.5)
Requirement already satisfied: joblib in d:\installed\anaconda3\lib\site-packages (from nltk) (0.17.0)
Requirement already satisfied: tqdm in d:\installed\anaconda3\lib\site-packages (from nltk) (4.50.2)
Requirement already satisfied: regex in d:\installed\anaconda3\lib\site-packages (from nltk) (2020.10.15)
Requirement already satisfied: click in d:\installed\anaconda3\lib\site-packages (from nltk) (7.1.2)

(base) C:\Users\toto>

NLTK最麻烦的是它的使用需要一些较大的数据包,如果对自己的网速有信心,可以直接在切到安装环境后,使用python命令进入到python环境中,输入:

import nltk
nltk.download()

然后在可视化界面中下载就好。
在这里插入图片描述
但是,这种方式不仅仅下载慢,还容易遇到大大小小的下载问题,因此,可以直接到nltk的github上下载数据包:https://github.com/nltk/nltk_data
在这里插入图片描述
下载之后,需要将文件放在nltk扫描的文件下,其中的路径可以通过下面的方式找到:
在这里插入图片描述
解决办法是将上面github上下载的packages包里面的内容放到D:\installed\Anaconda\nltk_data中,最终如下:
在这里插入图片描述
不过,要注意一点,在Github上下载的这个压缩数据包,里面的一些子文件夹下还有压缩内容,例如,如果调用nltk进行句子分割,会用到这个函数: word_tokenize():

import nltk

sen = 'hello, how are you?'
res = nltk.word_tokenize(sen)
print(res)

却会报错(我这里是这样),可以在报错信息中看到是punkt数据未找到:

  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

类似这样的错误,其实如果找到查找的路径,也就是上面我们放数据包的地方,是可以在tokenizers文件夹下找到这个punkt的,原因就在于没有解压,那么,把punkt.zip解压到文件夹中,再运行分割句子的代码就没问题了。话有其他的一些数据也是这样的,如果遇到显示没有找到某个数据包,不妨试一试。
如下:
在这里插入图片描述
最后再次运行,结果如下:
在这里插入图片描述

2.3.1.分词

import nltk

from nltk.tokenize import word_tokenize
from nltk.text import Text

input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."
tokens = word_tokenize(input_str)
tokens = [word.lower() for word in tokens]
print(tokens)
'''
输出结果:
['today', "'s", 'weather', 'is', 'good', ',', 'very', 'windy', 'and', 'sunny', ',', 'we', 'have', 'no', 'classes', 'in', 'the', 'afternoon', ',', 'we', 'have', 'to', 'play', 'basketball', 'tomorrow', '.']
'''

print(tokens[:5])
'''
输出结果:
['today', "'s", 'weather', 'is', 'good']
'''

2.3.2.Text对象

import nltk

# from nltk.tokenize import word_tokenize
from nltk.text import Text

help(nltk.text)

输出结果:

D:\installed\Anaconda\python.exe E:/workspace/nlp/nltk/demo.py
Help on module nltk.text in nltk:

NAME
    nltk.text

DESCRIPTION
    This module brings together a variety of NLTK functionality for
    text analysis, and provides simple, interactive interfaces.
    Functionality includes: concordancing, collocation discovery,
    regular expression search over tokenized strings, and
    distributional similarity.

CLASSES
    builtins.object
        ConcordanceIndex
        ContextIndex
        Text
            TextCollection
        TokenSearcher
    
    class ConcordanceIndex(builtins.object)
     |  ConcordanceIndex(tokens, key=<function ConcordanceIndex.<lambda> at 0x000002602C7FA280>)
     |  
     |  An index that can be used to look up the offset locations at which
     |  a given word occurs in a document.
     |  
     |  Methods defined here:
     |  
     |  __init__(self, tokens, key=<function ConcordanceIndex.<lambda> at 0x000002602C7FA280>)
     |      Construct a new concordance index.
     |      
     |      :param tokens: The document (list of tokens) that this
     |          concordance index was created from.  This list can be used
     |          to access the context of a given word occurrence.
     |      :param key: A function that maps each token to a normalized
     |          version that will be used as a key in the index.  E.g., if
     |          you use ``key=lambda s:s.lower()``, then the index will be
     |          case-insensitive.
     |  
     |  __repr__(self)
     |      Return repr(self).
     |  
     |  find_concordance(self, word, width=80)
     |      Find all concordance lines given the query word.
     |  
     |  offsets(self, word)
     |      :rtype: list(int)
     |      :return: A list of the offset positions at which the given
     |          word occurs.  If a key function was specified for the
     |          index, then given word's key will be looked up.
     |  
     |  print_concordance(self, word, width=80, lines=25)
     |      Print concordance lines given the query word.
     |      :param word: The target word
     |      :type word: str
     |      :param lines: The number of lines to display (default=25)
     |      :type lines: int
     |      :param width: The width of each line, in characters (default=80)
     |      :type width: int
     |      :param save: The option to save the concordance.
     |      :type save: bool
     |  
     |  tokens(self)
     |      :rtype: list(str)
     |      :return: The document that this concordance index was
     |          created from.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class ContextIndex(builtins.object)
     |  ContextIndex(tokens, context_func=None, filter=None, key=<function ContextIndex.<lambda> at 0x000002602C7F4EE0>)
     |  
     |  A bidirectional index between words and their 'contexts' in a text.
     |  The context of a word is usually defined to be the words that occur
     |  in a fixed window around the word; but other definitions may also
     |  be used by providing a custom context function.
     |  
     |  Methods defined here:
     |  
     |  __init__(self, tokens, context_func=None, filter=None, key=<function ContextIndex.<lambda> at 0x000002602C7F4EE0>)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  common_contexts(self, words, fail_on_unknown=False)
     |      Find contexts where the specified words can all appear; and
     |      return a frequency distribution mapping each context to the
     |      number of times that context was used.
     |      
     |      :param words: The words used to seed the similarity search
     |      :type words: str
     |      :param fail_on_unknown: If true, then raise a value error if
     |          any of the given words do not occur at all in the index.
     |  
     |  similar_words(self, word, n=20)
     |  
     |  tokens(self)
     |      :rtype: list(str)
     |      :return: The document that this context index was
     |          created from.
     |  
     |  word_similarity_dict(self, word)
     |      Return a dictionary mapping from words to 'similarity scores,'
     |      indicating how often these two words occur in the same
     |      context.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class Text(builtins.object)
     |  Text(tokens, name=None)
     |  
     |  A wrapper around a sequence of simple (string) tokens, which is
     |  intended to support initial exploration of texts (via the
     |  interactive console).  Its methods perform a variety of analyses
     |  on the text's contexts (e.g., counting, concordancing, collocation
     |  discovery), and display the results.  If you wish to write a
     |  program which makes use of these analyses, then you should bypass
     |  the ``Text`` class, and use the appropriate analysis function or
     |  class directly instead.
     |  
     |  A ``Text`` is typically initialized from a given document or
     |  corpus.  E.g.:
     |  
     |  >>> import nltk.corpus
     |  >>> from nltk.text import Text
     |  >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
     |  
     |  Methods defined here:
     |  
     |  __getitem__(self, i)
     |  
     |  __init__(self, tokens, name=None)
     |      Create a Text object.
     |      
     |      :param tokens: The source text.
     |      :type tokens: sequence of str
     |  
     |  __len__(self)
     |  
     |  __repr__(self)
     |      Return repr(self).
     |  
     |  __str__(self)
     |      Return str(self).
     |  
     |  collocation_list(self, num=20, window_size=2)
     |      Return collocations derived from the text, ignoring stopwords.
     |      
     |          >>> from nltk.book import text4
     |          >>> text4.collocation_list()[:2]
     |          [('United', 'States'), ('fellow', 'citizens')]
     |      
     |      :param num: The maximum number of collocations to return.
     |      :type num: int
     |      :param window_size: The number of tokens spanned by a collocation (default=2)
     |      :type window_size: int
     |      :rtype: list(tuple(str, str))
     |  
     |  collocations(self, num=20, window_size=2)
     |      Print collocations derived from the text, ignoring stopwords.
     |      
     |          >>> from nltk.book import text4
     |          >>> text4.collocations() # doctest: +ELLIPSIS
     |          United States; fellow citizens; four years; ...
     |      
     |      :param num: The maximum number of collocations to print.
     |      :type num: int
     |      :param window_size: The number of tokens spanned by a collocation (default=2)
     |      :type window_size: int
     |  
     |  common_contexts(self, words, num=20)
     |      Find contexts where the specified words appear; list
     |      most frequent common contexts first.
     |      
     |      :param words: The words used to seed the similarity search
     |      :type words: str
     |      :param num: The number of words to generate (default=20)
     |      :type num: int
     |      :seealso: ContextIndex.common_contexts()
     |  
     |  concordance(self, word, width=79, lines=25)
     |      Prints a concordance for ``word`` with the specified context window.
     |      Word matching is not case-sensitive.
     |      
     |      :param word: The target word
     |      :type word: str
     |      :param width: The width of each line, in characters (default=80)
     |      :type width: int
     |      :param lines: The number of lines to display (default=25)
     |      :type lines: int
     |      
     |      :seealso: ``ConcordanceIndex``
     |  
     |  concordance_list(self, word, width=79, lines=25)
     |      Generate a concordance for ``word`` with the specified context window.
     |      Word matching is not case-sensitive.
     |      
     |      :param word: The target word
     |      :type word: str
     |      :param width: The width of each line, in characters (default=80)
     |      :type width: int
     |      :param lines: The number of lines to display (default=25)
     |      :type lines: int
     |      
     |      :seealso: ``ConcordanceIndex``
     |  
     |  count(self, word)
     |      Count the number of times this word appears in the text.
     |  
     |  dispersion_plot(self, words)
     |      Produce a plot showing the distribution of the words through the text.
     |      Requires pylab to be installed.
     |      
     |      :param words: The words to be plotted
     |      :type words: list(str)
     |      :seealso: nltk.draw.dispersion_plot()
     |  
     |  findall(self, regexp)
     |      Find instances of the regular expression in the text.
     |      The text is a list of tokens, and a regexp pattern to match
     |      a single token must be surrounded by angle brackets.  E.g.
     |      
     |      >>> print('hack'); from nltk.book import text1, text5, text9
     |      hack...
     |      >>> text5.findall("<.*><.*><bro>")
     |      you rule bro; telling you bro; u twizted bro
     |      >>> text1.findall("<a>(<.*>)<man>")
     |      monied; nervous; dangerous; white; white; white; pious; queer; good;
     |      mature; white; Cape; great; wise; wise; butterless; white; fiendish;
     |      pale; furious; better; certain; complete; dismasted; younger; brave;
     |      brave; brave; brave
     |      >>> text9.findall("<th.*>{3,}")
     |      thread through those; the thought that; that the thing; the thing
     |      that; that that thing; through these than through; them that the;
     |      through the thick; them that they; thought that the
     |      
     |      :param regexp: A regular expression
     |      :type regexp: str
     |  
     |  generate(self, length=100, text_seed=None, random_seed=42)
     |      Print random text, generated using a trigram language model.
     |      See also `help(nltk.lm)`.
     |      
     |      :param length: The length of text to generate (default=100)
     |      :type length: int
     |      
     |      :param text_seed: Generation can be conditioned on preceding context.
     |      :type text_seed: list(str)
     |      
     |      :param random_seed: A random seed or an instance of `random.Random`. If provided,
     |      makes the random sampling part of generation reproducible. (default=42)
     |      :type random_seed: int
     |  
     |  index(self, word)
     |      Find the index of the first occurrence of the word in the text.
     |  
     |  plot(self, *args)
     |      See documentation for FreqDist.plot()
     |      :seealso: nltk.prob.FreqDist.plot()
     |  
     |  readability(self, method)
     |  
     |  similar(self, word, num=20)
     |      Distributional similarity: find other words which appear in the
     |      same contexts as the specified word; list most similar words first.
     |      
     |      :param word: The word used to seed the similarity search
     |      :type word: str
     |      :param num: The number of words to generate (default=20)
     |      :type num: int
     |      :seealso: ContextIndex.similar_words()
     |  
     |  vocab(self)
     |      :seealso: nltk.prob.FreqDist
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class TextCollection(Text)
     |  TextCollection(source)
     |  
     |  A collection of texts, which can be loaded with list of texts, or
     |  with a corpus consisting of one or more texts, and which supports
     |  counting, concordancing, collocation discovery, etc.  Initialize a
     |  TextCollection as follows:
     |  
     |  >>> import nltk.corpus
     |  >>> from nltk.text import TextCollection
     |  >>> print('hack'); from nltk.book import text1, text2, text3
     |  hack...
     |  >>> gutenberg = TextCollection(nltk.corpus.gutenberg)
     |  >>> mytexts = TextCollection([text1, text2, text3])
     |  
     |  Iterating over a TextCollection produces all the tokens of all the
     |  texts in order.
     |  
     |  Method resolution order:
     |      TextCollection
     |      Text
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __init__(self, source)
     |      Create a Text object.
     |      
     |      :param tokens: The source text.
     |      :type tokens: sequence of str
     |  
     |  idf(self, term)
     |      The number of texts in the corpus divided by the
     |      number of texts that the term appears in.
     |      If a term does not appear in the corpus, 0.0 is returned.
     |  
     |  tf(self, term, text)
     |      The frequency of the term in text.
     |  
     |  tf_idf(self, term, text)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from Text:
     |  
     |  __getitem__(self, i)
     |  
     |  __len__(self)
     |  
     |  __repr__(self)
     |      Return repr(self).
     |  
     |  __str__(self)
     |      Return str(self).
     |  
     |  collocation_list(self, num=20, window_size=2)
     |      Return collocations derived from the text, ignoring stopwords.
     |      
     |          >>> from nltk.book import text4
     |          >>> text4.collocation_list()[:2]
     |          [('United', 'States'), ('fellow', 'citizens')]
     |      
     |      :param num: The maximum number of collocations to return.
     |      :type num: int
     |      :param window_size: The number of tokens spanned by a collocation (default=2)
     |      :type window_size: int
     |      :rtype: list(tuple(str, str))
     |  
     |  collocations(self, num=20, window_size=2)
     |      Print collocations derived from the text, ignoring stopwords.
     |      
     |          >>> from nltk.book import text4
     |          >>> text4.collocations() # doctest: +ELLIPSIS
     |          United States; fellow citizens; four years; ...
     |      
     |      :param num: The maximum number of collocations to print.
     |      :type num: int
     |      :param window_size: The number of tokens spanned by a collocation (default=2)
     |      :type window_size: int
     |  
     |  common_contexts(self, words, num=20)
     |      Find contexts where the specified words appear; list
     |      most frequent common contexts first.
     |      
     |      :param words: The words used to seed the similarity search
     |      :type words: str
     |      :param num: The number of words to generate (default=20)
     |      :type num: int
     |      :seealso: ContextIndex.common_contexts()
     |  
     |  concordance(self, word, width=79, lines=25)
     |      Prints a concordance for ``word`` with the specified context window.
     |      Word matching is not case-sensitive.
     |      
     |      :param word: The target word
     |      :type word: str
     |      :param width: The width of each line, in characters (default=80)
     |      :type width: int
     |      :param lines: The number of lines to display (default=25)
     |      :type lines: int
     |      
     |      :seealso: ``ConcordanceIndex``
     |  
     |  concordance_list(self, word, width=79, lines=25)
     |      Generate a concordance for ``word`` with the specified context window.
     |      Word matching is not case-sensitive.
     |      
     |      :param word: The target word
     |      :type word: str
     |      :param width: The width of each line, in characters (default=80)
     |      :type width: int
     |      :param lines: The number of lines to display (default=25)
     |      :type lines: int
     |      
     |      :seealso: ``ConcordanceIndex``
     |  
     |  count(self, word)
     |      Count the number of times this word appears in the text.
     |  
     |  dispersion_plot(self, words)
     |      Produce a plot showing the distribution of the words through the text.
     |      Requires pylab to be installed.
     |      
     |      :param words: The words to be plotted
     |      :type words: list(str)
     |      :seealso: nltk.draw.dispersion_plot()
     |  
     |  findall(self, regexp)
     |      Find instances of the regular expression in the text.
     |      The text is a list of tokens, and a regexp pattern to match
     |      a single token must be surrounded by angle brackets.  E.g.
     |      
     |      >>> print('hack'); from nltk.book import text1, text5, text9
     |      hack...
     |      >>> text5.findall("<.*><.*><bro>")
     |      you rule bro; telling you bro; u twizted bro
     |      >>> text1.findall("<a>(<.*>)<man>")
     |      monied; nervous; dangerous; white; white; white; pious; queer; good;
     |      mature; white; Cape; great; wise; wise; butterless; white; fiendish;
     |      pale; furious; better; certain; complete; dismasted; younger; brave;
     |      brave; brave; brave
     |      >>> text9.findall("<th.*>{3,}")
     |      thread through those; the thought that; that the thing; the thing
     |      that; that that thing; through these than through; them that the;
     |      through the thick; them that they; thought that the
     |      
     |      :param regexp: A regular expression
     |      :type regexp: str
     |  
     |  generate(self, length=100, text_seed=None, random_seed=42)
     |      Print random text, generated using a trigram language model.
     |      See also `help(nltk.lm)`.
     |      
     |      :param length: The length of text to generate (default=100)
     |      :type length: int
     |      
     |      :param text_seed: Generation can be conditioned on preceding context.
     |      :type text_seed: list(str)
     |      
     |      :param random_seed: A random seed or an instance of `random.Random`. If provided,
     |      makes the random sampling part of generation reproducible. (default=42)
     |      :type random_seed: int
     |  
     |  index(self, word)
     |      Find the index of the first occurrence of the word in the text.
     |  
     |  plot(self, *args)
     |      See documentation for FreqDist.plot()
     |      :seealso: nltk.prob.FreqDist.plot()
     |  
     |  readability(self, method)
     |  
     |  similar(self, word, num=20)
     |      Distributional similarity: find other words which appear in the
     |      same contexts as the specified word; list most similar words first.
     |      
     |      :param word: The word used to seed the similarity search
     |      :type word: str
     |      :param num: The number of words to generate (default=20)
     |      :type num: int
     |      :seealso: ContextIndex.similar_words()
     |  
     |  vocab(self)
     |      :seealso: nltk.prob.FreqDist
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from Text:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class TokenSearcher(builtins.object)
     |  TokenSearcher(tokens)
     |  
     |  A class that makes it easier to use regular expressions to search
     |  over tokenized strings.  The tokenized string is converted to a
     |  string where tokens are marked with angle brackets -- e.g.,
     |  ``'<the><window><is><still><open>'``.  The regular expression
     |  passed to the ``findall()`` method is modified to treat angle
     |  brackets as non-capturing parentheses, in addition to matching the
     |  token boundaries; and to have ``'.'`` not match the angle brackets.
     |  
     |  Methods defined here:
     |  
     |  __init__(self, tokens)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  findall(self, regexp)
     |      Find instances of the regular expression in the text.
     |      The text is a list of tokens, and a regexp pattern to match
     |      a single token must be surrounded by angle brackets.  E.g.
     |      
     |      >>> from nltk.text import TokenSearcher
     |      >>> print('hack'); from nltk.book import text1, text5, text9
     |      hack...
     |      >>> text5.findall("<.*><.*><bro>")
     |      you rule bro; telling you bro; u twizted bro
     |      >>> text1.findall("<a>(<.*>)<man>")
     |      monied; nervous; dangerous; white; white; white; pious; queer; good;
     |      mature; white; Cape; great; wise; wise; butterless; white; fiendish;
     |      pale; furious; better; certain; complete; dismasted; younger; brave;
     |      brave; brave; brave
     |      >>> text9.findall("<th.*>{3,}")
     |      thread through those; the thought that; that the thing; the thing
     |      that; that that thing; through these than through; them that the;
     |      through the thick; them that they; thought that the
     |      
     |      :param regexp: A regular expression
     |      :type regexp: str
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)

DATA
    __all__ = ['ContextIndex', 'ConcordanceIndex', 'TokenSearcher', 'Text'...

FILE
    d:\installed\anaconda\lib\site-packages\nltk\text.py

创建一个Text对象,方便后续操作

import nltk

from nltk.tokenize import word_tokenize
from nltk.text import Text

input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."
tokens = word_tokenize(input_str)
tokens = [word.lower() for word in tokens]

t = Text(tokens)
print(t.count('good'))
'''
输出结果:
1
'''

print(t.index('good'))
'''
输出结果:
4
'''

t.plot(8)

在这里插入图片描述

2.3.3.停用词

可以看一下说明中的介绍

import nltk
from nltk.corpus import stopwords
print(stopwords.readme().replace('\n', ' '))

输出结果:

Stopwords Corpus  This corpus contains lists of stop words for several languages.  These are high-frequency grammatical words which are usually ignored in text retrieval applications.  They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/  The stop words for the Romanian language were obtained from: http://arlc.ro/resources/  The English list has been augmented https://github.com/nltk/nltk_data/issues/22  The German list has been corrected https://github.com/nltk/nltk_data/pull/49  A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52  A Nepali list has been added https://github.com/nltk/nltk_data/pull/83  An Azerbaijani list has been added https://github.com/nltk/nltk_data/pull/100  A Greek list has been added https://github.com/nltk/nltk_data/pull/103  An Indonesian list has been added https://github.com/nltk/nltk_data/pull/112 
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# print(stopwords.readme().replace('\n', ' '))

print(stopwords.fileids())
'''
输出结果:
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
'''

print(stopwords.raw('english').replace('\n', ' '))
'''
输出结果:
i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't 
'''

'''
数据准备
'''
input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."
tokens = word_tokenize(input_str)
tokens = [word.lower() for word in tokens]

test_words = [word.lower() for word in tokens]
test_words_set = set(test_words)

print(test_words_set)
'''
输出结果:
{'no', 'good', 'windy', 'in', 'afternoon', 'very', '.', 'have', 'to', 'basketball', 'classes', 'and', 'the', 'we', 'weather', 'tomorrow', 'is', ',', 'today', "'s", 'play', 'sunny'}
'''

'''
获得test_words_set中的停用词
1'''
print(test_words_set.intersection(set(stopwords.words('english'))))
'''
{'no', 'to', 'and', 'is', 'very', 'the', 'we', 'have', 'in'}
'''

2.3.4.过滤掉停用词

filtered = [w for w in test_words_set if(w not in stopwords.words('english'))]
print(filtered)
'''
输出结果:
['.', 'play', 'windy', 'tomorrow', 'today', 'weather', 'afternoon', 'classes', 'sunny', 'good', "'s", 'basketball', ',']
'''

2.3.5.词性标注

nltk.download()  # 第三个
'''
输出结果:
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
'''
from nltk import pos_tag
tags = pos_tag(tokens)
print(tags)
'''
输出结果:
[('today', 'NN'), ("'s", 'POS'), ('weather', 'NN'), ('is', 'VBZ'), ('good', 'JJ'), (',', ','), ('very', 'RB'), ('windy', 'JJ'), ('and', 'CC'), ('sunny', 'JJ'), (',', ','), ('we', 'PRP'), ('have', 'VBP'), ('no', 'DT'), ('classes', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('afternoon', 'NN'), (',', ','), ('we', 'PRP'), ('have', 'VBP'), ('to', 'TO'), ('play', 'VB'), ('basketball', 'NN'), ('tomorrow', 'NN'), ('.', '.')]
'''

在这里插入图片描述

2.3.6.分块

from nltk.chunk import RegexpParser
sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('died','VBD')]
grammer = "MY_NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammer)    # 生成规则
result = cp.parse(sentence)        # 进行分块
print(result)

result.draw()                      # 调用matplotlib库画出来

2.3.7.命名实体识别

nltk.download() 
#maxent_ne_chunke
#words

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
from nltk import ne_chunk

sentence = "Edison went to Tsinghua University today"
print(ne_chunk(pos_tag(word_tokenize(sentence))))
'''
输出结果:
(S
  (PERSON Edison/NNP)
  went/VBD
  to/TO
  (ORGANIZATION Tsinghua/NNP University/NNP)
  today/NN)
'''

2.3.8.数据清洗实例

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# 输入数据
s = '    RT @Amila #Test\nTom\'s newly listed Co  &amp; Mary\'s unlisted     Group to supply tech for nlTK.\nh $TSLA $AAPL https:// t.co/x34afsfQsh'

# 指定停用词
cache_english_stopwords = stopwords.words('english')


def text_clean(text):
    print('原始数据:', text, '\n')

    # 去掉HTML标签(e.g. &amp;)
    text_no_special_entities = re.sub(r'\&\w*;|#\w*|@\w*', '', text)
    print('去掉特殊标签后的:', text_no_special_entities, '\n')

    # 去掉一些价值符号
    text_no_tickers = re.sub(r'\$\w*', '', text_no_special_entities)
    print('去掉价值符号后的:', text_no_tickers, '\n')

    # 去掉超链接
    text_no_hyperlinks = re.sub(r'https?:\/\/.*\/\w*', '', text_no_tickers)
    print('去掉超链接后的:', text_no_hyperlinks, '\n')

    # 去掉一些专门名词缩写,简单来说就是字母比较少的词
    text_no_small_words = re.sub(r'\b\w{1,2}\b', '', text_no_hyperlinks)
    print('去掉专门名词缩写后:', text_no_small_words, '\n')

    # 去掉多余的空格
    text_no_whitespace = re.sub(r'\s\s+', ' ', text_no_small_words)
    text_no_whitespace = text_no_whitespace.lstrip(' ')
    print('去掉空格后的:', text_no_whitespace, '\n')

    # 分词
    tokens = word_tokenize(text_no_whitespace)
    print('分词结果:', tokens, '\n')

    # 去停用词
    list_no_stopwords = [i for i in tokens if i not in cache_english_stopwords]
    print('去停用词后结果:', list_no_stopwords, '\n')
    # 过滤后结果
    text_filtered = ' '.join(list_no_stopwords)  # ''.join() would join without spaces between words.
    print('过滤后:', text_filtered)


text_clean(s)

输出结果:

D:\installed\Anaconda\python.exe E:/workspace/nlp/nltk/demo2.py
原始数据:     RT @Amila #Test
Tom's newly listed Co  &amp; Mary's unlisted     Group to supply tech for nlTK.
h $TSLA $AAPL https:// t.co/x34afsfQsh 

去掉特殊标签后的:     RT  
Tom's newly listed Co   Mary's unlisted     Group to supply tech for nlTK.
h $TSLA $AAPL https:// t.co/x34afsfQsh 

去掉价值符号后的:     RT  
Tom's newly listed Co   Mary's unlisted     Group to supply tech for nlTK.
h   https:// t.co/x34afsfQsh 

去掉超链接后的:     RT  
Tom's newly listed Co   Mary's unlisted     Group to supply tech for nlTK.
h    

去掉专门名词缩写后:       
Tom' newly listed    Mary' unlisted     Group  supply tech for nlTK.
    

去掉空格后的: Tom' newly listed Mary' unlisted Group supply tech for nlTK.  

分词结果: ['Tom', "'", 'newly', 'listed', 'Mary', "'", 'unlisted', 'Group', 'supply', 'tech', 'for', 'nlTK', '.'] 

去停用词后结果: ['Tom', "'", 'newly', 'listed', 'Mary', "'", 'unlisted', 'Group', 'supply', 'tech', 'nlTK', '.'] 

过滤后: Tom ' newly listed Mary ' unlisted Group supply tech nlTK .

Process finished with exit code 0

2.3.9.参考文章

https://pypi.org/project/nltk/#files
https://blog.csdn.net/sinat_34328764/article/details/94830948
  • 1
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

涂作权的博客

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值