Python自然语言处理 | 编写结构化程序

最新推荐文章于 2023-06-30 07:15:51 发布

Claire_chen_jia

最新推荐文章于 2023-06-30 07:15:51 发布

阅读量499

点赞数

文章标签： python 自然语言处理

本文链接：https://blog.csdn.net/Claire_chen_jia/article/details/113794423

版权

本章解决问题

怎么能写出结构良好、可读的程序，你和其他人将能够很容易的重新使用它?
基本结构块，如循环、函数以及赋值，是如何执行的?
Python 编程的陷阱有哪些，你怎么能避免它们吗?

1回到基础

1.1 赋值

关键是要明白通过一个对象引用修改一个对象与通过覆盖一个对象引用之间的区别。
在这里插入图片描述

对于结构化对象，如链表，其赋值实际上是一个对象的引用。因此在使用与该对象相关的变量修改链表时，所有变量的值都会变化。
关键是要明白通过一个对象引用修改一个对象与通过覆盖一个对象引用之间的区别。
复制结构，而不复制任何对象引用使用copy.deepcopy()。需要import copy。

# 在使用与该对象相关的变量修改链表时，所有变量的值都会变化。
foo = ["1", "2"]
bar = foo
foo[1] = "3"
print(bar)

empty = []
nested = [empty, empty, empty]
print(nested)
nested[1].append("3")
print(nested)

# 关键是要明白通过一个对象引用修改一个对象与通过覆盖一个对象引用之间的区别
nes = [[]] * 3
nes[1].append("3")
print(nes)
nes[1] = ["2"]                # 这里最新赋值时，不会传递给其他元素
print(nes)

# 复制结构，而不复制任何对象引用使用copy.deepcopy()。需要import copy。
import copy
new2 = copy.deepcopy(nested)
print(new2)
new2[2] = ["new2"]
print(new2)
print(nested)

1.2 等式

is操作符测试是否为同一个对象。
==关系操作符判断的只是对象的类型和值是否一致，而is是判断是否为同一个对象。

snake_nest = [["Python"]] * 5
snake_nest[2] = ['Python']

print(snake_nest)
print(snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4])
print(snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4])

1.3 条件语句 if与elif

在if语句的条件部分，一个非空字符串或链表被判定为真，而一个空字符串或链表判定为假。
if elif 表示 if 为假而且elif 后边为真，则执行elif，反之如果第一个if为真，则执行if，不执行elif
all()函数和any()函数可以应用到一个链表（或其他序列），来检查是否全部或任一项目满足一些条件。

animals = ["cat", "dog"]
if "cat" in animals:
    print(1)
if "dog" in animals:
    print(2)

animals = ["cat", "dog"]
if "cat" in animals:
    print(1)
elif "dog" in animals:
    print(2)

sent =['No', 'good', 'fish', 'goes', 'anywhere','without', 'a', 'porpoise', '.']
print(all(len(w) > 4 for w in sent))  # False 都要
print(any(len(w) > 4 for w in sent))  # True 任一

2 序列

2.1 序列类型的操作

在这里插入图片描述

可以在这些序列类型之间相互转换。
tuple(s)将任何种类的序列转换成一个元组。
list(s)将任何种类的序列转换成一个链表。
使用join()函数将一个字符串链表转换成单独的字符串。
FreqDist对象也可以使用list()函数转换成一个序列，且支持迭代。
zip()函数取两个或两个以上的序列中的项目，将它们“压缩”打包成单个的配对链表。
给定一个序列s，enumerate(s)返回一个包含索引和索引处项目的配对。
可以指定想要分割数据的位置。

# 由逗号隔开，通常使用括号括起来，可以被索引和切片，并且由长度
t = "walk", "fem", 3
print(t)
print(t[0])
print(t[1:])
print(len(t))

# 序列可以直接相互赋值
words = ["I", "turned", "off", "the", "spectroroute"]
words[1], words[4] = words[4], words[1]
print(words)

# sorted（）函数、reversed（）函数、zip（）函数、enumerate（）函数
print("\n",words)
print(sorted(words))

print("\n",words)
print(reversed(words))
print(list(reversed(words)))

print("\n",words)
print(zip(words, range(len(words))))
print(list(zip(words, range(len(words)))))

print("\n",words)
print(enumerate(words))
print(list(enumerate(words)))

2.2 合并不同类型的序列

下划线只是一个普通的python变量，我们约定可以用下划线表示我们不会使用其值的变量。
一个链表是一个典型的具有相同类型的对象序列，它的长度是任意的。
一个元组通常是不同类型的对象的集合，长度固定。经常用一个元组来保存一个记录：与一些实体相关的不同字段的集合。
在Python中，列表是可变的，而元组是不可变的。也就是说，列表可以被修改，而元组不能被修改。

words = "I turned off the spectroroute".split()
print (words)

wordlens = [(len(word), word) for word in words]
print(wordlens)

wordlens.sort()
print (wordlens)

print(" ".join(w for (_, w) in wordlens))

2.3 产生器表达式

使用列表推导的好处在于，用它处理文本结构紧凑和可读性好。
将列表推导插入到一些函数调用中时，如果不加方括号可以构成一个产生器表达式，比起加了方括号，产生器表达式会更高效。

# 使用列表推导，因为用它处理文本结构紧凑和可读性好
from nltk import *
text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
... "it means just what I choose it to mean - neither more nor less."'''
print([w.lower() for w in word_tokenize(text)])

"""
第二行使用了生成器表达式。这不仅仅是标记方便：在许多语言处理的案例中，生成器表达式会更高效。
在[1]中，列表对象的存储空间必须在max()的值被计算之前分配。如果文本非常大的，这将会很慢。
在[2]中，数据流向调用它的函数。由于调用的函数只是简单的要找最大值——按字典顺序排在最后的词——它可以处理数据流，
而无需存储迄今为止的最大值以外的任何值。
"""


print(max([w.lower() for w in word_tokenize(text)]))  # [1]
print (max(w.lower() for w in word_tokenize(text)))  # [2]

# 3 风格的问题
## 3.1 Python代码风格
1. 代码布局中每个缩进级别应使用4个空格。避免使用tab缩进，因为它可能由于不同的文本编辑器的不同解释而产生混乱。
2. 每行应少于80个字符长，如果必要的话，你可以在圆括号、方括号或花括号内换行，因为Python能够探测到该行与下一行是连续的。
3. 如果你需要在圆括号、方括号或大括号中换行，通常可以添加额外的括号，也可以在行尾需要换行的地方添加一个反斜杠。
![image-2.png](attachment:image-2.png)
![image.png](attachment:image.png)

3.2 过程风格与声明风格

"""
计算布朗语料库中词的平均长度的程序
"""
# 过程风格
import nltk
tokens = nltk.corpus.brown.words(categories='news')
count = 0
total = 0
for token in tokens:
    count += 1
    total += len(token)
print(total / count)

# 声明风格
total = sum(len(w) for w in tokens)
print(total / len(tokens))

# 其他声明风格的例子
# 使用两个列表推到
maxlen = max(len(word) for word in text)
print([word for word in text if len(word) == maxlen])

# enumerate() 枚举频率分布的值
fd = nltk.FreqDist(nltk.corpus.brown.words())
cumulative = 0.0
most_common_words = [word for (word, count) in fd.most_common()]
for rank, word in enumerate(most_common_words):
    cumulative += fd.freq(word)
    print("%3d %10.2f%% %10s" % (rank + 1, fd.freq(word) * 100, word))
    if cumulative > 0.25:
        break

3.3 计数器的一些合理用途

# 迭代
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
n = 3
[sent[i:i+n] for i in range(len(sent) - n +1)]

4 函数：结构化编程的基础

4.1 函数的输入和输出

使用函数的参数传递信息给函数，参数是括号括起的变量和常量列表，在函数定义中跟函数名称之后。
函数通常会通过return语句将其结果返回给调用它的程序。
一般情况下，函数应该修改参数的内容或返回一个值，而不是两个都做。

def repeat(msg,num):
    return ' '.join([msg] *num)
    
monty = 'Monty Python'
repeat(monty,3)

def monty():
    return "Monty Python"

monty()
repeat(monty(),3)

4.2 参数传递

理解Python的按值传递参数和按引用传递参数的区别。
可以使用id()函数和is操作符来检查每个语句执行之后对象标识符是否发生变化。

def set_up(word, properties):
    word = 'lolcat'
    properties.append('noun')
    properties = 5

w = ''
p = []
set_up(w, p)
print(w)
print(p)

"""
w没有被函数改变。当我们调用set_up(w, p)时，w（空字符串）的值被分配到一个新的变量word。在函数内部word值被修改。
然而，这种变化并没有传播给w。这个参数传递过程与下面的赋值序列是一样的
"""
w = ''
word = w
word = 'lolcat'
print(w)

"""
当我们调用set_up(w, p)，p的值（一个空列表的引用）被分配到一个新的本地变量properties，所以现在这两个变量引用相同的内存位置。
函数修改properties，而这种变化也反映在p值上，正如我们所看到的。函数也分配给properties一个新的值（数字5）；
这并不能修改该内存位置上的内容，而是创建了一个新的局部变量。
"""
p = []
properties = p
properties.append('noun')
properties = 5
print(p)

4.3 变量的作用域

名称解析的LGB规则：先本地（local），再全局（global），后内置（built-in）。

4.4 参数类型检查

可以使用assert来判断某个条件是否成立，如果assert语句失败，将会产生一个不可忽视的错误而停止程序执行。这是一种防御性编程。

def tag(word):
    assert isinstance(word, str), "argument to tag() must be a string"
    if word in ['a', 'the', 'all']:
        return 'det'
    else:
        return 'noun'

print(tag(["'Tis", 'but', 'a', 'scratch']))

print(tag('hello'))

4.5 功能分解

当我们使用函数时,主程序可以在一个更高的抽象水平编写,使其结构更透明,例如:

# 设计不佳的函数用来计算高频词
from nltk import *
from urllib import request
from bs4 import BeautifulSoup

def freq_words(url, freqdist, n):
    html = request.urlopen(url).read().decode('utf8')
    raw = BeautifulSoup(html).get_text()
    for word in word_tokenize(raw):
        freqdist[word.lower()] += 1
    result = []
    for word, count in freqdist.most_common(n):
        result = result + [word]
    print(result)
constitution = "http://www.archives.gov/national-archives-experience" \
"/charters/constitution_transcript.html"
fd = nltk.FreqDist()
print([w for (w, _) in fd.most_common(20)])
freq_words(constitution, fd, 20)
print("\n",[w for (w, _) in fd.most_common(30)])

"""
这个函数有几个问题。该函数有两个副作用：它修改了第二个参数的内容，并输出它已计算的结果的经过选择的子集。
如果我们在函数内部初始化FreqDist()对象（在它被处理的同一个地方），
并且去掉选择集而将结果显示给调用程序的话，函数会更容易理解和更容易在其他地方重用。
考虑到它的任务是找出频繁的一个词，它应该只应该返回一个列表，而不是整个频率分布。
"""

# 改进
from urllib import request
from bs4 import BeautifulSoup

def freq_words(url, n):
    html = request.urlopen(url).read().decode('utf8')
    text = BeautifulSoup(html).get_text()
    freqdist = nltk.FreqDist(word.lower() for word in word_tokenize(text))
    return [word for (word, _) in freqdist.most_common(n)]

constitution = "http://www.archives.gov/national-archives-experience" \
"/charters/constitution_transcript.html"
print(freq_words(constitution, 20))

4.6 文档说明函数

docstring应该用一个三重引号引起来。
docstring中可以包括一个doctest块，说明使用的函数和预期的输出。这些都可以使用Python的docutils模块自动测试。
docstring中应当记录函数的每个参数的类型和返回类型。
学习doctest模块（测试模块）：https://www.maixj.net/ict/python-doctest-19001，用Python编程，在写代码的时候，不仅同时完成了文档docstring，还同时完成了单元测试
官方教程：https://docs.python.org/3/library/doctest.html
其他：Python中关于doctest的使用

def accuracy(reference, test):
    """
    Calculate the fraction of test items that equal the corresponding reference items.

    Given a list of reference values and a corresponding list of test values,
    return the fraction of corresponding values that are equal.
    In particular, return the fraction of indexes
    {0<i<=len(test)} such that C{test[i] == reference[i]}.

        >>> accuracy(['ADJ', 'N', 'V', 'N'], ['N', 'N', 'V', 'ADJ'])
        0.5

    :param reference: An ordered list of reference values
    :type reference: list
    :param test: A list of values to compare against the corresponding
        reference values
    :type test: list
    :return: the accuracy score
    :rtype: float
    :raises ValueError: If reference and length do not have the same length
    """

    if len(reference) != len(test):
        raise ValueError("Lists must have the same length.")
    num_correct = 0
    for x, y in zip(reference, test):
        if x == y:
            num_correct += 1
    return float(num_correct) / len(reference)

5 更多关于函数

5.1 作为参数的函数

可以传递内置函数len()或用户定义的函数last_letter()作为另一个函数的参数：
Python提供了更多的方式来定义函数作为其它函数的参数，即所谓的lambda表达式。
传递一个函数给sorted()函数。python3 sorted取消了对cmp的支持，格式：sorted(iterable，key=None,reverse=False)，key接受一个函数，这个函数只接受一个元素，默认为None；reverse是一个布尔值。如果设置为True，列表元素将被倒序排列，默认为False。

sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
        'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
def extract_property(prop):
    return [prop(word) for word in sent]

print(extract_property(len))

def last_letter(word):
    return word[-1]

print(extract_property(last_letter))

extract_property(lambda w: w[-1])

#  传递一个函数给sorted()函数。python3 sorted取消了对cmp的支持，
#  格式：sorted(iterable，key=None,reverse=False)，key接受一个函数，这个函数只接受一个元素，默认为None；
# reverse是一个布尔值。如果设置为True，列表元素将被倒序排列，默认为False。
print(sorted(sent))
print(sorted(sent, key = lambda x:x[-1]))

5.2 累计函数

使用yield语句可以使原本返回一个序列的函数变成一个生成器，使得程序更加高效。

import nltk

def search1(substring, words):
    result = []
    for word in words:
        if substring in word:
            result.append(word)
    return result

def search2(substring, words):
    for word in words:
        if substring in word:
            yield word
            
for item in search1('fizzled', nltk.corpus.brown.words()):
    print (item)

for item in search2('Grizzlies', nltk.corpus.brown.words()):
    print (item)

#  一个更复杂的生成器的例子，产生一个词列表的所有排列。为了强制permutations()函数产生所有它的输出，我们将它包装在list()调用中
def permutations(seq):
    if len(seq) <= 1:
        yield seq
    else:
        for perm in permutations(seq[1:]):
            for i in range(len(perm)+1):
                yield perm[:i] + seq[0:1] + perm[i:]

list(permutations(['police', 'fish', 'buffalo']))

5.3 高阶函数

filter():我们使用函数作为filter()的第一个参数，它对作为它的第二个参数的序列中的每个项目运用该函数，只保留该函数返回True的项目。

def is_content_word(word):
    return word.lower() not in ['a', 'of', 'the', 'and', 'will', ',', '.']
sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
        'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
list(filter(is_content_word, sent))

map():将一个函数运用到一个序列中的每一项。

lengths = list(map(len, nltk.corpus.brown.sents(categories='news')))
print(sum(lengths) / len(lengths))


lengths = [len(sent) for sent in nltk.corpus.brown.sents(categories='news')]
print(sum(lengths) / len(lengths))

lambda 表达式:匿名函数
列表推导为基础的解决方案通常比基于高阶函数的解决方案可读性更好，我们在整个这本书的青睐于使用前者。

# 等效的两个例子
list(map(lambda w: len(list(filter(lambda c: c.lower() in "aeiou", w))), sent))

[len([c for c in w if c.lower() in "aeiou"]) for w in sent]

5.4 参数的命名

我们可以i通过名字引用参数，甚至可以给它们分配默认值以供调用程序没有提供该参数时使用。这些参数被称为关键字参数。
如果将关键字参数和非关键字参数混合使用的话，必须确保未命名的参数在命名的参数前面。因为未命名参数是根据位置来定义的。
我们可以定义一个函数，接受任意数量的未命名和命名参数，并通过一个参数链表*args和一个关键字参数字典**kwargs来访问它们。
当*args作为函数参数时，它实际上对应函数所有的未命名参数。
注意不要使用可变对象作为参数的默认值。

# 命名参数的另一个作用是它们允许选择性使用参数
def repeat(msg='<empty>', num=1):
    return msg * num
print(repeat(num=3))

print(repeat(msg='Alice'))

print(repeat(num=5, msg='Alice'))

# *args作为函数参数
song = [['four', 'calling', 'birds'],
        ['three', 'French', 'hens'],
        ['two', 'turtle', 'doves']]
print(song[0])
print(list(zip(song[0], song[1], song[2])))

print(list(zip(*song)))

# 可选参数的另一个常见用途是作为标志使用。
# 这里是同一个的函数的修订版本，如果设置了verbose标志将会报告其进展情况：
def freq_words(file, min=1, num=10, verbose=False):
    freqdist = FreqDist()
    if verbose: print("Opening", file)
    with open(file) as f:
        text = f.read()
        if verbose: print("Read in %d characters" % len(file))
        for word in word_tokenize(text):
            if len(word) >= min:
                freqdist[word] += 1
                if verbose and freqdist.N() % 100 == 0: print(".", sep="",end = " ")
        if verbose: print
        return freqdist.most_common(num)
fw = freq_words("test.html", 4 ,10, True)

6 程序开发

6.1 Python模块的结构

程序模块的目的是把逻辑上相关的定义和函数结合在一起，以方便重用和更高层次的抽象。
一个模块可以包含用于创建和操纵一种特定数据结构的代码或者执行特定的处理任务。
可以使用变量__file__定位你的系统中任一NLTK模块的代码。
模块的一些变量和函数仅用于模块内部，它们的名字应该以下划线开头，这些名称将不会被导入。
可以选择性地列出一个模块的外部可访问的名称，使用一个特殊的内置变量：all= [‘edit_distance’, ‘jaccard_distance’]

6.2 多模块程序

一些程序汇集多种任务，例如从语料库加载数据、对数据进行一些分析、然后将其可视化。我们可能已经有了稳定的模块来加载数据和实现数据可视化。我们的工作可能会涉及到那些分析任务的编码，只是从现有的模块调用一些函数。
在这里插入图片描述

6.3 误差的源头

输入的数据可能包含一些意想不到的字符。
提供的函数可能不会像预期的那样运作。
我们对Python语义的理解可能出错。

print("%s.%s.%02d" % "ph.d.", "n", 1)  # type error ,加个括号解决

print("%s.%s.%02d" % ("ph.d.", "n", 1))  # 修正

# 在函数中命名参数不能设置列表等对象，程序的行为并不如预期，因为我们错误地认为在函数被调用时会创建默认值。然而，它只创建了一次，在Python解释器加载这个函数时。这一个列表对象会被使用，只要没有给函数提供明确的值。
def find_words(text, wordlength, result=[]):
    for word in text:
        if len(word) == wordlength:
            result.append(word)
    return result

print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 3) )

print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 2, ['ur']) )

print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 3) )  # 明显错误

6.4 调试技术

由于大多数代码错误是因为程序员的不正确的假设，你检测bug要做的第一件事是检查你的假设。通过给程序添加print语句定位问题，显示重要的变量的值，并显示程序的进展程度。
解释器会输出一个堆栈跟踪，精确定位错误发生时程序执行的位置。
Python提供了一个调试器，它允许你监视程序的执行，指定程序暂停运行的行号（即断点），逐步调试代码段和检查变量的值。

6.5 防御性编程

考虑在你的代码中添加assert语句，指定变量的属性，例如assert(isinstance(text, list))。如果text的值在你的代码被用在一些较大的环境中时变为了一个字符串，将产生一个AssertionError，于是你会立即得到问题的通知。
一旦你觉得你发现了错误，作为一个假设查看你的解决方案。在重新运行该程序之前尝试预测你修正错误的影响。如果bug不能被修正，不要陷入盲目修改代码希望它会奇迹般地重新开始运作的陷阱。相反，每一次修改都要尝试阐明错误是什么和为什么这样修改会解决这个问题的假设。如果这个问题没有解决就撤消这次修改。
当你开发你的程序时，扩展其功能，并修复所有bug，维护一套测试用例是有益的。这被称为回归测试，因为它是用来检测代码“回归”的地方——修改代码后会带来一个意想不到的副作用是以前能运作的程序不运作了的地方。Python以doctest模块的形式提供了一个简单的回归测试框架。这个模块搜索一个代码或文档文件查找类似与交互式Python会话这样的文本块，这种形式你已经在这本书中看到了很多次。它执行找到的Python命令，测试其输出是否与原始文件中所提供的输出匹配。每当有不匹配时，它会报告预期值和实际值。有关详情，请查询在 documentation at http://docs.python.org/library/doctest.html 上的doctest文档。除了回归测试它的值，doctest模块有助于确保你的软件文档与你的代码保持同步。
也许最重要的防御性编程策略是要清楚的表述你的代码，选择有意义的变量和函数名，并通过将代码分解成拥有良好文档的接口的函数和模块尽可能的简化代码。

7 算法设计

7.1 递归

https://blog.csdn.net/Claire_chen_jia/article/details/105757449

7.2 权衡时间和空间

from timeit import Timer
vocab_size = 100000
setup_list = "import random; vocab = range(%d)" % vocab_size          #[1]
setup_set = "import random; vocab = set(range(%d))" % vocab_size   #[2]
statement = "random.randint(0, %d) in vocab" % (vocab_size * 2)     #[3]
print(Timer(statement, setup_list).timeit(1000))

print(Timer(statement, setup_set).timeit(1000))

"""
我们可以使用timeit模块测试这种说法。Timer类有两个参数：一个是多次执行的语句，一个是只在开始执行一次的设置代码。
我们将分别使用一个整数的列表[1]和一个整数的集合[2]模拟10 万个项目的词汇表。
测试语句将产生一个随机项，它有50％的机会在词汇表中[3]。

执行1000 次链表成员资格测试总共需要2.8秒，而在集合上的等效试验仅需0.0037 秒，也就是说快了三个数量级！
"""

7.3 动态规划

动态规划(Dynamic programming)是一种自然语言处理中被广泛使用的算法设计的一般方法。
“programming”一词的用法与你可能想到的感觉不同,是规划或调度的意思。动态规划用于解决包含多个重叠的子问题的问题。不是反复计算这些子问题,而是简单的将它们的计算结果存储在一个查找表中。、
Pingala 是大约生活在公元前 5 世纪的印度作家,作品有被称为《Chandas Shastra》的梵文韵律专著。Virahanka 大约在公元 6 世纪延续了这项工作,研究短音节和长音节组合产生一个长度为 n 的旋律的组合数。短音节,标记为 S,占一个长度单位,而长音节,标记为L,占 2 个长度单位。Pingala 发现,例如:有 5 种方式构造一个长度为 4 的旋律:V 4 = {LL, SSL, SLS, LSS, SSSS}。

"""
四种方法计算梵文旋律:
(一)迭代(递归);
(二)自底向上的动态规划;
(三)自上而下的动态规划;
(四)内置默记法。
"""
def virahanka1(n):
    """
    法1.递归（迭代）
    正如你可以看到,V 2 计算了两次。这看上去可能并不像是一个重大的问题,但事实证明,当 n 变大时使用这种递归技术计算 V 20 ,
    我们将计算 V 2 4,181 次;对 V 40 我们将计算 V 2
    63245986 次!
    """
    if n == 0:
        return [""]
    elif n == 1:
        return ["S"]
    else:
        s = ["S" + prosody for prosody in virahanka1(n-1)]
        l = ["L" + prosody for prosody in virahanka1(n-2)]
    return s + l

def virahanka2(n):
    """
    法2.动态规划（自下而上）
    函数 virahanka2()实现动态规划方法解决这个问题。它的工作原理是使用问题的所有较小的实例的计算结果填充一个表格(叫做 lookup),
    一旦我们得到了我们感兴趣的值就立即停止。此时,我们读出值,并返回它。最重要的是,每个子问题只计算了一次。
    """
    lookup = [[""], ["S"]]
    for i in range(n-1):
        s = ["S" + prosody for prosody in lookup[i+1]]
        l = ["L" + prosody for prosody in lookup[i]]
        lookup.append(s + l)
    return lookup[n]

             
def virahanka3(n, lookup={0:[""], 1:["S"]}):
    """
    法3.动态规划（自上而下）
    请注意,virahanka2()所采取的办法是解决较大问题前先解决较小的问题。因此,这被称为自下而上的方法进行动态规划。
    不幸的是,对于某些应用它还是相当浪费资源的,因为它计算的一些子问题在解决主问题时可能并不需要。
    采用自上而下的方法进行动态规划可避免这种计算的浪费,如例 4-9 中函数 virahanka3()所示。
    不同于自下而上的方法,这种方法是递归的。通过检查是否先前已存储了结果,它避免了 virahanka1()的巨大浪费。
    如果没有存储,就递归的计算结果,并将结果存储在表中。最后一步返回存储的结果。
    """
    if n not in lookup:
        s = ["S" + prosody for prosody in virahanka3(n - 1)]
        l = ["L" + prosody for prosody in virahanka3(n - 2)]
        lookup[n] = s + l
    return lookup[n]

from nltk import memoize
@memoize
def virahanka4(n):
    """
    法4.内置默记法
    invirahanka4(),使用一个 Python 的“装饰器”称为默记法(memoize),它会做 virahanka3()所做的繁琐的工作而不会搞乱程序。
    这种“默记”过程中会存储每次函数调用的结果以及使用到的参数。
    如果随后的函数调用了同样的参数,它会返回存储的结果,而不是重新计算。(这方面的 Python 语法超出了本书的范围。)
    """
    if n == 0:
        return [""]
    elif n == 1:
        return ["S"]
    else:
        s = ["S" + prosody for prosody in virahanka4(n - 1)]
        l = ["L" + prosody for prosody in virahanka4(n - 2)]
        return s + l

print(virahanka1(4))

print(virahanka2(4))

print(virahanka3(4))

print(virahanka4(4))

8 Python 库的样例

8.1 Matplotlib 绘图工具

python数据分析 | Matplotlib全面介绍及使用

# 例 4-10. 布朗语料库中不同部分的情态动词频率。

from numpy import arange
from matplotlib import pyplot

colors = 'rgbcmyk' # red, green, blue, cyan, magenta, yellow, black

def bar_chart(categories, words, counts):
    "Plot a bar chart showing counts for each word by category"
    ind = arange(len(words))
    width = 1 / (len(categories) + 1)
    bar_groups = []
    for c in range(len(categories)):
        bars = pyplot.bar(ind+c*width, counts[categories[c]], width,
                         color=colors[c % len(colors)])
        bar_groups.append(bars)
    pyplot.xticks(ind+width, words)
    pyplot.legend([b[0] for b in bar_groups], categories, loc='upper left')
    pyplot.ylabel('Frequency')
    pyplot.title('Frequency of Six Modal Verbs by Genre')
    pyplot.show()
    
genres = ['news', 'religion', 'hobbies', 'government', 'adventure']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfdist = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in genres
    for word in nltk.corpus.brown.words(categories=genre)
    if word in modals)

counts = {}
for genre in genres:
    counts[genre] = [cfdist[genre][word] for word in modals]
bar_chart(genres, modals, counts)

在这里插入图片描述

8.2 NetworkX

networkx（图论）的基本操作
 NetworkX官网

# 导入模块
import networkx as nx

# 创建有向图
G = nx.DiGraph() 
G.add_edge(2, 3)
G.add_edge(3, 2)
G.to_undirected()  # 转换成无向图
print(G.edges)

8.3 csv模块

import csv

input_file = open("lexicon.csv", "rb") 
for row in csv.reader(input_file): 
    print(row)

8.4 NumPy

NumPy包对Python中的数值处理提供了大量的支持。NumPy有一个多维数组对象，它可以很容易初始化和访问.
python数据分析 | numpy全面介绍及使用

9 小结

在这里插入图片描述

10 练习

Claire_chen_jia

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Python自然语言处理 | 编写结构化程序

本章解决问题怎么能写出结构良好、可读的程序，你和其他人将能够很容易的重新使用它?基本结构块，如循环、函数以及赋值，是如何执行的?Python 编程的陷阱有哪些，你怎么能避免它们吗?这里写目录标题1回到基础1.1 赋值1.2 等式1.3 条件语句 if与elif2 序列2.1 序列类型的操作2.2 合并不同类型的序列2.3 产生器表达式3.2 过程风格与声明风格3.3 计数器的一些合理用途4 函数：结构化编程的基础4.1 函数的输入和输出4.2 参数传递4.3 变量的作用域4.4 参数类型检查.
复制链接

扫一扫