NLP课程-笔记-03

最新推荐文章于 2021-02-23 17:32:06 发布

罗小海

最新推荐文章于 2021-02-23 17:32:06 发布

阅读量287

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/qq_32562429/article/details/98170958

版权

NLP 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

- Lesson-03

Lesson-03

Dynamic Programming

什么是动态规划？
https://blog.csdn.net/mengmengdastyle/article/details/81809103
https://www.jianshu.com/p/69669c7bd69e

基本思想：
　　把一个较复杂的问题按照阶段划分，分解为若干个较小的局部问题，然后按照局部问题的递推关系，依次作出一系列决策，直至整个问题达到总体最优的目标。

动态规划包含三个重要的概念：

最优子结构
边界
状态转移方程

解题的一般步骤是：

找出最优解的性质，刻画其结构特征和最优子结构特征；
递归地定义最优值，刻画原问题解与子问题解间的关系；
以自底向上的方式计算出各个子问题、原问题的最优值，并避免子问题的重复计算；
根据计算最优值时得到的信息，构造最优解。

使用动态规划特征：

求一个问题的最优解
大问题可以分解为子问题，子问题还有重叠的更小的子问题
整体问题最优解取决于子问题的最优解（状态转移方程）
从上往下分析问题，从下往上解决问题
讨论底层的边界问题

任务要求：

任务1:

钢筋米数对应价目表如下：

长度	1	2	3	4	5	6	7	8	9	10	11
价钱	1	5	8	9	10	17	17	20	24	30	35

现在要求一段长度N的钢筋最佳切法，使得利益最大化

任务2：

字符替换：Edit Distance

计算一个单词变为另一单词最少修改次数，例如：Intention 变成 execution
最少修改5词，那么Edit Distance = 5

I	N	T	E	*	N	T	I	O	N
*	E	X	E	C	U	T	I	O	N

三个步骤：

Insertio
Deletion
Substitution

STEP1:

$max(p_n,r_1+r_{n-1},r_2+r_{n-2},\cdots,r_{n-1}+r_1)\tag{3.1}$

枚举所有情况，然后求大值，代码如下：

original_price = [1, 5, 8, 9, 10, 17, 17, 20, 24, 30, 35]
from collections import defaultdict
price = defaultdict(int)
for i, p in enumerate(original_price): 
    price[i + 1] = p
#这里用defaultdict用处是当输入不存在的键返回0
def r(n):
    return max([price[n]] + [r(i)+ r(n-i) for i in range(1, n)])

IN: r(5)
**OUT:**13

STEP2:

STEP1中能求出最大价钱，但是切法却没有求出，现在修改代码，使得能够记录切法：solution记录每个长度的最佳切法

solution ={}
def r(n):
    max_price, max_split = max([(price[n],(0, n))] + [(r(i)+r(n-i),(i, n-i)) for i in range(1, n)], key=lambda x:x[0])
    
    solution[n] = (max_price, max_split)
    return max_price

IN: r(5)
**OUT:**13
IN: solution
OUT:
{1: (1, (0, 1)),
2: (5, (0, 2)),
3: (8, (0, 3)),
4: (10, (2, 2)),
5: (13, (2, 3)),
6: (17, (0, 6))}

STEP3:

问题来了，无论是STEP1还是STEP2我们都做了大量的重复计算，时间复杂度如下：
$\begin{aligned} r_n&=2(r_1+r_2+r_3+\cdots+r_{n-1})\\ r_{n-1}&=2(r_1+r_2+r_3+\cdots+r_{n-2})\\ \cdots \\ r_2层&：2(r_1)\\ r_1层&：c \quad \rightarrow base\ case \ \\ \end{aligned}$

$O(3^n)\tag{3.3}$

我们通过实际例子来计算下rn的调用次数，为了不修改原来的r(n)函数，我们定义一个新函数，用它来装饰r(n)函数，代码如下：

from functools import wraps
called_time_with_arg = defaultdict(int)
def get_call_time(f):
    @wraps(f)
    def wrap(n):
        result = f(n)
        called_time_with_arg[(f.__name__, n)] += 1
        return result
    return wrap

使用 @wraps的作用是为了，使得修饰的函数的name的值保持不变
有疑问可参考：https://www.jianshu.com/p/5df1769e562e

然后我们用@get_call_time修饰r(n)函数，最后运行r(n)函数，当计算长度是10的钢筋时候，我们得到调用次数结果：

defaultdict(int,
            {('r', 1): 13122,
             ('r', 2): 4374,
             ('r', 3): 1458,
             ('r', 4): 486,
             ('r', 5): 162,
             ('r', 6): 54,
             ('r', 7): 18,
             ('r', 8): 6,
             ('r', 9): 2,
             ('r', 10): 1})

可见，做了太多的重复计算，当n增大，程序运行时间呈指数增长，所以我们要减少重新计算次数

STEP4:

减少重复计算次数，把已经计算过的存入字典，再次访问时先查字典，如果有就直接读取，没有再去计算，代码如下：

def memo(f):
    memo.already_computed = {}
    @wraps(f)
    def wrap(n):
        if n not in memo.already_computed:
            result = f(n)
            memo.already_computed[n]=result
            return result
        else:
            return memo.already_computed[n]
    return wrap

然后用@memo去修饰之前r(n)函数：

solution ={}
@memo
def r(n):
    max_price, max_split = max([(price[n],(0, n))] + [(r(i)+r(n-i),(i, n-i)) for i in range(1, n)], key=lambda x:x[0])
    
    solution[n] = (max_price, max_split)
    return max_price

现在运行r(n)函数，那叫一个快
IN:%%timeit
**IN:**r(400)
**OUT:**339 ns ± 3.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

未进行优化前：
IN:%%timeit
**IN:**r(10)
**OUT:**55.6 ms ± 3.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

我们继续进行次数测试下：

@get_call_time
@memo
def r(n):

运行r(500)得到：called_time_with_arg**为什么不是每个层次只运行一次？**出现这个的原因是双重修饰例如(‘r’, 1): 998,实际上没有执行r(1)998次数，而是memo函数执行了998次

defaultdict(int,
            {('r', 1): 998,
             ('r', 2): 996,
             ('r', 3): 994,
             ('r', 4): 992,
             ('r', 5): 990,
             ('r', 6): 988,
             ('r', 7): 986,
             #....
             ('r', 500): 1
            }

STEP_1:

字符替换：Edit Distance，求出最小编辑距离

Intention 变成 execution

I	N	T	E	*	N	T	I	O	N
*	E	X	E	C	U	T	I	O	N

分析：字符串a 长度为n ，字符串b 长度为m

定位到两串字符a和b的最末端位置，会有三种情况出现：

a和b匹配,这里又分为a[-1]==b[-1] 和 a[-1]!=b[-1] 两种情况
a和b前m-1项匹配
a前n-1和b匹配

在这三种情况中筛选出distance最小的那个即是我们的答案
(三种情况中，每一种情况又回到了开始时候的新的a,b的计算)
$\ ;D(i,0) = i\\ 状态转移公式：D(i,j) = min \begin{cases} D(i-1,j) + 1\\ D(i,j-1) + 1\\ D(i-1,j-1) + 1 \ \ if \ X[i]!=Y[j]\ else \ D(i-1,j-1)\\ \end{cases}\tag{3.1}$
同理也可从字符首个元素分析，情况分析是一致的：

伪代码如下：

edit_distance:
Input: two strings x of length n , y of length m
Output: min distance and its path
1:if n=0 then return m //base case
2:if m=0 then return n //base case
3:x_1 = 1 to n-1 element of x
4:y_1 = 1 to m-1 element of y
5:candidates =
edit_distance(x_1, y) + 1
edit_distance(x, y_1) + 1
edit_distance(x_1, y_1) + 2 if x[i]==y[i] else edit_distance(x_1, y_1)
6:max of candidates

STEP_2:

代码实现：

from functools import lru_cache
solution = {}
@lru_cache(maxsize=2**10)#缓存，避免重复计算子问题
def edit_distance_start_0(string1, string2):
    '''这里从首个元素分析'''
    if len(string1)==0 : return len(string2)  #Base case
    if len(string2)==0 : return len(string1)  #Base case
    
    head_s1 = string1[0]
    head_s2 = string2[0]
    
    candidates = [
        (edit_distance_start_0(string1[1:], string2)+1 , 'DEL {}'.format(head_s1)),#删除了head_s1, string[1:]会和string2匹配
        (edit_distance_start_0(string1, string2[1:])+1 , 'ADD {}'.format(head_s2)) #增加head_s2, string会和string2匹配
    ]
    
    if head_s1==head_s2:
        candidates.append((edit_distance_start_0(string1[1:], string2[1:])+ 0 , 'No Actions'))
    else:
        candidates.append((edit_distance_start_0(string1[1:], string2[1:])+1 , 'SUB {} => {}'.format(head_s1, head_s2)))
        
                        
    min_distance, steps = min(candidates, key = lambda x:x[0])
    solution[(string1, string2)] = steps 
    
    return min_distance

**IN:**edit_distance_start_0(‘intention’, ‘execution’)
**OUT:**5
需要5步才能完成修改！

solution =

{('n', 'n'): 'No Actions',
 ('n', 'on'): 'ADD o',
 ('n', 'ion'): 'ADD i',
 ('n', 'tion'): 'ADD t',
 ('n', 'ution'): 'ADD u',
 ('n', 'cution'): 'ADD c',
 ('n', 'ecution'): 'ADD e',
 ('n', 'xecution'): 'ADD x',
 ('n', 'execution'): 'ADD e',
 ('on', 'n'): 'DEL o',
 ('on', 'on'): 'No Actions',
 #.....
 ('intention', 'execution'): 'DEL i'}
 ('on', 'ion'): 'ADD i',

为了得到每次修改的路线，我们定义一个函数来实现，代码如下：

def edit_distance_find_path(solution, string1, string2):
    current = string1, string2
    paths = []
    while(current in solution):
        current_action = solution[current]
        
        if current_action.startswith('ADD'):
            paths.append((current, current_action))
            current = current[0], current[1][1:]     
            
        elif current_action.startswith('DEL'):
            paths.append((current, current_action))
            current = current[0][1:], current[1]
            
        else :
            paths.append((current, current_action))
            current = current[0][1:], current[1][1:]
    
    return paths

**IN:**edit_distance_find_path(solution,‘intention’, ‘execution’)
OUT:

[(('intention', 'execution'), 'DEL i'),
 (('ntention', 'execution'), 'SUB n => e'),
 (('tention', 'xecution'), 'SUB t => x'),
 (('ention', 'ecution'), 'No Actions'),
 (('ntion', 'cution'), 'ADD c'),
 (('ntion', 'ution'), 'SUB n => u'),
 (('tion', 'tion'), 'No Actions'),
 (('ion', 'ion'), 'No Actions'),
 (('on', 'on'), 'No Actions'),
 (('n', 'n'), 'No Actions')]

可以看到，具体如何修改了5次

Dynamic Progranmming Homework

任务要求：

已知n个点，随意取其中一个点为出发点，求从该点出发，经过所有点最短的路线

难度升级：已知n个点，取其中多个点为出发点，求从这些出发，总的要经过所有点，并求经过所有点的最短距离的走法

STEP1:

分析：

假设有n个点编号分别是1~n，随机取编号为j的点出发
其中d(i,j)表示i点和j点的距离
$\begin{aligned} D(2,j) &= d(1, 2)\\ D(n,j) &= min([D(n-1, i)+d(j,i), for \ i \ in \ range(n) \ and\ i!=j ]) \end{aligned}$

STEP2:

获得点集数据：

latitudes = [random.randint(-100, 100) for _ in range(20)]
longitude = [random.randint(-100, 100) for _ in range(20)]
chosen_p = (-50, 10)
point_location = {}
for i in range(len(latitudes)):
    point_location[str(i+1)] = (latitudes[i], longitude[i])

point_location[str(i+2)] = chosen_p

定义STEP1中的d(i,j)函数：

import math
def distance_calcu(point1, point2):
    return math.sqrt((point1[0]-point2[0])**2 + (point1[1]- point2[1])**2)

STEP3:

根据STEP1中的分析编写查找函数：
这里传入参数string是字符串，因为如果传入是列表的话，@lru_cache修饰器的实际功能把每次递归传入的参数当作字典的键，返回值当作字典的值来生成缓存，而列表或元组不能当作字典的键。

solution_path = {}
@lru_cache(maxsize=2**30)
def min_way(string, i):
    ''' string: 点集合，用字符串来表示，主要是为了能添加到缓存中
        i：开始的点 ，字符串格式
    '''
    array_n = string.split(' ')
    
    if len(array_n) == 2: #Base case 边界条件
        solution_path[(string,i)] =  (i, string.replace(i,'').strip())
        return distance_calcu(point_location[array_n[0]], point_location[array_n[1]])
    
    array_n.remove(i)
    string_new = ' '.join(str(i) for i in array_n)
       
    #状态转移候选
    candidates = [(distance_calcu(point_location[i], point_location[j])+ min_way(string_new, j),(i,j)) for j in array_n]
    
    #筛选出最优点
    min_distance, way = min(candidates, key = lambda x:x[0])
    
    #把当前最优添加到solution
    solution_path[(string,i)] = way
    
    return min_distance

也可不用@lru_cache，自定义个一个缓存memo，代码如下：

def memo(f):
    memo.already_calcu = {}
    @wraps(f)
    def wrap(string, i):
        if (string, i) not in memo.already_calcu:
            distance = f(string, i)
           #print('test')
            memo.already_calcu[(string, i)] = distance
            return distance
        else:
            return memo.already_calcu[(string, i)]
    return wrap

STEP4:

测试STEP3中的函数是否实现了功能：

string = ' '.join(str(i) for i in array_n)
#string = '1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21'

%%time
min_way(string, '21')
#Wall time: 6min 36s
#675.9963624776407

编写一个函数从solution最优解集合中找到我们需要的路径：

def find_path(solution, string, i):
    connection = {}
    current = string, i
   # print(current)
    while current in solution:
        from_, to_ = solution[current]
        connection[from_] = [to_.strip()]
        
        temp = current[0].split(' ')
        temp.remove(from_)
        
        current =  ' '.join(str(i) for i in temp), to_   
    
    return connection

#寻找路径并存入nn
nn  = find_path(solution_path, string, '21')

把找到的路径结果用图画出来：

import networkx as nx 
nx.draw(nx.Graph(nn), point_location, with_labels = True, node_size = 10)

比如原始点集图：

#### STEP5

参考了其它同学的该问题的算法，有的是两层遍历，一直求得是当前点到初始点的最短距离，这有点类似最短路径，但该问题没有指定终点，只要求遍历完所有点，所以类似两层遍历求当前最短距离的算法有局限性，因为它短视(只能看到当前的)，所以求出来的结果不一定是最优解，而本算法，相当于遍历了所有种可能性，假如有20个点那么总可能性就是22的阶乘，所以计算量随着点数的增加而增大，即使用了缓存也需要计算一定时间，当然不用缓存压根算不出来。