算法简介
An algorithm is an effective method that can be expressed within a finite amount of space and time and in a well-defined formal language for calculating a function.算法是在一个明确定义的标准语言里在有限空间与时间内计算函数的有效方法。
- 二分查找:
猜数字游戏的二分查找代码:
def binary_search(list, item):
# low and high keep track of which part of the list you'll search in.
low = 0
high = len(list) - 1
# While you haven't narrowed it down to one element ...
while low <= high:
# ... check the middle element
mid = (low + high) // 2
guess = list[mid]
# Found the item.
if guess == item:
return mid
# The guess was too high.
if guess > item:
high = mid - 1
# The guess was too low.
else:
low = mid + 1
# Item doesn't exist
return None
my_list = [1, 3, 5, 7, 9]
print binary_search(my_list, 3) # => 1
# 'None' means nil in Python. We use to indicate that the item wasn't found.
print binary_search(my_list, -1) # => None
复制代码
-
一些常见的大O运行时间
- O(logn): 对数时间,如二分查找;
- O(n): 线性时间,如简单查找;
- O(n*logn): 如快排;
- O(n^2): 速度较慢的排序算法;
- -O(n!):如旅行商问题解决方案。
-
一些结论:
- 算法速度并非时间,而是操作数的增速;
- 谈及算法速度,一般为随着输入的增加其运行时间将如何增加;
- 算法运行时间以大O表示法表示;
选择排序
内存的工作原理: 将数据存入内存时,请求计算机提供存储空间,计算机给你一存储地址。当存储多项数据时,有数组和链表两种存储方式。
- 常见的数组和链表操作的运行时间:
数组 | 链表 | |
---|---|---|
读取 | O(1) | O(n) |
插入 | O(n) | O(1) |
删除 | O(n) | O(1) |
注意: 仅当能够立即访问要删除的元素时,删除操作的运行时间才为O(1)。通常我们都记录了链表的首尾元素。
- 随机访问和顺序访问: 链表只能顺序访问,数组都行;由于数组支持随机访问故其读取速度快,且其应用场景较多。
- 选择排序实现
# Finds the smallest value in an array
def findSmallest(arr):
# Stores the smallest value
smallest = arr[0]
# Stores the index of the smallest value
smallest_index = 0
for i in range(1, len(arr)):
if arr[i] < smallest:
smallest = arr[i]
smallest_index = i
return smallest_index
# Sort array
def selectionSort(arr):
newArr = []
for i in range(len(arr)):
# Finds the smallest element in the array and adds it to the new array
smallest = findSmallest(arr)
newArr.append(arr.pop(smallest))
return newArr
print selectionSort([5, 3, 6, 2, 10]) #[2, 3, 5, 6, 10]
复制代码
递归
如果使用循环,程序的性能可能更高;如果使用递归,程序可能更容易理解。
- 递归条件:基线条件(base case,不递归)和递归条件(recursive case,递归)
- 调用栈 & 递归调用栈:计算机内存被使用的顺序,所有函数调用都进入调用栈。
- 栈:计算机在内部使用被称为调用栈的栈
- 栈有两种操作: 压入和弹出
- 调用栈可能很长,将占用很大内存,可以使用尾递归或循环或重写代码来优化。
- 示例代码:
# recursive count
def count(list):
if list == []:
return 0
return 1 + count(list[1:])
# recursive max
def max(list):
if len(list) == 2:
return list[0] if list[0] > list[1] else list[1]
sub_max = max(list[1:])
return list[0] if list[0] > sub_max else sub_max
# factorial
def fact(x):
if x == 1:
return 1
else:
return x * fact(x-1)
print fact(5)
复制代码
快速排序
D & C(divide and Conquer):分而治之,一种递归式问题解决方案,快速排序就是很好的?。
- 快排代码(层数为O(logn),每层所需时间为O(n),算法复杂度O(n*log(n))):
def quicksort(array):
if len(array) < 2:
# base case, arrays with 0 or 1 element are already "sorted"
return array
else:
# recursive case
pivot = array[0]
# sub-array of all the elements less than the pivot
less = [i for i in array[1:] if i <= pivot]
# sub-array of all the elements greater than the pivot
greater = [i for i in array[1:] if i > pivot]
return quicksort(less) + [pivot] + quicksort(greater)
print quicksort([10, 5, 2, 3])
复制代码
- 回顾一下大O表示法(基于每秒10次操作,仅作大致认识):
- 平均(最佳)情况和最糟情况:快速排序的高度依赖于所选的基准值,由此出现最佳和最糟情况。
散列表
- 散列函数:将输入映射到数字。特点:将相同的输入映射到相同的数字,将不同输入映射到不同数字。
- 散列表应用场景:模拟映射关系,防止重复,缓存数据等。
- 散列函数很重要,理想的散列函数将键均匀地映射到散列表的不同位置。
- 用法示例:
voted = {}
def check_voter(name):
if voted.get(name):
print "kick them out!"
else:
voted[name] = True
print "let them vote!"
check_voter("tom")
check_voter("mike")
check_voter("mike")
复制代码
广度优先搜索
- 广度优先算法:解决最短路径问题(shortest-path-problem)的算法,其运行时间为O(V+E),Vertice为顶点数,Edge为边数。
- 图:图有节点和边组成。一个节点可能与众多节点直接相连(邻居节点)。
- 队列:是一种先进先出(FIFO)的数据结构,而栈是一种后进先出(LIFO)的数据结构。
- 应用案例:
from collections import deque
def person_is_seller(name):
return name[-1] == 'm'
graph = {}
graph["you"] = ["alice", "bob", "claire"]
graph["bob"] = ["anuj", "peggy"]
graph["alice"] = ["peggy"]
graph["claire"] = ["thom", "jonny"]
graph["anuj"] = []
graph["peggy"] = []
graph["thom"] = []
graph["jonny"] = []
def search(name):
search_queue = deque()
search_queue += graph[name]
# This array is how you keep track of which people you've searched before.
searched = []
while search_queue:
person = search_queue.popleft()
# Only search this person if you haven't already searched them.
if not person in searched:
if person_is_seller(person):
print person + " is a mango seller!"
return True
else:
search_queue += graph[person]
# Marks this person as searched
searched.append(person)
return False
search("you")
复制代码
狄克斯特拉算法(Dijkstra’s algorithm)
广度优先搜索,找出的是段数最少的路径,但不一定是最快但路径。而Dijkstra’s algorithm就是解决找出最快路径但问题。
-
四个步骤:
- 找出最便宜的节点,即最短时间内可到达的节点;
- 对于该节点的邻居,坚持是否有前往它们的最短路径,有则更新其开销;
- 重复此过程,直到每个节点都这么做了;
- 计算最终路径。
-
权重:该算法每条边都有关联数字的图。
-
(非)加权图((un)weight graph):(不)带权重的图。计算非加权图的最短路径用广度优先算法,计算加权图最短路径用狄克斯特拉算法。
-
示例:
# the graph
graph = {}
graph["start"] = {}
graph["start"]["a"] = 6
graph["start"]["b"] = 2
graph["a"] = {}
graph["a"]["fin"] = 1
graph["b"] = {}
graph["b"]["a"] = 3
graph["b"]["fin"] = 5
graph["fin"] = {}
# the costs table
infinity = float("inf")
costs = {}
costs["a"] = 6
costs["b"] = 2
costs["fin"] = infinity
# the parents table
parents = {}
parents["a"] = "start"
parents["b"] = "start"
parents["fin"] = None
processed = []
def find_lowest_cost_node(costs):
lowest_cost = float("inf")
lowest_cost_node = None
# Go through each node.
for node in costs:
cost = costs[node]
# If it's the lowest cost so far and hasn't been processed yet...
if cost < lowest_cost and node not in processed:
# ... set it as the new lowest-cost node.
lowest_cost = cost
lowest_cost_node = node
return lowest_cost_node
# Find the lowest-cost node that you haven't processed yet.
node = find_lowest_cost_node(costs)
# If you've processed all the nodes, this while loop is done.
while node is not None:
cost = costs[node]
# Go through all the neighbors of this node.
neighbors = graph[node]
for n in neighbors.keys():
new_cost = cost + neighbors[n]
# If it's cheaper to get to this neighbor by going through this node...
if costs[n] > new_cost:
# ... update the cost for this node.
costs[n] = new_cost
# This node becomes the new parent for this neighbor.
parents[n] = node
# Mark the node as processed.
processed.append(node)
# Find the next node to process, and loop.
node = find_lowest_cost_node(costs)
print "Cost from the start to each node:"
print costs
复制代码
贪婪算法
- 贪婪算法:每步都选择局部最优解,最终得到的就是全局最优解。简单易行。
- 近似算法:在获得精确解需要⌚️过长时,?️使用近似解算法,其判断标准有速度有多快和与最优解的接近程度两方面。
- NP完全问题:没有快速算法的问题。为解决集合覆盖问题,必须计算每个可能的组。
- 如何判断NP完全问题(旅行商问题):
- 元素较少时算法运行速度非常快,但随元素?的增加,速度会非常慢;
- 涉及所有组合;
- 不能将问题分成小问题,必须?各种可能情况;
- 涉及排列或组合或集合覆盖问题且难以解决;
动态规划
工作原理:先解决子问题,在逐步解决大问题。
- 动态规划启示:
- 在约束条件下找到最优解;
- 在问题可被分为独立的子问题时可考虑;
- 涉及网格;
- 单元格?️的值通常就是你要优化的值;
- 每个单元格都是☝️子问题,你该考虑如何将问题划分为子问题,有利于你找出网格的坐标轴。
- 费曼算法:
- 将问题写下来;
- 好好?;
- 将答案写下来。
- 应用场景:
- 生物学家利用最长公共序列来✅DNA 链的相似性 ;
- git diff 算法的实现;
- 字符串相似度(编辑距离);
- Microsoft Word等断字功能的实现等。
K 最近邻算法
- 余弦相似度:距离公式;
- 分类(classification):编组;
- 回归(regression):预测结果;
- 特征抽取:将物品转为一系列可以比较的数字
接下来如何做
- ?:二叉查找树,B?,红黑?;
- 反向索引:搜索引擎的工作原理(根据网页内容创建一散列表,键为单词,值为包含指定单词的?);
- 傅立叶变换;
- 并行算法;
- MapReduce;
- 布隆过滤器;
- SHA算法;
- 局部敏感的散列算法;
- Diffie-Hellman密钥交换;
- 线性规划