举例:
word_list = ["你","我","他","你们","我们","他们","它们"]
word_frequency = [50, 10, 8, 7, 6, 3, 2]
1 node的意思
叶子节点:图中方框,也就是各个词
中间节点:图中圆圈,不代表实际的词,由节点合并而成
根节点:图中圆圈86,最后的中间节点
2 代码流程
创造节点 (创造叶子节点)-> 合并节点(合并节点的时候创造中间节点) -> 形成树结构 -> 获得编码
2.1 创造节点
用类创造节点,每个节点的属性有:
data:节点数据,如果是叶子节点,是叶子节点代表的词,如果是中间节点,是创建时的id
frequency:词频,是叶子节点就是词对应的词频,是中间节点就是子节点词频之和,
left_child, right_child:叶子节点没有,所以默认是None,
father:根节点没有,所以默认是None
class HuffmanNode:
def __init__(self, node_id, frequency):
self.node_id = node_id
self.frequency = frequency
self.left_child = None
self.right_child = None
self.father = None
2.2 合并节点
创建所有的叶子节点 -> 合并节点 -> 在所有叶子节点的基础上append中间节点 -> 获得各个节点的id
class HuffmanTree:
def __init__(self, word_list, word_frequency):
# 创建原始的叶子节点,用来遍历
self.leaf_node = [HuffmanNode(data, frequency) for data, frequency in zip(word_list, word_frequency)]
# 创建所有的叶子节点,这个列表以后会扩充到所有节点
self.all_node = [HuffmanNode(data, frequency) for data, frequency in zip(word_list, word_frequency)]
# 合并节点
def merge_node(self, node1, node2):
sum_frequency = node1.frequency + node2.frequency
mid_node_id = len(self.all_node) # 中间节点在叶子节点之后编号
father_node = HuffmanNode(mid_node_id, sum_frequency) # 创建中间节点
# 把frequency大的节点放到右边
if node1.frequency <= node2.frequency:
father_node.left_child = node1
father_node.right_child = node2
else:
father_node.left_child = node2
father_node.right_chile = node1
# 在all_node列表中,把father_node也添加进来
# 新创建的father_node具有编号,frequency,left_child, right_child属性
self.all_node.append(father_node)
# 返回新创建的father_node
return father_node
2.3 形成树结构
用到了python中的heapq模块,去生成一个优先队列,每次从优先队列中拿前两个进行节点合并。
class HuffmanTree:
def __init__(self, word_list, word_frequency):
# 创建原始的叶子节点,用来遍历
self.leaf_node = [HuffmanNode(data, frequency) for data, frequency in zip(word_list, word_frequency)]
# 创建所有的叶子节点,这个列表以后会扩充到所有节点
self.all_node = [HuffmanNode(data, frequency) for data, frequency in zip(word_list, word_frequency)]
def build_tree(self):
# 初始化优先队列
# 使用heapq模块创建了一个优先队列。优先队列是一种特殊的队列,其中的元素按照给定的优先级进行排序
# 在这里按照node.frequency进行排序,值最小的元素位于队列前端
priority_queue = [(node.frequency, node) for node in self.all_node]
heapq.heapify(priority_queue)
while len(priority_queue) > 1:
# 弹出两个frequency最小的节点
# heappop和列表中的pop一样,移除并返回最小的元素
freq1, node1 = heapq.heappop(priority_queue)
freq2, node2 = heapq.heappop(priority_queue)
# 把这两个值合并
new_node = self.merge_node(node1, node2)
# 再把合并好的节点加到优先队列里面去
heapq.heappush(priority_queue, (new_node.frequency, new_node))
# 不断重复这个过程直到优先队列里面只剩下1个元素
# 结束循环后,优先队列中的元素就是根节点,索引0是frequency,索引1是node
self.root = heapq.heappop(priority_queue)[1]
# 合并节点
def merge_node(self, node1, node2):
sum_frequency = node1.frequency + node2.frequency
mid_node_id = len(self.all_node) # 中间节点在叶子节点之后编号
father_node = HuffmanNode(mid_node_id, sum_frequency) # 创建中间节点
# 把frequency大的节点放到右边
if node1.frequency <= node2.frequency:
father_node.left_child = node1
father_node.right_child = node2
else:
father_node.left_child = node2
father_node.right_chile = node1
# 在all_node列表中,把father_node也添加进来
# 新创建的father_node具有编号,frequency,left_child, right_child属性
self.all_node.append(father_node)
# 返回节点列表,这个函数多次操作就可以得到所有的节点列表,也就是所有的节点id
return father_node
2.4 获得编码
获得编码的过程写到HuffmanNode这个类里面,可以理解为是每个节点都有的方法。
如果这个节点没有子节点,那么就是叶子节点,就返回这个节点的data(词)和对应的编码。
如果不是叶子节点,根据左0右1的原则,递归刷新编码,直到刷到没有子节点。
(这里面,left_child,right_child都是一个节点的实例,都具有generate_code这个方法)
class HuffmanNode:
def __init__(self, data, frequency):
self.data = data
self.frequency = frequency
self.left_child = None
self.right_child = None
self.father = None
def generate_code(self, code=""):
if self.left_child is None and self.right_child is None:
# 叶子节点
return {self.data: code}
codes = {}
if self.left_child is not None:
# update更新键值对
codes.update(self.left_child.generate_code(code + "0"))
if self.right_child is not None:
codes.update(self.right_child.generate_code(code + "1"))
return codes
然后在HuffmanTree这个类中调用节点的generate_code方法:
def get_codes(self):
# 如果self.root是None,说明没有构建huffman树,这时候返回空字典
if self.root is not None:
return self.root.generate_code()
else:
return {}
3 完整代码
import heapq
class HuffmanNode:
def __init__(self, data, frequency):
self.data = data
self.frequency = frequency
self.left_child = None
self.right_child = None
self.father = None
def generate_code(self, code=""):
if self.left_child is None and self.right_child is None:
# 叶子节点
return {self.data: code}
codes = {}
if self.left_child is not None:
codes.update(self.left_child.generate_code(code + "0"))
if self.right_child is not None:
codes.update(self.right_child.generate_code(code + "1"))
return codes
class HuffmanTree:
def __init__(self, word_list, word_frequency):
# 创建原始的叶子节点,用来遍历
self.leaf_node = [HuffmanNode(data, frequency) for data, frequency in zip(word_list, word_frequency)]
# 创建所有的叶子节点,这个列表以后会扩充到所有节点
self.all_node = [HuffmanNode(data, frequency) for data, frequency in zip(word_list, word_frequency)]
self.build_tree()
def build_tree(self):
# 初始化优先队列
# 使用heapq模块创建了一个优先队列。优先队列是一种特殊的队列,其中的元素按照给定的优先级进行排序
# 在这里按照node.frequency进行排序,值最小的元素位于队列前端
priority_queue = [(node.frequency, node) for node in self.all_node]
heapq.heapify(priority_queue)
while len(priority_queue) > 1:
# 弹出两个frequency最小的节点
# heappop和列表中的pop一样,移除并返回最小的元素
freq1, node1 = heapq.heappop(priority_queue)
freq2, node2 = heapq.heappop(priority_queue)
# 把这两个值合并
new_node = self.merge_node(node1, node2)
# 再把合并好的节点加到优先队列里面去
heapq.heappush(priority_queue, (new_node.frequency, new_node))
# 不断重复这个过程直到优先队列里面只剩下1个元素
# 结束循环后,优先队列中的元素就是根节点,索引0是frequency,索引1是node
self.root = heapq.heappop(priority_queue)[1]
# 合并节点
def merge_node(self, node1, node2):
sum_frequency = node1.frequency + node2.frequency
mid_node_id = len(self.all_node) # 中间节点在叶子节点之后编号
father_node = HuffmanNode(mid_node_id, sum_frequency) # 创建中间节点
# 把frequency大的节点放到右边
if node1.frequency <= node2.frequency:
father_node.left_child = node1
father_node.right_child = node2
else:
father_node.left_child = node2
father_node.right_chile = node1
# 在all_node列表中,把father_node也添加进来
# 新创建的father_node具有编号,frequency,left_child, right_child属性
self.all_node.append(father_node)
# 返回节点列表,这个函数多次操作就可以得到所有的节点列表,也就是所有的节点id
return father_node
# 获得huffman编码
def get_codes(self):
# 如果self.root是None,说明没有构建huffman树,这时候返回空字典
if self.root is not None:
return self.root.generate_code()
else:
return {}
word_list = ["你","我","他","你们","我们","他们","它们"]
word_frequency = [50, 10, 8, 7, 6, 3, 2]
tree = HuffmanTree(word_list, word_frequency)
huffman_codes = tree.get_codes()
for word in word_list:
code = huffman_codes.get(word, "no code")
print(f"word:{word}, HuffmanCode:{code}")
4 输出结果
word:你, HuffmanCode:1
word:我, HuffmanCode:010
word:他, HuffmanCode:001
word:你们, HuffmanCode:000
word:我们, HuffmanCode:0111
word:他们, HuffmanCode:01101
word:它们, HuffmanCode:01100
5 总结
节点是核心,先有节点,再有树。
根据节点的属性找到节点的位置(是否为叶子节点等),最终找到路径,找到编码。