需求背景
我们需要定时采集一些配置文件的内容,每次采集之后和上一次采集的内容进行比较,将按行为单位的变更记录持久化到数据库中。
这样做的好处是可以随时来查看这些变更记录,我们可以知道在什么时候进行了哪些变更,可以比较方便的分析出哪些变更影响到了服务的正常运行。
下面就开始使用difflib模块来实现这个需求。
difflib介绍
官方文档地址:https://docs.python.org/3/library/difflib.html
中文版:https://docs.python.org/zh-cn/3/library/difflib.html
difflib是python的标准库模块,它提供的类和方法用来比较两个序列之间的差异,生成差异结果文本或者html格式的差异化页面。
使用Differ类
先使用Differ类来比较两个文本序列。
代码示例
text1 = ''' 1. Beautiful is better than ugly.
2. Explicit is better than implicit.
3. Simple is better than complex.
4. Complex is better than complicated.
'''.splitlines(keepends=True)
text2 = ''' 1. Beautiful is better than ugly.
3. Simple is better than complex.
4. Complicated is better than complicated.
5. Flat is better than nested.
'''.splitlines(keepends=True)
differ = Differ()
for i in differ.compare(text1, text2):
print(i, end='')
执行结果
1. Beautiful is better than ugly.
- 2. Explicit is better than implicit.
3. Simple is better than complex.
- 4. Complex is better than complicated.
? ^
+ 4. Complicated is better than complicated.
? ++++ ^
+ 5. Flat is better than nested.
该方法生成的结果包括了行间和行内的差异,其实我们对行内的差异并不在意,而且结果的格式很难做解析。
使用SequenceMatcher类
SequenceMatcher类的get_opcodes方法返回描述如何将a转换为b的元组列表。
代码示例
matcher = SequenceMatcher(None, text1, text2)
for tag, alo, ahi, blo, bhi in matcher.get_opcodes():
if tag == 'replace':
print('replace\n{}\n{}'.format(text1[alo:ahi], text2[blo:bhi]))
elif tag == 'delete':
print('delete\n{}'.format(text1[alo:ahi]))
elif tag == 'insert':
print('insert\n{}'.format(text2[blo:bhi]))
elif tag == 'equal':
print('equal\n{}\n{}'.format(text1[alo:ahi], text2[blo:bhi]))
执行结果
equal
[' 1. Beautiful is better than ugly.\n']
[' 1. Beautiful is better than ugly.\n']
delete
[' 2. Explicit is better than implicit.\n']
equal
[' 3. Simple is better than complex.\n']
[' 3. Simple is better than complex.\n']
replace
[' 4. Complex is better than complicated.\n']
[' 4. Complicated is better than complicated.\n', ' 5. Flat is better than nested.\n']
将变更内容拆分成单一的变更
使用SequenceMatcher类得到的结果其实已经符合想要的结果,如果将变更内容拆成单一的变更就更好了。
下面尝试写处理函数去实现。
代码示例
def diff(text1, text2):
change_list = []
matcher = SequenceMatcher(None, text1, text2)
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == 'replace':
l1, l2 = text1[i1:i2], text2[j1:j2]
change_list.extend(map(lambda x, y: (tag, x, y), l1, l2))
if len(l1) == len(l2):
continue
if len(l1) > len(l2):
change_list.extend(('delete', line) for line in l1[len(l2):])
else:
change_list.extend(('insert', line) for line in l2[len(l1):])
elif tag == 'delete':
change_list.extend([(tag, line) for line in text1[i1:i2]])
elif tag == 'insert':
change_list.extend([(tag, line) for line in text2[j1:j2]])
elif tag == 'equal':
pass
return change_list
for change in diff(text1, text2):
print(change)
执行结果
('delete', ' 2. Explicit is better than implicit.\n')
('replace', ' 4. Complex is better than complicated.\n', ' 4. Complicated is better than complicated.\n')
('insert', ' 5. Flat is better than nested.\n')
这个结果已经可以进行解析和持久化了。
只是对于一些特殊情况并不能有正确的比较结果。
发现问题
经测试后发现,如果将测试数据更改为以下内容,会出现内容错位的情况。
代码示例
text1 = ''' 1. Beautiful is better than ugly.
2. Explicit is better than implicit.
3. Simple is better than complex.
4. Complex is better than complicated.
'''.splitlines(keepends=True)
text2 = ''' 1. Beautiful is better than ugly.
3. Simple is better than complexed.
4. Complicated is better than complicated.
5. Flat is better than nested.
'''.splitlines(keepends=True)
for change in diff(text1, text2):
print(change)
执行结果
('replace', ' 2. Explicit is better than implicit.\n', ' 3. Simple is better than complexed.\n')
('replace', ' 3. Simple is better than complex.\n', ' 4. Complicated is better than complicated.\n')
('replace', ' 4. Complex is better than complicated.\n', ' 5. Flat is better than nested.\n')
自定义CustomDiffer类
看来以上方式也并不可靠,我决定还是从Differ类下手。
Differ类内部其实使用了SequenceMatcher类,它采用了查找最佳匹配对的方式对replace的部分进行了分解,可以很好的解决我们刚才碰到的问题。
接下来我自定义了CustomDiffer类去继承Differ类,并重写了父类的格式化方法,主要的目的还是将Diifer方式的结果格式变得统一。
代码
class CustomDiffer(Differ):
def _dump(self, tag, x, lo, hi):
if tag == '+':
type = 'insert'
elif tag == '-':
type = 'delete'
else:
return
for i in range(lo, hi):
yield type, x[i]
def _qformat(self, aline, bline, atags, btags):
yield 'replace', aline, bline
执行结果
('delete', ' 2. Explicit is better than implicit.\n')
('replace', ' 3. Simple is better than complex.\n', ' 3. Simple is better than complexed.\n')
('replace', ' 4. Complex is better than complicated.\n', ' 4. Complicated is better than complicated.\n')
('insert', ' 5. Flat is better than nested.\n')
可以看到单个变更内容变为了元组形式,这样就比较容易解析和处理了。