整数序列压缩

最新推荐文章于 2021-07-13 15:17:05 发布

djskl

最新推荐文章于 2021-07-13 15:17:05 发布

阅读量2k

点赞数

本文链接：https://blog.csdn.net/djskl/article/details/44874399

版权

如果所有整数都大于0，那可以直接将整数看做unicode的code point，从而将整数转换成一个字符，原来的几位变成1位了。

比如：

存1000个122：

>>> f=open('1.txt','w')
>>> lst=['122' for i in range(1000)]
>>> s="".join(lst)
>>> f.write(s)
>>> f.close()

ls -l 1.txt：

-rw-r--r-- 1 root root 3000 Apr 4 18:10 1.txt

存1000个122需要3KB的空间。

unicdoe(utf-8)中code point与122对应的字符是z，如果存z的话：

>>> f=open('2.txt','w')
>>> lst=['z' for i in range(1000)]
>>> s="".join(lst)
>>> f.write(s)
>>> f.close()

ls -l 2.txt：

-rw-r--r-- 1 root root 1000 Apr 4 18:11 2.txt

存储1000个z只需要1KB的空间，相当于原来的1/3。

json.dumps默认会把非ASCII码的字符用unicode转义字符表示，比如：

>>> lst=['我']
>>> import json
>>> json.dumps(lst)
'["\\u6211"]'

这样原来的一个字符就变成6个了('我' <---> '\u6211')。

这时需要设置：ensure_ascii=False，意思就是非ascii码的字符不转义了原样输出。

>>> print json.dumps(lst, ensure_ascii=False)
["我"]
>>> print json.dumps(lst)
["\u6211"]

假如所有整数的绝对值都小于1000，那么编码的时候可以这样：

正数：

# 编码
>>> x = 560
>>> x << 1 ^ x >> 10
1120
# 解码
>>> xx = 1120
>>> xx >> 1 ^ -(xx & 1)
560

负数：

# 编码
>>> x = -560
>>> x << 1 ^ x >> 10
1119
# 解码
>>> xx = 1119
>>> xx >> 1 ^ -(xx & 1)
-560

x << 1：相当于乘以2，变成偶数。

x >> 10：如果x小于2**10(1024)，则正整数右移10位变成0，负整数右移10位变成-1。

正偶数与0做异或运算(^)，数值不边。负偶数与-1做异或运算，变成|x|-1，由负偶数变正奇数了。

注：非ascii码的字符不能直接写到文件里，需要用codecs：

codecs.open('1.txt','w',encoding='utf-8')

关注