引言:
在IT界,大数据安全和密码学的高级实现似乎很难找到,很简单的一个例子是:倒排索引的实现有很多,但是在加密基础上再次实现密文检索和倒排索引却是寥寥无几,这篇博文基于对称密文实现检索。
数据集
真实数据集:
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Enron Emails,NIPS full papers,NYTimes news articles 用关键词W对密文建立索引,对密文进行检索 D=39861 代表文档数目 W=28102代表单词数目 N=6,400,000 (approx)代表单词总数
问题描述
实现以下加密方案
环境
Python3.7
Pycharm professional
Cryptography(python密码学算法库)
Cryptography是Python提供的一个为密码学使用者提供便利的第三方库,官方中,cryptography 的目标是成为“人们易于使用的密码学包”,就像 requests 是“人们易于使用的http库”一样。这个想法使你能够创建简单安全、易于使用的加密方案。如果有需要的话,你也可以使用一些底层的密码学基元,但这也需要你知道更多的细节,否则创建的东西将是不安全的。
参考学习链接:
官网:
https://cryptography.io/en/latest/
Python3加密学习:
https://linux.cn/article-7676-1.html
安装:
Pip install cryptography
调用(这里只提供本文需要的包)
from cryptography.fernet import Fernet #used for the symmetric key generation
from cryptography.hazmat.backends import default_backend #used in making key from password
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
dataset
数据集: 本次实验采用University of California Irvine的数据集,Enron Emails,NIPS full papers,NYTimes news articles 用关键词W对密文建立索引,对密文进行检索 D=39861 代表文档数目 W=28102代表单词数目 N=6,400,000 (approx)代表单词总数。
首先以Enron Emails数据为例子,数据文件的数据以如下的形式呈现:
其中每一行的第一个数据是文件的编号,第二个数据是单词的编号,第三个数据是词频,这种形式的数据其实为我们在进行倒排索引的构建时提供了便利。
倒排索引word_dict的构建
这一部分是对数据集构建基本的倒排索引,采用传统的建立方法就可以,核心代码如下:
for idx, val in enumerate(filenames):#val is the name of the file
cnt = Counter()
for line in open(filenames[idx], 'r'):
print(line)
word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
for word in word_list:
cnt[word] += 1
filedata.append((val, cnt))
for i in allwords:
word_dict[i]
for idx, val in enumerate(filedata):
if i in val[1]:#val[1] is allwords of the value
word_dict[i].append(val[0])# val[0] is the name of file
word_dict[i].sort()
首先我们通过两个简单的测试文件进行验证:
example1:
File1:“hello this is a test data file data file file”
File2: “also file data is a test file”
将这两个文件作为输入,可以看到输出的倒排索引如下:word_dict
defaultdict(<class ‘list’>, {‘file’: [‘simple.txt’, ‘simple2.txt’], ‘data’: [‘simple.txt’, ‘simple2.txt’], ‘hello’: [‘simple.txt’], ‘this’: [‘simple.txt’], ‘is’: [‘simple.txt’, ‘simple2.txt’], ‘also’: [‘simple2.txt’], ‘a’: [‘simple.txt’, ‘simple2.txt’]})
example2:
有了这两个的基础,我们再对Enron Emails dataset进行word_list构建,考虑到Enron Emails dataset的数据量较大,难以从输出上看到结构,我们只取其中的前10个文件对应的数据集运行,得到如下的word_dict:
defaultdict(<class ‘list’>, {‘118’: [‘1’], ‘285’: [‘1’, ‘5’], ‘1229’: [‘1’, ‘3’], ‘1688’: [‘1’], ‘2068’: [‘1’, ‘2’], ‘5511’: [‘2’, ‘5’], ‘19675’: [‘2’], ‘1197’: [‘2’], ‘9458’: [‘2’], ‘2233’: [‘2’, ‘6’], ‘14050’: [‘3’], ‘26050’: [‘3’], ‘1976’: [‘3’], ‘3328’: [‘3’], ‘536’: [‘2’, ‘3’], ‘22690’: [‘4’], ‘9404’: [‘4’], ‘4802’: [‘2’, ‘4’], ‘19497’: [‘4’], ‘23690’: [‘4’], ‘19640’: [‘5’], ‘3182’: [‘2’, ‘5’], ‘24409’: [‘5’], ‘25181’: [‘5’], ‘16151’: [‘6’], ‘1599’: [‘6’], ‘6993’: [‘2’, ‘3’, ‘6’], ‘13091’: [‘5’, ‘6’, ‘8’], ‘15091’: [‘6’], ‘6964’: [‘7’], ‘9464’: [‘7’], ‘10636’: [‘7’], ‘12107’: [‘7’], ‘14325’: [‘4’, ‘7’], ‘4813’: [‘8’], ‘15088’: [‘10’, ‘6’, ‘8’], ‘25519’: [‘8’], ‘15291’: [‘8’], ‘1503’: [‘8’], ‘9970’: [‘9’], ‘22771’: [‘9’], ‘1267’: [‘9’], ‘4402’: [‘9’], ‘10258’: [‘9’], ‘6623’: [‘10’, ‘8’], ‘13104’: [‘10’, ‘3’], ‘19117’: [‘10’, ‘6’], ‘171’: [‘10’], ‘5680’: [‘10’]})
索引完整代码:
import itertools
from itertools import permutations, combinations # used for permutations
from cryptography.fernet import Fernet # used for the symmetric key generation
from collections import Counter # used to count most common word
from collections import defaultdict # used to make the the distinct word list
from llist import dllist, dllistnode # python linked list library
import base64 # used for base 64 encoding
import os
from cryptography.hazmat.backends import default_backend # used in making key from password
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
import random # to select random key
import sys
import re
import bitarray # for lookup table
def main():
word_dict = intialization() # if you want to repair ,it is important
print(word_dict)
word_dict = intialization2()
print(word_dict)
############################################################################################
def intialization():
'''
Prompts user for documents to be encrypted and generates the distinct
words in each. Returns the distinct words and the documents that contained them
in a dictionary 'word_dict'
'''
filenames = []
x = input("Please enter the name of a file you want to encrypt: ") # filename
filenames.append(x)
while (True):
x = input("\nEnter another file name or press enter if done: ")
if not x:
break
filenames.append(x)
# finds the occurence of each word in a flle
filedata = []
for idx, val in enumerate(filenames):#val is the name of the file
cnt = Counter()
for line in open(filenames[idx], 'r'):#这里的line感觉是文件中的所有内容,,还是一个个单词读的??
print(line)
word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
for word in word_list:
cnt[word] += 1
filedata.append((val, cnt))#这其实是一个统计词频的
print(filedata)
# takes the 5 most common from each document as the distinct words,in fact ,this is not necessary
allwords = []
for idx, val in enumerate(filedata):
for value, count in val[1].most_common(5):
if value not in allwords:
allwords.append(value)
print(allwords)
# makes a dictory with the distinct word as index and a value of a list of filenames
word_dict = defaultdict(list)
for i in allwords:
word_dict[i]
for idx, val in enumerate(filedata):
if i in val[1]:#val[1] is allwords of the value
word_dict[i].append(