Searchable Symmetric Encryption Scheme——对称密文检索

最新推荐文章于 2024-04-18 09:27:03 发布

Soul fragments

最新推荐文章于 2024-04-18 09:27:03 发布

阅读量6.1k

点赞数 2

分类专栏：大数据安全

本文链接：https://blog.csdn.net/weixin_43943977/article/details/103411564

版权

引言：

在IT界，大数据安全和密码学的高级实现似乎很难找到，很简单的一个例子是：倒排索引的实现有很多，但是在加密基础上再次实现密文检索和倒排索引却是寥寥无几，这篇博文基于对称密文实现检索。

数据集

真实数据集：
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Enron Emails，NIPS full papers，NYTimes news articles 用关键词W对密文建立索引，对密文进行检索 D=39861 代表文档数目 W=28102代表单词数目 N=6,400,000 (approx)代表单词总数

问题描述

实现以下加密方案
在这里插入图片描述

环境

Python3.7
Pycharm professional

Cryptography(python密码学算法库）

Cryptography是Python提供的一个为密码学使用者提供便利的第三方库，官方中，cryptography 的目标是成为“人们易于使用的密码学包”，就像 requests 是“人们易于使用的http库”一样。这个想法使你能够创建简单安全、易于使用的加密方案。如果有需要的话，你也可以使用一些底层的密码学基元，但这也需要你知道更多的细节，否则创建的东西将是不安全的。

参考学习链接：
官网：
https://cryptography.io/en/latest/

Python3加密学习：
https://linux.cn/article-7676-1.html

安装：
Pip install cryptography

调用（这里只提供本文需要的包）
from cryptography.fernet import Fernet #used for the symmetric key generation
from cryptography.hazmat.backends import default_backend #used in making key from password
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC

dataset

数据集：本次实验采用University of California Irvine的数据集，Enron Emails，NIPS full papers，NYTimes news articles 用关键词W对密文建立索引，对密文进行检索 D=39861 代表文档数目 W=28102代表单词数目 N=6,400,000 (approx)代表单词总数。
首先以Enron Emails数据为例子，数据文件的数据以如下的形式呈现：

在这里插入图片描述

其中每一行的第一个数据是文件的编号，第二个数据是单词的编号，第三个数据是词频，这种形式的数据其实为我们在进行倒排索引的构建时提供了便利。

倒排索引word_dict的构建

这一部分是对数据集构建基本的倒排索引，采用传统的建立方法就可以，核心代码如下：

    for idx, val in enumerate(filenames):#val is the name of the file
        cnt = Counter()
        for line in open(filenames[idx], 'r'): 
            print(line)
            word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
            for word in word_list:
                cnt[word] += 1
        filedata.append((val, cnt)) 


    for i in allwords:
        word_dict[i]
        for idx, val in enumerate(filedata):
            if i in val[1]:#val[1] is allwords of the value
                word_dict[i].append(val[0])# val[0] is the name of file
        word_dict[i].sort()

首先我们通过两个简单的测试文件进行验证:
example1:
File1:“hello this is a test data file data file file”
File2: “also file data is a test file”

将这两个文件作为输入，可以看到输出的倒排索引如下：word_dict
defaultdict(<class ‘list’>, {‘file’: [‘simple.txt’, ‘simple2.txt’], ‘data’: [‘simple.txt’, ‘simple2.txt’], ‘hello’: [‘simple.txt’], ‘this’: [‘simple.txt’], ‘is’: [‘simple.txt’, ‘simple2.txt’], ‘also’: [‘simple2.txt’], ‘a’: [‘simple.txt’, ‘simple2.txt’]})

example2：
有了这两个的基础，我们再对Enron Emails dataset进行word_list构建，考虑到Enron Emails dataset的数据量较大，难以从输出上看到结构，我们只取其中的前10个文件对应的数据集运行，得到如下的word_dict：
defaultdict(<class ‘list’>, {‘118’: [‘1’], ‘285’: [‘1’, ‘5’], ‘1229’: [‘1’, ‘3’], ‘1688’: [‘1’], ‘2068’: [‘1’, ‘2’], ‘5511’: [‘2’, ‘5’], ‘19675’: [‘2’], ‘1197’: [‘2’], ‘9458’: [‘2’], ‘2233’: [‘2’, ‘6’], ‘14050’: [‘3’], ‘26050’: [‘3’], ‘1976’: [‘3’], ‘3328’: [‘3’], ‘536’: [‘2’, ‘3’], ‘22690’: [‘4’], ‘9404’: [‘4’], ‘4802’: [‘2’, ‘4’], ‘19497’: [‘4’], ‘23690’: [‘4’], ‘19640’: [‘5’], ‘3182’: [‘2’, ‘5’], ‘24409’: [‘5’], ‘25181’: [‘5’], ‘16151’: [‘6’], ‘1599’: [‘6’], ‘6993’: [‘2’, ‘3’, ‘6’], ‘13091’: [‘5’, ‘6’, ‘8’], ‘15091’: [‘6’], ‘6964’: [‘7’], ‘9464’: [‘7’], ‘10636’: [‘7’], ‘12107’: [‘7’], ‘14325’: [‘4’, ‘7’], ‘4813’: [‘8’], ‘15088’: [‘10’, ‘6’, ‘8’], ‘25519’: [‘8’], ‘15291’: [‘8’], ‘1503’: [‘8’], ‘9970’: [‘9’], ‘22771’: [‘9’], ‘1267’: [‘9’], ‘4402’: [‘9’], ‘10258’: [‘9’], ‘6623’: [‘10’, ‘8’], ‘13104’: [‘10’, ‘3’], ‘19117’: [‘10’, ‘6’], ‘171’: [‘10’], ‘5680’: [‘10’]})

索引完整代码：

import itertools
from itertools import permutations, combinations  # used for permutations
from cryptography.fernet import Fernet  # used for the symmetric key generation
from collections import Counter  # used to count most common word
from collections import defaultdict  # used to make the the distinct word list
from llist import dllist, dllistnode  # python linked list library
import base64  # used for base 64 encoding
import os
from cryptography.hazmat.backends import default_backend  # used in making key from password
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
import random  # to select random key
import sys
import re
import bitarray  # for lookup table


def main():
    word_dict = intialization()  # if you want to repair ,it is important
    print(word_dict)
    word_dict = intialization2()
    print(word_dict)


############################################################################################

def intialization():
    '''
    Prompts user for documents to be encrypted and generates the distinct
    words in each. Returns the distinct words and the documents that contained them
    in a dictionary 'word_dict'
    '''

    filenames = []
    x = input("Please enter the name of a file you want to encrypt: ")  # filename
    filenames.append(x)
    while (True):
        x = input("\nEnter another file name or press enter if done: ")
        if not x:
            break
        filenames.append(x)
    # finds the occurence of each word in a flle
    filedata = []
    for idx, val in enumerate(filenames):#val is the name of the file
        cnt = Counter()
        for line in open(filenames[idx], 'r'):#这里的line感觉是文件中的所有内容，，还是一个个单词读的？？
            print(line)
            word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
            for word in word_list:
                cnt[word] += 1
        filedata.append((val, cnt))#这其实是一个统计词频的
        print(filedata)

    # takes the 5 most common from each document as the distinct words，in fact ,this is not necessary
    allwords = []
    for idx, val in enumerate(filedata):
        for value, count in val[1].most_common(5):
            if value not in allwords:
                allwords.append(value)
    print(allwords)
    # makes a dictory with the distinct word as index and a value of a list of filenames
    word_dict = defaultdict(list)

    for i in allwords:
        word_dict[i]
        for idx, val in enumerate(filedata):
            if i in val[1]:#val[1] is allwords of the value
                word_dict[i].append(

最低0.47元/天解锁文章

Soul fragments

关注

2
点赞
踩
31

收藏

觉得还不错? 一键收藏
7
评论
Searchable Symmetric Encryption Scheme——对称密文检索

引言：在IT界，大数据安全和密码学的高级实现似乎很难找到，很简单的一个例子是：倒排索引的实现有很多，但是在加密基础上再次实现密文检索和倒排索引却是寥寥无几，这篇博文基于对称密文实现检索。数据集真实数据集：http://archive.ics.uci.edu/ml/datasets/Bag+of+WordsEnron Emails，NIPS full papers，NYTimes news...
复制链接

扫一扫