SM4加解密算法的电路设计

一、SM4介绍

SM4算法是我国的一个商用加解密算法,主要用于数据的加解密,与DES和AES类似,属于分组密码算法。该算法输入明文、密钥,输出密文,或者输入密文、密钥,输出明文。明文是指待加密的信息,在SM4中为128位的二进制数,密钥是指参与运算的特殊数据参数,也是128位的二进制数。进行加密运算时得到的是128位的二进制密文,进行解密运算时得到的是128位二进制明文。

例如:

明文OText=128'h01234567_89abcdef_fedcba98_76543210,

密钥MKey=128'h01234567_89abcdef_fedcba98_76543210,

经过加密后的密文PCText=128'h681edf34_d206965e_86b3e94f_536e4246。

加密运算可以分为三步:1.生成32个轮密钥,2.对明文进行32次轮操作迭代运算,3.对步骤2结束后的数据分成4个32位字将4个字逆序得到最终的密文。

解密运算时如果已经生成了轮密钥就可以不再重复轮密钥的生成步骤直接进行32次文的轮操作,得到数据后也是像加密运算最后一步一样逆序输出,不过此时参与32次轮操作的轮密钥也是逆序的,比如32个轮密钥中加密时第5轮使用的是第5个轮密钥,解密时第5轮使用的是第28个轮密钥。

本设计同时设计了两个轮函数,分别是生成轮密钥的轮函数和明文加密或密文解密的轮函数,在数据通路里将两个轮函数通过一定的逻辑连接,使得在加密时每生成一个轮密钥后即刻与明文进行轮操作迭代一次,于是数据通路总共的轮函数执行时钟周期为32次,加上控制通路里的状态机的额外状态的时钟消耗,得到加解密一次的总共时间为35个时钟上升沿。但是对于一组密钥在第一次解密之前必须要加密一次来生成32个轮密钥(存储在电路内部的触发器里),否则不能解密。

二、电路结构

(一)总接口

 设计的顶层模块接口如下图所示。

接口信号描述表如下表所示。

信号位宽方向描述
clk1input时钟
rst_n1input全局异步清零
en_or_de1input加解密标志,0为加密1为解密
start1input加解密开始标志,1有效
K_valid1input输入的密钥数据MKey有效标志,1有效
X_valid1input输入的明密文数据OText有效标志,1有效
MKey128input密钥
OText128input输入的明密文
rkey_q1024output32个32位的轮密钥数据
PCText128output输出的加解密后的密明文
finish1output加解密运算完成标志,1有效

顶层模块的Verilog代码如下。

/* Module:SM4
 * top
*/
module SM4(
    //global
    input           clk,
    input           rst_n,

    //start,high active
    input           start,

    //de-encryption mode select
    input           en_or_de,

    //input data valid,active high
    input           K_valid,
    input           X_valid,

    //Key input
    input  [127:0]  MKey,

    //Text input
    input  [127:0]  OText,

    //roundkey output
    output [1023:0] rkey_q,

    //text output after de-encryption
    output [127:0]  PCText,

    //finish
    output          finish
);
    //internal signals
    wire       loadK,loadX;
    wire       flopK_en,flopX_en;
    wire       write_rk_en;
    wire [4:0] cntK,cntX;

    //CtrlPath
    Controller CtrlPath_x(
        .clk         (clk),
        .rst_n       (rst_n),
        .start       (start),
        .en_or_de    (en_or_de),
        .K_valid     (K_valid),
        .X_valid     (X_valid),
        .loadK       (loadK),
        .loadX       (loadX),
        .flopK_en    (flopK_en),
        .flopX_en    (flopX_en),
        .write_rk_en (write_rk_en),
        .cntK        (cntK),
        .cntX        (cntX),
        .finish      (finish)
    );

    //DataPath
    DataPath DataPath_x(
        .clk         (clk),
        .rst_n       (rst_n),
        .en_or_de    (en_or_de),
        .MKey        (MKey),
        .OText       (OText),
        .rkey_q      (rkey_q),
        .PCText      (PCText),
        .loadK       (loadK),
        .loadX       (loadX),
        .flopK_en    (flopK_en),
        .flopX_en    (flopX_en),
        .write_rk_en (write_rk_en),
        .cntK        (cntK),
        .cntX        (cntX)
    );
endmodule

(二)控制通路

1、接口

总的控制通路接口图如下图所示。

该接口信号描述表如下表所示。

信号位宽方向描述
clk1input时钟
rst_n1input全局异步清零
start1input加解密开始标志,1有效
en_or_de1input加解密标志,0为加密1为解密
K_valid1input输入的密钥数据MKey有效标志,1有效
X_valid1input输入的明密文数据OText有效标志,1有效
loadK1output在第一个轮操作时置数MKey到内部,1有效
loadX1output在第一个轮操作时置数OText到内部,1有效
flopK_en1output存储轮函数RoundFunc1输出信号KeyNew的触发器的使能信号,1有效
flopX_en1output存储轮函数RoundFunc2输出信号XNew的触发器的使能信号,1有效
write_rk_en1output存储32个轮密钥的触发器的使能信号,1有效
cntK5output计数器counterK的计数输出
cntX5output计数器counterX的计数输出
finish1output加解密运算完成标志,1有效

2、状态机

状态转换图和两个计数器的信号如下所示。

状态输出信号描述表如下表所示。

状态名描述
IDLE空闲
WAIT等待MKey和OText有效
LOAD将MKey和OText置入内部,进行第一次轮操作
DEALK处理key,在加密阶段counterK的进位co_cntK拉高后跳出,在解密阶段只消耗一个时钟周期用来写入第一个RoundFunc2结果到触发器
DEALX处理text,counterX的进位co_cntX拉高后跳出
FINISH完成状态,拉高finish信号表示完成加密或解密
状态名loadKloadXflopK_enflopX_en
IDLE0000
WAITK_validX_valid00
LOAD11!en_or_de0
DEALK0cntX==0!en_or_de1
DEALX0001
FINISH0000
状态名cntK_encntX_enwrite_rk_enfinish
IDLE0000
WAIT0000
LOAD!en_or_de000
DEALK!en_or_de1!en_or_de0
DEALX01!en_or_de0
FINISH0001

控制通路包括一个有限状态机和两个计数器,co_cntK为计数器counterK的进位输出,co_cntX为计数器counterX的进位输出,两个计数器的计数上限均为31,所以是32进制计数器。在之后的第(三)部分的总数据通路部分里可以看到控制通路给数据通路的各个信号的作用。

控制通路的Verilog代码如下。

/* Module:Controller
 * 
*/
module Controller(
    //global
    input            clk,
    input            rst_n,
    input            start,//wait for high to start de-encryption
    input            en_or_de,//0 is encryption,1 is decryption
    input            K_valid,//input MKey valid
    input            X_valid,//input Text valid

    //to DataPath
    output reg       loadK,//load MKey in first make-rkey round
    output reg       loadX,//load Text in first de-encryption round
    output reg       flopK_en,//en for flop stores Key every round
    output reg       flopX_en,//en for flop stores Text every round
    output reg       write_rk_en,//en for flop stores rk every round
    output reg [4:0] cntK,//counterK output
    output reg [4:0] cntX,//counterX output

    //finish signal
    output reg       finish//pull high when whole procedure finished
);
    //parameters
    localparam IDLE   = 6'b000001;//nothing to do
    localparam WAIT   = 6'b000010;//wait for the Key and Text valid
    localparam LOAD   = 6'b000100;//load Key and Text
    localparam DEALK  = 6'b001000;//deal with Key
    localparam DEALX  = 6'b010000;//deal with Text
    localparam FINISH = 6'b100000;//finish

    //internal signals
    reg [5:0] cur_state,nxt_state;
    wire      co_cntK;
    wire      co_cntX;
    reg       cntK_en;
    reg       cntX_en;

    //counterK
    assign co_cntK = &cntK;
    always @(posedge clk,negedge rst_n) begin
        if(!rst_n)
            cntK <= 5'b00000;
        else if(cntK_en)
            cntK <= cntK + 1'b1;
        else
            cntK <= cntK;
    end

    //counterX
    assign co_cntX = &cntX;
    always @(posedge clk,negedge rst_n) begin
        if(!rst_n)
            cntX <= 5'b00000;
        else if(cntX_en)
            cntX <= cntX + 1'b1;
        else
            cntX <= cntX;
    end

    //============================================================
    //                        state machine
    //============================================================
    //timing logic
    always @(posedge clk,negedge rst_n) begin
        if(!rst_n)
            cur_state <= IDLE;
        else
            cur_state <= nxt_state;
    end
    //combinational logic
    always @(*) begin
        loadK       = 1'b0;
        loadX       = 1'b0;
        cntK_en     = 1'b0;
        cntX_en     = 1'b0;
        flopK_en    = 1'b0;
        flopX_en    = 1'b0;
        write_rk_en = 1'b0;
        finish      = 1'b0;
        case(cur_state)
        IDLE:begin
            nxt_state = start ? WAIT : IDLE;
        end
        WAIT:begin
            nxt_state = (K_valid && X_valid) ? LOAD : WAIT;
            loadK     = K_valid;
            loadX     = X_valid;
        end
        LOAD:begin
            nxt_state = DEALK;
            loadK     = 1'b1;
            loadX     = 1'b1;
            cntK_en   = !en_or_de;
            flopK_en  = !en_or_de;
        end
        DEALK:begin
            nxt_state   = en_or_de ? DEALX
                                   : (co_cntK ? DEALX : DEALK);
            loadK       = 1'b0;
            loadX       = !(|cntX);//load X in the first counterX cycle
            cntK_en     = !en_or_de;
            cntX_en     = 1'b1;
            flopK_en    = !en_or_de;
            flopX_en    = 1'b1;
            write_rk_en = !en_or_de;
        end
        DEALX:begin
            nxt_state   = co_cntX ? FINISH : DEALX;
            cntX_en     = 1'b1;
            flopX_en    = 1'b1;
            write_rk_en = !en_or_de;
        end
        FINISH:begin
            nxt_state = IDLE;
            finish    = 1'b1;
        end
        default:begin
            nxt_state = IDLE;
        end
        endcase
    end
endmodule

(三)数据通路

1、接口

该接口图里左边和上边的信号是对接顶层的全局信号,右边是对接控制通路的信号,以上的信号都在本文之前的部分有描述说明,这里不重复说明。

2、S盒

SBox为生成轮密钥和处理明密文时的轮操作使用到的轮函数里的重要的非线性运算部分。对于一个S盒,输入为一个8位信号,输出也为一个8位信号,也就是将一个8位的addr映射为另外一个8位的q。官网标准文档里给出的是16x16数据表,所以在设计时采用组合逻辑的方式用case语句描述S盒。

以下是S盒模块的Verilog代码。

/* module:SBox
 * checked
*/
module SBox(
    input      [7:0] addr,
    output reg [7:0] q
);
    always @(*) begin
        case(addr)
        8'h00: q = 8'hd6;    8'h01: q = 8'h90;    8'h02: q = 8'he9;    8'h03: q = 8'hfe;
        8'h04: q = 8'hcc;    8'h05: q = 8'he1;    8'h06: q = 8'h3d;    8'h07: q = 8'hb7;
        8'h08: q = 8'h16;    8'h09: q = 8'hb6;    8'h0a: q = 8'h14;    8'h0b: q = 8'hc2;
        8'h0c: q = 8'h28;    8'h0d: q = 8'hfb;    8'h0e: q = 8'h2c;    8'h0f: q = 8'h05;

        8'h10: q = 8'h2b;    8'h11: q = 8'h67;    8'h12: q = 8'h9a;    8'h13: q = 8'h76;
        8'h14: q = 8'h2a;    8'h15: q = 8'hbe;    8'h16: q = 8'h04;    8'h17: q = 8'hc3;
        8'h18: q = 8'haa;    8'h19: q = 8'h44;    8'h1a: q = 8'h13;    8'h1b: q = 8'h26;
        8'h1c: q = 8'h49;    8'h1d: q = 8'h86;    8'h1e: q = 8'h06;    8'h1f: q = 8'h99;

        8'h20: q = 8'h9c;    8'h21: q = 8'h42;    8'h22: q = 8'h50;    8'h23: q = 8'hf4;
        8'h24: q = 8'h91;    8'h25: q = 8'hef;    8'h26: q = 8'h98;    8'h27: q = 8'h7a;
        8'h28: q = 8'h33;    8'h29: q = 8'h54;    8'h2a: q = 8'h0b;    8'h2b: q = 8'h43;
        8'h2c: q = 8'hed;    8'h2d: q = 8'hcf;    8'h2e: q = 8'hac;    8'h2f: q = 8'h62;

        8'h30: q = 8'he4;    8'h31: q = 8'hb3;    8'h32: q = 8'h1c;    8'h33: q = 8'ha9;
        8'h34: q = 8'hc9;    8'h35: q = 8'h08;    8'h36: q = 8'he8;    8'h37: q = 8'h95;
        8'h38: q = 8'h80;    8'h39: q = 8'hdf;    8'h3a: q = 8'h94;    8'h3b: q = 8'hfa;
        8'h3c: q = 8'h75;    8'h3d: q = 8'h8f;    8'h3e: q = 8'h3f;    8'h3f: q = 8'ha6;

        8'h40: q = 8'h47;    8'h41: q = 8'h07;    8'h42: q = 8'ha7;    8'h43: q = 8'hfc;
        8'h44: q = 8'hf3;    8'h45: q = 8'h73;    8'h46: q = 8'h17;    8'h47: q = 8'hba;
        8'h48: q = 8'h83;    8'h49: q = 8'h59;    8'h4a: q = 8'h3c;    8'h4b: q = 8'h19;
        8'h4c: q = 8'he6;    8'h4d: q = 8'h85;    8'h4e: q = 8'h4f;    8'h4f: q = 8'ha8;

        8'h50: q = 8'h68;    8'h51: q = 8'h6b;    8'h52: q = 8'h81;    8'h53: q = 8'hb2;
        8'h54: q = 8'h71;    8'h55: q = 8'h64;    8'h56: q = 8'hda;    8'h57: q = 8'h8b;
        8'h58: q = 8'hf8;    8'h59: q = 8'heb;    8'h5a: q = 8'h0f;    8'h5b: q = 8'h4b;
        8'h5c: q = 8'h70;    8'h5d: q = 8'h56;    8'h5e: q = 8'h9d;    8'h5f: q = 8'h35;

        8'h60: q = 8'h1e;    8'h61: q = 8'h24;    8'h62: q = 8'h0e;    8'h63: q = 8'h5e;
        8'h64: q = 8'h63;    8'h65: q = 8'h58;    8'h66: q = 8'hd1;    8'h67: q = 8'ha2;
        8'h68: q = 8'h25;    8'h69: q = 8'h22;    8'h6a: q = 8'h7c;    8'h6b: q = 8'h3b;
        8'h6c: q = 8'h01;    8'h6d: q = 8'h21;    8'h6e: q = 8'h78;    8'h6f: q = 8'h87;

        8'h70: q = 8'hd4;    8'h71: q = 8'h00;    8'h72: q = 8'h46;    8'h73: q = 8'h57;
        8'h74: q = 8'h9f;    8'h75: q = 8'hd3;    8'h76: q = 8'h27;    8'h77: q = 8'h52;
        8'h78: q = 8'h4c;    8'h79: q = 8'h36;    8'h7a: q = 8'h02;    8'h7b: q = 8'he7;
        8'h7c: q = 8'ha0;    8'h7d: q = 8'hc4;    8'h7e: q = 8'hc8;    8'h7f: q = 8'h9e;

        8'h80: q = 8'hea;    8'h81: q = 8'hbf;    8'h82: q = 8'h8a;    8'h83: q = 8'hd2;
        8'h84: q = 8'h40;    8'h85: q = 8'hc7;    8'h86: q = 8'h38;    8'h87: q = 8'hb5;
        8'h88: q = 8'ha3;    8'h89: q = 8'hf7;    8'h8a: q = 8'hf2;    8'h8b: q = 8'hce;
        8'h8c: q = 8'hf9;    8'h8d: q = 8'h61;    8'h8e: q = 8'h15;    8'h8f: q = 8'ha1;

        8'h90: q = 8'he0;    8'h91: q = 8'hae;    8'h92: q = 8'h5d;    8'h93: q = 8'ha4;
        8'h94: q = 8'h9b;    8'h95: q = 8'h34;    8'h96: q = 8'h1a;    8'h97: q = 8'h55;
        8'h98: q = 8'had;    8'h99: q = 8'h93;    8'h9a: q = 8'h32;    8'h9b: q = 8'h30;
        8'h9c: q = 8'hf5;    8'h9d: q = 8'h8c;    8'h9e: q = 8'hb1;    8'h9f: q = 8'he3;

        8'ha0: q = 8'h1d;    8'ha1: q = 8'hf6;    8'ha2: q = 8'he2;    8'ha3: q = 8'h2e;
        8'ha4: q = 8'h82;    8'ha5: q = 8'h66;    8'ha6: q = 8'hca;    8'ha7: q = 8'h60;
        8'ha8: q = 8'hc0;    8'ha9: q = 8'h29;    8'haa: q = 8'h23;    8'hab: q = 8'hab;
        8'hac: q = 8'h0d;    8'had: q = 8'h53;    8'hae: q = 8'h4e;    8'haf: q = 8'h6f;

        8'hb0: q = 8'hd5;    8'hb1: q = 8'hdb;    8'hb2: q = 8'h37;    8'hb3: q = 8'h45;
        8'hb4: q = 8'hde;    8'hb5: q = 8'hfd;    8'hb6: q = 8'h8e;    8'hb7: q = 8'h2f;
        8'hb8: q = 8'h03;    8'hb9: q = 8'hff;    8'hba: q = 8'h6a;    8'hbb: q = 8'h72;
        8'hbc: q = 8'h6d;    8'hbd: q = 8'h6c;    8'hbe: q = 8'h5b;    8'hbf: q = 8'h51;

        8'hc0: q = 8'h8d;    8'hc1: q = 8'h1b;    8'hc2: q = 8'haf;    8'hc3: q = 8'h92;
        8'hc4: q = 8'hbb;    8'hc5: q = 8'hdd;    8'hc6: q = 8'hbc;    8'hc7: q = 8'h7f;
        8'hc8: q = 8'h11;    8'hc9: q = 8'hd9;    8'hca: q = 8'h5c;    8'hcb: q = 8'h41;
        8'hcc: q = 8'h1f;    8'hcd: q = 8'h10;    8'hce: q = 8'h5a;    8'hcf: q = 8'hd8;

        8'hd0: q = 8'h0a;    8'hd1: q = 8'hc1;    8'hd2: q = 8'h31;    8'hd3: q = 8'h88;
        8'hd4: q = 8'ha5;    8'hd5: q = 8'hcd;    8'hd6: q = 8'h7b;    8'hd7: q = 8'hbd;
        8'hd8: q = 8'h2d;    8'hd9: q = 8'h74;    8'hda: q = 8'hd0;    8'hdb: q = 8'h12;
        8'hdc: q = 8'hb8;    8'hdd: q = 8'he5;    8'hde: q = 8'hb4;    8'hdf: q = 8'hb0;

        8'he0: q = 8'h89;    8'he1: q = 8'h69;    8'he2: q = 8'h97;    8'he3: q = 8'h4a;
        8'he4: q = 8'h0c;    8'he5: q = 8'h96;    8'he6: q = 8'h77;    8'he7: q = 8'h7e;
        8'he8: q = 8'h65;    8'he9: q = 8'hb9;    8'hea: q = 8'hf1;    8'heb: q = 8'h09;
        8'hec: q = 8'hc5;    8'hed: q = 8'h6e;    8'hee: q = 8'hc6;    8'hef: q = 8'h84;

        8'hf0: q = 8'h18;    8'hf1: q = 8'hf0;    8'hf2: q = 8'h7d;    8'hf3: q = 8'hec;
        8'hf4: q = 8'h3a;    8'hf5: q = 8'hdc;    8'hf6: q = 8'h4d;    8'hf7: q = 8'h20;
        8'hf8: q = 8'h79;    8'hf9: q = 8'hee;    8'hfa: q = 8'h5f;    8'hfb: q = 8'h3e;
        8'hfc: q = 8'hd7;    8'hfd: q = 8'hcb;    8'hfe: q = 8'h39;    8'hff: q = 8'h48;
        endcase
    end
endmodule

3、轮函数

上面说过轮函数是轮操作的核心,对于轮密钥的生成有一个轮函数,对于明密文的处理也有一个轮函数,两者都包含了上一节里的S盒置换模块,区别在于剩下的列混淆变换。

以下是生成轮密钥的轮函数模块的图,输入Key为128位的每一轮的密钥,CKey为32位的每一轮的轮操作的常数密钥,输出为128位的下一个轮操作的轮函数的输入。

第一步为常数异或运算。首先将输入Key分解为K0,K1,K2,K3,即{K0,K1,K2,K3}=Key。然后将K1,K2,K3,CKey四个数按位异或得到32位的K_before_SBox。

第二步为S盒置换运算。将K_before_SBox分解为4个8位,分别进行并行的S盒置换运算,得到的4个8位结果再按照原来的顺序拼接成一个32位的K_after_SBox。

第三步为列混淆运算。将K_after_SBox与K_after_SBox循环左移13位(<<<13)后的结果与K_after_SBox循环左移23位(<<<23)后的结果按位异或。结果再与输入Key分解后的K0按位异或得到这一轮的轮密钥KNew3。

第四步为拼凑。原来的K1作为KNew0,K2作为KNew1,K3作为KNew2,然后按序拼接成128位的输出KNew,即KNew={KNew0,KNew1,KNew2,KNew3},存储到触发器里后再在下一个时钟周期里作为轮函数的输入进行下一轮运算。

以下为明密文处理的轮函数的框图,除了信号名不同,结构上只有混淆运算不同,其余一致。

在混淆运算时参与异或的量有五个,分别是原量,原量分别循环左移2位、10位、18位、24位后的量。在混淆结束后依然要与输入拆分后的高32位按位异或得到新的密明文作为输出的低32位。

两个轮函数的Verilog代码如下。

/* Module:RoundFunc1
 * checked
*/
module RoundFunc1(
    input  [127:0] Key,
    input  [31:0]  CKey,
    output [127:0] KeyNew
);
    //internal signals
    wire [31:0] K3,K2,K1,K0;
    wire [31:0] K_before_SBox;
    wire [7:0]  K3addr,K2addr,K1addr,K0addr;
    wire [7:0]  K3q,K2q,K1q,K0q;
    wire [31:0] K_after_SBox;
    wire [31:0] K3new,K2new,K1new,K0new;

    //before SBox
    assign {K0,K1,K2,K3}                 = Key;
    assign K_before_SBox                 = K1 ^ K2 ^ K3 ^ CKey;
    assign {K3addr,K2addr,K1addr,K0addr} = K_before_SBox;

    //four parallel SBoxs
    SBox SBox_k3(
        .addr (K3addr),
        .q    (K3q)
    );
    SBox SBox_k2(
        .addr (K2addr),
        .q    (K2q)
    );
    SBox SBox_k1(
        .addr (K1addr),
        .q    (K1q)
    );
    SBox SBox_k0(
        .addr (K0addr),
        .q    (K0q)
    );

    //after SBox
    assign K_after_SBox = {K3q,K2q,K1q,K0q};

    //new Key
    assign K0new  = K1;
    assign K1new  = K2;
    assign K2new  = K3;
    assign K3new  = K_after_SBox ^ {K_after_SBox[18:0],K_after_SBox[31:19]}//<<<13
                                 ^ {K_after_SBox[8:0],K_after_SBox[31:9]}//<<<23
                                 ^ K0;
    assign KeyNew = {K0new,K1new,K2new,K3new};
endmodule
/* Module:RoundFunc2
 * checked
*/
module RoundFunc2(
    input  [127:0] X,
    input  [31:0]  rkey,
    output [127:0] XNew
);
    //internal signals
    wire [31:0] X3,X2,X1,X0;
    wire [31:0] X_before_SBox;
    wire [7:0]  X3addr,X2addr,X1addr,X0addr;
    wire [7:0]  X3q,X2q,X1q,X0q;
    wire [31:0] X_after_SBox;
    wire [31:0] XNew3,XNew2,XNew1,XNew0;

    //before SBox
    assign {X0,X1,X2,X3}                 = X;
    assign X_before_SBox                 = X3 ^ X2 ^ X1 ^ rkey;
    assign {X3addr,X2addr,X1addr,X0addr} = X_before_SBox;

    //four parallel SBoxs
    SBox SBox_x3(
        .addr (X3addr),
        .q    (X3q)
    );
    SBox SBox_x2(
        .addr (X2addr),
        .q    (X2q)
    );
    SBox SBox_x1(
        .addr (X1addr),
        .q    (X1q)
    );
    SBox SBox_x0(
        .addr (X0addr),
        .q    (X0q)
    );

    //after SBox
    assign X_after_SBox = {X3q,X2q,X1q,X0q};

    //XNew
    assign XNew0 = X1;
    assign XNew1 = X2;
    assign XNew2 = X3;
    assign XNew3 = X_after_SBox ^ {X_after_SBox[29:0],X_after_SBox[31:30]}//<<<2
                                ^ {X_after_SBox[21:0],X_after_SBox[31:22]}//<<<10
                                ^ {X_after_SBox[13:0],X_after_SBox[31:14]}//<<<18
                                ^ {X_after_SBox[7:0],X_after_SBox[31:8]}//<<<24
                                ^ X0;
    assign XNew  = {XNew0,XNew1,XNew2,XNew3};
endmodule

4、总数据通路

总数据通路的结构框图如下图所示。

总数据通路可以分为三部分,分别为轮密钥生成部分(generate rkey),轮密钥存储和选择部分(store and select rkey),明密文生成部分(generate PCText),在上图中已经用三种不同颜色的框圈出相应的逻辑。

第一部分,输入密钥MKey后首先将密钥与固定常数密钥FKey按位异或,然后根据loadK选择输入给轮函数RoundFunc1的第一个轮操作,loadK只有在状态机的LOAD状态有效,保证置入密钥产生第一个轮密钥并存储后不再对电路产生影响。接着就是轮函数的连接,将轮函数的输出接到触发器的D端,输入接到2x1MUX输出端,再将轮函数的输入Key的低32位作为产生的轮密钥连接到第二部分,轮函数的常数密钥的输入由来自控制通路的计数值cntK决定。

第二部分,在加密阶段en_or_de控制的MUX总是选通0端而解密阶段总是选通1端,存储轮密钥的1024位的rkey_q在加密时通过一定的逻辑实现每次来一个轮密钥后就将之前存储的轮密钥向右位移32位然后把新的轮密钥存储在高32位里,这可以通过移位和按位异或实现,而在解密时根据从控制通路输入的计数值cntX决定输出给第三部分的轮密钥,由于在加密时新的轮密钥是存储在高位的,所以解密时输出给第三部分的轮密钥可以每次依次从高位向低位取32位。

第三部分,由于数据通路里由两级的触发器根据一定的逻辑连接而成,在置入MKey虽然产生了第一个轮密钥,但是不能马上打开第二级触发器的使能,也就是明密文的置入要额外保持一个时钟周期,这也就是为什么在状态机的输出信号表里在DEALK状态下的loadX要赋值为(cntX==0)的原因,其余的逻辑就和第一部分形式一致,接下来是第二个轮函数RoundFunc2的连接,最后将触发器的Q端分解为4个32位的字,然后将四个字逆序重新拼接后输出得到明密文加解密后128位的PCText。

数据通路的Verilog代码如下。

/* Module:DataPath
 * 
*/
module DataPath(
    //global
    input               clk,
    input               rst_n,
    input               en_or_de,

    //data inout
    input      [127:0]  MKey,//original key
    input      [127:0]  OText,//original text
    output reg [1023:0] rkey_q,//round key
    output     [127:0]  PCText,//output plain text or cipher text

    //from Controller
    input               loadK,//load MKey in first make-rk round
    input               loadX,//load Text in first de-encryption round
    input               flopK_en,//en for flop stores Key every round
    input               flopX_en,//en for flop stores Text every round
    input               write_rk_en,//en for flop stores rk every round
    input      [4:0]    cntK,//counterK output
    input      [4:0]    cntX//counterX output
);
    //constants for make rkey in RoundFunc1
    localparam FKey = 128'ha3b1bac6_56aa3350_677d9197_b27022dc;
    localparam CK0  = 32'h00070e15;
    localparam CK1  = 32'h1c232a31;
    localparam CK2  = 32'h383f464d;
    localparam CK3  = 32'h545b6269;
    localparam CK4  = 32'h70777e85;
    localparam CK5  = 32'h8c939aa1;
    localparam CK6  = 32'ha8afb6bd;
    localparam CK7  = 32'hc4cbd2d9;
    localparam CK8  = 32'he0e7eef5;
    localparam CK9  = 32'hfc030a11;
    localparam CK10 = 32'h181f262d;
    localparam CK11 = 32'h343b4249;
    localparam CK12 = 32'h50575e65;
    localparam CK13 = 32'h6c737a81;
    localparam CK14 = 32'h888f969d;
    localparam CK15 = 32'ha4abb2b9;
    localparam CK16 = 32'hc0c7ced5;
    localparam CK17 = 32'hdce3eaf1;
    localparam CK18 = 32'hf8ff060d;
    localparam CK19 = 32'h141b2229;
    localparam CK20 = 32'h30373e45;
    localparam CK21 = 32'h4c535a61;
    localparam CK22 = 32'h686f767d;
    localparam CK23 = 32'h848b9299;
    localparam CK24 = 32'ha0a7aeb5;
    localparam CK25 = 32'hbcc3cad1;
    localparam CK26 = 32'hd8dfe6ed;
    localparam CK27 = 32'hf4fb0209;
    localparam CK28 = 32'h10171e25;
    localparam CK29 = 32'h2c333a41;
    localparam CK30 = 32'h484f565d;
    localparam CK31 = 32'h646b7279;

    //internal signals
    wire [127:0]  Key_1;
    wire [127:0]  Key_0;
    wire [127:0]  Key;
    reg  [127:0]  Key_q;
    wire [127:0]  Key_d;
    wire [127:0]  KeyNew;
    reg  [31:0]   CKey;
    wire [31:0]   rkey;
    wire [31:0]   rkey_0;
    reg  [31:0]   rkey_1;
    wire [1023:0] rkey_d;
    wire [31:0]   rkey_part [0:31];//partial rkey to 32 parts
    reg  [127:0]  Text_q;
    wire [127:0]  Text_d;
    wire [127:0]  X;
    wire [127:0]  XNew;
    wire [127:0]  X_0;
    wire [31:0]   PCText0,PCText1,PCText2,PCText3;

    //===========================================================================================================================================
    //                                    generate rkey
    //===========================================================================================================================================
    //assign logic
    assign Key_1  = FKey ^ MKey;
    assign Key_0  = Key_q;
    assign Key    = loadK ? Key_1 : Key_0;
    assign rkey_0 = Key[31:0];//low 32 bits of Key after every round
    assign Key_d  = KeyNew;

    //RoundFunc1
    RoundFunc1 RoundFunc1_x(
        .Key    (Key),
        .CKey   (CKey),
        .KeyNew (KeyNew)
    );

    //flop logic
    always @(posedge clk,negedge rst_n) begin
        if(!rst_n)
            Key_q <= {128{1'b0}};
        else if(flopK_en)
            Key_q <= Key_d;
        else
            Key_q <= Key_q;
    end

    //CKey logic
    always @(*) begin
        case(cntK)
        5'd0:  CKey = CK0;     5'd1:  CKey = CK1;     5'd2:  CKey = CK2;     5'd3:  CKey = CK3;
        5'd4:  CKey = CK4;     5'd5:  CKey = CK5;     5'd6:  CKey = CK6;     5'd7:  CKey = CK7;
        5'd8:  CKey = CK8;     5'd9:  CKey = CK9;     5'd10: CKey = CK10;    5'd11: CKey = CK11;
        5'd12: CKey = CK12;    5'd13: CKey = CK13;    5'd14: CKey = CK14;    5'd15: CKey = CK15;
        5'd16: CKey = CK16;    5'd17: CKey = CK17;    5'd18: CKey = CK18;    5'd19: CKey = CK19;
        5'd20: CKey = CK20;    5'd21: CKey = CK21;    5'd22: CKey = CK22;    5'd23: CKey = CK23;
        5'd24: CKey = CK24;    5'd25: CKey = CK25;    5'd26: CKey = CK26;    5'd27: CKey = CK27;
        5'd28: CKey = CK28;    5'd29: CKey = CK29;    5'd30: CKey = CK30;    5'd31: CKey = CK31;
        default: CKey = {32{1'b0}};
        endcase
    end

    //===========================================================================================================================================
    //                                    store and select rkey
    //===========================================================================================================================================
    //assign logic
    assign rkey   = en_or_de ? rkey_1 : rkey_0;
    assign rkey_d = (rkey_q>>32) ^ {rkey,{31*32{1'b0}}};
    assign {rkey_part[0], rkey_part[1], rkey_part[2], rkey_part[3],
            rkey_part[4], rkey_part[5], rkey_part[6], rkey_part[7],
            rkey_part[8], rkey_part[9], rkey_part[10],rkey_part[11],
            rkey_part[12],rkey_part[13],rkey_part[14],rkey_part[15],
            rkey_part[16],rkey_part[17],rkey_part[18],rkey_part[19],
            rkey_part[20],rkey_part[21],rkey_part[22],rkey_part[23],
            rkey_part[24],rkey_part[25],rkey_part[26],rkey_part[27],
            rkey_part[28],rkey_part[29],rkey_part[30],rkey_part[31]} = rkey_q;

    //flop logic
    always @(posedge clk,negedge rst_n) begin
        if(!rst_n)
            rkey_q <= {1024{1'b0}};
        else if(write_rk_en)
            rkey_q <= rkey_d;
        else
            rkey_q <= rkey_q;
    end

    //rkey_1 logic
    always @(*) begin
        case(cntX)
        5'd0:  rkey_1 = rkey_part[0];     5'd1:  rkey_1 = rkey_part[1];     5'd2:  rkey_1 = rkey_part[2];     5'd3:  rkey_1 = rkey_part[3];
        5'd4:  rkey_1 = rkey_part[4];     5'd5:  rkey_1 = rkey_part[5];     5'd6:  rkey_1 = rkey_part[6];     5'd7:  rkey_1 = rkey_part[7];
        5'd8:  rkey_1 = rkey_part[8];     5'd9:  rkey_1 = rkey_part[9];     5'd10: rkey_1 = rkey_part[10];    5'd11: rkey_1 = rkey_part[11];
        5'd12: rkey_1 = rkey_part[12];    5'd13: rkey_1 = rkey_part[13];    5'd14: rkey_1 = rkey_part[14];    5'd15: rkey_1 = rkey_part[15];
        5'd16: rkey_1 = rkey_part[16];    5'd17: rkey_1 = rkey_part[17];    5'd18: rkey_1 = rkey_part[18];    5'd19: rkey_1 = rkey_part[19];
        5'd20: rkey_1 = rkey_part[20];    5'd21: rkey_1 = rkey_part[21];    5'd22: rkey_1 = rkey_part[22];    5'd23: rkey_1 = rkey_part[23];
        5'd24: rkey_1 = rkey_part[24];    5'd25: rkey_1 = rkey_part[25];    5'd26: rkey_1 = rkey_part[26];    5'd27: rkey_1 = rkey_part[27];
        5'd28: rkey_1 = rkey_part[28];    5'd29: rkey_1 = rkey_part[29];    5'd30: rkey_1 = rkey_part[30];    5'd31: rkey_1 = rkey_part[31];
        default: rkey_1 = {32{1'b0}};
        endcase
    end

    //===========================================================================================================================================
    //                                    generate PCText
    //===========================================================================================================================================
    //assign logic
    assign Text_d                            = XNew;
    assign X_0                               = Text_q;
    assign X                                 = loadX ? OText : X_0;
    assign {PCText0,PCText1,PCText2,PCText3} = Text_q;
    assign PCText                            = {PCText3,PCText2,PCText1,PCText0};//make reverse

    //RoundFunc2
    RoundFunc2 RoundFunc2_x(
        .X    (X),
        .rkey (rkey),
        .XNew (XNew)
    );

    //flop logic
    always @(posedge clk,negedge rst_n) begin
        if(!rst_n)
            Text_q <= {128{1'b0}};
        else if(flopX_en)
            Text_q <= Text_d;
        else
            Text_q <= Text_q;
    end
endmodule

三、仿真

(一)testbench

采用Verilog的测试平台代码如下。其中输出的明密文PCText在testbench里被拆分为4个32位的字:t0,t1,t2,t3。即{t0,t1,t2,t3}=PCText,waveform里没有PCText信号,可以观察t0,t1,t2,t3这四个信号。

`timescale 1ns/1ns
/* Module:tb
 * testbench for SM4
*/
module tb;
reg           clk;
reg           rst_n;
reg           start;
reg           en_or_de;
reg           K_valid;
reg           X_valid;
reg  [127:0]  MKey,OText;
wire          finish;
wire [31:0]   rkey;
wire [31:0]   text0,text1,text2,text3;
wire [127:0]  PCText;
wire [1023:0] rkey_q;

//register
reg [127:0] q;
reg en;
always @(posedge clk,negedge rst_n) begin
  if(!rst_n)
    q <= {128{1'b0}};
  else if(en)
    q <= PCText;
  else
    q <= q;
end

//assign logic
assign rkey                      = rkey_q[1023:1024-32];//every round key in encryption mode
assign {text0,text1,text2,text3} = PCText;

//initialize
initial begin
  clk      = 0;
  rst_n    = 0;
  start    = 0;
  en_or_de = 0;
  K_valid  = 0;
  X_valid  = 0;
  MKey     = 128'h01234567_89abcdef_fedcba98_76543210;
  OText    = 128'h01234567_89abcdef_fedcba98_76543210;
  //OText    = 128'h11111111_22222222_33333333_44444444;
  en =1;
end

//generate clk
initial begin
  forever #1 clk=~clk;
end

//other signals
initial fork
  #96  en       = 0;

  #4   rst_n    = 1;

  #6   start    = 1;
  #10  start    = 0;
  #100 start    = 1;
  #104 start    = 0;

  #100 en_or_de = 1;

  #100 OText    = q;

  #6   K_valid  = 1;
  #6   X_valid  = 1;
join

//dut

  SM4 dut(
    .clk      (clk),
    .rst_n    (rst_n),
    .start    (start),
    .en_or_de (en_or_de),
    .K_valid  (K_valid),
    .X_valid  (X_valid),
    .MKey     (MKey),
    .OText    (OText),
    .rkey_q   (rkey_q),
    .PCText   (PCText),
    .finish   (finish)
  );
endmodule

在testbench里,用了一个128位的触发器存储加密后的密文,然后在加密完成后又用该值作为解密的密文输入给SM4模块同时拉高en_or_de进行解密运算。因此在以下的两个cases的仿真200ns内先进行了加密随后立即进行了解密,所以最后的输出数据应该与最初输入的明文一致。

(二)two cases

1、case0

case0的仿真结果如下图所示。

输入的密钥为128'h01234567_89abcdef_fedcba98_76543210,

输入的明文为128'h11111111_22222222_33333333_44444444。

可以看到最后的数据与输入的明文一致。其中间的加密结果可以通过使用高级语言设计的SM4加密软件验证结果,软件实现的SM4网上有很多可以参考,可以验证结果是正确的。 

2、case1

case1的仿真结果如下图所示。

输入的密钥为128'h01234567_89abcdef_fedcba98_76543210,

输入的明文为128'h01234567_89abcdef_fedcba98_76543210。

加密后的密文结果与本文一开始举的例子一致,且经过加密又解密后输出的数据与最初输入的数据一致。

四、总结

本文参考了SM4的相关标准文档,在数字电路上实现了SM4加解密算法,并用Verilog编写了简单的testbench,然后使用QuestaSim对设计的电路进行了仿真,验证了电路功能的正确性。

考虑到32个轮密钥是依次生成的,且加密明文时每一轮轮操作需要的轮密钥都是当前的轮密钥,所以同时设计了两个轮函数分别处理轮密钥的生成和明文的加密,即每生成一个轮密钥就将该轮密钥输出给明文加密部分进行加密,从而将先生成32个轮密钥然后再进行加密的64个时钟周期消耗减少到了32个时钟周期,大大减少了运算时间。

本设计还在同一个电路里实现了加密运算和解密运算,通过一个输入信号en_or_de来选择加解密模式,需要注意的是在加解密时这个信号需要保持固定直至运算结束,而运算开始的start信号只要在一个时钟上升沿能被有效采集到后即可拉低不需要一直保持高电平,K_valid和X_valid两个数据有效标志需要在start有效后依旧保持一个周期,密钥Key和输入的文建议在开始运算后就保持不变,避免最后的结果不正确的情况发生。

最后一点,观察数据通路和两个轮函数,其实两个轮函数的区别只是在于列混淆运算部分不同,常数异或和S盒置换都是一样的,那么就可以考虑器件复用,复用两个轮函数相同的部分,从而减少电路面积,有兴趣的读者可以按照这个思路修改数据通路拆解轮函数复用异或门和S盒并增加一些MUX作为选择逻辑。

  • 23
    点赞
  • 39
    收藏
    觉得还不错? 一键收藏
  • 7
    评论
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值