Curve25519 Field域2^255-19内的快速运算

mutourend

已于 2024-01-11 17:19:30 修改

阅读量1.4k

点赞数 3

分类专栏：基础理论文章标签：基础理论

于 2019-07-19 13:44:58 首次发布

本文链接：https://blog.csdn.net/mutourend/article/details/96476213

版权

基础理论专栏收录该内容

135 篇文章 29 订阅

订阅专栏

1. 引言

对于Curve25519，其Field域内的module F_p = 2²⁵⁵-19。
若采用常规的Montgomery reduce算法，其运算性能并不是最优的。

如要求某整数 u mod (2^255-19)，可将u整数用多项式做如下表示：
$u=\sum_{i}^{}u_ix^i，其中，u_i=n*2^{\left \lceil 25.5i \right \rceil}，n \in N$
设置x=1，通过对u多项式求值即可代表域F_p内的值。
u多项式的选取有两个限制条件：

1）多项式的阶应小，来限制多项式乘法时所需系数乘法数量的次数。通常，reduced-degree多项式的阶最高为9；
2）为了提升系数乘法的效率，每个系数 $u_i$ 均应为 $2^{\left \lceil 25.5i \right \rceil}$ 的整数倍。通常，reduced-coefficient多项式的系数满足 $u_i/2^{\left \lceil 25.5i \right \rceil} \in \{-2^{25}, -2^{25}+1,...,-1,0,1,...,2^{25}-1, 2^{25}\}$

结合以上两个显示条件，一个reduced-degree reduced-coefficient多项式可表示为：
$u_0+u_1x+...+u_9x^9，其中u_0/2^0,u_1/2^{26},u_2/2^{51},u_3/2^{77},u_4/2^{102},u_5/2^{128},u_6/2^{153},u_7/2^{179},u_8/2^{204},u_9/2^{230}均\in \{-2^{25}, -2^{25}+1,...,-1,0,1,...,2^{25}-1, 2^{25}\}$

该多项式所代表的值即为（x=1时）， $u_0+u_1+...+u_9$ 。
该值只在计算最后阶段才转换为最小值，不需要直接转换。

2. F_p=2²⁵⁵-19域内的加法和减法运算

将两个整数分别表示为u多项式和v多项式，则这两个整数的加法运算可表示为多项式加法运算u+v，减法运算可表示为多项式减法运算u-v。
若u和v均为reduced-degree reduced-coefficient多项式，则u+v或u-v分别有10次fp(64-bit floating-point)数字加法或减法系数运算。

3. F_p=2²⁵⁵-19域内的乘法运算

注意：对于多项式 $a=2^{255}x^{10}-19$ ，当取x=1时， $a=2^{255}-19$ ，对应的 $\quad mod \quad F_p =0$ ，因此可用多项式 $a=2^{255}x^{10}-19$ 代表域F_p内的零值。
进一步推论：
$\quad 2^{255}x^{10}-19\equiv 0$
$\Leftrightarrow 2^{255}x^{10}\equiv 19$
$\Leftrightarrow x^{10}\equiv 2^{-255}*19$

将两个整数分别表示为u多项式和v多项式，且u和v均为reduced-degree reduced-coefficient多项式。乘法运算会转换为多项式乘法运算b=u*v，多项式的阶最大为18。可利用上面的推论结论 $x^{10}\equiv 2^{-255}*19$ ，对系数 $x^{10},x^{11},...,x^{18}$ 分别进行替换，如取b多项式部分内容，有：
$b_8x^8+b_{18}x^{18}=b_8x^8+b_{18}x^{8}x^{10}\equiv(b_8+b_{18}(2^{-255}*19))x^8$

实际计算机执行时,充分利用64位存储器的性能，上述表示可进一步优化，参见论文《High-speed high-security signatures》和《Sandy2x: New Curve25519 Speed Records》，以51位标识，对于F_p域内的元素f，可表示为 $(f_0,f_1,f_2,f_3,f_4)，f=\sum_{i=0}^{4}f_i2^{51i}=f_0+2^{51}f_1+2^{102}f_2+2^{153}f_3+2^{204}f_4，f_i \in [0,...,2^{51}-1]$

注意x=1时有：
$x^{10}\equiv 2^{-255}*19$
$\Leftrightarrow 2^{-255}*19\equiv 1$
$\Leftrightarrow 2^{255}\equiv 19$

则f和g多项式乘积h=f*g module $2^{255} -19$ 可表示为：
$h_0=f_0g_0+19f_1g_4+19f_2g_3+19f_3g_2+19f_4g_1$
$h_1=f_0g_1+f_1g_0+19f_2g_4+19f_3g_3+19f_4g_2$
$h_2=f_0g_2+f_1g_1+f_2g_0+19f_3g_4+19f_4g_3$
$h_3=f_0g_3+f_1g_2+f_2g_1+f_3g_0+19f_4g_4$
$h_4=f_0g_4+f_1g_3+f_2g_2+f_3g_1+f_4g_0$

若f=g，则为求f²。curver25519-dalek中的pow2k即为求 $x^{2^k}$ ，代码中对进位的处理很赞：

/// Given `k > 0`, return `self^(2^k)`.
    pub fn pow2k(&self, mut k: u32) -> FieldElement51 {

        debug_assert!( k > 0 );

        /// Multiply two 64-bit integers with 128 bits of output.
        #[inline(always)]
        fn m(x: u64, y: u64) -> u128 { (x as u128) * (y as u128) }

        let mut a: [u64; 5] = self.0;

        loop {
            // Precondition: assume input limbs a[i] are bounded as
            //
            // a[i] < 2^(51 + b)
            //
            // where b is a real parameter measuring the "bit excess" of the limbs.

            // Precomputation: 64-bit multiply by 19.
            //
            // This fits into a u64 whenever 51 + b + lg(19) < 64.
            //
            // Since 51 + b + lg(19) < 51 + 4.25 + b
            //                       = 55.25 + b,
            // this fits if b < 8.75.
            let a3_19 = 19 * a[3];
            let a4_19 = 19 * a[4];

            // Multiply to get 128-bit coefficients of output.
            //
            // The 128-bit multiplications by 2 turn into 1 slr + 1 slrd each,
            // which doesn't seem any better or worse than doing them as precomputations
            // on the 64-bit inputs.
            let     c0: u128 = m(a[0],  a[0]) + 2*( m(a[1], a4_19) + m(a[2], a3_19) );
            let mut c1: u128 = m(a[3], a3_19) + 2*( m(a[0],  a[1]) + m(a[2], a4_19) );
            let mut c2: u128 = m(a[1],  a[1]) + 2*( m(a[0],  a[2]) + m(a[4], a3_19) );
            let mut c3: u128 = m(a[4], a4_19) + 2*( m(a[0],  a[3]) + m(a[1],  a[2]) );
            let mut c4: u128 = m(a[2],  a[2]) + 2*( m(a[0],  a[4]) + m(a[1],  a[3]) );

            // Same bound as in multiply:
            //    c[i] < 2^(102 + 2*b) * (1+i + (4-i)*19)
            //         < 2^(102 + lg(1 + 4*19) + 2*b)
            //         < 2^(108.27 + 2*b)
            //
            // The carry (c[i] >> 51) fits into a u64 when
            //    108.27 + 2*b - 51 < 64
            //    2*b < 6.73
            //    b < 3.365.
            //
            // So we require b < 3 to ensure this fits.
            debug_assert!(a[0] < (1 << 54));
            debug_assert!(a[1] < (1 << 54));
            debug_assert!(a[2] < (1 << 54));
            debug_assert!(a[3] < (1 << 54));
            debug_assert!(a[4] < (1 << 54));

            const LOW_51_BIT_MASK: u64 = (1u64 << 51) - 1;

            // Casting to u64 and back tells the compiler that the carry is bounded by 2^64, so
            // that the addition is a u128 + u64 rather than u128 + u128.
            c1 += ((c0 >> 51) as u64) as u128;
            a[0] = (c0 as u64) & LOW_51_BIT_MASK;

            c2 += ((c1 >> 51) as u64) as u128;
            a[1] = (c1 as u64) & LOW_51_BIT_MASK;

            c3 += ((c2 >> 51) as u64) as u128;
            a[2] = (c2 as u64) & LOW_51_BIT_MASK;

            c4 += ((c3 >> 51) as u64) as u128;
            a[3] = (c3 as u64) & LOW_51_BIT_MASK;

            let carry: u64 = (c4 >> 51) as u64;
            a[4] = (c4 as u64) & LOW_51_BIT_MASK;

            // To see that this does not overflow, we need a[0] + carry * 19 < 2^64.
            //
            // c4 < a2^2 + 2*a0*a4 + 2*a1*a3 + (carry from c3)
            //    < 2^(102 + 2*b + lg(5)) + 2^64.
            //
            // When b < 3 we get
            //
            // c4 < 2^110.33  so that carry < 2^59.33
            //
            // so that
            //
            // a[0] + carry * 19 < 2^51 + 19 * 2^59.33 < 2^63.58
            //
            // and there is no overflow.
            a[0] = a[0] + carry * 19;

            // Now a[1] < 2^51 + 2^(64 -51) = 2^51 + 2^13 < 2^(51 + epsilon).
            a[1] += a[0] >> 51;
            a[0] &= LOW_51_BIT_MASK;

            // Now all a[i] < 2^(51 + epsilon) and a = self^(2^k).

            k = k - 1;
            if k == 0 {
                break;
            }
        }

        FieldElement51(a)
    }

4. F_p=2²⁵⁵-19域内的倒数运算

curver25519-dalek中的实际实现主要时基于：
在这里插入图片描述

/// Given a nonzero field element, compute its inverse.
    ///
    /// The inverse is computed as self^(p-2), since
    /// x^(p-2)x = x^(p-1) = 1 (mod p).
    ///
    /// This function returns zero on input zero.
    pub fn invert(&self) -> FieldElement {
        // The bits of p-2 = 2^255 -19 -2 are 11010111111...11.
        //
        //                                 nonzero bits of exponent
        let (t19, t3) = self.pow22501();   // t19: 249..0 ; t3: 3,1,0
        let t20 = t19.pow2k(5);            // 254..5
        let t21 = &t20 * &t3;              // 254..5,3,1,0

        t21
    }

5. F_p=2²⁵⁵-19域内的module求模运算

实际本文第1~4节的代码，并没有做module运算，实际的mod Fp代码见reduce和to_bytes()代码：

/// Given 64-bit input limbs, reduce to enforce the bound 2^(51 + epsilon).
    #[inline(always)]
    fn reduce(mut limbs: [u64; 5]) -> FieldElement51 {
        const LOW_51_BIT_MASK: u64 = (1u64 << 51) - 1;

        // Since the input limbs are bounded by 2^64, the biggest
        // carry-out is bounded by 2^13.
        //
        // The biggest carry-in is c4 * 19, resulting in
        //
        // 2^51 + 19*2^13 < 2^51.0000000001
        //
        // Because we don't need to canonicalize, only to reduce the
        // limb sizes, it's OK to do a "weak reduction", where we
        // compute the carry-outs in parallel.

        let c0 = limbs[0] >> 51;
        let c1 = limbs[1] >> 51;
        let c2 = limbs[2] >> 51;
        let c3 = limbs[3] >> 51;
        let c4 = limbs[4] >> 51;

        limbs[0] &= LOW_51_BIT_MASK;
        limbs[1] &= LOW_51_BIT_MASK;
        limbs[2] &= LOW_51_BIT_MASK;
        limbs[3] &= LOW_51_BIT_MASK;
        limbs[4] &= LOW_51_BIT_MASK;

        limbs[0] += c4 * 19;
        limbs[1] += c0;
        limbs[2] += c1;
        limbs[3] += c2;
        limbs[4] += c3;

        FieldElement51(limbs)
    }


	 /// Serialize this `FieldElement51` to a 32-byte array.  The
    /// encoding is canonical.
    pub fn to_bytes(&self) -> [u8; 32] {
        // Let h = limbs[0] + limbs[1]*2^51 + ... + limbs[4]*2^204.
        //
        // Write h = pq + r with 0 <= r < p.
        //
        // We want to compute r = h mod p.
        //
        // If h < 2*p = 2^256 - 38,
        // then q = 0 or 1,
        //
        // with q = 0 when h < p
        //  and q = 1 when h >= p.
        //
        // Notice that h >= p <==> h + 19 >= p + 19 <==> h + 19 >= 2^255.
        // Therefore q can be computed as the carry bit of h + 19.

        // First, reduce the limbs to ensure h < 2*p.
        let mut limbs = FieldElement51::reduce(self.0).0;

        let mut q = (limbs[0] + 19) >> 51;
        q = (limbs[1] + q) >> 51;
        q = (limbs[2] + q) >> 51;
        q = (limbs[3] + q) >> 51;
        q = (limbs[4] + q) >> 51;

        // Now we can compute r as r = h - pq = r - (2^255-19)q = r + 19q - 2^255q

        limbs[0] += 19*q;

        // Now carry the result to compute r + 19q ...
        let low_51_bit_mask = (1u64 << 51) - 1;
        limbs[1] +=  limbs[0] >> 51;
        limbs[0] = limbs[0] & low_51_bit_mask;
        limbs[2] +=  limbs[1] >> 51;
        limbs[1] = limbs[1] & low_51_bit_mask;
        limbs[3] +=  limbs[2] >> 51;
        limbs[2] = limbs[2] & low_51_bit_mask;
        limbs[4] +=  limbs[3] >> 51;
        limbs[3] = limbs[3] & low_51_bit_mask;
        // ... but instead of carrying (limbs[4] >> 51) = 2^255q
        // into another limb, discard it, subtracting the value
        limbs[4] = limbs[4] & low_51_bit_mask;

        // Now arrange the bits of the limbs.
        let mut s = [0u8;32];
        s[ 0] =   limbs[0]        as u8;
        s[ 1] =  (limbs[0] >>  8) as u8;
        s[ 2] =  (limbs[0] >> 16) as u8;
        s[ 3] =  (limbs[0] >> 24) as u8;
        s[ 4] =  (limbs[0] >> 32) as u8;
        s[ 5] =  (limbs[0] >> 40) as u8;
        s[ 6] = ((limbs[0] >> 48) | (limbs[1] << 3)) as u8;
        s[ 7] =  (limbs[1] >>  5) as u8;
        s[ 8] =  (limbs[1] >> 13) as u8;
        s[ 9] =  (limbs[1] >> 21) as u8;
        s[10] =  (limbs[1] >> 29) as u8;
        s[11] =  (limbs[1] >> 37) as u8;
        s[12] = ((limbs[1] >> 45) | (limbs[2] << 6)) as u8;
        s[13] =  (limbs[2] >>  2) as u8;
        s[14] =  (limbs[2] >> 10) as u8;
        s[15] =  (limbs[2] >> 18) as u8;
        s[16] =  (limbs[2] >> 26) as u8;
        s[17] =  (limbs[2] >> 34) as u8;
        s[18] =  (limbs[2] >> 42) as u8;
        s[19] = ((limbs[2] >> 50) | (limbs[3] << 1)) as u8;
        s[20] =  (limbs[3] >>  7) as u8;
        s[21] =  (limbs[3] >> 15) as u8;
        s[22] =  (limbs[3] >> 23) as u8;
        s[23] =  (limbs[3] >> 31) as u8;
        s[24] =  (limbs[3] >> 39) as u8;
        s[25] = ((limbs[3] >> 47) | (limbs[4] << 4)) as u8;
        s[26] =  (limbs[4] >>  4) as u8;
        s[27] =  (limbs[4] >> 12) as u8;
        s[28] =  (limbs[4] >> 20) as u8;
        s[29] =  (limbs[4] >> 28) as u8;
        s[30] =  (limbs[4] >> 36) as u8;
        s[31] =  (limbs[4] >> 44) as u8;

        // High bit should be zero.
        debug_assert!((s[31] & 0b1000_0000u8) == 0u8);

        s
    }

6. F_p=2²⁵⁵-19域内的batch_invert运算

总体的算法原理可参见curve25519-dalek中Scalar的Montgomery inversion及batch_invert算法中的第四节batch_invert算法。只是在求倒数的过程中，scalar域中采用的是通用的montgomery算法，而2²⁵⁵-19域采用的是更快的求倒数算法（见本文第四节内容）。

	/// Given a slice of public `FieldElements`, replace each with its inverse.
    ///
    /// All input `FieldElements` **MUST** be nonzero.
    #[cfg(feature = "alloc")]
    pub fn batch_invert(inputs: &mut [FieldElement]) {
        // Montgomery’s Trick and Fast Implementation of Masked AES
        // Genelle, Prouff and Quisquater
        // Section 3.2

        let n = inputs.len();
        let mut scratch = vec![FieldElement::one(); n];

        // Keep an accumulator of all of the previous products
        let mut acc = FieldElement::one();

        // Pass through the input vector, recording the previous
        // products in the scratch space
        for (input, scratch) in inputs.iter().zip(scratch.iter_mut()) {
            *scratch = acc;
            acc = &acc * input;
        }

        // Compute the inverse of all products
        acc = acc.invert();

        // Pass through the vector backwards to compute the inverses
        // in place
        for (input, scratch) in inputs.iter_mut().rev().zip(scratch.into_iter().rev()) {
            let tmp = &acc * input;
            *input = &acc * &scratch;
            acc = tmp;
        }
    }

参考资料：
[1] 论文《Curve25519: new Diffie-Hellman speed records》
[2] 论文《Verifying Curve25519 Software》
[3] 论文《High-speed high-security signatures》
[4] 论文《Sandy2x: New Curve25519 Speed Records》
[5] https://briansmith.org/ecc-inversion-addition-chains-01#curve25519_field_inversion