1. 引言
对于Curve25519,其Field域内的module Fp = 2255-19。
若采用常规的Montgomery reduce算法,其运算性能并不是最优的。
如要求某整数 u mod (2^255-19)
,可将u整数用多项式做如下表示:
u
=
∑
i
u
i
x
i
,其中,
u
i
=
n
∗
2
⌈
25.5
i
⌉
,
n
∈
N
u=\sum_{i}^{}u_ix^i,其中,u_i=n*2^{\left \lceil 25.5i \right \rceil},n \in N
u=∑iuixi,其中,ui=n∗2⌈25.5i⌉,n∈N
设置x=1,通过对u多项式求值即可代表域Fp内的值。
u多项式的选取有两个限制条件:
- 1)多项式的阶应小,来限制多项式乘法时所需系数乘法数量的次数。通常,reduced-degree多项式的阶最高为9;
- 2)为了提升系数乘法的效率,每个系数 u i u_i ui均应为 2 ⌈ 25.5 i ⌉ 2^{\left \lceil 25.5i \right \rceil} 2⌈25.5i⌉的整数倍。通常,reduced-coefficient多项式的系数满足 u i / 2 ⌈ 25.5 i ⌉ ∈ { − 2 25 , − 2 25 + 1 , . . . , − 1 , 0 , 1 , . . . , 2 25 − 1 , 2 25 } u_i/2^{\left \lceil 25.5i \right \rceil} \in \{-2^{25}, -2^{25}+1,...,-1,0,1,...,2^{25}-1, 2^{25}\} ui/2⌈25.5i⌉∈{−225,−225+1,...,−1,0,1,...,225−1,225}
结合以上两个显示条件,一个reduced-degree reduced-coefficient多项式可表示为:
u
0
+
u
1
x
+
.
.
.
+
u
9
x
9
,其中
u
0
/
2
0
,
u
1
/
2
26
,
u
2
/
2
51
,
u
3
/
2
77
,
u
4
/
2
102
,
u
5
/
2
128
,
u
6
/
2
153
,
u
7
/
2
179
,
u
8
/
2
204
,
u
9
/
2
230
均
∈
{
−
2
25
,
−
2
25
+
1
,
.
.
.
,
−
1
,
0
,
1
,
.
.
.
,
2
25
−
1
,
2
25
}
u_0+u_1x+...+u_9x^9,其中u_0/2^0,u_1/2^{26},u_2/2^{51},u_3/2^{77},u_4/2^{102},u_5/2^{128},u_6/2^{153},u_7/2^{179},u_8/2^{204},u_9/2^{230}均\in \{-2^{25}, -2^{25}+1,...,-1,0,1,...,2^{25}-1, 2^{25}\}
u0+u1x+...+u9x9,其中u0/20,u1/226,u2/251,u3/277,u4/2102,u5/2128,u6/2153,u7/2179,u8/2204,u9/2230均∈{−225,−225+1,...,−1,0,1,...,225−1,225}
该多项式所代表的值即为(x=1时),
u
0
+
u
1
+
.
.
.
+
u
9
u_0+u_1+...+u_9
u0+u1+...+u9。
该值只在计算最后阶段才转换为最小值,不需要直接转换。
2. Fp=2255-19域内的加法和减法运算
将两个整数分别表示为u多项式和v多项式,则这两个整数的加法运算可表示为多项式加法运算u+v,减法运算可表示为多项式减法运算u-v。
若u和v均为reduced-degree reduced-coefficient多项式,则u+v或u-v分别有10次fp(64-bit floating-point)数字加法或减法系数运算。
3. Fp=2255-19域内的乘法运算
注意:对于多项式
a
=
2
255
x
10
−
19
a=2^{255}x^{10}-19
a=2255x10−19,当取x=1时,
a
=
2
255
−
19
a=2^{255}-19
a=2255−19, 对应的
a
m
o
d
F
p
=
0
a \quad mod \quad F_p =0
amodFp=0,因此可用多项式
a
=
2
255
x
10
−
19
a=2^{255}x^{10}-19
a=2255x10−19代表域Fp内的零值。
进一步推论:
2
255
x
10
−
19
≡
0
\quad 2^{255}x^{10}-19\equiv 0
2255x10−19≡0
⇔
2
255
x
10
≡
19
\Leftrightarrow 2^{255}x^{10}\equiv 19
⇔2255x10≡19
⇔
x
10
≡
2
−
255
∗
19
\Leftrightarrow x^{10}\equiv 2^{-255}*19
⇔x10≡2−255∗19
将两个整数分别表示为u多项式和v多项式,且u和v均为reduced-degree reduced-coefficient多项式。乘法运算会转换为多项式乘法运算b=u*v,多项式的阶最大为18。可利用上面的推论结论
x
10
≡
2
−
255
∗
19
x^{10}\equiv 2^{-255}*19
x10≡2−255∗19,对系数
x
10
,
x
11
,
.
.
.
,
x
18
x^{10},x^{11},...,x^{18}
x10,x11,...,x18分别进行替换,如取b多项式部分内容,有:
b
8
x
8
+
b
18
x
18
=
b
8
x
8
+
b
18
x
8
x
10
≡
(
b
8
+
b
18
(
2
−
255
∗
19
)
)
x
8
b_8x^8+b_{18}x^{18}=b_8x^8+b_{18}x^{8}x^{10}\equiv(b_8+b_{18}(2^{-255}*19))x^8
b8x8+b18x18=b8x8+b18x8x10≡(b8+b18(2−255∗19))x8
实际计算机执行时,充分利用64位存储器的性能,上述表示可进一步优化,参见论文《High-speed high-security signatures》和《Sandy2x: New Curve25519 Speed Records》,以51位标识,对于F_p域内的元素f,可表示为 ( f 0 , f 1 , f 2 , f 3 , f 4 ) , f = ∑ i = 0 4 f i 2 51 i = f 0 + 2 51 f 1 + 2 102 f 2 + 2 153 f 3 + 2 204 f 4 , f i ∈ [ 0 , . . . , 2 51 − 1 ] (f_0,f_1,f_2,f_3,f_4),f=\sum_{i=0}^{4}f_i2^{51i}=f_0+2^{51}f_1+2^{102}f_2+2^{153}f_3+2^{204}f_4,f_i \in [0,...,2^{51}-1] (f0,f1,f2,f3,f4),f=∑i=04fi251i=f0+251f1+2102f2+2153f3+2204f4,fi∈[0,...,251−1]
注意x=1时有:
x
10
≡
2
−
255
∗
19
x^{10}\equiv 2^{-255}*19
x10≡2−255∗19
⇔
2
−
255
∗
19
≡
1
\Leftrightarrow 2^{-255}*19\equiv 1
⇔2−255∗19≡1
⇔
2
255
≡
19
\Leftrightarrow 2^{255}\equiv 19
⇔2255≡19
则f和g多项式乘积h=f*g module
2
255
−
19
2^{255} -19
2255−19可表示为:
h
0
=
f
0
g
0
+
19
f
1
g
4
+
19
f
2
g
3
+
19
f
3
g
2
+
19
f
4
g
1
h_0=f_0g_0+19f_1g_4+19f_2g_3+19f_3g_2+19f_4g_1
h0=f0g0+19f1g4+19f2g3+19f3g2+19f4g1
h
1
=
f
0
g
1
+
f
1
g
0
+
19
f
2
g
4
+
19
f
3
g
3
+
19
f
4
g
2
h_1=f_0g_1+f_1g_0+19f_2g_4+19f_3g_3+19f_4g_2
h1=f0g1+f1g0+19f2g4+19f3g3+19f4g2
h
2
=
f
0
g
2
+
f
1
g
1
+
f
2
g
0
+
19
f
3
g
4
+
19
f
4
g
3
h_2=f_0g_2+f_1g_1+f_2g_0+19f_3g_4+19f_4g_3
h2=f0g2+f1g1+f2g0+19f3g4+19f4g3
h
3
=
f
0
g
3
+
f
1
g
2
+
f
2
g
1
+
f
3
g
0
+
19
f
4
g
4
h_3=f_0g_3+f_1g_2+f_2g_1+f_3g_0+19f_4g_4
h3=f0g3+f1g2+f2g1+f3g0+19f4g4
h
4
=
f
0
g
4
+
f
1
g
3
+
f
2
g
2
+
f
3
g
1
+
f
4
g
0
h_4=f_0g_4+f_1g_3+f_2g_2+f_3g_1+f_4g_0
h4=f0g4+f1g3+f2g2+f3g1+f4g0
若f=g,则为求f2。curver25519-dalek中的pow2k
即为求
x
2
k
x^{2^k}
x2k,代码中对进位的处理很赞:
/// Given `k > 0`, return `self^(2^k)`.
pub fn pow2k(&self, mut k: u32) -> FieldElement51 {
debug_assert!( k > 0 );
/// Multiply two 64-bit integers with 128 bits of output.
#[inline(always)]
fn m(x: u64, y: u64) -> u128 { (x as u128) * (y as u128) }
let mut a: [u64; 5] = self.0;
loop {
// Precondition: assume input limbs a[i] are bounded as
//
// a[i] < 2^(51 + b)
//
// where b is a real parameter measuring the "bit excess" of the limbs.
// Precomputation: 64-bit multiply by 19.
//
// This fits into a u64 whenever 51 + b + lg(19) < 64.
//
// Since 51 + b + lg(19) < 51 + 4.25 + b
// = 55.25 + b,
// this fits if b < 8.75.
let a3_19 = 19 * a[3];
let a4_19 = 19 * a[4];
// Multiply to get 128-bit coefficients of output.
//
// The 128-bit multiplications by 2 turn into 1 slr + 1 slrd each,
// which doesn't seem any better or worse than doing them as precomputations
// on the 64-bit inputs.
let c0: u128 = m(a[0], a[0]) + 2*( m(a[1], a4_19) + m(a[2], a3_19) );
let mut c1: u128 = m(a[3], a3_19) + 2*( m(a[0], a[1]) + m(a[2], a4_19) );
let mut c2: u128 = m(a[1], a[1]) + 2*( m(a[0], a[2]) + m(a[4], a3_19) );
let mut c3: u128 = m(a[4], a4_19) + 2*( m(a[0], a[3]) + m(a[1], a[2]) );
let mut c4: u128 = m(a[2], a[2]) + 2*( m(a[0], a[4]) + m(a[1], a[3]) );
// Same bound as in multiply:
// c[i] < 2^(102 + 2*b) * (1+i + (4-i)*19)
// < 2^(102 + lg(1 + 4*19) + 2*b)
// < 2^(108.27 + 2*b)
//
// The carry (c[i] >> 51) fits into a u64 when
// 108.27 + 2*b - 51 < 64
// 2*b < 6.73
// b < 3.365.
//
// So we require b < 3 to ensure this fits.
debug_assert!(a[0] < (1 << 54));
debug_assert!(a[1] < (1 << 54));
debug_assert!(a[2] < (1 << 54));
debug_assert!(a[3] < (1 << 54));
debug_assert!(a[4] < (1 << 54));
const LOW_51_BIT_MASK: u64 = (1u64 << 51) - 1;
// Casting to u64 and back tells the compiler that the carry is bounded by 2^64, so
// that the addition is a u128 + u64 rather than u128 + u128.
c1 += ((c0 >> 51) as u64) as u128;
a[0] = (c0 as u64) & LOW_51_BIT_MASK;
c2 += ((c1 >> 51) as u64) as u128;
a[1] = (c1 as u64) & LOW_51_BIT_MASK;
c3 += ((c2 >> 51) as u64) as u128;
a[2] = (c2 as u64) & LOW_51_BIT_MASK;
c4 += ((c3 >> 51) as u64) as u128;
a[3] = (c3 as u64) & LOW_51_BIT_MASK;
let carry: u64 = (c4 >> 51) as u64;
a[4] = (c4 as u64) & LOW_51_BIT_MASK;
// To see that this does not overflow, we need a[0] + carry * 19 < 2^64.
//
// c4 < a2^2 + 2*a0*a4 + 2*a1*a3 + (carry from c3)
// < 2^(102 + 2*b + lg(5)) + 2^64.
//
// When b < 3 we get
//
// c4 < 2^110.33 so that carry < 2^59.33
//
// so that
//
// a[0] + carry * 19 < 2^51 + 19 * 2^59.33 < 2^63.58
//
// and there is no overflow.
a[0] = a[0] + carry * 19;
// Now a[1] < 2^51 + 2^(64 -51) = 2^51 + 2^13 < 2^(51 + epsilon).
a[1] += a[0] >> 51;
a[0] &= LOW_51_BIT_MASK;
// Now all a[i] < 2^(51 + epsilon) and a = self^(2^k).
k = k - 1;
if k == 0 {
break;
}
}
FieldElement51(a)
}
4. Fp=2255-19域内的倒数运算
curver25519-dalek中的实际实现主要时基于:
/// Given a nonzero field element, compute its inverse.
///
/// The inverse is computed as self^(p-2), since
/// x^(p-2)x = x^(p-1) = 1 (mod p).
///
/// This function returns zero on input zero.
pub fn invert(&self) -> FieldElement {
// The bits of p-2 = 2^255 -19 -2 are 11010111111...11.
//
// nonzero bits of exponent
let (t19, t3) = self.pow22501(); // t19: 249..0 ; t3: 3,1,0
let t20 = t19.pow2k(5); // 254..5
let t21 = &t20 * &t3; // 254..5,3,1,0
t21
}
5. Fp=2255-19域内的module求模运算
实际本文第1~4节的代码,并没有做module运算,实际的mod Fp
代码见reduce和to_bytes()代码:
/// Given 64-bit input limbs, reduce to enforce the bound 2^(51 + epsilon).
#[inline(always)]
fn reduce(mut limbs: [u64; 5]) -> FieldElement51 {
const LOW_51_BIT_MASK: u64 = (1u64 << 51) - 1;
// Since the input limbs are bounded by 2^64, the biggest
// carry-out is bounded by 2^13.
//
// The biggest carry-in is c4 * 19, resulting in
//
// 2^51 + 19*2^13 < 2^51.0000000001
//
// Because we don't need to canonicalize, only to reduce the
// limb sizes, it's OK to do a "weak reduction", where we
// compute the carry-outs in parallel.
let c0 = limbs[0] >> 51;
let c1 = limbs[1] >> 51;
let c2 = limbs[2] >> 51;
let c3 = limbs[3] >> 51;
let c4 = limbs[4] >> 51;
limbs[0] &= LOW_51_BIT_MASK;
limbs[1] &= LOW_51_BIT_MASK;
limbs[2] &= LOW_51_BIT_MASK;
limbs[3] &= LOW_51_BIT_MASK;
limbs[4] &= LOW_51_BIT_MASK;
limbs[0] += c4 * 19;
limbs[1] += c0;
limbs[2] += c1;
limbs[3] += c2;
limbs[4] += c3;
FieldElement51(limbs)
}
/// Serialize this `FieldElement51` to a 32-byte array. The
/// encoding is canonical.
pub fn to_bytes(&self) -> [u8; 32] {
// Let h = limbs[0] + limbs[1]*2^51 + ... + limbs[4]*2^204.
//
// Write h = pq + r with 0 <= r < p.
//
// We want to compute r = h mod p.
//
// If h < 2*p = 2^256 - 38,
// then q = 0 or 1,
//
// with q = 0 when h < p
// and q = 1 when h >= p.
//
// Notice that h >= p <==> h + 19 >= p + 19 <==> h + 19 >= 2^255.
// Therefore q can be computed as the carry bit of h + 19.
// First, reduce the limbs to ensure h < 2*p.
let mut limbs = FieldElement51::reduce(self.0).0;
let mut q = (limbs[0] + 19) >> 51;
q = (limbs[1] + q) >> 51;
q = (limbs[2] + q) >> 51;
q = (limbs[3] + q) >> 51;
q = (limbs[4] + q) >> 51;
// Now we can compute r as r = h - pq = r - (2^255-19)q = r + 19q - 2^255q
limbs[0] += 19*q;
// Now carry the result to compute r + 19q ...
let low_51_bit_mask = (1u64 << 51) - 1;
limbs[1] += limbs[0] >> 51;
limbs[0] = limbs[0] & low_51_bit_mask;
limbs[2] += limbs[1] >> 51;
limbs[1] = limbs[1] & low_51_bit_mask;
limbs[3] += limbs[2] >> 51;
limbs[2] = limbs[2] & low_51_bit_mask;
limbs[4] += limbs[3] >> 51;
limbs[3] = limbs[3] & low_51_bit_mask;
// ... but instead of carrying (limbs[4] >> 51) = 2^255q
// into another limb, discard it, subtracting the value
limbs[4] = limbs[4] & low_51_bit_mask;
// Now arrange the bits of the limbs.
let mut s = [0u8;32];
s[ 0] = limbs[0] as u8;
s[ 1] = (limbs[0] >> 8) as u8;
s[ 2] = (limbs[0] >> 16) as u8;
s[ 3] = (limbs[0] >> 24) as u8;
s[ 4] = (limbs[0] >> 32) as u8;
s[ 5] = (limbs[0] >> 40) as u8;
s[ 6] = ((limbs[0] >> 48) | (limbs[1] << 3)) as u8;
s[ 7] = (limbs[1] >> 5) as u8;
s[ 8] = (limbs[1] >> 13) as u8;
s[ 9] = (limbs[1] >> 21) as u8;
s[10] = (limbs[1] >> 29) as u8;
s[11] = (limbs[1] >> 37) as u8;
s[12] = ((limbs[1] >> 45) | (limbs[2] << 6)) as u8;
s[13] = (limbs[2] >> 2) as u8;
s[14] = (limbs[2] >> 10) as u8;
s[15] = (limbs[2] >> 18) as u8;
s[16] = (limbs[2] >> 26) as u8;
s[17] = (limbs[2] >> 34) as u8;
s[18] = (limbs[2] >> 42) as u8;
s[19] = ((limbs[2] >> 50) | (limbs[3] << 1)) as u8;
s[20] = (limbs[3] >> 7) as u8;
s[21] = (limbs[3] >> 15) as u8;
s[22] = (limbs[3] >> 23) as u8;
s[23] = (limbs[3] >> 31) as u8;
s[24] = (limbs[3] >> 39) as u8;
s[25] = ((limbs[3] >> 47) | (limbs[4] << 4)) as u8;
s[26] = (limbs[4] >> 4) as u8;
s[27] = (limbs[4] >> 12) as u8;
s[28] = (limbs[4] >> 20) as u8;
s[29] = (limbs[4] >> 28) as u8;
s[30] = (limbs[4] >> 36) as u8;
s[31] = (limbs[4] >> 44) as u8;
// High bit should be zero.
debug_assert!((s[31] & 0b1000_0000u8) == 0u8);
s
}
6. Fp=2255-19域内的batch_invert运算
总体的算法原理可参见curve25519-dalek中Scalar的Montgomery inversion及batch_invert算法中的第四节batch_invert算法。只是在求倒数的过程中,scalar域中采用的是通用的montgomery算法,而2255-19域采用的是更快的求倒数算法(见本文第四节内容)。
/// Given a slice of public `FieldElements`, replace each with its inverse.
///
/// All input `FieldElements` **MUST** be nonzero.
#[cfg(feature = "alloc")]
pub fn batch_invert(inputs: &mut [FieldElement]) {
// Montgomery’s Trick and Fast Implementation of Masked AES
// Genelle, Prouff and Quisquater
// Section 3.2
let n = inputs.len();
let mut scratch = vec![FieldElement::one(); n];
// Keep an accumulator of all of the previous products
let mut acc = FieldElement::one();
// Pass through the input vector, recording the previous
// products in the scratch space
for (input, scratch) in inputs.iter().zip(scratch.iter_mut()) {
*scratch = acc;
acc = &acc * input;
}
// Compute the inverse of all products
acc = acc.invert();
// Pass through the vector backwards to compute the inverses
// in place
for (input, scratch) in inputs.iter_mut().rev().zip(scratch.into_iter().rev()) {
let tmp = &acc * input;
*input = &acc * &scratch;
acc = tmp;
}
}
参考资料:
[1] 论文《Curve25519: new Diffie-Hellman speed records》
[2] 论文《Verifying Curve25519 Software》
[3] 论文《High-speed high-security signatures》
[4] 论文 《Sandy2x: New Curve25519 Speed Records》
[5] https://briansmith.org/ecc-inversion-addition-chains-01#curve25519_field_inversion