111

Stephen R. Dussb

Burton S. Kaliski Jr.

RSA Data Security Inc.

Redwood City, CA

 

 

3.1 Montgomery's method

We now outline Montgomery's method for modular arithmetic. Readers familiar with the topic may skip to Sec. 3.2.

In Montgomery's method we represent residue classes in an unusual way and redefine modular arithmetic within this representation. Specifically, let N be an integer (the modulus) and let R be an integer relatively prime to N. We represent the residue class A mod N as AR mod N and redefine modular multiplication as

 

It is not hard to verify that Montgomery multiplication in the new representation is isomorphic to modular multiplication in the ordinary one:

 

We can similarly redefine modular exponentiation as repeated Montgomery multiplication. This "Montgomery exponentiation'' can be computed with all the usual modular exponentiation speedups. TO compute ordinary modular exponentiation , we compute M' = MR mod N (ordinary modular reduction), (Montgomery exponentiation), and  (Montgomery reduction).

The practicality of Montgomery's method rests on the following nice theorem, which leads directly to an algorithm for Montgomery multiplication.

Theorem 1 (Montgomery, 1985)

Let N and R be relatively-prime integers, and let . Then for all integers T, (T+MN)/R is an integer satisfying

                                         (1)

where M = TN' mod R.

Proof Equation 1 is straightforward. The fact that (T+MN)/R is an integer can be shown by substituting M.

If we choose the right R-say, a power of the base in which we represent multiple-precision integers, then division by R and reduction modulo R are trivial. With such an R Montgomery reduction is no more expensive than two multiple-precision products, and we can make it even easier.

3.2 Computing the Montgomery product

We now describe our algorithm for the Montgomery product. For the discussion we will let b be the base in which multiple-precision integers are represented. That is, we will represent an integer A as a sequence of digits () where

                                   (2)

We will further require that all inputs to our algorithms can be represented in n base b digits, and that . In Sec. 3.3 we determine limitations on the individual digits .

We derive our algorithm by successive improvements, beginning with the following algorithm taken directly from Theorem 1. (We note that our algorithm does not "normalize" its output to the range [0, N-1]. Sec. 3.3 shows why.)

 

Improvement 1. Instead of computing all of M at once, let us compute one digit  at a time, add to T, and repeat. The resulting T may not be the same as in the original algorithm but the effect of adding multiples of N will be: namely, to make T a multiple of R. This is essentially the approach Montgomery gives for multiple-precision integers. We note that this change allows us to compute  instead of N'.

 

Improvement 2. Now let us interleave multiplication and reduction. We note that Montgomery reduction is intrinsically a right-to-left procedure. That is, mi depends only on. So we can begin adding this multiple to T as soon as we know. This results in the following algorithm:

 

Improvement 3. At this point we can begin to observe a potential difficulty for the

DSP56000. The operation, the basic shift-and-add operation-is likely to break down into the following single-precision operations:

4.1    do   

4.2         for  to n-1

4.3               do  

4.4                   

4.5                        -right shift

4.6                          -(initial)

These operations involve not only n single-precision multiplications but also n right shifts. On many processors the "high part" and the "low part" of accumulators are separately addressable and the right shift can be accomplished with move instructions.

This is true also on the DSP56000, but such shifting takes longer than a multiplication on the DSP56000, which motivates us to minimize the number of right shifts. Happily, there is a good way to avoid right shifts, and that is with the convolution-sum method of multiplication. In this method instead of performing operations like, we perform operations like. These involve k+l multiplications but only one shift. The fact that Montgomery reduction is intrinsically right-to-left helps us again, and leads to our final algorithm.

 

We expect that our final algorithm will generally be faster than the interleaved shift-and-add version on most processors, because our algorithm has fewer right shifts, O(n) versus O(). It also has fewer stores, again O(n) versus O(). (The number of other operations-fecthes, multiplies, and adds-is essentially the same for both algorithms.)

However, we note that our algorithm has more complex loop control. We also note that our algorithm accumulates intermediate results that are a factor of 2n larger in magnitude than those in the shift-and-add algorithm, so we need a larger accumulator and addition instructions that can handle the larger accumulator.

On most processors we can implement the larger accumulator with multiple registers and the additions involving it with add-with-carry instructions. The DSP56000 is especially well suited since its accumulator is eight bits longer than the largest product its ALU can produce. Thus even without multiple registers or add-with-carry instructions the DSP56000 can handle the intermediate results for n up to 128.

The extent to which our algorithm is faster depends mostly on the relative speeds of multiplication and shifting. If multiplication is relatively slow then changes in the number of shifts will have an insignificant effect on total execution time. For example, on the Intel 80386 multiplication is an order of magnitude slower than shifting and we have observed what appears to be at best a 10 percent improvement in execution time. But on the DSP56000 the improvement is manifold.

We conclude with a couple of remarks. First, we can derive a Montgomery squaring algorithmin the usual way that is asymptotically 25 percent faster than the alternative.

Second, we can precompute  once during a Montgomery exponentiation since it depends only on the modulus N. Computing  by a general

modular inverse algorithm such as extended Euclidean GCD would not be all that expensive, since b is small. We have found instead (or rediscovered?) a very nice way to compute the modular inverse in the special case that no is odd and b is a power of 2:

MODULAR-INVERSE (x, b)

1        -Computes  for x odd and b a power of 2.

 

The correctness of MODULAR-INVERSE can be verified by induction with the hypothesis.

3.3 Representation of multiple-precision integers

We have not yet defined "base b representation" for the DSP56000, so we do so now.

The DSP56000 has a signed multiply instruction that multiplies two 24-bit two's complement integers and adds their product to a 56-bit accumulator. Thus the logical choices for "base b representation" are a sequence of 24-bit signed digits and a sequence of 23-bit unsigned digits. A sequence of 24-bit unsigned digits is rather awkward with a signed multiply instruction. Given that the 23-bit unsigned representation of an integer would generally be longer than the 24-bit signed representation, we chose the signed representation.

Thus the digits  in Eq. 2  satisfy .

We now prove our claim that MONTGOMERY-PRODUCT need not adjust its result to the range [0,N-1] by showing that the redundant range [-N,N-l] can be maintained through all intermediate calculations.

Theorem 2

Let  where b, n > 0, and let A, B and N be n-digit, multiple-precision integers in the signed representation, where N > 0. If A and B are in the range [-N,N-1] then for all n digit multiple-precision integers M, (AB+MN)/R is in the range [-N,N-l].

Proof:  We begin by proving two identities:

N + M < R

-N+M>-R

The fist follows from the observation that the largest positive n-digit integer in the signed representation is less than R/2. The second follows from the fact that the largest positive n-digit integer and the smallest negative n-digit integer differ by less than R.

The theorem follows, since

(AB+MN)/R (N+M)N/RN

(AB+MN)/R (-N+M)N/R > -N

We note that a similar property holds in the unsigned representation, but it requires the further condition that N < R/4.

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值